Some new exibilizations of Bregman divergences and their asymptotics

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22588
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

Ordinary Bregman divergences (distances) OBD are widely used in statistics, machine learning, and information theory (see e.g. [18], [5]; [7], [14], [4], [23], [6], [16], [22], [25], [15] ). They can be exibilized in
various di erent ways. For instance, there are the Scaled Bregman divergences SBD of Stummer [20] and Stummer & Vajda [21] which contain both the OBDs as well the Csiszar-Ali-Slivey Φ-divergences as special
cases. On the other hand, the OBDs are subsumed by the Total Bregman divergences of Liu et al. [12],[13], Vemuri et al. [24] and the more general Conformal Divergences COD of Nock et al. [17]. The latter authors also
indicated the possibility to combine the concepts of SBD and COD, under the name \Conformal Scaled Bregman divergences" CSBD. In this paper, we introduce some new divergences between (non-)probability distributions
which particularly cover the corresponding OBD, SBD, COD and CSBD (for separable situations) as special cases. Non-convex generators are employed, too. Moreover, for the case of i.i.d. sampling we derive the asymptotics of a useful new-divergence-based test statistics.

Some new exibilizations of Bregman divergences and their asymptotics

Collection

application/pdf Some new exibilizations of Bregman divergences and their asymptotics Wolfgang Stummer, Anna-Lena Kißlinger
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

Ordinary Bregman divergences (distances) OBD are widely used in statistics, machine learning, and information theory (see e.g. [18], [5]; [7], [14], [4], [23], [6], [16], [22], [25], [15] ). They can be exibilized in
various di erent ways. For instance, there are the Scaled Bregman divergences SBD of Stummer [20] and Stummer & Vajda [21] which contain both the OBDs as well the Csiszar-Ali-Slivey Φ-divergences as special
cases. On the other hand, the OBDs are subsumed by the Total Bregman divergences of Liu et al. [12],[13], Vemuri et al. [24] and the more general Conformal Divergences COD of Nock et al. [17]. The latter authors also
indicated the possibility to combine the concepts of SBD and COD, under the name \Conformal Scaled Bregman divergences" CSBD. In this paper, we introduce some new divergences between (non-)probability distributions
which particularly cover the corresponding OBD, SBD, COD and CSBD (for separable situations) as special cases. Non-convex generators are employed, too. Moreover, for the case of i.i.d. sampling we derive the asymptotics of a useful new-divergence-based test statistics.
Some new exibilizations of Bregman divergences and their asymptotics

Média

Voir la vidéo

Métriques

0
0
326.66 Ko
 application/pdf
bitcache://82d7b2d47d7a2e5db994933fa2d2f9cd6c83e9d0

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
gdrmia_logo.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22588</identifier><creators><creator><creatorName>Wolfgang Stummer</creatorName></creator><creator><creatorName>Anna-Lena Kißlinger</creatorName></creator></creators><titles>
            <title>Some new exibilizations of Bregman divergences and their asymptotics</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>Bregman divergences (distances)</subject><subject>total Bregman divergences</subject><subject>conformal divergences</subject><subject>goodness-of-fit divergence</subject></subjects><dates>
	    <date dateType="Created">Thu 8 Mar 2018</date>
	    <date dateType="Updated">Thu 8 Mar 2018</date>
            <date dateType="Submitted">Tue 13 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">82d7b2d47d7a2e5db994933fa2d2f9cd6c83e9d0</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37310</version>
        <descriptions>
            <description descriptionType="Abstract">Ordinary Bregman divergences (distances) OBD are widely used in statistics, machine learning, and information theory (see e.g. [18], [5]; [7], [14], [4], [23], [6], [16], [22], [25], [15] ). They can be exibilized in<br />
various di erent ways. For instance, there are the Scaled Bregman divergences SBD of Stummer [20] and Stummer & Vajda [21] which contain both the OBDs as well the Csiszar-Ali-Slivey Φ-divergences as special<br />
cases. On the other hand, the OBDs are subsumed by the Total Bregman divergences of Liu et al. [12],[13], Vemuri et al. [24] and the more general Conformal Divergences COD of Nock et al. [17]. The latter authors also<br />
indicated the possibility to combine the concepts of SBD and COD, under the name \Conformal Scaled Bregman divergences" CSBD. In this paper, we introduce some new divergences between (non-)probability distributions<br />
which particularly cover the corresponding OBD, SBD, COD and CSBD (for separable situations) as special cases. Non-convex generators are employed, too. Moreover, for the case of i.i.d. sampling we derive the asymptotics of a useful new-divergence-based test statistics.
</description>
        </descriptions>
    </resource>
.

Some new flexibilizations of Bregman divergences and their asymptotics Wolfgang Stummer1 and Anna-Lena Kißlinger2 1 Department of Mathematics, University of Erlangen–Nürnberg, Cauerstrasse 11, 91058 Erlangen, Germany, as well as Affiliated Faculty Member of the School of Business and Economics, University of Erlangen–Nürnberg, Lange Gasse 20, 90403 Nürnberg, Germany 2 Corresponding Author. Chair of Statistics and Econometrics, University of Erlangen–Nürnberg, Lange Gasse 20, 90403 Nürnberg, Germany Abstract. Ordinary Bregman divergences (distances) OBD are widely used in statistics, machine learning, and information theory (see e.g. [18], [5]; [7], [14], [4], [23], [6], [16], [22], [25], [15] ). They can be flexibilized in various different ways. For instance, there are the Scaled Bregman diver- gences SBD of Stummer [20] and Stummer & Vajda [21] which contain both the OBDs as well the Csiszar-Ali-Slivey φ−divergences as special cases. On the other hand, the OBDs are subsumed by the Total Bregman divergences of Liu et al. [12],[13], Vemuri et al. [24] and the more general Conformal Divergences COD of Nock et al. [17]. The latter authors also indicated the possibility to combine the concepts of SBD and COD, un- der the name “Conformal Scaled Bregman divergences” CSBD. In this paper, we introduce some new divergences between (non-)probability dis- tributions which particularly cover the corresponding OBD, SBD, COD and CSBD (for separable situations) as special cases. Non-convex gen- erators are employed, too. Moreover, for the case of i.i.d. sampling we derive the asymptotics of a useful new-divergence-based test statistics. Keywords: Bregman divergences (distances), total Bregman divergences, conformal divergences, asymptotics of goodness-of-fit divergence. 1 Introduction and results Let us assume that the modeled respectively observed random data take val- ues in a state space X (with at least two distinct values), equipped with a system A of admissible events (σ−algebra). On this, we want to quantify the divergence (distance, dissimilarity, proximity) D(P, Q) between two probabil- ity distributions P, Q 3 . Since the ultimate purposes of a (divergence-based) statistical inference or machine learning task may vary from case to case, it is of fundamental importance to have at hand a flexible, far-reaching toolbox D := {Dφ,M1,M2,M3 (P, Q) : φ ∈ Φ, M1, M2, M3 ∈ M } of divergences which al- lows for goal-oriented situation-based applicability; in the following, we present such a new toolbox, where the flexibility is controlled by various different choices of a “generator” φ ∈ Φ, and scalings M1, M2, M3 ∈ M . In order to achieve this goal, we use the following ingredients: (i) for the class F of all (measurable) 3 our concept can be analogously worked out for non-probability distributions (non- negative measures) P,Q functions from Y = (0, ∞) to R := R ∪ {∞} ∪ {−∞} and for fixed sub- class U ⊂ F, the divergence-generator family Φ = ΦU is supposed to con- sist of all functions φ ∈ F which are U −convex and for which the strict U −subdifferential ∂U φ|y0 is non-empty for all y0 ∈ Y . Typically, the family U contains (“approximating”) functions which are “less complicated” than φ. Recall that (see e.g. [19]) a function u : Y 7→ R is called a strict U −subgradient of φ at a point y0 ∈ Y , if u ∈ U and φ(y) − φ(y0) ≥ u(y) − u(y0) for all y ∈ Y and the last inequality is strict (i.e., >) for all y 6= y0; the set of all strict U −subgradients of φ at a point y0 ∈ Y is called strict U −subdifferential of φ at y0 ∈ Y , and is denoted by ∂U φ|y0 . In case of ∂U φ|y0 6= ∅ for all y0 ∈ Y , a function φ is characterized to be U −convex if φ(y)=max{u(y)+c : u ∈ U, c ∈ R, u(z)+c ≤ φ(z) for all z ∈ Y } for all y ∈ Y,(1) and if furthermore the class U is invariant under addition of constants (i.e. if U + const := {u + c : u ∈ U , c ∈ R} = U ), then (1) can be further simplified to φ(y) = max{u(y) : u ∈ U and u(z) ≤ φ(z) for all z ∈ Y } for all y ∈ Y (“curved lower envelope” at y). The most prominent special case is the class U = Ual of all affine-linear functions for which the divergence-generator family Φ = ΦUal is the class of all “usual” strictly convex lower semicontinuous functions on (0, ∞). (ii) As a second group of ingredients, the two probability distributions P, Q are supposed to be described by their probability densities x 7→ p(x) ≥ 0, x 7→ q(x) ≥ 0 via P[A] = R A p(x) dλ(x), Q[A] = R A q(x) dλ(x) (A ∈ A ), where λ is a fixed – maybe nonprobability 4 – distribution and one has the normal- izations R X p(x) dλ(x) = R X q(x) dλ(x) = 1. The set of all such probability distributions will be denoted by M 1 λ . We also employ the set Mλ of all general – maybe nonprobability – distributions M of the form M[A] = R A m(x) dλ(x) (A ∈ A ) with density x 7→ m(x) ≥ 0. For instance, in the discrete setup where X = Xcount has countably many elements and λ := λcount is the counting mea- sure (i.e., λcount[{x}] = 1 for all x ∈ Xcount) then p(·), q(·) are (e.g. binomial) probability mass functions and m(·) is a (e.g. unnormalized-histogram-related) general mass function. If λ is the Lebesgue measure on X = R, then p(·), q(·) are (e.g. Gaussian) probability density functions and m(·) is a general (possi- bly unnormalized) density function. Within such a context, we introduce the following framework of statistical distances: Definition 1. Let φ ∈ ΦU . Then the divergence (distance) of P, Q ∈ M 1 λ scaled by M1, M2 ∈ Mλ and aggregated by M3 ∈ Mλ is defined by 5 0 ≤ Dφ,M1,M2,M3 (P, Q) := Z X h inf u∈∂U φ| q(x) m2(x)  φ  p(x) m1(x)  −u  p(x) m1(x)  −φ  q(x) m2(x)  +u  q(x) m2(x) i m3(x)dλ(x).(2) To guarantee the existence of the integrals in (2) (with possibly infinite values), the zeros of p, q, m1, m2, m3 have to be combined by proper conventions (taking into account the limit of φ(y) at y = 0); the full details will appear elsewhere. 4 sigma-finite 5 in (2), we can also extend [. . .] to G([. . .]) for some nonnegative scalar function G satisfying G(z) = 0 iff z = 0 Notice that Dφ,M1,M2,M3 (P, Q) ≥ 0, with equality iff p(x) = m1(x) m2(x) · q(x) for all x (in case of absence of zeros). For the special case of the discrete setup (Xcount, λcount), (2) becomes 0 ≤ Dφ,M1,M2,M3 (P, Q) := X x∈X h inf u∈∂U φ| q(x) m2(x)  φ  p(x) m1(x)  −u  p(x) m1(x)  −φ  q(x) m2(x)  +u  q(x) m2(x) i m3(x). In the following, we illuminate several special cases, in a “structured” manner: (I) Let e φ be from the class Φ := ΦC1 ⊂ ΦUal of functions e φ : (0, ∞) 7→ R which are continuously differentiable with derivative e φ0 , strictly convex, continuously extended to y = 0, and (say) satisfy e φ(1) = 0. Moreover, let h : R 7→ R be a function which is strictly increasing on the range Re φ of e φ and which satisfies h(0) = 0 as well as h(z) < infs∈Re φ h(s) for all z / ∈ Re φ. For generator φ(y) := h(e φ(y)) we choose U = Uh := {h(a + b · y) : a ∈ R, b ∈ R, y ∈ [0, ∞)} to obtain 0 ≤ Dφ,M1,M2,M3 (P, Q) := Z X h φ  p(x) m1(x)  −h  e φ  p(x) m1(x)  + e φ0  q(x) m2(x)  ·  p(x) m1(x) − q(x) m2(x) i m3(x)dλ(x). (3) As a first example, take e φ(y) := (y−1)2 /2 (y ≥ 0) with Re φ = [0, ∞) and h(z) := (z − 1)3 + 1 (z ∈ R). The generator φ(y) := h(e φ(y)) = (0.5 · y2 − y − 0.5)3 + 1 is a degree-6 polynomial which is neither convex nor concave in the classical sense, and uy0 (y) := h(e φ(y0)+ e φ0 (y0)·(y −y0)) = (y ·y0 −y −0.5·(y0)2 −0.5)3 +1 ∈ Uh is a degree-3 polynomial being a strict Uh−subgradient of φ at y0 ≥ 0. As a second example, let e φ ∈ ΦC1 have continuous second derivative and h be twice continuously differentiable and strictly convex on Re φ with h0 (0) = 1 (in addition to the above assumptions). Then, φ(y) := h(e φ(y)) is in ΦC1 having strictly larger curvature than e φ (except at y = 1). Especially, for h(z) := exp(z)−1 (z ∈ R) the generator φ is basically strictly log-convex and the divergence in (3) becomes Z X h exp e φ p(x) m1(x)  −exp e φ q(x) m2(x)  ·exp  e φ0 q(x) m2(x)  ·  p(x) m1(x) − q(x) m2(x) i dλ(x) 1/m3(x) . (4) (II) If φ itself is in the subclass Φ := ΦC1 ⊂ ΦUal we obtain from (2) 0 ≤ Dφ,M1,M2,M3 (P, Q) = Z X h φ  p(x) m1(x)  − φ  q(x) m2(x)  − φ0  q(x) m2(x)  ·  p(x) m1(x) − q(x) m2(x) i m3(x)dλ(x). (5) In contrast, if φ has a non-differentiable “cusp” at y0 = q(x) m2(x) , then one has to take the smaller of the deviations (at y = p(x) m1(x) ) from the right-hand respectively left-hand tangent line at y0. Notice that in (5) one gets Dφ,M1,M2,M3 (P, Q) = Dφ̃,M1,M2,M3 (P, Q) for any φ̃(y) := φ(y)+c1 +c2 ·y (y ∈ (0, ∞)) with c1, c2 ∈ R. In the subcase φ(y) := exp(e φ(y)) − 1 of (I), the divergence in (5) becomes Z X h exp e φ p(x) m1(x)  −exp e φ q(x) m2(x)  ·  1+ e φ0  q(x) m2(x)  ·  p(x) m1(x) − q(x) m2(x) i dλ(x) 1/m3(x) which is larger than (4) which uses the additional information of log-convexity. This holds analogously for the more general h leading to larger curvature. (III) By further specializing φ ∈ ΦC1 , m1(x) = m2(x) =: m`(x), m3(x) = m`(x) · H (mg(x))x∈X  for some (measurable) function mg : X 7→ [0, ∞) and some strictly positive scalar functional H thereupon, we deduce from (5) 0 ≤ Bφ,Mg,H (P, Q | M`) := Dφ,M1,M2,M3 (P, Q) = H (mg(x))x∈X  · Z X h φ  p(x) m`(x)  −φ  q(x) m`(x)  − φ0  q(x) m`(x)  ·  p(x) m`(x) − q(x) m`(x)  i m`(x) dλ(x). (6) The term H (mg(x))x∈X  can be viewed as a “global steepness tuning” multi- plier of the generator φ, in the sense of Bφ,Mg,H (P, Q | M`) = Bc·φ,Mg,1 (P, Q | M`) where 1 denotes the functional with constant value 1. This becomes non-trivial for the subcase where the “global” density mg depends on the probability distri- butions P,Q of which we want to quantify the distance, e.g. if Mg = Wg(P, Q) in the sense of mg(x) = wg(p(x), q(x)) ≥ 0 for some (measurable) “global scale-connector” wg : [0, ∞) × [0, ∞) 7→ [0, ∞] between the densities p(x) and q(x). Analogously, one can also use “local” scaling distributions of the form M` = W`(P, Q) in the sense that m`(x) = w`(p(x), q(x)) ≥ 0 (λ−a.a. x ∈ X ) for some “local scale-connector” w` : [0, ∞)×[0, ∞) 7→ [0, ∞] between the densi- ties p(x) and q(x) (where w` is strictly positive on (0, ∞)×(0, ∞)). Accordingly, (6) turns into Bφ,Mg,H(P, Q | M`) = Bφ,Wg(P,Q),H(P, Q | W`(P, Q)) = H (wg(p(x), q(x)))x∈X  · R X h φ  p(x) w`(p(x),q(x))  − φ  q(x) w`(p(x),q(x))  − φ0  q(x) w`(p(x),q(x))  ·  p(x) w`(p(x),q(x)) − q(x) w`(p(x),q(x))  i w`(p(x), q(x)) dλ(x). (7) In the discrete setup (Xcount, λcount), (7) leads to Bφ,Mg,H (P, Q | M`) = Bφ,Wg(P,Q),H (P, Q | W`(P, Q)) = H (wg(p(x), q(x)))x∈X  · P x∈X h φ  p(x) w`(p(x),q(x))  − φ  q(x) w`(p(x),q(x))  − φ0  q(x) w`(p(x),q(x))  ·  p(x) w`(p(x),q(x)) − q(x) w`(p(x),q(x))  i · w`(p(x), q(x)). (8) Returning to the general setup, from (7) we can extract the following well-known, widely used distances as special subcases of our universal framework: (IIIa) Ordinary Bregman divergences OBD between probability distributions (see e.g. Pardo & Vajda [18]): R X [φ(p(x)) − φ(q(x)) − φ0 (q(x)) · (p(x) − q(x))] dλ(x) = Bφ,I,1 (P, Q | I) = Bφ,Mg,H (P, Q | M`) where Mg = I, M` = I means mg(x) = 1, m`(x) = 1 for all x. (IIIb) Csiszar-Ali-Silvey φ−divergences CASD (cf. Csiszar [8], Ali & Silvey [3]): R X h q(x) · φ  p(x) q(x) i dλ(x) = Bφ,I,1 (P, Q | Q) . This includes in particular the Kullback-Leibler information divergence and Pearson’s chisquare divergence (see Section 2 for explicit formulas). (IIIc) Scaled Bregman divergences SBD (cf. Stummer [20], Stummer&Vajda [21]): Z X h φ p(x) m`(x)  −φ q(x) m`(x)  −φ0 q(x) m`(x)  ·  p(x) m`(x) − q(x) m`(x) i m`(x) dλ(x)=Bφ,I,1(P, Q | M`) . The sub-setup m`(x) = w`(p(x), q(x)) ≥ 0 was used in Kißlinger & Stummer [11] for comprehensive investigations on robustness; see also [9], [10]. (IIId) Total Bregman divergences (cf. Liu et al. [12],[13], Vemuri et al. [24]): 1 √ 1+ R X (φ0(q(x)))2 dλ(x) · R X [φ(p(x)) − φ(q(x)) − φ0 (q(x)) · (p(x) − q(x))] dλ(x) = Bφ,Mto g ,Hto (P, Q | I) where Mto g := Wto g (P, Q) in the sense of mto g (x) := wto g (p(x), q(x)) := (φ0 (q(x))) 2 , and Hto (h(x))x∈X  := 1 √ 1+ R X h(x) dλ(x) . For example, for the special case of the discrete setup (Xfin, λcount) where X = Xfin has only finitely (rather than countably) many elements, Liu et al. [12],[13], Vemuri et al. [24] also deal with non-probability vectors and non-additive aggregations and show that their total Bregman divergences have the advantage to be invariant against certain trans- formations, e.g. those from the special linear group (matrices whose determinant is equal to 1, for instance rotations). (IIIe) Conformal divergences: H (wg(q(x)))x∈X  · R X [φ(p(x)) − φ(q(x)) − φ0 (q(x)) · (p(x) − q(x))] dλ(x) = Bφ,Wg(Q),H (P, Q | I) . (9) For the special case of the finite discrete setup (Xfin, λcount), (9) reduces to the conformal Bregman divergences of Nock et al. [17]; within this (Xfin, λcount) they also consider non-probability vectors and non-additive aggregations. (IIIf) Scaled conformal divergences: H  wg  q(x) m`(x)  x∈X  · R X h φ  p(x) m`(x)  − φ  q(x) m`(x)  − φ0  q(x) m`(x)  ·  p(x) m`(x) − q(x) m`(x)  i m`(x) dλ(x) = Bφ,Wg(Q/M`),H(P, Q | M`) . (10) In the special finite discrete setup (Xfin, λcount), (10) leads to the scaled con- formal Bregman divergences indicated in Nock et al. [17]; within (Xfin, λcount) they also employ non-probability vectors and non-additive aggregations. (IIIg) Generalized Burbea-Rao divergences with β ∈ (0, 1): H (mg(x))x∈X  · R X  β · φ(p(x)) + (1 − β) · φ(q(x)) −φ(βp(x) + (1 − β)q(x))  dλ(x) = Bφ2,Mg,H  P, Q | M (β,φ) `  where φ2(y) := (y−1)2 /2 and M (β) ` = W (β,φ) ` (P, Q) in the sense that m (β,φ) ` (x) = w (β,φ) ` (p(x), q(x)) with w (β,φ) ` (u, v) := (u−v)2 2·(β·φ(u)+(1−β)·φ(v)−φ(βu+(1−β)v)) . In anal- ogy with the considerations in (IIIe) above, one may call the special case Bφ2,Wg(Q),H  P, Q | M (β,φ) `  a conformal Burbea-Rao divergence. To end up Section 1, let us mention that there is a well-known interplay be- tween the geometry of parameters for exponential families and divergences, in the setups (IIIa)-(IIIe) (see e.g.[2],[4],[21],[9],[1],[17]). To gain further insights, it would be illuminating to extend this to the context of Definition 1. 2 General asymptotic results for finite discrete case In this section, we deal with the above-mentioned setup (8) and assume addi- tionally that the function φ ∈ ΦC1 is thrice continuously differentiable on (0, ∞], as well as that all three functions w`(u, v), w1(u, v) := ∂w` ∂u (u, v) and w11(u, v) := ∂2 w` ∂u2 (u, v) are continuous in all (u, v) of some (maybe tiny) neighbourhood of the diagonal {(t, t) : t ∈ (0, 1)} (so that the behaviour for u ≈ v is tech- nically appropriate). In such a setup, we consider the following context: for i ∈ N let the observation of the i−th data point be represented by the ran- dom variable Xi which takes values in some finite space X := {x1, . . . , xs} which has s := |X | ≥ 2 outcomes and thus, we choose the counting distribution λ := λcount as reference distribution (i.e., λcount[{xk}] = 1 for all k). Accordingly, let X1, . . . , XN represent a random sample of independent and identically dis- tributed observations generated from an unknown true distribution Pθtrue which is supposed to be a member of a parametric family PΘ := {Pθ ∈ M 1 λ : θ ∈ Θ} of hypothetical, potential candidate distributions with probability mass func- tion pθ. Here, Θ ⊂ R` is a `−dimensional parameter set. Moreover, we denote by P := Pemp N := 1 N · PN i=1 δXi [·] the empirical distribution for which the prob- ability mass function pemp N consists of the relative frequencies p(x) = pemp N (x) = 1 N · #{i ∈ {1, . . . , N} : Xi = x} (i.e. the “histogram entries”). If the sample size N tends to infinity, it is intuitively plausible that the divergence (cf. (8)) 0 ≤ T φ N (P emp N ,Pθ) 2N := Bφ,Wg(P emp N ,Pθ),H (Pemp N , Pθ | W`(Pemp N , Pθ)) =H (wg(pemp N (x), pθ(x)))x∈X  · X x∈X h φ  pemp N (x) w`(pemp N (x),pθ(x))  −φ  pθ(x) w`(pemp N (x),pθ(x))  −φ0  pθ(x) w`(pemp N (x),pθ(x))  ·  pemp N (x) w`(pemp N (x),pθ(x)) − pθ(x) w`(pemp N (x),pθ(x))  i ·w`(pemp N (x), pθ(x)) =: H (wg(pemp N (x), pθ(x)))x∈X  · ζN =: ΥN · ζN (11) between the data-derived empirical distribution Pemp N and the candidate model Pθ converges to zero, provided that we have found the correct model in the sense that Pθ is equal to the true data generating distribution Pθtrue , and that H (wg(pemp N (x), pθ(x)))x∈X  converges a.s. to a constant aθ > 0. In the same line of argumentation, Bφ,Wg(P emp N ,Pθ),H(Pemp N , Pθ | W`(Pemp N , Pθ)) becomes close to zero if Pθ is close to Pθtrue . Notice that (say, for pemp N and pθ without zeros) the Kullback-Leibler divergence KL case with φ1(y) := y log y+1−y ≥ 0 (y > 0) Bφ1,I,1(Pemp N , Pθ | Pθ) = X x∈X pθ(x) · φ1  pemp N (x) pθ(x)  = X x∈X pemp N (x) · log  pemp N (x) pθ(x)  is nothing but the (multiple of the) very prominent likelihood ratio test statistics (likelihood disparity); minimizing it over θ produces the maximum likelihood estimate b θMLE . Moreover, by employing φ2(y) := (y − 1)2 /2 the divergence Bφ2,I,1(Pemp N , Pθ | Pθ) = P x∈X (pemp N (x)−pθ(x))2 2pθ(x) represents the (multiple of the) Pearson chi-square test statistics. Concerning the above-mentioned conjectures where the sample size N tends to infinity, in case of Pθtrue = Pθ one can even derive the limit distribution of the divergence test statistics Tφ N (Pemp N , Pθ) in quite “universal generality”: Theorem 1. Under the null hypothesis “H0: Pθtrue = Pθ with pθ(x) > 0 for all x ∈ X ” and the existence of a.s.−limN→∞ H (wg(pemp N (x), pθ(x)))x∈X  =: aθ > 0, the asymptotic distribution (as N → ∞) of Tφ N (Pemp N , Pθ) = 2N · Bφ,Wg(P emp N ,Pθ),H (Pemp N , Pθ | W`(Pemp N , Pθ)) has the following density fs∗ 6 : fs∗ (y; γφ,θ ) = y s∗ 2 −1 2 s∗ 2 ∞ P k=0 ck · −y 2 k Γ s∗ 2 + k  , y ∈ [0, ∞[ , with c0 = s∗ Q j=1  γφ,θ j −0.5 and ck = 1 2k k−1 P r=0 cr s∗ P j=1  γφ,θ j r−k (k ∈ N) where s∗ := rank(ΣAΣ) is the number of the strictly positive eigenvalues (γφ,θ i )i=1,...,s∗ of the matrix AΣ = (c̄i · (δij − pθ(xj)))i,j=1,...,s consisting of Σ = (pθ(xi) · (δij − pθ(xj))i,j=1,...,s A =   aθ · φ00  pθ(xi) w(pθ(xi),pθ(xi))  w(pθ(xi), pθ(xi)) δij   i,j=1,...,s c̄i = aθ · φ00  pθ(xi) w(pθ(xi), pθ(xi))  · pθ(xi) w(pθ(xi), pθ(xi)) . Here we have used Kronecker’s delta δij which is 1 iff i = j and 0 else. In particular, the asymptotic distribution (as N → ∞) of TN := Tφ N (Pemp N , Pθ) coincides with the distribution of a weighted linear combination of standard- chi-square-distributed random variables where the weights are the γφ,θ i (i = 1, . . . , s∗ ). Notice that Theorem 1 extends a theorem of Kißlinger & Stummer [11] who deal with the subcase (IIIc) of scaled Bregman divergences. The proof of the latter can be straightforwardly adapted to verify Theorem 1, due to the representation TN = ΥN ·(2N·ζN ) in (11) and the assumption a.s.−limN→∞ ΥN = aθ > 0. The details will appear elsewhere. Remarkably, within (IIIc) the limit distribution of TN is even a parameter-free “ordinary” chi-square distribution provided that the condition w(v, v) = v holds for all v (cf. [11]). Acknowledgement: We are grateful to 3 referees for their useful suggestions. References 1. Amari, S.-I.: Information Geometry and Its Applications. Springer, Japan (2016). 2. Amari, S.-I., Nagaoka, H.: Methods of Information Geometry. Oxford University Press (2000). 3. Ali, M.S., Silvey, D.: A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. B-28, 131-140 (1966) 4. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman diver- gences. J. Mach. Learn. Res. 6, 1705-1749 (2005) 6 (with respect to the one-dim. Lebesgue measure) 5. Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Ap- proach. CRC Press, Boca Raton (2011) 6. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning and Games. Cambridge Univer- sity Press (2006). 7. Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 48, 253–285 (2002) 8. Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A-8, 85-108 (1963) 9. Kißlinger, A.-L., Stummer, W.: Some Decision Procedures Based on Scaled Bregman Distance Surfaces. In: F. Nielsen and F. Barbaresco (Eds.): GSI 2013, LNCS 8085, pp. 479-486. Springer, Berlin (2013) 10. Kißlinger, A.-L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: F. Nielsen and F. Barbaresco (Eds.): GSI 2015, LNCS 9389, pp. 693-701. Springer, Berlin (2015) 11. Kißlinger, A.-L., Stummer, W.: Robust statistical engineering by means of scaled Bregman distances. In: C. Agostinelli, A. Basu, P. Filzmoser and D. Mukherjee (eds.): Recent Advances in Robust Statistics – Theory and Applications, pp. 81– 113. Springer India (2016) 12. Liu, M., Vemuri, B.C., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to shape retrieval. Proc. 23rd IEEE CVPR, 3463–3468 (2010) 13. Liu, M., Vemuri, B.C., Amari S.-I., Nielsen, F.: Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Analysis and Machine Intelligence 34(12), 2407–2419 (2012) 14. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437-1481 (2004) 15. Nock, R., Menon, A.K., Ong, C.S.: A scaled Bregman theorem with applications. Advances in Neural Information Processing Systems 29 (NIPS 2016), pages 19–27 (2016) 16. Nock, R., Nielsen, F.: Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 31 (11), 2048–2059 (2009) 17. Nock, R., Nielsen, F., Amari, S.-I.: On Conformal Divergences and their Population Minimizers. IEEE Transaction on Information Theory 62 (1), 527–538 (2016) 18. Pardo, M.C., Vajda, I.: On asymptotic properties of information-theoretic diver- gences. IEEE Transaction on Information Theory 49(7), 1860–1868 (2003) 19. Pallaschke, D., Rolewicz, S.: Foundations of Mathematical Optimization. Kluwer Acad. Publ., Dordrecht (1997) 20. Stummer, W.: Some Bregman distances between financial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503 – 1050504 (2007) 21. Stummer, W., Vajda, I.: On Bregman Distances and Divergences of Probability Measures. IEEE Transaction on Information Theory 58 (3), 1277–1288 (2012) 22. Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Breg- man divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64, 1009-1044 (2012) 23. Tsuda, K., Rtsch, G., Warmuth, M. (2005). Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995-1018 (2005) 24. Vemuri, B.C., Liu, M., Amari, S.-I., Nielsen, F.: Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med. Imag. 30(2), 475–483 (2011) 25. Wu, L., Hoi, S.C.H, Jin, R., Zhu, J., Yu, N..: Learning Bregman distance functions for semi-supervised clustering. IEEE Transaction on Knowledge and Data Engineer- ing 24(3), 478–491 (2012)