## 3D insights to some divergences for robust statistics and machine learning

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22365
Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

## Résumé

Divergences (distances) which measure the similarity respectively proximity between two probability distributions have turned out to be very useful for several di erent tasks in statistics, machine learning, information theory, etc. Some prominent examples are the Kullback-Leibler information, - for convex functions Φ - the Csiszar-Ali-Silvey  Φ - divergences CASD, the \classical" (i.e., unscaled) Bregman distances and the more general scaled Bregman distances SBD of [26],[27]. By means of 3D plots we show several properties and pitfalls of the geometries of SBDs, also for non-probability distributions; robustness of corresponding minimum-distance concepts will also be covered. For these investigations, we construct a special SBD subclass which covers both the often used power divergences (of CASD type) as well as their robustness-enhanced extensions with non-convex non-concave  Φ.

## Collection

3D insights to some divergences for robust statistics and machine learning Birgit Roensch, Wolfgang Stummer
Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

Divergences (distances) which measure the similarity respectively proximity between two probability distributions have turned out to be very useful for several di erent tasks in statistics, machine learning, information theory, etc. Some prominent examples are the Kullback-Leibler information, - for convex functions Φ - the Csiszar-Ali-Silvey  Φ - divergences CASD, the \classical" (i.e., unscaled) Bregman distances and the more general scaled Bregman distances SBD of [26],[27]. By means of 3D plots we show several properties and pitfalls of the geometries of SBDs, also for non-probability distributions; robustness of corresponding minimum-distance concepts will also be covered. For these investigations, we construct a special SBD subclass which covers both the often used power divergences (of CASD type) as well as their robustness-enhanced extensions with non-convex non-concave  Φ.
3D insights to some divergences for robust statistics and machine learning (flgures)
 Télécharger Le téléchargement implique l’acceptation de nos conditions d’utilisation

## Métriques

0
0
5.33 Mo
application/pdf
bitcache://18ca48695a3c8d0daa8c3a8251e93f2d7fa7f5e5

## Sponsors

### Organisateurs

```<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://datacite.org/schema/kernel-4"
xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
<identifier identifierType="DOI">10.23723/17410/22365</identifier><creators><creator><creatorName>Wolfgang Stummer</creatorName></creator><creator><creatorName>Birgit Roensch</creatorName></creator></creators><titles>
<title>3D insights to some divergences for robust statistics and machine learning</title></titles>
<publisher>SEE</publisher>
<publicationYear>2018</publicationYear>
<resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>Scaled Bregman distances</subject><subject>Φ-divergences</subject><subject>power divergences</subject><subject>robustness</subject><subject>minimum distance estimation</subject></subjects><dates>
<date dateType="Created">Sun 18 Feb 2018</date>
<date dateType="Updated">Sun 18 Feb 2018</date>
<date dateType="Submitted">Thu 14 Mar 2019</date>
</dates>
<alternateIdentifiers>
<alternateIdentifier alternateIdentifierType="bitstream">18ca48695a3c8d0daa8c3a8251e93f2d7fa7f5e5</alternateIdentifier>
</alternateIdentifiers>
<formats>
<format>application/pdf</format>
</formats>
<version>37030</version>
<descriptions>
<description descriptionType="Abstract">Divergences (distances) which measure the similarity respectively proximity between two probability distributions have turned out to be very useful for several di erent tasks in statistics, machine learning, information theory, etc. Some prominent examples are the Kullback-Leibler information, - for convex functions Φ - the Csiszar-Ali-Silvey  Φ - divergences CASD, the \classical" (i.e., unscaled) Bregman distances and the more general scaled Bregman distances SBD of [26],[27]. By means of 3D plots we show several properties and pitfalls of the geometries of SBDs, also for non-probability distributions; robustness of corresponding minimum-distance concepts will also be covered. For these investigations, we construct a special SBD subclass which covers both the often used power divergences (of CASD type) as well as their robustness-enhanced extensions with non-convex non-concave  Φ.
</description>
</descriptions>
</resource>```
.

3D insights to some divergences for robust statistics and machine learning Birgit Roensch1 and Wolfgang Stummer1,2 1 Department of Mathematics, University of Erlangen–Nürnberg, Cauerstrasse 11, 91058 Erlangen, Germany, 2 Corresponding Author, stummer@math.fau.de, also Affiliated Faculty Member of the School of Business and Economics, University of Erlangen–Nürnberg, Lange Gasse 20, 90403 Nürnberg, Germany . Abstract. Divergences (distances) which measure the similarity respectively proxi- mity between two probability distributions have turned out to be very useful for several different tasks in statistics, machine learning, information theory, etc. Some prominent examples are the Kullback-Leibler information, – for convex functions φ – the Csiszar-Ali-Silvey φ−divergences CASD, the “classical” (i.e., unscaled) Bregman distances and the more general scaled Bregman distances SBD of [26],[27]. By means of 3D plots we show several properties and pitfalls of the geometries of SBDs, also for non-probability distributions; robustness of corresponding minimum-distance- concepts will also be covered. For these investigations, we construct a special SBD subclass which covers both the often used power divergences (of CASD type) as well as their robustness-enhanced extensions with non-convex non-concave φ. Keywords: φ−divergences, power divergences, scaled Bregman distances, robust- ness, minimum distance estimation. 1 Introduction and results As exemplary current state of the art, some divergences (distances, (dis)similarity measures, discrepancy measures) between probability distributions have been successfully used for parameter estimation, goodness-of-fit testing, various dif- ferent machine learning tasks, procedures in information theory, the detection of changes, pattern recognition, etc. Amongst them, let us mention exemplarily that from a (strictly) convex function φ one can construct the φ−divergences of [9] and [1], as well as the classical “unscaled” Bregman distances (see e.g. [23]) which also include the density power divergences of [3]. Some comprehensive cov- erages on their statistical use can e.g. be found in [22] and [4]. Machine learning applications are e.g. given in [11], [18], [5], [30], [8], [21],[29],[31]. Recently, [27] (cf. also [26]) introduced the concept of scaled Bregman distances (SBD), which cover all the above-mentioned distances as special cases; see [12], [13], [15] for some applications of SBD to simultaneous parameter estimation and goodness- of-fit investigations, and [14] for utilizations in robust change point detections. Notice also [20] for indicating some potential applications of SBD to machine learning tasks, in connection with v−conformal divergences; a special sub-setup of the latter was also employed by [19]. In the present paper, we visualize some interesting properties and pitfalls of the SBD geometries induced by the involved divergence balls. To start with, let us assume that the modeled respectively ob- served random data take values in a state space Y equipped with a system A of admissible events (σ−algebra). On this, let us consider the similarity/proximity of two probability distributions P, Q described by their probability densities y 7→ p(y) ≥ 0, y 7→ q(y) ≥ 0 via P[A] = R A p(y) dλ(y), Q[A] = R A q(y) dλ(y) (A ∈ A ), where λ is a fixed – maybe nonprobability – distribution and one has the normalizations R Y p(y) dλ(y) = R Y q(y) dλ(y) = 1. The set of all such proba- bility distributions will be denoted by M 1 λ . We also employ the set Mλ of all gen- eral – maybe nonprobability – distributions ν of the form ν[A] = R A n(y) dλ(y) (A ∈ A ) with density y 7→ n(y) ≥ 0 satisfying R Y n(y) dλ(y) < ∞. For in- stance, if λ is the counting distribution (attributing the value 1 to each outcome y ∈ Y ) then p(·), q(·) are (e.g. binomial) probability mass functions and n(·) is a (e.g. unnormalized-histogram-related) general mass function; if λ is the Lebesgue measure on Y = R, then p(·), q(·) are (e.g. Gaussian) probability density func- tions and n(·) is a general (possibly unnormalized) density function. In such a context, one can use the general concept of distances (divergences, (dis)similarity measures) between distributions introduced by [27] (see also [26],[15]): Definition 1. Let φ : (0, ∞) 7→ IR be a (for the sake of this paper) strictly convex, differentiable function, continuously extended to t = 0. Its derivative is denoted by φ0 . The Bregman distance of the two probability distributions P, Q ∈ M 1 λ scaled by the general distribution W ∈ Mλ (with density w) is defined by 0≤Bφ (P, Q || W)= R Y  φ p(y) w(y)  −φ q(y) w(y)  −φ0 q(y) w(y)  · p(y) w(y) − q(y) w(y)  dW(y) (1) = R Y w(y) ·  φ p(y) w(y)  − φ q(y) w(y)  − φ0 q(y) w(y)  · p(y) w(y) − q(y) w(y)  dλ(y). (2) To guarantee the existence of the integrals in (1), (2), the zeros of p(·), q(·), w(·) have to be combined by proper conventions. Analogously, we define the scaled Bregman distance Bφ (µ, ν || W) ≥ 0 of two general distributions µ, ν ∈ Mλ sca- led by the general distribution W ∈ Mλ, where we additionally assume φ(t) ≥ 0. The papers [26], [27] show that if φ is (say) strictly convex on [0, ∞), continu- ous on (0, ∞) with φ(1) = 0, then for the special case p(y) > 0, q(y) = w(y) > 0 (y ∈ Y ) the scaled Bregman distance (2) becomes Bφ(P, Q || Q) = R Y q(y) · φ p(y) q(y)  dλ(y) =: Dφ(P, Q) , (3) which is nothing but the well-known φ−divergence between P and Q. The lat- ter has been first studied by [9] as well as [1]; see e.g. also [28] for pitfalls on φ−divergences Dφ(µ, ν) between general distributions µ, ν ∈ Mλ, and [6] for recent applications of the latter for bootstrapping purposes. For “generator” φ(t) = φ1(t) := t log t + 1 − t ≥ 0 (t > 0) one ends up with the Kullback-Leibler KL divergence Dφ1 (P, Q) 3 . The special choice φ(t) = φ0(t) := − log t+t−1 ≥ 0 leads to the reversed KL divergence Dφ0 (P, Q) = Dφ1 (Q, P), and the function φ(t) = φα(t) := tα −1 α(α−1) − t−1 α−1 ≥ 0 (α ∈ IR\{0, 1}) generates the other power divergences Dφα (P, Q) (cf. [16], [24]), where α = 2 gives the Pearson’s chi-square divergence and α = 1/2 the squared Hellinger distance. So far, φ has been (strictly) convex. However, notice that scaled Bregman distances Bφ(P, Q || W) can also be used to construct e φ−divergences De φ (P, Q) with non-convex non-concave generator e φ (this new approach contrasts e.g. the construction method of [25], [7]). Exemplarily, let φ := φα and W := f Wβ,r(P, Q) 3 which is equal to Dφ̌1 (P, Q) with φ̌1(t) := t log t ∈ [−e−1 , ∞[, but generally Dφ1 (µ, ν) 6= Dφ̌1 (µ, ν) where the latter can be negative and thus isn’t a distance in terms of the “locally adaptive” scaling density w(y) = e wβ,r(p(y), q(y)) ≥ 0 defined by the r-th power mean e wβ,r(u, v) := (β ur + (1 − β) vr )1/r , β ∈ [0, 1], r ∈ R\{0}, u ≥ 0, v ≥ 0. Accordingly, for α · (α − 1) 6= 0 we derive the corresponding scaled Bregman distance Bφα (P, Q || f Wβ,r(P, Q))= R Y e wβ,r(p(y),q(y))1−α ·{p(y)α +(α−1) q(y)α −α p(y) q(y)α−1 } α·(α−1) dλ(y) = R Y q(y) · (β·(p(y) q(y) ) r +1−β) (1−α)/r · {(p(y) q(y) ) α +α−1−α· p(y) q(y) } α·(α−1) dλ(y) =: De φα,β,r (P, Q)(4) where the generator e φα,β,r(t) := φα(t) · (β tr + 1 − β)(1−α)/r = (α · (α − 1))−1 · (tα + α − 1 − α · t) · β tr + 1 − β (1−α)/r > 0, t > 0, (5) can be non-convex non-concave in t; see e.g. Figure 1(d) which shows t 7→ e φα,β,r(t) for α = 7.5, β = 0.05, r = 7.5. Analogously, we construct the more general Bφα (µ, ν || f Wβ,r(µ, ν)) =: De φα,β,r (µ, ν) for general distributions µ, ν. The subcase β = 0, α · (α − 1) 6= 0, leads to the power divergences De φα,0,r (P, Q) = Dφα (P, Q) where the function t 7→ e φα,0,r(t) = φα(t) is strictly convex. We shall see in the RAF discussion in (6) below, that β 6= 0 opens the gate to enhanced robustness properties; interesting divergence geometries can be achieved, too. Returning to the general context, with each scaled Bregman divergence Bφ(·, · || W) one can associate a divergence-ball Bφ(P, ρ) with “center” P ∈ M 1 λ and “radius” ρ ∈]0, ∞[, defined by Bφ(P, ρ) := {Q ∈ M 1 λ : Bφ(P, Q || W) ≤ ρ}, whereas the corresponding divergence-sphere is given by Sφ(P, ρ) := {Q ∈ M 1 λ : Bφ(P, Q || W) = ρ}; see e.g. [10] for a use of some divergence balls with strictly convex generators as a constraint in financial-risk related deci- sions. Analogously, we define the general-distribution-versions Bg φ(µ, ρ) := {ν ∈ Mλ : Bφ(µ, ν || W) ≤ ρ} and Sg φ(µ, R) := {ν ∈ Mλ : Bφ(µ, ν || W) = ρ}. Of course, the “geometry/topology” induced by these divergence balls and spheres is quite non-obvious. In order to help building up a corresponding intuition, we concretely show several effects in the following, where for the sake of brevity and preparation for the robustness investigations below we confine ourselves to the flexible divergence family Bφα(P, Q || f Wβ,r(P, Q)) = De φα,β,r (P, Q) and to P := P,θ0 = (1 − ) Bin(2, θ0) +  δ2, where  ∈]0, 1[, δy denotes Dirac’s distri- bution at y (i.e. δy[A] = 1 iff y ∈ A and δy[A] = 0 else), and Bin(2, θ0) =: e P is a binomial distribution with parameters n = 2 and θ0 ∈]0, 1[ (which amounts to Y = {0, 1, 2}, e p(0) = (1 − θ0)2 , e p(1) = 2θ0 · (1 − θ0), e p(2) = θ2 0). In other words, P,θ0 is a binomial distribution which is contaminated at the state y = 2 with percentage-degree ∈ ]0, 1[. For the visualization of the divergence spheres Sφ(P,θ0 , ρ), all the involved probability distributions (say) P can be – as usual – identified with the 3D column-vectors P _ = (p(0), p(1), p(2))0 of the corresponding three components of its probability mass function. Thus, each (p(0), p(1), p(2))0 lies in the “probability simplex” Π := {(π1, π2, π3)0 ∈ R3 : π1 ≥ 0, π2 ≥ 0, π3 ≥ 0, π1 + π2 + π3 = 1}. Analogously, each general distribution (say) ν can be iden- tified with the 3D column-vector ν _ = (n(0), n(1), n(2))0 of the corresponding three components of its mass function. Hence, each (n(0), n(1), n(2))0 is a point in the first octant Σ := {(σ1, σ2, σ3)0 ∈ R3 : σ1 ≥ 0, σ2 ≥ 0, σ3 ≥ 0}. Of course, data-derived randomness can enter this context – for instance – in the following way: for index m ∈ τ := N let the generation of the m−th data point be represented by the random variable Ym which takes values in the state space Y . The associated family of random variables (Ym, m ∈ τ) is supposed to be independent and identically distributed (i.i.d.) under the probability distribu- tion P,θ0 . For each concrete sample (Y1, . . . , YN ) of size N one builds the corre- sponding (random) empirical distribution Pemp N [ · ] := 1 N · PN i=1 δYi [·] which under the correct model converges (in distribution) to P,θ0 as the sample size N tends to ∞. Notice that the 3D vector (pemp N (0), pemp N (1), pemp N (2))0 of the probability- mass-function components (where pemp N (y) := 1 N · #  i ∈ {1, . . . , N} : Yi = y ) moves randomly in the probability simplex Π as N increases. However, for large N one can (approximately) identify Pemp N + P,θ0 which we do in the following. Within this special context, let us exemplarily explain the following effects: Effect 1: divergence spheres Sφ(P, ρ) can take quite different kinds of shapes, e.g. triangles (with rounded edges), rectangles (with rounded edges), and non- convex non-concave “blobs”; this can even appear with fixed center P (= P,θ0 ) when only the radius ρ changes; see Figure 1(a)-(c). As comparative preparation for other effects below, we draw θ0 7→ P,θ0 as an orange curve, as well as a dark-blue curve which represents the set C := {Bin(2, θ) : θ ∈]0, 1[} of all binomial distributions, and in Fig. 1(a)-(c) we aim for spheres (in red) which are fully in the green-coloured probability simplex Π (i.e., no need for cutoff on the Π−boundaries) and “on the left of C ”. Notice that we use viewing angles with “minimal visual distortion”. The corresponding, interestingly shaped, “non- simplex-restricted” spheres Sg φ(P,θ0 , ρ) are plotted, too (cf. Figure 1(e)-(g)). Effect 2: unlike Euclidean balls, even for fixed radius ρ the divergence spheres Sφ(P, ρ) can quite shift their shape as the center P moves in the probability space (e.g., along the orange “contamination” curve θ0 7→ P,θ0 ); see Fig. 1(h)-(i). Effect 3: for fixed center P, increasing the radius ρ may lead to a quite nonlinear growth of the divergence spheres Sφ(P, ρ) (with P = P,θ0 ); see Figure 1(j)-(l). Notice that the principal shape remains the same (as opposed to Effect 1). Effect 4: for an i.i.d. sample (Y1, . . . , YN ) under the probability distribution P0, the corresponding minimum-distance parameter estimator is given by any b θ from the possibly multi-valued set arg minθ∈Θ Bφ(Pemp N , Qθ || W) where C := {Qθ ∈ M 1 λ : θ ∈ Θ} is a parametric family of probability distributions. At the same time the (distribution of the random) size of minθ∈Θ Bφ(Pemp N , Qθ || W) is an indicator for the goodness-of-fit. To visualize some corresponding robustness- concerning geometric effects, we confine ourselves to the above-mentioned con- tamination context P0 = P,θ0 = (1 − ) Bin(2, θ0) +  δ2 and C := {Bin(2, θ) : θ ∈]0, 1[}, and to the special-SBD-subfamily minimization (cf. (4)) arg minθ∈Θ De φα,β,r (Pemp N , Qθ). In fact, for the sake of brevity we only consider in the following the (for large sample sizes N reasonable) deterministic proxy T() := arg minθ∈Θ De φα,β,r (P,θ0 , Qθ), and discuss robustness against contami- nation in terms of nearness of T() to θ0 even for “large” contaminations reflected by “large” ; furthermore, we discuss abrupt changes of  7→ T(). A formal ro- bustness treatment is given in terms of the RAF below (cf. (6)). As can be seen from Figure 1(m) for α = 0.05, β = 0 (and thus, De φ0.05,0,r (·, ·) is the classical 0.05−power divergence independently of r 6= 0) and θ0 = 0.08, the function  7→ T() is quite robust for contamination percentage-degrees  ∈ [0, 0.45] (i.e. T() ≈ θ0), but it exhibits a sharp breakdown at  ≈ 0.46; this con- trasts the non-robust (“uniformly much steeper” but smooth) behaviour in the case α = 1.0, β = 0 of minimum-Kullback-Leibler-divergence estimation which is in one-to-one correspondence with the maximum likelihood estimation. A plot which is similar to Figure 1(m) was first shown by [17] (see also [2]) for the squared Hellinger distance HD Dφ0.5 (·, ·) – which in our extended frame- work corresponds to De φ0.5,0,r (·, ·) – for the larger, non-visualizable state space Y = {0, 1, . . . , 12} and the contamination P0 := (1 − ) Bin(12, 1 2 ) +  δ12 4 . Because of our low-dimensional state space, we can give further geometric in- sights to such robustness effects. Indeed, Figure 1(n) respectively (o) show those spheres Sφ0.05 (P0.45,0.08, ρmin) respectively Sφ0.05 (P0.46,0.08, e ρmin) which touch the binomial projection set C (i.e., the dark-blue coloured curve) for the “first time” as the radius ρ grows. The corresponding respective touching points – which represent the resulting estimated probability distributions Bin(2, T(0.45)) respectively Bin(2, T(0.46)) – are marked as red dots on the blue curve, and are “very far apart”. This is also consistent with Figure 1(p) respectively 1(q) which show the functions ]0, 1[3 θ 7→ De φ0.05,0,r (P0.45,0.08, Bin(2, θ)) respectively ]0, 1[3 θ 7→ De φ0.05,0,r (P0.46,0.08, Bin(2, θ)) where one can see the “global switch- ing between the two local-minimum values”. Furthermore, in Figure 1(n) the red dot lies “robustly close” to the green dot on the dark-blue curve which repre- sents the uncontaminated Bin(2, θ0) to be “found out”. This contrasts the cor- responding behaviour in Figure 1(o) where the (only 1% higher) contamination- percentage-degree is already in the non-robust range. Additionally, with our new divergence family we can produce similar variants with non-convex diver- gence spheres (see Figures 1(r)-(t)) resp. with smoother (non-sharp) breakdown (“smooth-rolling-over of the red dots”, see e.g. Figures 1(u)-(w)). Further, e.g. “cascade-type”, transition effects are omitted for the sake of brevity. Due to the minimization step (in our discrete setup with scalar parameter θ) 0 = − ∂ ∂θ Dφ(P, Qθ) = − P x∈X ∂ ∂v  v · φ  p(x) v  v=qθ(x) · ∂ ∂θ qθ(x) =: P x∈X aφ  p(x) qθ(x) − 1  · ∂ ∂θ qθ(x) , with P = Pemp N + P,θ0 , (6) the robustness-degree of minimum-distance estimation by φ−divergences Dφ(·, ·) can be quantified in terms of the residual adjustment function RAF aφ(δ) := (δ + 1) · φ0 (δ + 1) − φ (δ + 1) with Pearson residual δ := u v − 1 ∈ [−1, ∞[ (cf. [17],[2]; see also its generalization to density-pair adjustment functions for general scaled Bregman distances given in [15]). More detailed, for both large δ (reflec- ting outliers) and small δ (reflecting inliers) the RAF aφ(δ) should ideally be closer to zero (i.e. more dampening) than that of the Kullback-Leibler (i.e. maxi- mum-likelihood estimation) benchmark aφ1 (δ) = δ. Concerning this, for various different (α, β, r)−constellations our new divergences De φα,β,r (P, Q) are much 4 also notice that the HD together with θ0 = 0.5 does not exhibit such an effect for our smaller 3-element-state space, due to the lack of outliers more robust against outliers and inliers even than the very-well-performing nega- tive-exponential-divergence NED DφNED (P, Q) of [17],[2] with φ(t) = φNED(t) := exp(1 − t) + t − 2 (t > 0); see Figure 1(x) for an exemplary demonstration. Concluding remarks: By means of exemplary 3D plots we have shown some properties and pitfalls of divergence geometries. For this, we have used one spe- cial case of scaled Bregman distances SBD – namely power-function-type genera- tors and power-mean-type scalings – which can be represented as φ−divergences with possibly non-convex non-concave generator φ; classical Csiszar-Ali-Silvey- type power divergences are covered as a subcase, too. By exploiting the full flexibility of SBD – e.g. those which are not rewriteable as φ−divergence – one can construct further interesting geometric effects. Those contrast principally with the geometric behaviour of the balls constructed from the Bregman diver- gences with “global NMO-scaling” of [19] defined (in the separable setup) by H(P) · R Y  φ p(y) H(P )  − φ q(y) H(Q)  − φ0 q(y) H(Q)  · p(y) H(P ) − q(y) H(Q)  dλ(y) (7) where H(P) := H((p(y))y∈Y ), H(Q) = H((q(y))y∈Y ) are real-valued “global” functionals of the (not necessarily probability) density functions p(·), q(·), e.g. H(P) := R Y h(p(y))dλ(y) for some function h. Notice the very substantial differ- ence to SBD, i.e. to the Bregman divergences with the “local SV-scaling” of [26], [27] given in (2) (even in the locally adaptive subcase w(y) = e wβ,r(p(y), q(y))). Amongst other things, this difference is reflected by the fact that (under some assumptions) the “NMO-scaled” Bregman divergences can be represented as un- scaled Bregman distances with possibly non-convex generator φ (cf. [19]) whereas some “SV-scaled” Bregman divergences can e.g. be represented as Csiszar-Ali- Silvey φ−divergences (which are never unscaled Bregman distances except for KL) with non-convex non-concave generator φ := e φ, cf. (4). To gain further in- sights, it would be illuminating to work out closer connections and differences between these two scaling-types – under duality, reparametrization, ambient- space aspects – and to incorporate further, structurally different examples. Acknowledgement: We are grateful to all three referees for their useful sug- gestions. W. Stummer thanks A.L. Kißlinger for valuable discussions. References 1. Ali, M.S., Silvey, D.: A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. B-28, 131-140 (1966) 2. Basu, A., Lindsay, B.G.: Minimum disparity estimation for continuous models: effi- ciency, distributions and robustness. Ann. Inst. Statist. Math. 46(4), 683-705 (1994) 3. Basu, A., Harris, I.R., Hjort, N.L., Jones, M.C.: Robust and efficient estimation by minimising a density power divergence. Biometrika 85, 549–559 (1998) 4. Basu, A., Shioya, H., Park, C.: Statistical Inference: The Minimum Distance Ap- proach. CRC Press, Boca Raton (2011) 5. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman diver- gences. J. Mach. Learn. Res. 6, 1705-1749 (2005) 6. Broniatowski, M: A weighted bootstrap procedure for divergence minimization prob- lems. In: J. Antoch, J. Jureckova, M. Maciak, M. PeSta M. (Eds.): Analytical Methods in Statistics, pp. 1–22. Springer, Cham (2017) 7. Cerone, P., Dragomir, S.S.: Approximation of the integral mean divergence and f−divergence via mean results. Math. Comp. Model. 42, 207–219 (2005) 8. Cesa-Bianchi, N., Lugosi, G.: Prediction, Learning & Games. Cambridge UP(2006). 9. Csiszar, I.: Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. A-8, 85-108 (1963) 10. Csiszar, I., Breuer, Th.: Measuring distribution model risk. Mathematical Finance 26(2), 395–411 (2016) 11. Collins, M., Schapire, R.E., Singer, Y.: Logistic regression, AdaBoost and Bregman distances. Mach. Learn. 48, 253–285 (2002) 12. Kißlinger, A.-L., Stummer, W.: Some Decision Procedures Based on Scaled Breg- man Distance Surfaces. In: F. Nielsen and F. Barbaresco (Eds.): GSI 2013, LNCS 8085, pp. 479-486. Springer, Berlin (2013) 13. Kißlinger, A.-L., Stummer, W.: New model search for nonlinear recursive models, regressions and autoregressions. In: F. Nielsen and F. Barbaresco (Eds.): GSI 2015, LNCS 9389, pp. 693-701. Springer, Berlin (2015) 14. Kißlinger, A.-L., Stummer, W.: A New Information-Geometric Method of Change Detection. Preprint. 15. Kißlinger, A.-L., Stummer, W.: Robust statistical engineering by means of scaled Bregman distances. In: C. Agostinelli, A. Basu, P. Filzmoser and D. Mukherjee (eds.): Recent Advances in Robust Statistics – Theory and Applications, pp. 81– 113. Springer India (2016) 16. Liese, F., Vajda, I.: Convex Statistical Distances. Teubner, Leipzig (1987) 17. Lindsay, B.G.: Efficiency versus robustness: the case for minimum Hellinger dis- tance and related methods. Ann. Statist. 22(2), 1081-1114 (1994) 18. Murata, N., Takenouchi, T., Kanamori, T., Eguchi, S.: Information geometry of U-boost and Bregman divergence. Neural Comput. 16(7), 1437-1481 (2004) 19. Nock, R., Menon, A.K., Ong, C.S.: A scaled Bregman theorem with applications. Advances in Neural Information Processing Systems 29 (NIPS 2016), pp. 19–27 (2016) 20. Nock, R., Nielsen, F., Amari, S.-I.: On Conformal Divergences and their Population Minimizers. IEEE Transaction on Information Theory 62 (1), 527–538 (2016) 21. Nock, R., Nielsen, F.: Bregman divergences and surrogates for learning. IEEE Trans. Pattern Anal. Mach. Intell. 31 (11), 2048–2059 (2009) 22. Pardo, L.: Statistical Inference Based on Divergence Measures. Chapman H. (2006) 23. Pardo, M.C., Vajda, I.: On asymptotic properties of information-theoretic diver- gences. IEEE Transaction on Information Theory 49(7), 1860–1868 (2003) 24. Read, T.R.C., Cressie, N.A.C.: Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer, New York (1988) 25. Shioya, H., Da-te, T.: A generalisation of Lin divergence and the derivation of a new information divergence measure. Electr. Commun. Japan 78 (7), 34–40 (1995) 26. Stummer, W.: Some Bregman distances between financial diffusion processes. Proc. Appl. Math. Mech. 7(1), 1050503–1050504 (2007) 27. Stummer, W., Vajda, I.: On Bregman Distances and Divergences of Probability Measures. IEEE Transaction on Information Theory 58 (3), 1277–1288 (2012) 28. Stummer, W., Vajda, I.: On divergences of finite measures and their applicability in statistics and information theory. Statistics 44, 169–187 (2010) 29. Sugiyama, M., Suzuki, T., Kanamori, T.: Density-ratio matching under the Breg- man divergence: a unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 64, 1009-1044 (2012) 30. Tsuda, K., Rätsch, G., Warmuth, M. Matrix exponentiated gradient updates for on-line learning and Bregman projection. J. Mach. Learn. Res. 6, 995-1018 (2005) 31. Wu, L., Hoi, S.C.H, Jin, R., Zhu, J., Yu, N.: Learning Bregman distance functions for semi-supervised clustering. IEEE Trans. Knowl. Data Eng. 24(3), 478–491 (2012) (a) ρ = 0.05 (b) ρ = 0.08 (c) ρ = 0.085 (d) (e) ρ = 0.05 (f) ρ = 0.08 (g) ρ = 0.085 (h) θ0 = 0.07 (i) θ0 = 0.33 (j) ρ = 0.5 (k) ρ = 1.4 (l) ρ = 100000 (m) (n)  = .45 (o)  = .46 (p)  = .45 (q)  = .46 (r) (s)  = .45 (t)  = .46 (u) (v)  = .45 (w)  = .46 (x) Fig. 1. (a)-(c): divergence spheres Sφ(P,θ0 , ρ) (in red) for φ = e φα,β,r with θ0 = 0.32,  = 0.44, α = 7.5, β = 0.05, r = 7.5 and different radii ρ; the center P,θ0 is marked as green dot on the orange curve. (d): non-convex non-concave t 7→ e φα,β,r(t) with α = 7.5, β = 0.05, r = 7.5 (in blue), its first (in magenta) and second derivative (in green). (e)-(g): the to the plots (a)-(c) corresponding Sg φ(P,θ0 , ρ) (in different viewing angles) shown as blue surface. (h)-(i): divergence spheres Sφ(P,θ0 , ρ) for φ = e φα,β,r with  = 0.44, α = 3.35, β = 0.65, r = −6.31, radius ρ = 0.2 and different θ0. (j)-(l): divergence spheres Sφ(P,θ0 , ρ) for φ = e φα,β,r with θ0 = 0.02,  = 0.44, α = 7.5, β = 0, arbitrary r (has no effect), and different radii ρ. (m):  7→ T () for α = 0.05, β = 0, arbitrary r (no effect) and θ0 = 0.08 (dotted line is the KL case α = 1, β = 0); (n)-(o): corresponding “minimizing” (touching) divergence spheres (in red) for  = .45 resp.  = .46; (p)-(q): corresponding θ 7→ De φ0.05,0,r (P0.45,0.08, Bin(2, θ)) resp. θ 7→ De φ0.05,0,r (P0.46,0.08, Bin(2, θ)); (r):  7→ T () for α = 4, β = 0.35, r = 7.5 and θ0 = 0.08; (s)-(t): corresponding “minimizing” divergence spheres for  = .45 resp.  = .46; (u):  7→ T () for α = 2, β = 0.35, r = 7.5 and θ0 = 0.08; (v)-(w): corresponding “minimizing” divergence spheres for  = .45 resp.  = .46; (x): residual adjustment functions of KL divergence (in black), negative-exponential divergence (in blue), and of De φα,β,r (·, ·) (in dotted red) with α = 10, β = 0.25, r = 10.