Information geometry: Dualistic manifold structures and their uses

28/08/2018
Auteurs : Frank Nielsen
OAI : oai:www.see.asso.fr:14642:23438
DOI :

Résumé

Information geometry: Dualistic manifold structures and their uses

Auteurs

Information geometry: Dualistic manifold structures and their uses
An elementary introduction to information geometry
k-Means Clustering with Hölder divergences
On the Error Exponent of a Random Tensor with Orthonormal Factor Matrices
Bregman divergences from comparative convexity
session Computational Information Geometry (chaired by Frank Nielsen, Olivier Schwander)
Opening and closing sessions (chaired by Frédéric Barbaresco, Frank Nielsen, Silvère Bonnabel)
GSI'17-Closing session
GSI'17 Opening session
Bag-of-components an online algorithm for batch learning of mixture models
TS-GNPR Clustering Random Walk Time Series
Online k-MLE for mixture modeling with exponential families
Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry
Computational Information Geometry (chaired by Frank Nielsen, Paul Marriott)
Keynote speach Marc Arnaudon (chaired by Frank Nielsen)
Oral session 6 Foundations and Geometry (John Skilling, Frank Nielsen, Ariel Caticha)
Oral session 5 Bayesian inference (Frank Nielsen, John Skilling, Romke Brontekoe)
Oral session 3 Information geometry (Ariel Caticha, Steeve Zozor, Frank Nielsen)
Tutorial session 2 (Frank Nielsen, Ariel Caticha, Ken H. Knuth)
A new implementation of k-MLE for mixture modelling of Wishart distributions
Hypothesis testing, information divergence and computational geometry
ORAL SESSION 6 Computational Information Geometry (Frank Nielsen)
lncs_8085_cover.pdf
Geometric Science of Information - GSI 2013 Proceedings

Métriques

3
0
1.02 Mo
 application/pdf
bitcache://b35263648b053b05addb0a817bfefdb2760f1aaf

Licence

Creative Commons Aucune (Tous droits réservés)
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/14642/23438</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator></creators><titles>
            <title>Information geometry: Dualistic manifold structures and their uses</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Tue 28 Aug 2018</date>
	    <date dateType="Updated">Tue 28 Aug 2018</date>
            <date dateType="Submitted">Mon 19 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">b35263648b053b05addb0a817bfefdb2760f1aaf</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>38866</version>
        <descriptions>
            <description descriptionType="Abstract"></description>
        </descriptions>
    </resource>
.

Information geometry: Dualistic manifold structures and their uses Frank Nielsen Sony Computer Science Laboratories Inc, Japan @FrnkNlsn Slides: FrankNielsen.github.com 15th July 2018, GiMLi http://arxiv.org/abs/1808.08271 From Information Theory (IT) to Information Sciences (IS) I Shannon kicked off information sciences with his mathematical theory of communication in 1948: birth of Information Theory (IT) I More generally, Information Sciences (IS) study “communication” between data and models (assuming a family of models a priori) I Information sciences include statistics, information theory, (statistical) machine learning (ML), AI (neural networks), etc. Inference Variates Data Model I Inference θ̂n(x1, . . . , xn): Extract information (= model parameter) from data about model via estimators I Handle uncertainty due to limited/potentially corrupted data: Wald framework: Minimize risk in decision and find best decision rule What is information geometry? A broad definition IG = Geometry of decision making (and model fitting) Geometry from distances, geometry without explicit distances... I Goodness-of-fit distance between data and inferred models via data-model estimators (model fitting): statistics (likelihood), machine learning (classifier with loss functions), mathematical programming (constraints with objective functions) But also I Distance between models: → Decide between models (hypothesis testing/classification) I Model inference is a decision problem too! Decide which (parametric) model in a family models via decision rules (Wald) M mθ1 mθ2 mθ̂n(D) Outline of this talk 1. Minimal introduction to differential geometry: (M, g, ∇) = manifold M equipped with a metric tensor g and an affine connection ∇ defining ∇-geodesics, parallel transport and covariant derivatives 2. Theory: Dual information-geometric structure (M, g, ∇, ∇∗) and the fundamental theorem in information geometry 3. Examples of dually metric-coupled connection geometry: A. Dual geometry induced by a divergence B. Dually-flat Pythagorean geometry (from Bregman divergences) C. Expected α-geometry (from invariant statistical f-divergences) 4. Some applications of these structures and an overview of principled divergences 5. Summary and conclusion 4 1. Basics of Riemannian and non-Riemannian differential geometry 5 Manifold: Geometry and local coordinates I Geometric objects/entities defined on smooth manifolds: tensors (fields), differential operators I Geometric objects can also be expressed using local coordinate systems using an atlas of overlapping charts covering the manifold I Geometric calculations are coordinate-free, does not depend on chosen coordinate systems. Coordinate-free & local coordinate equations. Two essential geometric concepts on a manifold: g and ∇ I Metric tensor (field) g: allows to measure on tangent planes angles between vectors and vector magnitudes (“lengths”) I Connection ∇: a differential operator (hence the ’gradient’ notation ∇) that allows to 1. calculate the covariant derivative Z = ∇YX :=: ∇(X, Y) of a vector field X by another vector field Y 2. “parallel transport” vectors between different tangent planes along smooth curves (path dependent) 3. define ∇-geodesics as autoparallel curves 7 Smooth vector fields X ∈ F(M) I Tangent plane Tp linearly approximates the manifold M at point p I Tangent bundle TM = ∪pTp = {(p, v), p ∈ M, v ∈ Tp} (More generally, fiber bundles like tensor bundles) I A tangent vector v plays the role of a directional derivative: vf means derivative of smooth function f ∈ F(M) along the direction v I Technically, a smooth vector field X is defined as a “cross-section” for the tangent bundle: X ∈ X(M) = Γ(TM) I In local coordinates, X = ∑ i Xiei Σ = Xiei using Einstein convention on dummy indices (X)B = (Xi) with (X)B the contravariant vector components in the natural basis B = {ei = ∂i}i with ∂i :=: ∂ ∂xi in chart (U, x) Reciprocal basis and contravariant/covariant components A tangent plane (=vector space) equipped with an inner product ⟨·, ·⟩ induces a reciprocal basis B∗ = {e∗i = ∂i}i of B = {ei = ∂i}i so that vectors can also be expressed using the covariant vector component in the natural reciprocal basis. Primal/reciprocal basis are mutually orthogonal by construction x1 x2 e1 e2 e1 e2 hei, ej i = δj i vi = ⟨v, e∗i ⟩, vi = ⟨v, ei⟩ gij = ⟨ei, ej⟩, g∗ij = ⟨e∗i , e∗j ⟩, g∗ = g−1 , e∗i Σ = g∗ij ej, ei Σ = gije∗j Metric tensor field g (“metric” g for short) I Symmetric positive-definite bilinear form For u, v ∈ Tp, g(u, v) ≥ 0 ∈ R. Also written as gp(u, v) = ⟨u, v⟩g = ⟨u, v⟩g(p) = ⟨u, v⟩ I Orthogonal vectors: u ⊥ v iff ⟨u, v⟩ = 0 I Vector length from induced norm: ∥u∥p = √ ⟨u, u⟩g(p) I In local coordinates of chart (U, x), using matrix/vector algebra g(u, v) = (u)⊤ B × G(x(p)) × (v)B = (u)⊤ B∗ × G(x(p))−1 × (v)B∗ (with primal/reciprocal basis, so that G × G−1 = I) I Geometry is conformal when ⟨·, ·⟩p = κ(p)⟨·, ·⟩Euclidean (angles/orthogonality without deformation) I Technically, g Σ = gijdxi ⊗ dxj is a 2-covariant tensor field interpreted as a bilinear function (eating two contravariant vectors) Defining an affine connection ∇ via Christoffel symbols Γ I ∇ is a differential operator that defines the derivative Z of a vector field Y by another vector field X (or tensors): Z = ∇XY :=: ∇(X, Y) I For a D-dimensional manifold, define D3 smooth functions called Christoffel symbols (of the second kind) that induces the connection ∇: Γk ij := (∇∂i ∂j)k ∈ F(M) ⇒ ∇ where (·)k defines the k-th component of vector I The Christoffel symbols are not tensors because the transformation rules under a change of basis do not obey the tensor contravariant/covariant rules Parallel transport ∏∇ c along a smooth curve c(t) I Affine connection ∇ defines a way to “translate” vectors = “parallel transport” vectors on tangent planes along a smooth curve c(t). (Manifolds are not embedded) I Parallel transport of vectors between tangent planes Tp and Tq linked by c (defined by ∇), with c(0) = p and c(1) = q: ∀v ∈ Tp, vt = ∇ ∏ c(0)→c(t) v ∈ Tc(t) M p q c(t) vq = Q∇ c vp vq vp ∇-geodesics I ∇-geodesics are autoparallel curves = curves γ such that their velocity vector γ̇ = d dtγ(t) is moving on the curve (via ∇) parallel to itself: ∇-geodesics are “straight” lines. ∇γ̇γ̇ = 0 I In local coordinates, γ(t) = (γk(t)), the autoparallelism amounts to solve a second-order Ordinary Differential Equations (ODEs) via the Euler-Lagrange equations: d2 dt2 γk (t) + Γk ij d dt γi (t) d dt γj (t) = 0, γl (t) = xl ◦ γ(t) I In general, ∇-geodesics are not available in closed-form (and require numerical schemes like Schild/pole ladders). But when Γk ij = 0, easy to solve: → straight lines in the local chart! Affine connection ∇ compatible with the metric tensor g I Definition of metric compatibility g of an affine connection ∇: Geometric coordinate-free definition: X⟨Y, Z⟩ = ⟨∇XY, Z⟩ + ⟨Y, ∇XZ⟩ Definition using local coordinates {∂i}: ∂kgij = ⟨∇∂k ∂i, ∂j⟩ + ⟨∂i, ∇∂k ∂j⟩ I That is, the parallel transport is compatible with the metric ⟨·, ·⟩g: ⟨u, v⟩c(0) = ⟨ ∇ ∏ c(0)→c(t) u, ∇ ∏ c(0)→c(t) v ⟩ c(t) ∀t. → preserves angles/lengths of vectors in tangent planes when transported along a smooth curve. Fundamental theorem of Riemannian geometry Theorem (Levi-Civita connection from metric tensor) There exists a unique torsion-free affine connection compatible with the metric called the Levi-Civita connection LC∇ I Christoffel symbols of the Levi-Civita connection can be expressed from the metric tensor LC Γk ij Σ = 1 2 gkl (∂igil + ∂jgil − ∂lgij) where gij denote the matrix elements of the inverse matrix g−1. I Geometric equation defining the Levi-Civita connection is given by the Koszul formula: 2g(∇XY, Z) = X(g(Y, Z)) + Y(g(X, Z)) − Z(g(X, Y)) + g([X, Y], Z) − g([X, Z], Y) − g([Y, Z], X). Curvature tensor R of an affine connection ∇ I Riemann-Christoffel 4D curvature tensor R (Ri jkl): Geometric equation: R(X, Y)Z = ∇X∇YX − ∇Y∇XZ − ∇[X,Y]Z where [X, Y](f) = X(Y(f)) − Y(X(f)) (∀f ∈ F(M)) is the Lie bracket of vector fields I Manifold is flat = ∇-flat (i.e., R = 0) when there exists a coordinate system x such that Γk ij(x) = 0, i.e., all connection coefficients vanish I Manifold is torsion-free when connection is symmetric: Γk ij = Γk ji Geometric equation: ∇XY − ∇YX = [X, Y] Curvature and parallel transport along a loop (closed curve) I In general, parallel transport is path-dependent I The angle defect of a vector transported on a loop is related to the curvature. I But for a flat connection, it is path-independent (eg., Riemannian cylinder manifold is flat). 2. Information-geometric structure: Dual connections compatible with the metric and dual parallel transport Conjugate connections {∇, ∇∗ }g and dual parallel transport { ∏∇ , ∏∇∗ }g I Definition of a conjugate connection ∇∗ with respect to the metric tensor g (coupled with g): ∇∗ := Cg∇ X⟨Y, Z⟩ = ⟨∇XY, Z⟩ + ⟨Y, ∇∗ XZ⟩ I Involution: ∇∗∗ = Cg∇∗ = ∇. I Dual parallel transport of vectors is metric compatible: ⟨u, v⟩c(0) = ⟨ ∇ ∏ c(0)→c(t) u, ∇∗ ∏ c(0)→c(t) v ⟩ c(t) . I First studied by Norden (1945, relative geometry), studied also used in affine differential geometry. A 1D family of dualistic structures of information geometry I Given a dualistic structure (M, g, ∇, ∇∗) with {∇, ∇∗}g, its follows that ¯ ∇ := ∇+∇∗ 2 = LC∇ coincides with the Levi-Civita connection induced by g I Define the totally symmetric cubic tensor C called Amari-Chentsov tensor: Cijk = Γ∗k ij − Γk ij = Cσ(i)σ(j)σ(k) C(X, Y, Z) = ⟨∇XY − ∇∗ XY, Z⟩, C(∂i, ∂j, ∂k) = ⟨∇∂i ∂j − ∇∗ ∂i ∂j, ∂k⟩ I Then define a 1-family of connections {∇α}α∈R dually coupled to the metric with ∇0 = ¯ ∇ = ∇LC (with ∇1 = ∇ and ∇−1 = ∇∗): Γα ijk = Γ0 ijk − α 2 Cijk, Γ−α ijk = Γ0 ijk + α 2 Cijk, where Γkij Σ = Γl ijglk I Theorem: (M, g, {∇−α, ∇α}g) is an information-geometric dualistic structure (with dual parallel transport) The fundamental theorem of information geometry Theorem (Dually constant curvature manifolds) If a torsion-free affine connection ∇ has constant curvature κ then its conjugate torsion-free connection ∇∗ = Cg∇ has the same constant curvature κ I Corollary I: Dually α-flat manifolds ∇α-flat ⇔ ∇−α-flat (M, g, ∇α) flat ⇔ (M, g, ∇−α = (∇α)∗) flat ⇓ I Corollary II: Dually flat manifolds (α = ±1) 1-flat ⇔ −1-flat (M, g, ∇) flat ⇔ (M, g, ∇∗) flat Panorama of information-geometric structures Riemannian Manifolds (M, g) = (M, g, LC ∇) Smooth Manifolds Conjugate Connection Manifolds (M, g, ∇, ∇∗ ) (M, g, C = Γ∗ − Γ) Distance = Non-metric divergence Distance = Metric geodesic length g = Fisher g Fisher gij = E[∂il∂jl] Spherical Manifold Hyperbolic Manifold Self-dual Manifold Dually flat Manifolds (M, F, F∗ ) (Hessian Manifolds) Dual Legendre potentials Bregman Pythagorean theorem Divergence Manifold (M, Dg , D ∇, D ∇∗ = D∗ ∇) D ∇ − flat ⇔ D ∇∗ − flat f-divergences Bregman divergence Expected Manifold (M, Fisher g, ∇−α , ∇α ) α-geometry Multinomial family LC ∇ = ∇+∇∗ 2 Euclidean Manifold Location-scale family Location family Parametric families Fisher-Riemannian Manifold KL∗ on exponential families KL on mixture families Conformal divergences on deformed families Etc. Frank Nielsen Cubic skewness tensor Cijk = E[∂il∂jl∂kl] αC = ∇αFisher g ∇α = 1+α 2 ∇ + 1−α 2 ∇∗ Γ±α = Γ̄ ∓ α 2 C (M, g, ∇−α , ∇α ) (M, g, αC) canonical divergence I[pθ : p θ0 ] = D(θ : θ0) Divergence D induces structure (M, Dg, D∇α, D∇−α) ≡ (M, Dg, DαC) Convex potential F induces structure (M, Fg, F∇α, F∇−α) ≡ (M, Fg, FαC) (via Bregman divergence BF) Dually flat manifold = Bregman dual Pythagorean geometry Dually flat manifold I Consider a strictly convex smooth function F, called a potential function I Associate its Bregman divergence (parameter divergence): BF(θ : θ′ ) := F(θ) − F(θ′ ) − (θ − θ′ )⊤ ∇F(θ′ ) I The induced information-geometric structure is (M, Fg, FC) = (M, BF g, BF C) with: F g := BF g = − [ ∂i∂jBF(θ : θ′ )|θ′=θ ] = ∇2 F(θ) F Γ := BF Γijk(θ) = 0 F Cijk := BF Cijk = ∂i∂j∂kF(θ) I The manifold is F∇-flat I Levi-Civita connection LC∇ is obtained from the metric tensor Fg (usually not flat), and we get the conjugate connection (F∇)∗ = F∇−1 from (M, Fg, FC) Duality via Legendre-Fenchel transformation I The Legendre-Fenchel transformation gives the convex conjugate F∗, the dual potential function: F∗(η) := supθ∈Θ{θ⊤η − F(θ)} I Dual affine coordinate systems: η = ∇F(θ) and θ = ∇F∗(η) (Function convexity does not change by affine transformation) I Crouzeix identity ∇2F(θ)∇2F∗(η) = I reveals that {∂i}i/{∂j}j are reciprocal basis. I Bregman divergence reinterpreted using Young-Fenchel (in)equality as the canonical divergence: BF(θ : θ′ ) = AF,F∗ (θ : η′ ) = F(θ) + F∗ (η) − θ⊤ η = AF∗,F(η′ : θ) I The dual Bregman divergence BF ∗ (θ : θ′) := BF(θ′ : θ) = BF∗ (η : η′) yields F gij (η) = ∂i ∂j F∗ (η), ∂l :=: ∂ ∂ηl F Γ∗ijk (η) = 0, F Cijk = ∂i ∂j ∂k F∗ (η) I The manifold is both F∇-flat and F∇∗-flat = Dually flat manifold Dual Pythagorean theorems Any pair of points can either be linked using the ∇-geodesic (θ-straight [PQ]) or the ∇∗-geodesic (η-straight [PQ]∗): There are 23 = 8 geodesic triangles. P Q R P Q R D(P : R) = D(P : Q) + D(Q : R) BF (θ(P) : θ(R)) = BF (θ(P) : θ(Q)) + BF (θ(Q) : θ(R)) D∗ (P : R) = D∗ (P : Q) + D∗ (Q : R) BF ∗ (η(P) : η(R)) = BF ∗ (η(P) : η(Q)) + BF ∗ (η(Q) : η(R)) γ∗ (P, Q) ⊥F γ(Q, R) γ(P, Q) ⊥F γ∗ (Q, R) γ∗ (P, Q) ⊥F γ(Q, R) ⇔ ⟨η(P) − η(Q), θ(Q) − θ(R)⟩ = 0 γ(P, Q) ⊥F γ∗ (Q, R) ⇔ ⟨θ(P) − θ(Q), η(Q) − η(R)⟩ = 0 Dual Bregman projection: A uniqueness theorem I A submanifold S ⊂ M is said ∇-flat (∇∗-flat) if it corresponds to an affine subspace in θ (in η, respectively) I In a dually flat space, there is a canonical (Bregman) divergence D I The ∇-projection PS of P on S is unique if S is ∇∗-flat and minimizes the ∇-divergence D(θ(P) : θ(Q) ): ∇-projection: PS = arg min Q∈S D(θ(P) : θ(Q)) I The dual ∇∗-projection P∗ S is unique if M ⊆ S is ∇-flat and minimizes the ∇∗-divergence D∗(θ(P) : θ(Q) ) = D( θ(Q) : θ(P)): ∇∗ -projection: P∗ S = arg min Q∈S D(θ(Q) : θ(P)) Expected α-Geometry from Fisher information metric and skewness cubic tensor of a parametric family of distributions Fisher Information Matrix (FIM), positive semi-definite I Parametric family of probability distributions P := {pθ(x)}θ for θ ∈ Θ I Likelihood function L(θ; x) and log-likelihood function l(θ; x) := log L(θ; x) I Score vector sθ = ∇θl = (∂il)i indicates the sensitivity of likelihood ∂il :=: ∂ ∂θi l(θ; x) I Fisher information matrix (FIM) of D × D for dim(Θ) = D: PI(θ) := Eθ [∂il∂jl]ij ≽ 0 I Cramér-Rao lower bound1 on the variance of any unbiased estimator Vθ[θ̂n(X)] ≽ 1 n PI−1 (θ) I FIM invariant by reparameterization of the sample space X, and covariant by reparameterization of the parameter space Θ. 1F. Nielsen. “Cramér-Rao lower bound and information geometry”. In: Connected at Infinity II. Springer, 2013. Some examples of Fisher information matrices I FIM of an exponential family: Include Gaussian, Beta, ∆D, etc. E = { pθ(x) = exp ( D ∑ i=1 ti(x)θi − F(θ) + k(x) ) such that θ ∈ Θ } F is the strictly convex cumulant function EI(θ) = Covθ[t(X)] = ∇2 F(θ) = ∇2 F∗ (η)−1 ≻ 0 I FIM of a mixture family: Include statistical mixtures, ∆D M = { pθ(x) = D ∑ i=1 θiFi(x) + C(x) such that θ ∈ Θ } with {Fi(x)}i linearly independent on X, ∫ Fi(x)dµ(x) = 0 and ∫ C(x)dµ(x) = 1 MI(θ) = Eθ [ Fi(x)Fj(x) p2 θ(x) ] = ∫ X Fi(x)Fj(x) pθ(x) dµ(x) ≻ 0 Expected α-geometry from expected dual α-connections I Fisher “information metric” tensor from FIM (regular models) Pg(u, v) := (u)⊤ θ PI(θ)(v)θ ≻ 0 I Exponential connection and mixture connection: P∇e := Eθ [(∂i∂jl)(∂kl)] , P∇m := Eθ [(∂i∂jl + ∂il∂jl)(∂kl)] I Dualistic structure (P, Pg, m P ∇, e P∇) with cubic skewness tensor: Cijk = Eθ [∂il∂jl∂kl] I It follows a one-family of α-CCMs: {(P, Pg, P∇−α, P∇+α)}α with PΓαk ij := − 1 + α 2 Cijk = Eθ [( ∂i∂jl + 1 − α 2 ∂il∂jl ) (∂kl) ] P∇−α + P∇α 2 = LC P ∇ := LC ∇(Pg) I In case of an exponential family E or a mixture family M, we get Dually Flat Manifolds (Bregman geometry/ std f-divergence) e MΓ = m MΓ = e EΓ = m E Γ = 0 Statistical invariance criteria I Which metric tensors g make sense? I Which affine connections ∇ make sense? I Which statistical divergences make sense? I Invariant metric g shall preserve the inner product under important statistical mappings (Markov mappings) Theorem (Uniqueness of Fisher metric) Fisher metric is unique up to a scaling constant.2 2N. N. Chentsov. Statistical decision rules and optimal inference. 53. AMS, 1982 (Russian 1972); L. L. Campbell. “An extended Čencov characterization of the information metric”. In: American Mathematical Society (1986); H. Vân Lê. “The uniqueness of the Fisher metric as information metric”. In: Annals of the Institute of Statistical Mathematics 69.4 (2017), pp. 879–896. Divergence: Information monotonicity and f-divergences I A divergence satisfies the information monotonicity iff D(θĀ : θ′ Ā ) ≤ D(θ : θ′) for any coarse-grained partition A (A-lumping) p1 + p2 p3 + p4 + p5 p6 p7 + p8 p pA p1 p2 p3 p4 p5 p6 p7 p8 coarse graining I The only invariant and decomposable divergences are f-divergences3 (when D > 1): If(θ : θ′ ) := ∑ i θif ( θ′ i θi ) ≥ f(1), f(1) = 0 I Standard f-divergences are defined for f-generators satisfying f′(1) = 0 (choose fλ(u) := f(u) + λ(u − 1) since Ifλ = If), and f′′(u) = 1 (scale fixed) 3J. Jiao et al. “Information measures: the curious case of the binary alphabet”. In: IEEE Transactions on Information Theory 60.12 (2014), pp. 7616–7626. Properties of Csiszár-Ali-Silvey f-divergences I Statistical f-divergences are invariant4 under one-to-one/sufficient statistic transformations y = t(x) of the sample space: p(x; θ) = q(y(x); θ). If[p(x; θ) : p(x; θ′ )] = ∫ X p(x; θ)f ( p(x; θ′) p(x; θ) ) dµ(x) = ∫ Y q(y; θ)f ( q(y; θ′) q(y; θ) ) dµ(y) = If[q(y; θ) : q(y; θ′ )] I Dual f-divergences for reference duality: If ∗ [p(x; θ) : p(x; θ′ )] = If[p(x; θ′ ) : p(x; θ)] = If⋄ [p(x; θ) : p(x; θ′ )] for the standard conjugate f-generator (diamond f⋄ generator): f⋄ (u) := uf ( 1 u ) 4Y. Qiao and N. Minematsu. “A Study on Invariance of f-Divergence and Its Application to Speech Recognition”. In: IEEE Transactions on Signal Processing 58.7 (2010), pp. 3884–3890. Some common examples of f-divergences I The family of α-divergences: Iα[p : q] := 4 1 − α2 ( 1 − ∫ p 1−α 2 (x)q1+α (x)dµ(x) ) obtained for f(u) = 4 1−α2 (1 − u 1+α 2 ), and include I Kullback-Leibler KL[p : q] = ∫ p(x) log p(x) q(x) dµ(x) for f(u) = − log u (α = 1) I reverse Kullback-Leibler KL∗ [p : q] = ∫ q(x) log q(x) p(x) dµ(x) for f(u) = u log u (α = −1) I squared Hellinger divergence H2 [p : q] = ∫ ( √ p(x) − √ q(x))2 dµ(x) for f(u) = ( √ u − 1)2 (α = 0) I Pearson and Neyman chi-squared divergences I Jensen-Shannon divergence (bounded): 1 2 ∫ ( p(x) log 2p(x) p(x) + q(x) + q(x) log 2q(x) p(x) + q(x) ) dµ(x) for f(u) = −(u + 1) log 1+u 2 + u log u I Total Variation (metric) 1 2 ∫ |p(x) − q(x)|dµ(x) for f(u) = 1 2|u − 1| Invariant f-divergences yields Fisher metric/α-connections I Invariant (separable) standard f-divergences related infinitesimally to the Fisher metric: If[p(x; θ) : p(x; θ + dθ)] = ∫ p(x; θ)f ( p(x; θ + dθ) p(x; θ) ) dµ(x) Σ = 1 2 Fgij(θ)dθi dθj I A statistical parameter divergence on a parametric family of distributions yield an equivalent parameter divergence: PD(θ : θ′ ) := DP[p(x; θ) : p(x; θ′ )] Thus we can build the information-geometric structure induced by this parameter divergence PD(· : ·) I For PD(· : ·) = If[· : ·], the induced ±1-divergence connections If P∇ := P If ∇ and (If)∗ P ∇ := P I∗ f ∇ are the expected ±α-connections with α = 2f′′′(1) + 3. 4. Some applications of the information-geometric manifolds (M, g, ∇, ∇∗) ≡ (M, g, C) (M, g, ∇−α, ∇α) ≡ (M, g, αC) Broad view of applications of Information Geometry6 I Statistics: Asymptotic inference, Expectation-Maximization (EM/em), time series ARMA models I Machine learning: Restricted Boltzmann machines (RBMs), neuromanifolds (deep learning) and natural gradient5 I Signal processing: PCA, ICA, Non-negative Matrix Factorization (NMF) I Mathematical programming: Barrier function of interior point methods I Game theory: Score functions 5Ke Sun and Frank Nielsen. “Relative Fisher Information and Natural Gradient for Learning Large Modular Models”. In: ICML. 2017, pp. 3289–3298. 6Shun-ichi Amari. Information geometry and its applications. Springer, 2016. Hypothesis testing in the dually flat exponential family manifold Bayesian Hypothesis Testing/Binary classification7 Given two distributions P0 and P1, classify observations X1:n (= decision problem) as either iid sampled from P0 or from P1? For example, P0 = signal, P1 = noise (assume same prior w1 = w2 = 1 2) x1 x2 p0(x) p1(x) x Assume P0 ∼ Pθ0 , P1 ∼ Pθ1 ∈ E belong to an exponential family manifold E = {Pθ} with dually flat structure (E, Eg, E∇e, E∇m). This structure can also be derived from a divergence manifold structure (E, KL∗ ). Therefore KL[Pθ : Pθ′ ] amounts to a Bregman divergence (for the cumulant function of the exponential family): KL[Pθ : Pθ′ ] = BF(θ′ : θ) Hypothesis Testing (HT)/Binary classification10 The best exponent error8 α∗ of the best Maximum A Priori (MAP) decision rule is found by minimizing the Bhattacharyya distance to get the Chernoff information: C[P1, P2] = − log min α∈(0,1) ∫ x∈X pα 1 (x)p1−α 2 (x)dµ(x) ≥ 0, On E, the Bhattacharyya distance amounts to a skew Jensen parameter divergence9: J (α) F (θ1 : θ2) = αF(θ1) + (1 − α)F(θ2) − F(θ1 + (1 − α)θ2), Theorem: Chernoff information = Bregman divergence for exponential families at the optimal exponent value: C[Pθ1 : Pθ2 ] = B(θ1 : θ (α∗) 12 ) = B(θ2 : θ (α∗) 12 ) 8G.-T. Pham, R. Boyer, and F. Nielsen. “Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors”. In: Entropy 20.3 (2018), p. 203. 9F. Nielsen and S. Boltz. “The Burbea-Rao and Bhattacharyya centroids”. In: IEEE Transactions on Information Theory 57.8 (2011). 10Nielsen, “An Information-Geometric Characterization of Chernoff Information”. Geometry11 of the best error exponent: binary hypothesis on the exponential family manifold P∗ = Pθ∗ 12 = Ge(P1, P2) ∩ Bim(P1, P2) pθ1 pθ2 pθ∗ 12 m-bisector e-geodesic Ge(Pθ1 , Pθ2 ) η-coordinate system Pθ∗ 12 C(θ1 : θ2) = B(θ1 : θ∗ 12) Bim(Pθ1 , Pθ2 ) Synthetic information geometry (“Hellinger arc”): Exact characterization but not necessarily closed-form formula 11Nielsen, “An Information-Geometric Characterization of Chernoff Information”. Information geometry of multiple hypothesis testing13 Bregman Voronoi diagrams12 on E of P1, . . . , Pn ∈ E η-coordinate system Chernoff distribution between natural neighbours 12J.-D. Boissonnat, F. Nielsen, and R. Nock. “Bregman Voronoi diagrams”. In: Discrete & Computational Geometry 44.2 (2010), pp. 281–307. 13F. Nielsen. “Hypothesis Testing, Information Divergence and Computational Geometry”. In: GSI. 2013, pp. 241–248; Pham, Boyer, and Nielsen, “Computational Information Geometry for Binary Classification of High-Dimensional Random Tensors”. Clustering statistical w-mixtures on the dually flat mixture family manifold M 44 Dually flat mixture family manifold14 Consider D + 1 prescribed component distributions {p0, . . . , pD} and form the mixture manifold M by considering all their convex combinations M = {mθ(x) = ∑D i=0 wipi(x)} 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 -4 -2 0 2 4 6 8 M1 M2 Gaussian(-2,1) Cauchy(2,1) Laplace(0,1) Information-geometric structure: (M, Mg, M∇−1, M∇1) is dually flat and equivalent to (Mθ, KL). That is, the KL between two mixtures with prescribed components is equivalent to a Bregman divergence. KL[mθ : mθ′ ] = BF(θ : θ′ ), F(θ) = −h(mθ) = ∫ mθ(x) log mθ(x)dµ(x) 14F. Nielsen and R. Nock. “On the geometric of mixtures of prescribed distributions”. In: IEEE ICASSP. 2018. 45 Clustering on the dually flat w-mixture family manifold Apply Bregman k-means15 on Monte Carlo dually flat spaces16 of w-GMMs (Gaussian Mixture Models) 0 0.05 0.1 0.15 0.2 0.25 -4 -2 0 2 4 Because F = −h(m(x; θ)) is the negative differential entropy of a mixture (not available in closed form17), we approximate F by Monte-Carlo convex F̃X and get a sequence of tractable dually flat manifolds converging to the ideal mixture manifold 15F. Nielsen and R. Nock. “Sided and symmetrized Bregman centroids”. In: IEEE transactions on Information Theory 55.6 (2009). 16F. Nielsen and G. Hadjeres. “Monte Carlo Information Geometry: The dually flat case”. In: CoRR abs/1803.07225 (2018). 17F. Nielsen and K. Sun. “Guaranteed Bounds on Information-Theoretic Measures of Univariate Mixtures Using Piecewise Log-Sum-Exp Inequalities”. In: Entropy 18.12 (2016), p. 442. If (P : Q) = R p(x)f  (q(x) p(x)  dν(x) BF (P : Q) = F(P) − F(Q) − hP − Q, ∇F(Q)i tBF (P : Q) = BF (P :Q) √ 1+k∇F (Q)k2 CD,g(P : Q) = g(Q)D(P : Q) BF,g(P : Q; W) = WBF  P Q : Q W  Dv (P : Q) = D(v(P) : v(Q)) v-Divergence Dv total Bregman divergence tB(· : ·) Bregman divergence BF (· : ·) conformal divergence CD,g(· : ·) Csiszár f-divergence If (· : ·) scaled Bregman divergence BF (· : ·; ·) scaled conformal divergence CD,g(· : ·; ·) Dissimilarity measure Divergence Projective divergence γ-divergence Hÿvarinen SM/RM D(λp : λ0 p0 ) = D(p : p0 ) D(λp : p0 ) = D(p : p0 ) one-sided double sided C3 Principled classes of distances and divergences: Axiomatic approach with exhausitivity characteristics, conformal divergences18 , projective divergences19 , etc. 18R. Nock, F. Nielsen, and S.-i. Amari. “On Conformal Divergences and Their Population Minimizers”. In: IEEE TIT 62.1 (2016), pp. 527–538; F. Nielsen and R.Nock. “Total Jensen divergences: Definition, properties and clustering”. In: ICASSP. 2015, pp. 2016–2020. 19F. Nielsen and R. Nock. “Patch Matching with Polynomial Exponential Families and Projective Divergences”. In: SISAP. 2016, pp. 109–116; F. Nielsen, K. Sun, and 47 Summary I: Information-geometric structures I The fundamental information-geometric structure (M, g, ∇, ∇∗) is defined by a pair of conjugate affine connections coupled to the metric tensor. It yields a dual parallel transport that is metric-compatible. I Not directly related to any application fields! (but can be realized by statistical models20 ) I Not directly related to any distance/divergence (but can be realized by divergences or canonical divergences can be defined) I Not unique since we can always get a 1D family of α-geometry I Conjugate flat connections have geodesics corresponding to straight lines in dual affine coordinates (Legendre convex conjugate potential functions), and a dual Pythagorean theorem characterizes uniqueness of information projections21 (Hessian manifolds) I These pure geometric structures (M, g, ∇, ∇∗) ≡ (M, g, C) can be applied in information sciences (Statistics, Mathematical Programming, etc.). But is it fit to the application field? 20H. Vân Lê. “Statistical manifolds are statistical models”. In: Journal of Geometry 84.1-2 (2006), pp. 83–93. 21F. Nielsen. “What is... an Information Projection?” In: Notices of the AMS 65.3 (2018). Summary II: Application of the structures in Statistics I In statistics, Fisher metric is unique invariant metric tensor I In statistics, expected α-connections arise from invariant f-divergences, and at infinitesimal scale standard f-divergences expressed using Fisher information matrix. I Principled distances/divergences get an underlying information geometry: conformal divergences22, projective23 divergences, etc. I A Riemannian distance is never a divergence (not smooth), and a (symmetrized) divergence J = D+D∗ 2 can not always be powered Jβ to yield a metric distance (although √ JS is a metric) I Information geometry has a major role to play in machine learning/deep learning by studying neuromanifolds 22Nock, Nielsen, and Amari, “On Conformal Divergences and Their Population Minimizers”. 23Nielsen, Sun, and Marchand-Maillet, “On Hölder Projective Divergences”. Tutorial: “An elementary introduction to information geometry”24 https://arxiv.org/abs/1808.08271 24Frank Nielsen. “An elementary introduction to information geometry”. In: CoRR abs/1808.08271 (2018). arXiv: 1808.08271. url: http://arxiv.org/abs/1808.08271.