An elementary introduction to information geometry
28/08/2018Résumé
the fundamental theorem of information geometry, and illustrate some uses of these information
manifolds in information sciences. The exposition is self-contained by concisely introducing the
necessary concepts of diﬀerential geometry with proofs omitted for brevity.
Auteurs
Frank Nielsen |
Métriques
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd"> <identifier identifierType="DOI">10.23723/14642/23437</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator></creators><titles> <title>An elementary introduction to information geometry</title></titles> <publisher>SEE</publisher> <publicationYear>2018</publicationYear> <resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>statistical manifold</subject><subject>Diﬀerential geometry</subject><subject>metric tensor</subject><subject>aﬃne connection</subject><subject>metric compatibility</subject><subject>conjugate connections</subject><subject>dual metric-compatible parallel transport</subject><subject>information manifold</subject><subject>curvature and ﬂatness</subject><subject>dually ﬂat manifolds</subject><subject>exponential family</subject><subject>mixture family</subject><subject>statistical di- vergence</subject><subject>parameter divergence</subject><subject>separable divergence</subject><subject>Fisher-Rao distance</subject><subject>statistical invariance</subject><subject>Bayesian hypothesis testing</subject><subject>mixture clustering</subject></subjects><dates> <date dateType="Created">Tue 28 Aug 2018</date> <date dateType="Updated">Tue 28 Aug 2018</date> <date dateType="Submitted">Tue 12 Feb 2019</date> </dates> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="bitstream">fbfd93c42c119dec21ada90c7c6831c4ce86e164</alternateIdentifier> </alternateIdentifiers> <formats> <format>application/pdf</format> </formats> <version>38867</version> <descriptions> <description descriptionType="Abstract"> We describe the fundamental diﬀerential-geometric structures of information manifolds, state<br /> the fundamental theorem of information geometry, and illustrate some uses of these information<br /> manifolds in information sciences. The exposition is self-contained by concisely introducing the<br /> necessary concepts of diﬀerential geometry with proofs omitted for brevity. </description> </descriptions> </resource>
An elementary introduction to information geometry Frank Nielsen Sony Computer Science Laboratories Inc, Japan Abstract We describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some uses of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry with proofs omitted for brevity. Keywords: Differential geometry, metric tensor, affine connection, metric compatibility, conjugate connections, dual metric-compatible parallel transport, information manifold, statistical manifold, curvature and flatness, dually flat manifolds, exponential family, mixture family, statistical di- vergence, parameter divergence, separable divergence, Fisher-Rao distance, statistical invariance, Bayesian hypothesis testing, mixture clustering. Contents 1 Introduction 2 1.1 Overview of information geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Prerequisite: Basics of differential geometry 4 2.1 Overview of differential geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Metric tensor fields g . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Affine connections ∇ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Covariant derivatives ∇XY of vector fields . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Parallel transport Q∇ c along a smooth curve c . . . . . . . . . . . . . . . . . . 7 2.3.3 ∇-geodesics γ∇: Autoparallel curves . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.4 Curvature and torsion of a manifold . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 The fundamental theorem of Riemannian geometry: The Levi-Civita metric connection 10 2.5 Preview: Information geometry versus Riemannian geometry . . . . . . . . . . . . . 11 3 Information manifolds 11 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Conjugate connection manifolds: (M, g, ∇, ∇∗) . . . . . . . . . . . . . . . . . . . . . 12 3.3 Statistical manifolds: (M, g, C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.4 A family {(M, g, ∇−α, ∇α = (∇−α)∗)}α∈R of conjugate connection manifolds . . . . . 14 3.5 The fundamental theorem of information geometry: ∇ κ-curved ⇔ ∇∗ κ-curved . . . 14 3.6 Conjugate connections from divergences: (M, D) ≡ (M, Dg, D∇, D∇∗ = D∗ ∇) . . . . 15 3.7 Dually flat manifolds (Bregman geometry): (M, F) ≡ (M, BF g, BF ∇, BF ∇∗ = BF ∗ ∇) 16 1 arXiv:1808.08271v1 [cs.LG] 17 Aug 2018 3.8 Expected α-manifolds of a family of parametric probability distributions: (P, Pg, P∇−α, P∇α) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.9 Criteria for statistical invariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.10 Fisher-Rao expected Riemannian manifolds: (P, Pg) . . . . . . . . . . . . . . . . . . 23 3.11 The monotone α-embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4 Some illustrating applications of dually flat manifolds 25 4.1 Hypothesis testing in the dually flat exponential family manifold (E, KL∗ ) . . . . . . 27 4.2 Clustering mixtures in the dually flat mixture family manifold (M, KL) . . . . . . . 28 5 Conclusion: Summary, historical background, and perspectives 31 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 A brief historical review of information geometry . . . . . . . . . . . . . . . . . . . . 31 5.3 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 1 Introduction 1.1 Overview of information geometry We present a concise and modern view of the basic structures lying at the heart of Information Geometry (IG), and report some applications of those information-geometric manifolds (termed “in- formation manifolds”) in statistics (Bayesian hypothesis testing) and machine learning (statistical mixture clustering). By analogy to Information Theory (IT) pioneered by Claude Shannon [62] (in 1948) which considers primarily the communication of messages over noisy transmission channels, we may define Information Sciences as the fields that study “communication” between (noisy/imperfect) data and families of models (postulated as a priori knowledge). In short, Information Sciences (IS) seek methods to distill information from data to models. Thus, information sciences encompass information theory but also include Probability & Statistics, Machine Learning (ML), Artificial Intelligence (AI), Mathematical Programming, just to name a few areas. In §5.2, we review some key milestones of information geometry and report some definitions of the field by its pioneers. A modern and broad definition of information geometry can be stated as the field that studies the geometry of decision making. This definition also includes model fitting (inference) that can be interpreted as a decision problem as illustrated in Figure 1: Namely, deciding which model parameter to choose from a family of parametric models. This framework was advocated by Abraham Wald [72, 73, 17] who considered all statistical problems as statistical decision problems. Distances play a crucial role not only for measuring the goodness-of-fit of data to model (say, likelihood in statistics, classifier loss functions in ML, objective functions in mathematical programming, etc.) but also for measuring the discrepancy (or deviance) between models. Why adopting a geometric approach? Geometry allows one to study invariance and equivari- ance1 of “figures” in a coordinate-free approach. The geometric language (e.g., ball or projection) 1 For example, the triangle centroid is equivariant under affine transformation. In Statistics, the Maximum Likelihood Estimator (MLE) is equivariant. Let t(θ) denote a monotonic transformation of the model parameter θ. 2 M mθ1 mθ2 mθ̂n(D) Figure 1: The parameter inference θ̂ of a model from data D can also be interpreted as a decision making problem: Decide which parameter of a parametric family of models M = {mθ}θ∈Θ suits the “best” the data. Information geometry provides a differential-geometric manifold structure to M useful for designing and studying decision rules. also provides affordances that help us reason intuitively about problems. Note that although figures can be visualized (i.e., plotted in coordinate charts), they should be thought of as purely abstract objects, namely, geometric figures. 1.2 Outline The paper is organized as follows: In the first part (§2), we start by concisely introducing the necessary background of differential geometry in order to define a manifold (M, g, ∇) equipped with a metric tensor g and an affine connection ∇. We explain how this framework generalizes the Riemannian manifolds (M, g) by stating the fundamental theorem of Riemannian geometry that defines a unique torsion-free metric- compatible Levi-Civita connection from the metric tensor. In the second part (§3), we explain the dualistic structures of information manifolds: We present the conjugate connection manifolds (M, g, ∇, ∇∗), the statistical manifolds (M, g, C) where C is a cubic tensor, and show how to derive a family of information manifolds (M, g, ∇−α, ∇α) for α ∈ R provided any given pair (∇ = ∇−1, ∇∗ = ∇1) of conjugate connections. We explain how to get conjugate connections from any smooth (potentially asymmetric) distances (called divergences), present the dually flat manifolds obtained when considering Bregman divergences, and define, when dealing with parametric family of probability models, the exponential connection e∇ and the mixture connection m∇ that are coupled to the Fisher information metric. We discuss the concept of statistical invariance for the metric tensor and the notion of information monotonicity for statistical divergences. It follows that the Fisher metric is the unique invariant metric (up to a scaling factor), and that the f-divergences are the unique separable invariant divergences. In the third part (§4), we illustrate these information-geometric structures with two simple ap- plications: In the first application, we consider Bayesian hypothesis testing and show how Chernoff information which defines the best error exponent, can be geometrically characterized on the dually flat structure of an exponential family manifold. In the second application, we show how to cluster statistical mixtures sharing the same component distributions on the dually flat mixture family Then we have d t(θ) = t(b θ), where the MLE is denoted by ˆ ·. 3 manifold. Finally, we conclude in §5 by summarizing the important concepts and structures of infor- mation geometry, and by providing further references and textbooks [12, 4] to more advanced structures and applications for further readings. We mention recent studies of generic classes of distances/divergences. At the beginning of each part, we outline its contents. A summary of notations is provided page 34. 2 Prerequisite: Basics of differential geometry In §2.1, we review the basics of Differential Geometry (DG) for defining a manifold (M, g, ∇) equipped with both a metric tensor g and an affine connection ∇. We explain these two independent metric/connection structures in §2.2 and in §2.3, respectively. From a connection ∇, we show how to derive the notion of covariant derivative in §2.3.1, parallel transport in §2.3.2 and geodesics in §2.3.3. We further explain the intrinsic curvature and torsion of manifolds induced by the connection in §2.3.4, and state the fundamental theorem of Riemannian geometry in §2.4: The existence of a unique torsion-free Levi-Civita metric connection LC∇ that can be calculated from the metric. Thus Riemannian geometry (M, g) is obtained as a special case of the more general manifold structure (M, g, LC∇): (M, g) ≡ (M, g, LC∇). Information geometry shall further consider a dual structure (M, g, ∇∗) associated to (M, g, ∇), and the pair of dual structures shall form an information manifold (M, g, ∇, ∇∗). 2.1 Overview of differential geometry Informally speaking, a smooth D-dimensional manifold M is a topological space that locally behaves like the Euclidean space RD. Geometric objects (e.g., points and vector fields) and entities (e.g., functions and differential operators) live on M, and are coordinate-free but can conveniently be expressed in any local coordinate2 system of an atlas A = {(Ui, xi)}i of charts (Ui, xi)’s (fully covering the manifold) for calculations. A Ck manifold is obtained when the change of chart transformations are Ck. The manifold is said smooth when it is C∞. At each point p ∈ M, a tangent plane Tp locally best linearizes the manifold. On any smooth manifold M, we can define two independent structures: 1. a metric tensor g, and 2. an affine connection ∇. The metric tensor g induces on each tangent plane Tp an inner product space that allows one to measure vector magnitudes (vector “lengths”) and angles/orthogonality between vectors. The affine connection ∇ is a differential operator that allows one to define: 1. the covariant derivative operator which provides a way to calculate differentials of a vector field Y with respect to another vector field X: Namely, the covariant derivative ∇XY , 2 René Descartes (1596-1650) allegedly invented the Cartesian coordinate system while wondering how to locate a fly on the ceiling from his bed. In practice, we shall use the most expedient coordinate system to facilitate calculations. 4 2. the parallel transport Q∇ c which defines a way to transport vectors on tangent planes along any smooth curve c, 3. the notion of ∇-geodesics γ∇ which are defined as autoparallel curves, thus extending the ordinary notion of Euclidean straightness, 4. the intrinsic curvature and torsion of the manifold. 2.2 Metric tensor fields g The tangent bundle3 of M is defined as the “union” of all tangent spaces: TM:= ∪p Tp = {(p, v), p ∈ M, v ∈ Tp} (1) A tangent vector v plays the role of a directional derivative4, with vf informally meaning the derivative of a smooth function f (belonging to the space of smooth functions F(M)) along the direction v. A smooth vector field X is defined as a “cross-section” of the tangent bundle: X ∈ X(M) = Γ(TM), where X(M) or Γ(TM) denote the space of smooth vector fields. A basis B = {b1, . . . , bD} of a finite D-dimensional vector space is a maximal linearly independent set of vectors.5 Tangent spaces carry algebraic structures of vector spaces.6 Using local coordinates on a chart (U, x), the vector field X can be expressed as X = PD i=1 Xiei Σ = Xiei using Einstein summation convention on dummy indices (using notation Σ =), where (X)B:=(Xi) denotes the contravariant vector components (manipulated as “column vectors” in algebra) in the natural basis B = {e1 = ∂1, . . . , eD = ∂D} with ∂i:=: ∂ ∂xi . A tangent plane (vector space) equipped with an inner product h·, ·i yields an inner product space. We define a reciprocal basis B∗ = {e∗i = ∂i}i of B = {ei = ∂i}i so that vectors can also be expressed using the covariant vector components in the natural reciprocal basis. The primal and reciprocal basis are mutually orthogonal by construction as illustrated in Figure 2. For any vector v, its contravariant components vi’s (superscript notation) and its covariant components vi’s (subscript notation) can be retrieved from v using the inner product with the use of the reciprocal and primal basis, respectively: vi = hv, e∗i i, (2) vi = hv, eii. (3) The inner product defines a metric tensor g and a dual metric tensor g∗: gij := hei, eji, (4) g∗ij := he∗i , e∗j i. (5) 3 The tangent bundle is a particular example of a fiber bundle with base manifold M. 4 Since the manifolds are abstract and not embedded in some Euclidean space, we do not view a vector as an “arrow” anchored on the manifold. Rather, vectors can be understood in several ways in differential geometry like directional derivatives or equivalent class of smooth curves at a point. That is, tangent spaces shall be considered as the manifold abstract too. 5 A set of vectors B = {b1, . . . , bD} is linearly independent iff PD i=1 λibi = 0 iff λi = 0 for all i ∈ [D]. That is, in a linearly independent vector set, no vector of the set can be represented as a linear combination of the remaining vectors. A vector set is linearly independent maximal when we cannot add another linearly independent vector. 6 Furthermore, to any vector space V , we can associate a dual covector space V ∗ which is the vector space of real-valued linear mappings. We do not enter into details here to preserve this gentle introduction to information geometry with as little intricacy as possible. 5 x1 x2 e1 e2 e1 e2 hei, ej i = δj i Figure 2: Primal and reciprocal basis of an inner product h·, ·i space. The primal/reciprocal basis are mutually orthogonal: e1 is orthogonal to e2, and e1 is orthogonal to e2. Technically speaking, the metric tensor g is a 2-covariant tensor7 field: g Σ = gijdxi ⊗ dxj, (6) where ⊗ is the dyadic tensor product performed on pairwise covector basis {dxi}i (the covectors corresponding to the reciprocal vector basis). Let G = [gij] and G∗ = [g∗ij ] denote the D × D matrices It follows by construction of the reciprocal basis that G∗ = G−1. The reciprocal basis vectors e∗i ’s and primal basis vectors ei’s can be expressed using the dual metric g∗ and metric g on the primal basis vectors ej’s and reciprocal basis vectors e∗j ’s, respectively: e∗i Σ = g∗ij ej, (7) ei Σ = gije∗j . (8) The metric tensor field g (“metric tensor” or “metric” for short) defines a smooth symmetric positive-definite bilinear form on the tangent bundle so that for u, v ∈ Tp, g(u, v) ≥ 0 ∈ R. We can also write equivalently gp(u, v):=:hu, vip:=:hu, vig(p):=:hu, vi. Two vectors u and v are said orthogonal, denoted by u ⊥ v, iff hu, vi = 0. The length of a vector is induced from the norm kukp:=:kukg(p) = q hu, uig(p). Using local coordinates of a chart (U, x), we get the vector contravariant/covariant components, and compute the metric tensor using matrix algebra (with column vectors by convention) as follows: g(u, v) = (u)> B × Gx(p) × (v)B = (u)> B∗ × G−1 x(p) × (v)B∗ , (9) since it follows from the primal/reciprocal basis that G×G∗ = I, the identity matrix. Thus on any tangent plane Tp, we get a Mahalanobis distance: MG(u, v):=ku − vkG = sX i,j (ui − vi)(uj − vj)Gij. (10) 7 We do not describe tensors in details for sake of brevity. A tensor is a geometric entity of a tensor space that can also be interpreted as a multilinear map. A contravariant vector lives in a vector space while a covariant vector lives in the dual covector space. We recommend this book [31] for a concise and well-explained description of tensors. 6 The inner product of two vectors u and v is a scalar (a 0-tensor) that can be equivalently calculated as: hu, vi:=g(u, v) Σ = ui vi Σ = uivi . (11) A metric tensor g of manifold M is said conformal when h·, ·ip = κ(p)h·, ·iEuclidean. That is, when the inner product is a scalar function κ(·) of the Euclidean dot product. In conformal geometry, we can measure angles between vectors in tangent planes as if we were in an Euclidean space, without any deformation. This is handy for checking orthogonality (in charts). For example, Poincaré disk model of hyperbolic geometry is conformal but Klein disk model is not conformal (except at the origin), see [44]. 2.3 Affine connections ∇ An affine connection ∇ is a differential operator defined on a manifold that allows us to define a covariant derivative of vector fields, a parallel transport of vectors on tangent planes along a smooth curve, and geodesics. Furthermore, an affine connection fully characterizes the curvature and torsion of a manifold. 2.3.1 Covariant derivatives ∇XY of vector fields A connection defines a covariant derivative operator that tells us how to differentiate a vector field Y according to another vector field X. The covariant derivative operator is denoted using the traditional gradient symbol ∇. Thus a covariate derivative ∇ is a function: ∇ : X(M) × X(M) → X(M), (12) that has its own special subscript notation ∇XY :=:∇(X, Y ) for indicating that it is differentiating a vector field Y according to another vector field X. By prescribing D3 smooth functions Γk ij = Γk ij(p), called the Christoffel symbols of the second kind, we define the unique affine connection ∇ that satisfies in local coordinates of chart (U, x) the following equations: ∇∂i ∂j = Γk ij∂k. (13) The Christoffel symbols can also be written as Γk ij := (∇∂i ∂j)k, where (·)k denote the k-th coordinate. The k-th component (∇XY )k of the covariant derivative of vector field Y with respect to vector field X is given by: (∇XY )k Σ = Xi (∇iY )k Σ = Xi ∂Y k ∂xi + Γk ijY j . (14) The Christoffel symbols are not tensors (fields) because the transformation rules induced by a change of basis do not obey the tensor contravariant/covariant rules. 2.3.2 Parallel transport Q∇ c along a smooth curve c Since the manifold is not embedded8 in a Euclidean space, we cannot add a vector v ∈ Tp to a vector v0 ∈ Tp0 as the tangent vector spaces are unrelated to each others without a connection.9 8 Whitney embedding theorem states that any D-dimensional Riemannian manifold can be embedded into R2D . 9 When embedded, we can implicitly use the ambient Euclidean connection Euc ∇, see [1]. 7 M p q c(t) vq = Q∇ c vp vq vp vc(t) Figure 3: Illustration of the parallel transport of vectors on tangent planes along a smooth curve. For a smooth curve c, with c(0) = p and c(1) = q, a vector vp ∈ Tp is parallel transported smoothly to a vector vq ∈ Tq such that for any t ∈ [0, 1], we have vc(t) ∈ Tc(t). Thus a connection ∇ defines how to associate vectors between infinitesimally close tangent planes Tp and Tp+dp. Then the connection allows us to smoothly transport a vector v ∈ Tp by sliding it (with infinitesimal moves) along a smooth curve c(t) (with c(0) = p and c(1) = q), so that the vector vp ∈ Tp “corresponds” to a vector vq ∈ Tq: This is called the parallel transport. This mathematical prescription is necessary in order to study dynamics on manifolds (e.g., study the motion of a particle on the manifold). We can express the parallel transport along the smooth curve c as: ∀v ∈ Tp, ∀t ∈ [0, 1], vc(t) = ∇ Y c(0)→c(t) v ∈ Tc(t) (15) The parallel transport is schematically illustrated in Figure 3. 2.3.3 ∇-geodesics γ∇: Autoparallel curves A connection ∇ allows one to define ∇-geodesics as autoparallel curves, that are curves γ such that we have: ∇γ̇γ̇ = 0. (16) That is, the velocity vector γ̇ is moving along the curve parallel to itself: In other words, ∇- geodesics generalize the notion of “straight Euclidean” lines. In local coordinates (U, x), γ(t) = (γk(t))k, the autoparallelism amounts to solve the following second-order Ordinary Differential Equations (ODEs): γ̈(t) + Γk ijγ̇(t)γ̇(t) = 0, γl (t) = xl ◦ γ(t), (17) where Γk ij are the Christoffel symbols of the second kind, with: Γk ij Σ = Γij,lglk , Γij,k Σ = glkΓl ij, (18) where Γij,l the Christoffel symbols of the first kind. Geodesics are 1D autoparallel submanifolds and ∇-hyperplanes are defined similarly as autoparallel submanifolds of dimension D − 1. We may specify in subscript the connection that yields the geodesic γ: γ∇. 8 Figure 4: Parallel transport with respect to the metric connection: Curvature effect can be visualized as the angle defect along the parallel transport on smooth (infinitesimal) loops. For a sphere manifold, a vector parallel-transported along a loop does not coincide with itself, while it always conside with itself for a (flat) manifold. Drawings are courtesy of c
CNRS, http: //images.math.cnrs.fr/Visualiser-la-courbure.html 2.3.4 Curvature and torsion of a manifold An affine connection ∇ defines a 4D10 Riemann-Christoffel curvature tensor R (expressed using components Ri jkl of a (1, 3)-tensor). The coordinate-free equation of the curvature tensor is given by: R(X, Y )Z:=∇X∇Y X − ∇Y ∇XZ − ∇[X,Y ]Z, (19) where [X, Y ](f) = X(Y (f)) − Y (X(f)) (∀f ∈ F(M)) is the Lie bracket of vector fields. A manifold M equipped with a connection ∇ is said flat (meaning ∇-flat) when R = 0. This holds in particular when finding a particular11 coordinate system x of a chart (U, x) such that Γk ij = 0, i.e., when all connection coefficients vanish. A manifold is torsion-free when the connection is symmetric. A symmetric connection satisfies the following coordinate-free equation: ∇XY − ∇Y X = [X, Y ]. (20) Using local chart coordinates, this amounts to check that Γk ij = Γk ji. The torsion tensor is a (1, 2)-tensor defined by: T(X, Y ):=∇XY − ∇Y X − [X, Y ]. (21) In general, the parallel transport is path-dependent. The angle defect of a vector transported on an infinitesimal closed loop (a smooth curve with coinciding extremities) is related to the curvature. However for a flat connection, the parallel transport does not depend on the path. Figure 4 illustrates the parallel transport along a curve for a curved manifold (the sphere manifold) and a flat manifold ( the cylinder manifold12). 10 It follows from symmetry constraints that the number of independent components of the Riemann tensor is D2 (D2 −1) 12 in D dimensions. 11 For example, the Christoffel symbols vanish in a rectangular coordinate system of a plane but not in the polar coordinate system of it. 12 The Gaussian curvature at of point of a manifold is the product of the minimal and maximal sectional curvatures: κG:=κminκmax . For a cylinder, since κmin = 0, it follows that the Gaussian curvature of a cylinder is 0. Gauss’s Theorema Egregium (meaning “remarkable theorem”) proved that the Gaussian curvature is intrinsic and does not depend on how the surface is embedded into the ambient Euclidean space. 9 2.4 The fundamental theorem of Riemannian geometry: The Levi-Civita metric connection By definition, an affine connection ∇ is said metric compatible with g when it satisfies for any triple (X, Y, Z) of vector fields the following equation: XhY, Zi = h∇XY, Zi + hY, ∇XZi, (22) which can be written equivalently as: Xg(Y, Z) = g(∇XY , Z) + g(Y, ∇XZ) (23) Using local coordinates and natural basis {∂i} for vector fields, the metric-compatibility property amounts to check that we have: ∂kgij = h∇∂k ∂i, ∂ji + h∂i, ∇∂k ∂ji (24) A property of using a metric-compatible connection is that the parallel transport Q∇ of vectors preserve the metric: hu, vic(0) = * ∇ Y c(0)→c(t) u, ∇ Y c(0)→c(t) v + c(t) ∀t. (25) That is, the parallel transport preserves angles (and orthogonality) and lengths of vectors in tangent planes when transported along a smooth curve. The fundamental theorem of Riemannian geometry states the existence of a unique torsion-free metric compatible connection: Theorem 1 (Levi-Civita metric connection) There exists a unique torsion-free affine connec- tion compatible with the metric called the Levi-Civita connection LC∇. The Christoffel symbols of the Levi-Civita connection can be expressed from the metric tensor g as follows: LC Γk ij Σ = 1 2 gkl (∂igil + ∂jgil − ∂lgij) , (26) where gij denote the matrix elements of the inverse matrix g−1. The Levi-Civita connection can also be defined coordinate-free with the Koszul formula: 2g(∇XY, Z) = X(g(Y, Z)) + Y (g(X, Z)) − Z(g(X, Y )) + g([X, Y ], Z) − g([X, Z], Y ) − g([Y, Z], X). (27) There exists metric-compatible connections with torsions studied in theoretical physics. See for example the flat Weitzenböck connection [9]. The metric tensor g induces the torsion-free metric-compatible Levi-Civita connection that determines the local structure of the manifold. However, the metric g does not fix the global topological structure: For example, although a cone and a cylinder have locally the same flat Euclidean metric, they exhibit different global structures. 10 2.5 Preview: Information geometry versus Riemannian geometry In information geometry, we consider a pair of conjugate affine connections ∇ and ∇∗ (often but not necessarily torsion-free) that are coupled to the metric g: The structure is conventionally written as (M, g, ∇, ∇∗). The key property is that those conjugate connections are metric compatible, and therefore the induced dual parallel transport preserves the metric: hu, vic(0) = * ∇ Y c(0)→c(t) u, ∇∗ Y c(0)→c(t) v + c(t) . (28) Thus the Riemannian manifold (M, g) can be interpreted as the self-dual information-geometric manifold obtained for ∇ = ∇∗ = LC∇ the unique torsion-free Levi-Civita metric connection: (M, g) ≡ (M, g, LC∇, LC∇ ∗ = LC∇). However, let us point out that for a pair of self-dual Levi- Civita conjugate connections, the information-geometric manifold does not induce a distance. This contrasts with the Riemannian modeling (M, g) which provides a Riemmanian metric distance Dρ(p, q) defined by the length of the geodesic γ connecting the two points p = γ(0) and q = γ(1) (shortest path): Dρ(p, q) := Z 1 0 kγ0 (t)kγ(t)dt, (29) = Z 1 0 q γ̇(t)>gγ(t)γ̇(t)dt. (30) Usually, this Riemannian geodesic distance is not available in closed-form (and need to be approximated or bounded) because the geodesics cannot be explicitly parameterized (see geodesic shooting methods [7]). We are now ready to introduce the key geometric structures of information geometry. 3 Information manifolds 3.1 Overview In this part, we explain the dualistic structures of manifolds in information geometry. In §3.2, we first present the core Conjugate Connection Manifolds (CCMs) (M, g, ∇, ∇∗), and show how to build Statistical Manifolds (SMs) (M, g, C) from a CCM in §3.3. From any statistical manifold, we can build a 1-parameter family (M, g, ∇−α, ∇α) of CCMs, the information α-manifolds. We state the fundamental theorem of information geometry in §3.5. These CCMs and SMs struc- tures are not related to any distance a priori but require at first a pair (∇, ∇∗) of conjugate connections coupled to a metric tensor g. We show two methods to build an initial pair of conju- gate connections. A first method consists in building a pair of conjugate connections (D∇, D∇∗) from any divergence D in §3.6. Thus we obtain self-conjugate connections when the divergence is symmetric: D(θ1 : θ2) = D(θ2 : θ1). When the divergences are Bregman divergences (i.e., D = BF for a strictly convex and differentiable Bregman generator), we obtain Dually Flat Man- ifolds (DFMs) (M, ∇2F, F ∇, F ∇∗) in §3.7. DFMs nicely generalize the Euclidean geometry and exhibit Pythagorean theorems. We further characterize when orthogonal F ∇-projections and dual 11 F ∇∗-projections of a point on submanifold a is unique.13 A second method to get a pair of conju- gate connections (e∇, m∇) consists in defining these connections from a regular parametric family of probability distributions P = {pθ(x)}θ. In that case, these ‘e’xponential connection e∇ and ‘m’ixture connection m∇ are coupled to the Fisher information metric Pg. A statistical manifold (P, Pg, PC) can be recovered by considering the skewness Amari-Chentsov cubic tensor PC, and it follows a 1-parameter family of CCMs, (P, Pg, P∇−α, P∇+α), the statistical expected α-manifolds. In this parametric statistical context, these information manifolds are called expected information manifolds because the various quantities are expressed from statistical expectations E·[·]. Notice that these information manifolds can be used in information sciences in general, beyond the tra- ditional fields of statistics. In statistics, we motivate the choice of the connections, metric tensors and divergences by studying statistical invariance criteria, in §3.9. We explain how to recover the expected α-connections from standard f-divergences that are the only separable divergences that satisfy the property of information monotonicity. Finally, in §3.10, the recall the Fisher-Rao expected Riemannian manifolds that are Riemannian manifolds (P, Pg) equipped with a geodesic metric distance called the Fisher-Rao distance, or Rao distance for short. 3.2 Conjugate connection manifolds: (M, g, ∇, ∇∗ ) We begin with a definition: Definition 1 (Conjugate connections) A connection ∇∗ is said to be conjugate to a connection ∇ with respect to the metric tensor g if and only if we have for any triple (X, Y, Z) of smooth vector fields the following identity satisfied: XhY, Zi = h∇XY, Zi + hY, ∇∗ XZi, ∀X, Y, Z ∈ X(M). (31) We can notationally rewrite Eq. 31 as: Xg(Y, Z) = g(∇XY , Z) + g(Y, ∇∗ XZ), (32) and further explicit that for each point p ∈ M, we have: Xpgp(Yp, Zp) = gp((∇XY )p, Zp) + gp(Yp, (∇∗ XZ)p). (33) We check that the right-hand-side is a scalar and that the left-hand-side is a directional derivative of a real-valued function, that is also a scalar. Conjugation is an involution: (∇∗)∗ = ∇. Definition 2 (Conjugate Connection Manifold) The structure of the Conjugate Connection Manifold (CCM) is denoted by (M, g, ∇, ∇∗), where (∇, ∇∗) are conjugate connections with respect to the metric g. A remarkable property is that the dual parallel transport of vectors preserves the metric. That is, for any smooth curve c(t), the inner product is conserved when we transport one of the vector u using the primal parallel transport Q∇ c and the other vector v using the dual parallel transport Q∇∗ c . 13 In Euclidean geometry, the orthogonal projection of a point p onto an affine subspace S is proved to be unique using the Pythagorean theorem. 12 hu, vic(0) = * ∇ Y c(0)→c(t) u, ∇∗ Y c(0)→c(t) v + c(t) . (34) Property 1 (Dual parallel transport preserves the metric) A pair (∇, ∇∗) of conjugate connections preserves the metric g if and only if: ∀t ∈ [0, 1], * ∇ Y c(0)→c(t) u, ∇∗ Y c(0)→c(t) v + c(t) = hu, vic(0). (35) Property 2 Given a connection ∇ on (M, g) (i.e., a structure (M, g, ∇)), there exists a unique conjugate connection ∇∗ (i.e., a dual structure (M, g, ∇∗)). We consider a manifold M equipped with a pair of conjugate connections ∇ and ∇∗ that are coupled with the metric tensor g so that the dual parallel transport preserves the metric. We define the mean connection ¯ ∇: ¯ ∇ = ∇ + ∇∗ 2 , (36) with corresponding Christoffel coefficients denoted by Γ̄. This mean connection coincides with the Levi-Civita metric connection: ¯ ∇ = LC ∇. (37) Property 3 The mean connection ¯ ∇ is self-conjugate, and coincide with the Levi-Civita metric connection. 3.3 Statistical manifolds: (M, g, C) Lauritzen introduced this corner structure [30] of information geometry in 1987. Beware that although it bears the name “statistical manifold,” it is a purely geometric construction that may be used outside of the field of Statistics. However, as we shall mention later, we can always find a statistical model P corresponding to a statistical manifold [69]. We shall see how we can convert a conjugate connection manifold into such a statistical manifold, and how we can subsequently derive an infinite family of CCMs from a statistical manifold. In other words, once we have a pair of conjugate connections, we will be able to build a family of pairs of conjugate connections. We define a totally symmetric14 cubic (0, 3)-tensor (i.e., 3-covariant tensor) called the Amari- Chentsov tensor: Cijk:=Γk ij − Γ∗k ij, (38) or in coordinate-free equation: C(X, Y, Z):=h∇XY − ∇∗ XY, Zi. (39) Using the local basis, this cubic tensor can be expressed as: Cijk = C(∂i, ∂j, ∂k) = h∇∂i ∂j − ∇∗ ∂i ∂j, ∂ki (40) Definition 3 (Statistical manifold [30]) A statistical manifold (M, g, C) is a manifold M equipped with a metric tensor g and a totally symmetric cubic tensor C. 14 This means that Cijk = Cσ(i)σ(j)σ(k) for any permutation σ. The metric tensor is totally symmetric. 13 3.4 A family {(M, g, ∇−α , ∇α = (∇−α )∗ )}α∈R of conjugate connection manifolds For any pair (∇, ∇∗) of conjugate connections, we can define a 1-parameter family of connections {∇α}α∈R, called the α-connections such that (∇−α, ∇α) are dually coupled to the metric, with ∇0 = ¯ ∇ = LC∇, ∇1 = ∇ and ∇−1 = ∇∗. By observing that the scaled cubic tensor αC is also a totally symmetric cubic 3-covariant tensor, we can derive the α-connections from a statistical manifold (M, g, C) as: Γα ij,k = Γ0 ij,k − α 2 Cij,k, (41) Γ−α ij,k = Γ0 ij,k + α 2 Cij,k, (42) where Γ0 ij,k are the Levi-Civita Christoffel symbols, and Γki,j Σ = Γl ijglk (by index juggling). The α-connection ∇α can also be defined as follows: g(∇α XY, Z) = g(LC ∇XY, Z) + α 2 C(X, Y, Z), ∀X, Y, Z ∈ X(M). (43) Theorem 2 (Family of information α-manifolds) For any α ∈ R, (M, g, ∇−α, ∇α = (∇−α)∗) is a conjugate connection manifold. The α-connections ∇α can also be constructed directly from a pair (∇, ∇∗) of conjugate con- nections by taking the following weighted combination: Γα ij,k = 1 + α 2 Γij,k + 1 − α 2 Γ∗ ij,k. (44) 3.5 The fundamental theorem of information geometry: ∇ κ-curved ⇔ ∇∗ κ- curved We now state the fundamental theorem of information geometry and its corollaries: Theorem 3 (Dually constant curvature manifolds) If a torsion-free affine connection ∇ has constant curvature κ then its conjugate torsion-free connection ∇∗ has necessarily the same constant curvature κ. The proof is reported in [12] (Proposition 8.1.4, page 226). We get the following two corollaries: Corollary 1 (Dually α-flat manifolds) A manifold (M, g, ∇−α, ∇α) is ∇α-flat if and only if it is ∇−α-flat. Corollary 2 (Dually flat manifolds (α = ±1)) A manifold (M, g, ∇, ∇∗) is ∇-flat if and only if it is ∇∗-flat. Thus once we are given a pair of conjugate connections, we can always build a 1-parametric family of manifolds. Manifolds with constant curvature κ are interesting from the computational viewpoint as dual geodesics have simple closed-form expressions. 14 3.6 Conjugate connections from divergences: (M, D) ≡ (M, D g, D ∇, D ∇∗ = D∗ ∇) Loosely speaking, a divergence D(· : ·) is a smooth distance [74], potentially asymmetric. In order to define precisely a divergence, let us first introduce the following handy notations: ∂i,·f(x, y) = ∂ ∂xi f(x, y), ∂·,jf(x, y) = ∂ ∂yj f(x, y), ∂ij,kf(x, y) = ∂2 ∂xi∂xj ∂ ∂yk f(x, y) and ∂i,jkf(x, y) = ∂ ∂xi ∂2 ∂yj∂yk f(x, y), etc. Definition 4 (Divergence) A divergence D : M × M → [0, ∞) on a manifold M with respect to a local chart Θ ⊂ RD is a C3-function satisfying the following properties: 1. D(θ : θ0) ≥ 0 for all θ, θ0 ∈ Θ with equality holding iff θ = θ0 (law of the indiscernibles), 2. ∂i,·D(θ : θ0)|θ=θ0 = ∂·,jD(θ : θ0)|θ=θ0 = 0 for all i, j ∈ [D], 3. −∂·,i∂·,jD(θ : θ0)|θ=θ0 is positive-definite. The dual divergence is defined by swapping the arguments: D∗ (θ : θ0 ) := D(θ0 : θ), (45) and is also called the reverse divergence (reference duality in information geometry). Reference duality of divergences is an involution: (D∗)∗ = D. The Euclidean distance is a metric distance but not a divergence. The squared Euclidean distance is a non-metric symmetric divergence. The metric tensor g yields Riemannian metric distance Dρ but it is never a divergence. From any given divergence D, we can define a conjugate connection manifold following the construction of Eguchi [20] (1983): Theorem 4 (Manifold from divergence) (M, Dg, D∇, D∗ ∇) is an information manifold with: D g := −∂i,jD(θ : θ0 )|θ=θ0 = D∗ g, (46) D Γijk := −∂ij,kD(θ : θ0 )|θ=θ0 , (47) D∗ Γijk := −∂k,ijD(θ : θ0 )|θ=θ0 . (48) The associated statistical manifold is (M, Dg, DC) with: D Cijk = D∗ Γijk − D Γijk. (49) Since αDC is a totally symmetric cubic tensor for any α ∈ R, we can derive a one-parameter family of conjugate connection manifolds: n (M, D g, D C α ) ≡ (M, D g, D ∇ −α , (D ∇ −α )∗ = D ∇ α ) o α∈R . (50) In the remainder, we use the shortcut (M, D) to denote the divergence-induced information manifold (M, Dg, D∇, D∇ ∗ ). Notice that it follows from construction that: D ∇ ∗ = D∗ ∇. (51) 15 3.7 Dually flat manifolds (Bregman geometry): (M, F) ≡ (M, BF g, BF ∇, BF ∇∗ = BF ∗ ∇) We consider dually flat manifolds that satisfy asymmetric Pythagorean theorems. These flat man- ifolds can be obtained from a canonical Bregman divergence. Consider a strictly convex smooth function F(θ) called a potential function, with θ ∈ Θ where Θ is an open convex domain. Notice that the function convexity does not change by an affine transformation. We associate to the potential function F a corresponding Bregman divergence (parameter divergence): BF (θ : θ0 ):=F(θ) − F(θ0 ) − (θ − θ0 )> ∇F(θ0 ). (52) We write also the Bregman divergence between point P and point Q as D(P : Q):=BF (θ(P) : θ(Q)), where θ(P) denotes the coordinates of a point P. The induced information-geometric structure is (M, F g, F C):=(M, BF g, BF C) with: F g := BF g = − ∂i∂jBF (θ : θ0 )|θ0=θ = ∇2 F(θ), (53) F Γ := BF Γij,k(θ) = 0, (54) F Cijk := BF Cijk = ∂i∂j∂kF(θ). (55) Since all coefficients of the Christoffel symbols vanish (Eq. 54), the information manifold is F ∇-flat. The Levi-Civita connection LC∇ is obtained from the metric tensor F g (usually not flat), and we get the conjugate connection (F ∇)∗ = F ∇1 from (M, F g, F C). The Legendre-Fenchel transformation yields the convex conjugate F∗ that is interpreted as the dual potential function: F∗ (η):= sup θ∈Θ {θ> η − F(θ)}. (56) Theorem 5 (Fenchel-Moreau biconjugation [24]) If F is a lower semicontinuous15 and con- vex function, then its Legendre-Fenchel transformation is involutive: (F∗)∗ = F (biconjugation). In a dually flat manifold, there exists two dual affine coordinate systems η = ∇F(θ) and θ = ∇F∗(η). We have the Crouzeix [15] identity relating the Hessians of the potential functions: ∇2 F(θ)∇2 F∗ (η) = I, (57) where I denote the D × D identity matrix. This Crouzeix identity reveals that B = {∂i}i and B∗ = {∂j}j are the primal and reciprocal basis, respectively. The Bregman divergence can be reinterpreted using Young-Fenchel (in)equality as the canonical divergence AF,F∗ [8]: BF (θ : θ0 ) = AF,F∗ (θ : η0 ) = F(θ) + F∗ (η0 ) − θ> η0 = AF∗,F (η0 : θ). (58) 15 A function f is lower semicontinous (lsc) at x0 iff f(x0) ≤ limx→x0 inf f(x). A function f is lsc if it is lsc at x for all x in the function domain. 16 P Q R P Q R D(P : R) = D(P : Q) + D(Q : R) BF (θ(P) : θ(R)) = BF (θ(P) : θ(Q)) + BF (θ(Q) : θ(R)) D∗ (P : R) = D∗ (P : Q) + D∗ (Q : R) BF ∗ (η(P) : η(R)) = BF ∗ (η(P) : η(Q)) + BF ∗ (η(Q) : η(R)) γ∗ (P, Q) ⊥F γ(Q, R) γ(P, Q) ⊥F γ∗ (Q, R) Figure 5: Dual Pythagorean theorems in a dually flat space. The dual Bregman divergence BF ∗ (θ : θ0):=BF (θ0 : θ) = BF∗ (η : η0) yields F gij (η) = ∂i ∂j F∗ (η), ∂l :=: ∂ ∂ηl (59) F Γ∗ijk (η) = 0, F Cijk = ∂i ∂j ∂k F∗ (η) (60) Thus the information manifold is both F ∇-flat and F ∇∗-flat: This structure is called a dually flat manifold (DFM). In a DFM, we have two global affine coordinate systems θ(·) and η(·) related by the Legendre-Fenchel transformation of a pair of potential functions F and F∗. That is, (M, F) ≡ (M, F∗), and the dual atlases are A = {(M, θ)} and A∗ = {(M, η)}. In a dually flat manifold, any pair of points P and Q can either be linked using the ∇-geodesic (that is θ-straight) or the ∇∗-geodesic (that is η-straight). In general, there are 23 = 8 types of geodesic triangles in a dually flat manifold. Moreover, the dual Pythagorean theorems [36] illustrated in Figure 5 holds. Let γ(P, Q) = γ∇(P, Q) denote the ∇-geodesic passing through points P and Q, and γ∗(P, Q) = γ∇∗ (P, Q) denote the ∇∗-geodesic passing through points P and Q. Two curves γ1 and γ2 are orthogonal at point p = γ1(t1) = γ2(t2) with respect to the metric tensor g when g(γ̇1(t1), γ̇2(t2)) = 0. Theorem 6 (Dual Pythagorean identities) γ∗ (P, Q) ⊥ γ(Q, R) ⇔ (η(P) − η(Q))> (θ(Q) − θ(R)) Σ = (ηi(P) − ηi(Q))(θi(Q) − θi(R)) = 0, γ(P, Q) ⊥ γ∗ (Q, R) ⇔ (θ(P) − θ(Q))> (η(Q) − η(R)) Σ = (θi(P) − θi(Q))> (ηi(Q) − ηi(R)) = 0. We can define dual Bregman projections and characterize when these projections are unique: A submanifold S ⊂ M is said ∇-flat (∇∗-flat) iff. it corresponds to an affine subspace in the θ-coordinate system (in the η-coordinate system, respectively). 17 Theorem 7 (Uniqueness of projections) The ∇-projection PS of P on S is unique if S is ∇∗-flat and minimizes the divergence D(θ(P) : θ(Q)): ∇-projection: PS = arg min Q∈S D(θ(P) : θ(Q)). (61) The dual ∇∗-projection P∗ S is unique if M ⊆ S is ∇-flat and minimizes the divergence D(θ(Q) : θ(P)): ∇∗-projection: P∗ S = arg min Q∈S D(θ(Q) : θ(P)). (62) Let S ⊂ M and S0 ⊂ M, then we define the divergence between S and S0 as D(S : S0 ):= min s∈S,s0∈S0 D(s : s0 ). (63) When S is a ∇-flat submanifold and S0 ∇∗-flat submanifold, the divergence D(S : S0) between submanifold S and submanifold S0 can be calculated using the method of alternating projections [4]. Let us remark that Kurose [29] reported a Pythagorean theorem for dually constant curvature manifolds that generalizes the Pythagorean theorems of dually flat spaces. The dually flat geometry can be investigated under the wider scope of Hessian manifolds [63] which consider locally potential functions. We now consider information manifolds induced by parametric statistical models. 3.8 Expected α-manifolds of a family of parametric probability distributions: (P, Pg, P∇−α , P∇α ) Informally speaking, an expected manifold is an information manifold built on a regular parametric family of distributions. It is sometimes called “expected” manifold or “expected” geometry in the literature [76] because the components of the metric tensor g and the Amari-Chentsov cubic tensor C are expressed using statistical expectations E·[·]. Let P be a parametric family of probability distributions: P:= {pθ(x)}θ∈Θ , (64) with θ belonging to the open parameter space Θ. The order of the family is the dimension of its parameter space. We define the likelihood function16 L(θ; x):=pθ(x) as a function of θ, and its corresponding log-likelihood function: l(θ; x):= log L(θ; x) = log pθ(x). (65) The score vector: sθ = ∇θl = (∂il)i, (66) indicates the sensitivity of the likelihood ∂il:=: ∂ ∂θi l(θ; x). The Fisher information matrix (FIM) of D × D for dim(Θ) = D is defined by: PI(θ):=Eθ [∂il∂jl]ij 0, (67) 16 The likelihood function is an equivalence class of functions defined modulo a positive scaling factor. 18 where denotes the Löwner order. That is, for two symmetric positive-definite matrices A and B, A B if and only if matrix A − B is positive semidefinite. For regular models [12], the FIM is positive definite: PI(θ) 0, where A B if and only if matrix A − B is positive-definite. In statistics, the FIM plays a role in the attainable precision of unbiased estimators. For any unbiased estimator, the Cramér-Rao lower bound [33] on the variance of the estimator is: Varθ[θ̂n(X)] 1 n PI−1 (θ). (68) The FIM is invariant by reparameterization of the sample space X, and covariant by reparam- eterization of the parameter space Θ, see [12]. We report the expression of the FIM for two important generic parametric family of probability distributions: (1) an exponential family, and (2) a mixture family. Example 1 (FIM of an exponential family E) An exponential family [41] E is defined for a sufficient statistic vector t(x) = (t1(x), . . . , tD(x)), and an auxiliary carrier measure k(x) by the following canonical density: E = ( pθ(x) = exp D X i=1 ti(x)θi − F(θ) + k(x) ! such that θ ∈ Θ ) , (69) where F is the strictly convex cumulant function. Exponential families include the Gaussian family, the Gamma and Beta families, the probability simplex ∆, etc. The FIM of an exponential family is given by: EI(θ) = CovX∼pθ(x)[t(x)] = ∇2 F(θ) = (∇2 F∗ (η))−1 0. (70) Example 2 (FIM of a mixture family M) A mixture family is defined for D + 1 functions F1, . . . , FD and C as: M = ( pθ(x) = D X i=1 θiFi(x) + C(x) such that θ ∈ Θ ) , (71) where the functions {Fi(x)}i are linearly independent on the common support X and satisfying R Fi(x)dµ(x) = 0. Function C is such that R C(x)dµ(x) = 1. Mixture families include statistical mixtures with prescribed component distributions and the probability simplex ∆. The FIM of a mixture family is given by: MI(θ) = EX∼pθ(x) Fi(x)Fj(x) (pθ(x))2 = Z X Fi(x)Fj(x) pθ(x) dµ(x) 0. (72) Notice that the probability simplex of discrete distributions can be both modeled as an exponential family or a mixture family [4]. The expected α-geometry is built from the expected dual ±α-connections. The Fisher “informa- tion metric” tensor is built from the FIM as follows: Pg(u, v):=(u)> θ PI(θ) (v)θ (73) 19 The expected exponential connection and expected mixture connection are given by e P∇ := Eθ [(∂i∂jl)(∂kl)] , (74) m P ∇ := Eθ [(∂i∂jl + ∂il∂jl)(∂kl)] . (75) The dualistic structure is denoted by (P, Pg, m P ∇, e P∇) with Amari-Chentsov cubic tensor called the skewness tensor: Cijk:=Eθ [∂il∂jl∂kl] . (76) It follows that we can build a one-family of expected information α-manifolds: (P, Pg, P∇−α , P∇+α ) α∈R , (77) with PΓα ij,k:= − 1 + α 2 Cijk = Eθ ∂i∂jl + 1 − α 2 ∂il∂jl (∂kl) . (78) The Levi-Civita metric connection is recovered as follows: P ¯ ∇ = P∇−α + P∇α 2 = LC P ∇:=LC ∇(Pg) (79) In case of an exponential family E or a mixture family M equipped with the dual exponen- tial/mixture connection, we get dually flat manifolds (Bregman geometry). Indeed, for the exponential/mixture families, it is easy to check that the Christoffel symbols of ∇e and ∇m vanish: e MΓ = m MΓ = e EΓ = m E Γ = 0. (80) 3.9 Criteria for statistical invariance So far we have explained how to build an information manifold (or information α-manifold) from a pair of conjugate connections. Then we reported two ways to obtain such a pair of conjugate connections: (1) from a parametric divergence, or (2) by using the predefined expected exponen- tial/mixture connections. We now ask the following question: Which information manifold makes sense in Statistics? We can refine the question as follows: • Which metric tensors g make sense in statistics? • Which affine connections ∇ make sense in statistics? • Which statistical divergences make sense in statistics (from which we can get the metric tensor and dual connections)? By definition, an invariant metric tensor g shall preserve the inner product under important statistical mappings called Markov embeddings. Informally, we embed ∆D into ∆D0 with D0 > D and the induced metric should be preserved (see [4], page 62). Theorem 8 (Uniqueness of Fisher information metric [13, 70]) The Fisher information metric is the unique invariant metric tensor under Markov embeddings up to a scaling constant. 20 p1 + p2 p3 + p4 + p5 p6 p7 + p8 p pA p1 p2 p3 p4 p5 p6 p7 p8 coarse graining Figure 6: A divergence satisfies the property of information monotonicity iff D(θĀ : θ0 Ā ) ≤ D(θ : θ0). Here, parameter θ represents a discrete distribution. A D-dimensional parameter (discrete) divergence satisfies the information monotonicity if and only if: D(θĀ : θ0 Ā) ≤ D(θ : θ0 ) (81) for any coarse-grained partition A = {Ai}E i=1 of [D] = {1, . . . , D} (A-lumping [16]) with E ≤ D, where θi Ā = P j∈Ai θj for i ∈ [E]. This concept of coarse-graining is illustrated in Figure 6. A separable divergence D(θ1 : θ2) is a divergence that can be expressed as the sum of elementary scalar divergences d(x : y): D(θ1 : θ2):= X i d(θi 1 : θj 2). (82) For example, the squared Euclidean distance D(θ1 : θ2) = P i(θi 1 − θi 2)2 is a separable divergence for the scalar Euclidean divergence d(x : y) = (x − y)2. The Euclidean distance DE(θ1, θ2) = qP i(θi 1 − θi 2)2 is not separable because of the square root operation. The only invariant and decomposable divergences when D > 1 are f-divergences [27] defined for a convex functional generator f: If (θ : θ0 ):= D X i=1 θif θ0 i θi ≥ f(1), f(1) = 0 (83) The standard f-divergences are defined for f-generators satisfying f0(1) = 0 (choose fλ(u):=f(u) + λ(u − 1) since Ifλ = If ), and f00(u) = 1 (scale fixed). Statistical f-divergences are invariant [58] under one-to-one/sufficient statistic transformations y = t(x) of sample space: p(x; θ) = q(y(x); θ): If [p(x; θ) : p(x; θ0 )] = Z X p(x; θ)f p(x; θ0) p(x; θ) dµ(x), = Z Y q(y; θ)f q(y; θ0) q(y; θ) dµ(y), = If [q(y; θ) : q(y; θ0 )]. The dual f-divergences for reference duality is If ∗ [p(x; θ) : p(x; θ0 )] = If [p(x; θ0 ) : p(x; θ)] = If [p(x; θ) : p(x; θ0 )] (84) for the standard conjugate f-generator (diamond f generator) with: f (u):=uf 1 u . (85) 21 One can check that f is a standard f-generator when f is standard. Let us report some common examples of f-divergences: • The family of α-divergences: Iα[p : q]:= 4 1 − α2 1 − Z p 1−α 2 (x)q 1+α 2 (x)dµ(x) , (86) obtained for f(u) = 4 1−α2 (1 − u 1+α 2 ). The α-divergences include: – the Kullback-Leibler when α → 1: KL[p : q] = Z p(x) log p(x) q(x) dµ(x), (87) for f(u) = − log u. – the reverse Kullback-Leibler α → −1: KL∗ [p : q] = Z q(x) log q(x) p(x) dµ(x) = KL[q : p], (88) for f(u) = u log u. – the symmetric squared Hellinger divergence: H2 [p : q] = Z ( p p(x) − p q(x))2 dµ(x), (89) for f(u) = ( √ u − 1)2 (corresponding to α = 0) – the Pearson and Neyman chi-squared divergences [45], etc. • the Jensen-Shannon divergence: JS[p : q] = 1 2 Z p(x) log 2p(x) p(x) + q(x) + q(x) log 2q(x) p(x) + q(x) dµ(x), (90) for f(u) = −(u + 1) log 1+u 2 + u log u. • the Total Variation TV[p : q] = 1 2 Z |p(x) − q(x)|dµ(x), (91) for f(u) = 1 2|u − 1|. The total variation distance is the only metric f-divergence. A remarkable property is that invariant standard f-divergences yield the Fisher information matrix and the α-connections. Indeed, the invariant standard f-divergences is related infinitesimally to the Fisher metric as follows: If [p(x; θ) : p(x; θ + dθ)] = Z p(x; θ)f p(x; θ + dθ) p(x; θ) dµ(x) (92) Σ = 1 2 F gij(θ)dθi dθj (93) 22 A statistical parameter divergence D on a parametric family of distributions P yields an equiv- alent parameter divergence PD: PD(θ : θ0 ):=D[p(x; θ) : p(x; θ0 )]. (94) Thus we can build the information manifold induced by this parameter divergence PD(· : ·). For PD(· : ·) = If [· : ·], the induced ±1-divergence connections If P ∇:=P If ∇ and (If )∗ P ∇:=P I∗ f ∇ are precisely the expected ±α-connections (derived from the exponential/mixture connections) with: α = 2f000 (1) + 3. (95) Thus the invariant connections which coincide with the connections induced by the invariant divergences are the expected α-connections. 3.10 Fisher-Rao expected Riemannian manifolds: (P, Pg) Historically, a first manifold modeling of a regular parametric family of distributions P = {pθ(x)}θ was to consider the Fisher Information Matrix (FIM) as the Riemannian metric tensor g (see [25, 60]), with: PI(θ):=Epθ [∂il∂jl] , (96) where ∂il:=: ∂ ∂θi log p(x; θ). Under some regularity conditions, we can rewrite the FIM: PI(θ):= − Epθ [∂i∂jl] . (97) The Riemannian geodesic metric distance Dρ is commonly called the Fisher-Rao distance: Dρ(pθ1 , pθ2 ) = Z 1 0 q γ̇(t)>gγ(t)γ̇(t)dt, (98) where γ denotes the geodesic passing through γ(0) = θ1 and γ(1) = θ2. Definition 5 (Fisher-Rao distance) The Fisher-Rao distance is the geodesic metric distance of the Fisher-Riemannian manifold (P, Pg). Let us give some examples of Fisher-Riemannian manifolds: • The Fisher-Riemannian manifold of the family of categorical distributions (also called finite discrete distributions in [4]) amount to the spherical geometry [28] (spherical manifold). • The Fisher-Riemannian manifold of the family of bivariate location-scale families amount to hyperbolic geometry (hyperbolic manifold). • The Fisher-Riemannian manifold of the family of location families amount to Euclidean ge- ometry (Euclidean manifold). 23 The first fundamental form of the Riemannian geometry is ds2 = hdx, dxi Σ = gijdxidxj where ds denotes the line element. This Riemannian geometric structure applied to a family of parametric probability distributions was first proposed by Harold Hotelling [25] (in a handwritten note of 1929, reprinted typeset in [65]) and independently later by C. R. Rao [60] (1945, reprinted in [59]). In a similar vein, Jeffreys [26] proposed to use the volume element of a manifold as an invariant prior: The eponym Jeffreys prior in 1946. Notice that for a parametric family of probability distributions P, the Riemannian structure (P, Pg) coincides with the self-dual conjugate connection manifold (P, Pg, If P ∇, If P ∇∗) induced by a symmetric f-divergence like the squared Hellinger divergence. 3.11 The monotone α-embeddings Another common mathematically equivalent expression of the FIM [12] is given by: Iij(θ):=4 Z ∂i p p(x; θ)∂j p p(x; θ)dµ(x). (99) This form of the FIM is well-suited to prove that the FIM is always a positive semi-definite ma- trix [12] (I(θ) 0). It turns out that we can define a family of equivalent representations of the FIM using the α-embedding [75] of the parametric family. First, we define the α-representation of densities lα(x; θ) := kα(p(x; θ)) with: kα(u):= ( 2 1−αu 1−α 2 , if α 6= 1, log u, if α = 1. (100) The function lα(x; θ) is called the α-likelihood function. Then the α-representation of the FIM, the α-FIM for short, is expressed as: Iα ij(θ):= Z ∂ilα (x; θ)∂jl−α (x; θ)dµ(x). (101) We can rewrite compactly the α-FIM, as Iα ij(θ) = R ∂ilα∂jl−αdµ(x). Expanding the α-FIM, we get: Iα ij(θ) = ( 1 1−α2 R ∂ip(x; θ) 1−α 2 ∂jp(x; θ) 1+α 2 dµ(x) for α 6= ±1 R ∂i log p(x; θ)∂jp(x; θ)dµ(x) for α ∈ {−1, 1} (102) The 1-representation of the FIM is called the logarithmic representation and its 0-representation is called the square root representation. The set of α-scores vectors Bα:={∂ilα}i are interpreted as the tangent basis vectors of the α-base Bα. Thus the FIM is α-independent. Furthermore, the α-representation of the FIM can be rewritten under mild conditions [12] as: Iα ij(θ) = − 2 1 + α Z p(x; θ) 1+α 2 ∂i∂jlα (x; θ)dµ(x). (103) Since we have: ∂i∂jlα (x; θ) = p 1−α 2 ∂i∂jl + 1 − α 2 ∂il∂jl , (104) 24 it follows that: Iα ij(θ) = − 2 1 + α −Iα ij(θ) + 1 − α 2 Iα ij = Iij(θ). (105) Notice that when α = 1, we recover the equivalent expression of the FIM (under mild condi- tions): I1 ij(θ) = −E[∇2 log p(x; θ)]. (106) In particular, when the family is an exponential family [41] with cumulant function F(θ) (satisfying the mild conditions), we have: I(θ) = ∇2 F(θ). (107) The α-embeddings can be generalized by considering a pair of strictly increasing real-valued functions17 ρ and τ (the conjugate embeddings) to yield the (ρ, τ)-geometry [75, 51]. Zhang [75] further discussed the representation/reference biduality which was confounded in the α-geometry. Figure 7 displays the main types of information manifolds encountered in information geometry with their relationships. 4 Some illustrating applications of dually flat manifolds Information geometry [4] found broad applications in information sciences. For example, we can mention: • Statistics: Asymptotic inference, Expectation-Maximization (EM and the novel information- geometric em), time series (AutoRegressive Moving Average model, ARMA) models, • Machine learning: Restricted Boltzmann machines (RBMs), neuromanifolds and natural gra- dient [66], • Signal processing: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization (NMF), • Mathematical programming: Barrier function of interior point methods, • Game theory: Score functions. In this part, we describe how to use the dually flat structures for handling an exponential family E (in a hypothesis testing problem detailed in §4.1) and the mixture family M (clustering statistical mixtures §4.2). Note that for a general divergence, neither (E, D) nor (M, D) is dually flat. However, when D = KL, the Kullback-Leibler divergence, we get dually flat spaces that are computationally attractive since the primal/dual geodesics are straight lines in the corresponding global affine coordinate system. 17 The set of strictly increasing real-valued univariate functions has a group structure for the group operation chosen as the functional composition ◦. 25 Riemannian Manifolds (M, g) = (M, g, LC ∇) Smooth Manifolds Conjugate Connection Manifolds (M, g, ∇, ∇ ∗ ) (M, g, C = Γ ∗ − Γ) Distance = Non-metric divergence Distance = Metric geodesic length g = Fisher g Fisher g ij = E[∂ i l∂ j l] Spherical Manifold Hyperbolic Manifold Self-dual Manifold Dually flat Manifolds (M, F, F ∗ ) (Hessian Manifolds) Dual Legendre potentials Bregman Pythagorean theorem Divergence Manifold (M, D g , D ∇, D ∇ ∗ = D ∗ ∇) D ∇ − flat ⇔ D ∇ ∗ − flat f -divergences Bregman divergence Expected Manifold (M, Fisher g, ∇ −α , ∇ α ) α-geometry Multinomial family LC ∇ = ∇+∇ ∗ 2 Euclidean Manifold Location-scale family Location family Parametric families Fisher-Riemannian Manifold KL ∗ on exponential families KL on mixture families Conformal divergences on deformed families Etc. Frank Nielsen Cubic skewness tensor C ijk = E[∂ i l∂ j l∂ k l] αC = ∇ αFisher g ∇ α = 1+α 2 ∇ + 1−α 2 ∇ ∗ Γ ±α = Γ̄ ∓ α 2 C (M, g, ∇ −α , ∇ α ) (M, g, αC) canonical divergence I[p θ : p θ 0 ] = D(θ : θ 0 ) Figure 7: Overview of the main types of information manifolds with their relationships in information geometry. 26 x1 x2 p0(x) p1(x) x Figure 8: Statistical Bayesian hypothesis testing: The best Maximum A Posteriori (MAP) rule chooses to classify an observation from the class that yields the maximum likelihood. 4.1 Hypothesis testing in the dually flat exponential family manifold (E, KL∗ ) Given two probability distributions P0 ∼ p0(x) and P1 ∼ p1(x), we ask to classify a set of iid. observations X1:n = {x1, . . . , xn} as either sampled from P0 or from P1? This is a statistical decision problem [35]. For example, P0 can represent the signal distribution and P1 the noise distribution. Figure 8 displays the probability distributions and the unavoidable error that is made by any statistical decision rule (on observations x1 and x2). Assume that both distributions P0 ∼ Pθ0 and P1 ∼ Pθ1 belong to the same exponential family E = {Pθ : θ ∈ Θ}, and consider the exponential family manifold with the dually flat structure (E, Eg, E∇e, E∇m). That is, the manifold equipped with the Fisher information metric tensor field and the expected exponential connection and conjugate expected mixture connection. This struc- ture can also be derived from a divergence manifold structure by choosing the reverse Kullback- Leibler divergence KL∗ : (E, Eg, E∇e , E∇m ) ≡ (E, KL∗ ). (108) Therefore, the Kullback-Leibler divergence KL[Pθ : Pθ0 ] amounts to a Bregman divergence (for the cumulant function of the exponential family): KL∗ [Pθ0 : Pθ] = KL[Pθ : Pθ0 ] = BF (θ0 : θ). (109) The best exponent error α∗ of the best Maximum A Priori (MAP) decision rule is found by minimizing the Bhattacharyya distance to get the Chernoff information [56]: C[P1, P2] = − log min α∈(0,1) Z x∈X pα 1 (x)p1−α 2 (x)dµ(x) ≥ 0. (110) On the exponential family manifold E, the Bhattacharyya distance: Bα[p1 : p2] = − log Z x∈X pα 1 (x)p1−α 2 (x)dµ(x), (111) amounts to a skew Jensen parameter divergence [40] (also called Burbea-Rao divergence): Jα F (θ1 : θ2) = αF(θ1) + (1 − α)F(θ2) − F(θ1 + (1 − α)θ2). (112) It can be shown that the Chernoff information (that minimizes α) is equivalent to a Bregman divergence: Namely, the Bregman divergence for exponential families at the optimal exponent value α∗. 27 pθ1 pθ2 pθ∗ 12 m-bisector e-geodesic Ge(Pθ1 , Pθ2 ) η-coordinate system Pθ∗ 12 C(θ1 : θ2) = B(θ1 : θ∗ 12) Bim(Pθ1 , Pθ2 ) Figure 9: Exact geometric characterization (not necessarily i closed-form) of the best exponent error rate α∗. Theorem 9 (Chernoff information [35]) The Chernoff information between two distributions belonging to the same exponential family amount to a Bregman divergence: C[Pθ1 : Pθ2 ] = B(θ1 : θα∗ 12 ) = B(θ2 : θα∗ 12 ), (113) where θα 12 = (1 − α)θ1 + αθ2, and α∗ denote the best exponent error. Let θ∗ 12:=θα∗ 12 denote the best exponent error. The geometry [35] of the best error exponent can be explained on the dually flat exponential family manifold as follows: P∗ = Pθ∗ 12 = Ge(P1, P2) ∩ Bim(P1, P2), (114) where Ge denotes the exponential geodesic γ∇e and Bim the m-bisector: Bim(P1, P2) = {P : F(θ1) − F(θ2) + η(P)> (θ2 − θ1) = 0}. (115) Figure 9 illustrates how to retrieve the best error exponent from an exponential arc (θ-geodesic) intersecting the m-bisector. Furthermore, instead of considering two distributions for this statistical binary decision problem, we may consider a set of n distributions of P1, . . . , Pn ∈ E. The geometry of the error exponent in this multiple hypothesis testing setting has been investigated in [34]. On the dually flat exponential family manifold, it corresponds to check the exponential arcs between natural neighbors (sharing Voronoi subfaces) of a Bregman Voronoi diagram [11]. See Figure 10 for an illustration. 4.2 Clustering mixtures in the dually flat mixture family manifold (M, KL) Given a set of k prescribed statistical distributions p0(x), . . . , pk−1(x), all sharing the same support X (say, R), a mixture family M of order D = k − 1 consists of all strictly convex combinations of these component distributions [48]: M:= ( m(x; θ) = k−1 X i=1 θipi(x) + 1 − k−1 X i=1 θi ! p0(x) such that θi > 0, k−1 X i=1 θi < 1 ) . (116) 28 η-coordinate system Chernoff distribution between natural neighbours Figure 10: Geometric characterization of the best exponent error rate in the multiple hypothesis testing case. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 -4 -2 0 2 4 6 8 M1 M2 Gaussian(-2,1) Cauchy(2,1) Laplace(0,1) Figure 11: Example of a mixture family of order D = 2 (3 components: Laplacian, Gaussian and Cauchy prefixed distributions). 29 0 0.05 0.1 0.15 0.2 0.25 -4 -2 0 2 4 Figure 12: Example of w-GMM clustering into k = 2 clusters. Figure 11 displays two mixtures obtained as convex combinations of prescribed Laplacian, Gaus- sian and Cauchy component distributions (D = 2). When considering a set of prescribed Gaussian component distributions, we obtain a w-Gaussian Mixture Model, or w-GMM for short. We consider the expected information manifold (M, Mg, M∇m, M∇e) which is dually flat and equivalent to (MΘ, KL). That is, the KL between two mixtures with prescribed components (w- mixtures, for short) is equivalent to a Bregman divergence for F(θ) = −h(mθ), where h(p) = R p(x) log p(x)dµ(x) is the differential Shannon information (negative entropy) [48]: KL[mθ1 : mθ2 ] = BF (θ1 : θ2). (117) Consider a set {mθ1 , . . . , mθn } of n w-mixtures [48]. Because F(θ) = −h(m(x; θ)) is the negative differential entropy of a mixture (not available in closed form [49]), we approximate the untractable F by another close tractable generator F̃. We use Monte Carlo stochastic sampling to get Monte- Carlo convex F̃S for an independent and identically distributed sample S. Thus we can build a nested sequence (M, F̃S1 ), . . . , (M, F̃Sm ) of tractable dually flat mani- folds for nested sample sets S1 ⊂ . . . ⊂ Sm converging to the ideal mixture manifold (M, F): limm→∞(M, F̃Sm ) = (M, F) (where convergence is defined with respect to the induced canon- ical Bregman divergence). A key advantage of this approach is that for a given sample S, all computations carried inside the dually flat manifold (M, F̃S) are consistent, see [48]. For example, we can apply Bregman k-means [43] on these Monte Carlo dually flat spaces [42] of w-GMMs (Gaussian Mixture Models) to cluster a set of w-GMMs. Figure 12 displays the result of such a clustering. We have briefly described two applications using dually flat manifolds: (1) the dually flat expo- nential manifold induced by the statistical reverse Kullback-Leibler divergence on an exponential family (structure (E, KL∗ )), and (2) the dually flat mixture manifold induced by the statistical Kullback-Leibler divergence on a mixture family (structure (M, KL)). There are many other du- ally flat structures that can be met in a statistical context: For example, two other dually flat structures for the D-dimensional probability simplex ∆D are reported in Amari’s textbook [4]: (1) the conformally deforming of the α-geometry (page 88, Eq. 4.95 of [4]), and (2) the χ-escort 30 geometry (page 91, Eq. 4.114 of [4]). 5 Conclusion: Summary, historical background, and perspectives 5.1 Summary We explained the dualistic nature of information manifolds (M, g, ∇, ∇∗) in information geometry. The dualistic structure is defined by a pair of conjugate connections coupled with the metric tensor that provides a dual parallel transport that preserves the metric. We showed how to extend this structure to a 1-parameter family of structures. From a pair of conjugate connections, the pipeline can be informally summarized as: (M, g, ∇, ∇∗ ) ⇒ (M, g, C) ⇒ (M, g, αC) ⇒ (M, g, ∇−α , ∇α ), ∀α ∈ R. (118) We stated the fundamental theorem of information geometry on dual constant-curvature manifolds, including the special but important case of dually flat manifolds on which there exists two poten- tial functions and global affine coordinate systems related by the Legendre-Fenchel transformation. Although, information geometry historically started with the Riemannian modeling (P, Pg) of a parametric family of probability distributions P by letting the metric tensor be the Fisher infor- mation matrix, we have emphasized the dualistic view of information geometry which considers non-Riemannian manifolds that can be derived from any divergence, and not necessarily tied to a statistical context (e.g., information manifold can be used in mathematical programming [54]). Let us notice that for any symmetric divergence (e.g. any symmetrized f-divergence like the squared Hellinger divergence), the induced conjugate connections coincide with the Levi-Civita connection but the Fisher-Rao metric distance does not coincide with the squared Hellinger divergence. On one hand, a Riemannian metric distance Dρ is never a divergence because the rooted distance functions fail to be smooth at the extremities but a squared Riemmanian metric distance is always a divergence. On the other hand, taking the power δ of a divergence D (i.e., Dδ) for some δ > 0 may yield a metric distance (e.g., the square root of the Jensen-Shannon divergence [21]), but this may not always be the case: The powered Jeffreys divergence Jδ is never a metric distance (see [68], page 889). Recently, the Optimal Transport (OT) theory [71] gained interest in statistics and machine learning. But the optimal transport between two members of a same elliptically-contoured family has the same optimal transport formula distance (see [18] Eq. 16 and Eq. 17, although they have different Kullback-Leibler divergences). Another essential difference is that the Fisher-Rao manifold of location-scale families is hyperbolic but the Wasserstein manifold of location-scale families has positive curvature [18, 67]. 5.2 A brief historical review of information geometry The field of Information Geometry (IG) was historically motivated by providing some differential- geometric structure to statistical models in order to reason geometrically about statistical prob- lems with the endeavor goal of geometrizing mathematical statistics [14, 3, 32, 28, 5]: Harold Hotelling [25] first considered in the late 1920’s the Fisher Information Matrix (FIM) I as a Rie- mannian metric tensor g, and interpreted a parametric family of probability distributions M as a Riemannian manifold (M, g).18. In this pioneering work, Hotelling mentioned that location-scale 18 Hotelling attended the American Mathematical Society’s Annual Meeting in Bethlehem (Pennsylvania, USA) on December 26–29, 1929, but left before his scheduled talk on December 27. His handwritten notes on the “Spaces of 31 probability families yield manifolds of constant negative curvatures. This Riemannian modeling of parametric family of densities was further independently studied by Calyampudi Radhakrishna Rao in his celebrated paper [60] (1945) that also includes the Cramér-Rao lower bound [33] and the Rao-Blackwellization technique. Nowadays the induced Riemannian metric distance is often called the Fisher-Rao distance [64] or Rao distance [61]. Another use of Riemannian geometry in statistics was pioneered by Harold Jeffreys [26] that proposed to use as an invariant prior the normalized volume element of the expected Fisher-Riemannian manifold. In those seminal papers, there was no theoretical justification of using the Fisher information matrix as a metric tensor (be- sides the fact that it is a positive-definite matrix for regular identifiable models). Nowadays, this Riemmanian metric tensor is called the information metric for short. Information geometry con- siders a generalization of this approach using a non-Riemannian dualistic modeling (M, g, ∇, ∇∗) that coincide with the Riemannian manifold when ∇ = ∇∗ = LC∇, the Levi-Civita connection (the unique torsion-free connection compatible with the metric tensor). In the 1960’s, Nikolai Chentsov (also commonly written Čencov) studied the algebraic cate- gory of all statistical decision rules with its induced geometric structures: Namely, the expected α-geometries (“equivalent differential geometry”) and the dually flat manifolds (“Nonsymmetric Pythagorean geometry” of the exponential families with respect to the Kullback-Leibler diver- gence). In the preface of the english translation of his 1972’s russia monograph [14], the field of investigation is defined as “geometrical statistics.” However in the original Russian monograph, Chentsov used the russian term geometrostatistics. The geometrostatistics term was allegedly coined19 by Andrey Kolmogorov to define the field of differential geometry of statistical models. In the monograph of Chentsov, the Fisher information metric is shown to be the unique metric tensor (up to a scaling factor) yielding statistical invariance under Markov morphisms (see [13] for a simpler proof that generalizes to positive measures). The dual nature of the geometry was thoroughly investigated20 by Shun-ichi Amari. In the preface of his 1985’s monograph [3], Professor Amari coined the term information geometry as follows: “The differential-geometrical method developed in statistics is also applicable to other fields of sciences such as information theory and systems theory... They together will open a new field, which I would like to call information geometry.” The role of differential geometry in statistics has been discussed in [10]. Note that the dual affine connections of information geometry have also been investigated inde- pendently in affine differential geometry [52] which considers invariance under volume-preserving affine transformations by defining a volume form (instead of a metric form for Riemannian ge- ometry). The notion of dual parallel transport compatible with the metric is due to Aleksandr Norden [53]. We summarize the main fundamental structures of information manifolds below: Statistical Parameters” was read by a colleague and are fully typeset in [65]. We warmly thank Professor Stigler for sending us the scanned handwritten notes and for discussing by emails historical aspects of the birth of information geometry. 19 We thank Professor Alexander Holevo for email correspondences on this matter. 20 Professor Amari mentioned in [3] that he considered the Gaussian Riemannian manifold as a hyperbolic manifold in 1959, and was strongly influenced by Efron’s paper on statistical curvature [19] (1975) to study the family of α-connections in the 1980’s [2]. 32 (M, g) Riemannian manifold (P, Pg) Fisher-Riemannian expected Riemannian manifold (M, g, ∇) Manifold with affine connection (P, Pg, P e∇α) Chentsov’s manifold with affine α-connection (M, g, ∇, ∇∗) Amari’s dualistic information manifold (P, Pg, P∇−α, P∇α) Amari’s expected information α-manifold, α-geometry (M, g, C) Lauritzen’s statistical manifold [30] 33 (M, Dg, D∇, D∗ ∇) Eguchi’s conjugate connection manifold induced by divergence D (M, F g, F C) Chentsov/Amari’s dually flat manifold induced by convex potential F We use the ≡ symbol to denote the equivalence of geometric structures. For example, we have (M, g) ≡ (M, g, LC∇, LC∇ ∗ = LC∇). 5.3 Perspectives We recommend the two recent textbooks [12, 4] for an indepth covering of (parametric) information geometry, and the book [22] for a thorough description of some infinite-dimensional statistical models. We did not report the various coefficients of the metric tensors, Christoffel symbols and skewness tensors for the expected α-geometry of common parametric models like the multivariate Gaussian distributions, the Gamma/Beta distributions, etc. They can be found in [6, 12] and in various articles dealing with less common family of distributions [77]. Although we have focused on the finite parametric setting, information geometry is also considering non-parametric families of distributions [57], and quantum information geometry [23]. We have shown that we can always create an information manifold (M, D) from any divergence function D. It is therefore important to consider generic classes of divergences in applications, that are ideally axiomatized and shown to have exhaustive characteristics. Beyond the three main Bregman/Csiszár/Jensen classes (theses classes overlap [55]), we may also mention the class of conformal divergences [51, 46], the class of projective divergences [47, 50], etc. Figure 13 illustrates the relationships between the principal classes of distances. There are many perspectives on information geometry as attested by the new Springer journal21, and the biannual international conference “Geometric Sciences of Information” (GSI) [37, 38, 39]. Acknowledgments FN would like to thank the organizers of the Geometry In Machine Learning workshop 2018 (GiMLi22) for their kind talk invitation, and specially Prof. Søren Hauberg (DTU, Denmark). This document is based on the talk given at that event. I am very thankful to Prof. Stigler (Uni- versity of Chicago, USA) and Prof. Holevo (Steklov Mathematical Institute, Russia) for providing me feedback on the historical development of the field. I express my thanks to Gaëtan Hadjeres (Sony CSL, Paris) for his careful proofreading and feedback. Notations In order to avoid confusion between the different type of dualities, we could have denoted by Dr the dual divergence, ∇∗ the dual/conjugate connection, and F? the convex conjugate via Legendre- Fenchel transformation. In doing so, we express the dual divergence Dr (reverse divergence/reference duality, (Dr)r = D) as: Dr (θ1 : θ2) := D(θ2 : θ1), 21 ’Information Geometry’, https://www.springer.com/mathematics/geometry/journal/41884 22 http://gimli.cc/2018/ 34 If (P : Q) = R p(x)f (q(x) p(x) dν(x) BF (P : Q) = F(P) − F(Q) − hP − Q, ∇F(Q)i tBF (P : Q) = BF (P :Q) √ 1+k∇F (Q)k2 CD,g(P : Q) = g(Q)D(P : Q) BF,g(P : Q; W) = WBF P Q : Q W Dv (P : Q) = D(v(P) : v(Q)) v-Divergence Dv total Bregman divergence tB(· : ·) Bregman divergence BF (· : ·) conformal divergence CD,g(· : ·) Csiszár f-divergence If (· : ·) scaled Bregman divergence BF (· : ·; ·) scaled conformal divergence CD,g(· : ·; ·) Dissimilarity measure Divergence Projective divergence γ-divergence Hÿvarinen SM/RM D(λp : λ0 p0 ) = D(p : p0 ) D(λp : p0 ) = D(p : p0 ) one-sided double sided C3 Figure 13: Principled classes of distances/divergences 35 the dual connection ∇∗ (with (∇∗)∗ = ∇) as: Xg(Y, Z) = g(∇XY , Z) + g(Y, ∇∗ XZ), and the Legendre-Fenchel convex conjugate F? (with Moreau biconjugation theorem (F?)? = F): F? (η) := sup θ∈Θ {θ> η − F(θ)}. Then the divergence reference duality yields conjugate divergence-based connections: Dr ∇ = D ∇∗ . In particular, when D is a Bregman divergence BF with BF r (θ1 : θ2) = BF (θ2 : θ1) = BF? (∇F(θ1) : ∇F(θ2)) and F? ∇ := BF r ∇, F ∇ and F? ∇ are both flat, and we have F ∇∗ = F? ∇. However, it makes the equations look a bit clumsy so we preferred to stick with the same star symbol ∗ for expressing these different dualities that should be clear from the context. We use :=: to define a notational convention, and distinguish it from :=. For example, f(x):=1 2x2 defines a function f(x) by it rhs. expression, and Pd i=1 xi:=:x1 + . . . + xd. The symbol Σ = denotes Einstein summation on dummy indices, often used in tensor analysis. For example, hθ, ηi Σ = θiηi is a contraction for PD i=1 θiηi. Below is a list of notations we used in this document: [D] [D]:={1, . . . , D} h·, ·i inner product MQ(u, v) = ku − vkQ Mahalanobis distance MQ(u, v) = qP i,j(ui − vi)(uj − vj)Qij, Q 0 D(θ : θ0) parameter divergence D[p(x) : p0(x)] statistical divergence D, D∗ Divergence and dual (reverse) divergence Csiszár divergence If If (θ : θ0):= PD i=1 θif θ0 i θi with f(1) = 0 Bregman divergence BF BF (θ : θ0):=F(θ) − F(θ0) − (θ − θ0)>∇F(θ0) Canonical divergence AF,F∗ AF,F∗ (θ : η0) = F(θ) + F∗(η0) − θ>η0 Bhattacharyya distance Bα[p1 : p2] = − log R x∈X pα 1 (x)p1−α 2 (x)dµ(x) Jensen/Burbea-Rao divergence J (α) F (θ1 : θ2) = αF(θ1) + (1 − α)F(θ2) − F(θ1 + (1 − α)θ2) Chernoff information C[P1, P2] = − log minα∈(0,1) R x∈X pα 1 (x)p1−α 2 (x)dµ(x) F, F∗ Potential functions related by Legendre-Fenchel transformation Dρ(p, q) Riemannian distance Dρ(p, q):= R 1 0 kγ0(t)kγ(t)dt B, B∗ basis, reciprocal basis B = {e1 = ∂1, . . . , eD = ∂D} natural basis {dxi}i covector basis (one-forms) (v)B:=(vi) contravariant components of vector v 36 (v)B∗ :=(vi) covariant components of vector v u ⊥ v vector u is perpendicular to vector v (hu, vi = 0) kvk = p hv, vi induced norm, length of a vector v M, S Manifold, submanifold Tp tangent plane at p TM Tangent bundle TM = ∪pTp = {(p, v), p ∈ M, v ∈ Tp} F(M) space of smooth functions on M X(M) = Γ(TM) space of smooth vector fields on M vf direction derivative of f with respect to vector v X, Y, Z ∈ X(M) Vector fields g Σ = gijdxi ⊗ dxj metric tensor (field) (U, x) local coordinates x in a chat U ∂i:=: ∂ ∂xi natural basis vector ∂i:=: ∂ ∂xi natural reciprocal basis vector ∇ affine connection ∇XY covariant derivative Q∇ c parallel transport of vectors along a smooth curve c Q∇ c v Parallel transport of v ∈ Tc(0) along a smooth curve c γ, γ∇ geodesic, geodesic with respect to connection ∇ Γij,l Christoffel symbols of the first kind (functions) Γk ij Christoffel symbols of the second kind (functions) R Riemann-Christoffel curvature tensor [X, Y ] Lie bracket [X, Y ](f) = X(Y (f)) − Y (X(f)), ∀f ∈ F(M) ∇-projection PS = arg minQ∈S D(θ(P) : θ(Q)) ∇∗-projection P∗ S = arg minQ∈S D(θ(Q) : θ(P)) C Amari-Chentsov totally symmetric cubic 3-covariant tensor P = {pθ(x)}θinΘ parametric family of probability distributions E, M, ∆D exponential family, mixture family, probability simplex PI(θ) Fisher information matrix PI(θ) Fisher Information Matrix (FIM) for a parametric family P Pg Fisher information metric tensor field exponential connection e P∇ e P∇:=Eθ [(∂i∂jl)(∂kl)] mixture connection m P ∇ m P ∇:=Eθ [(∂i∂jl + ∂il∂jl)(∂kl)] expected skewness tensor Cijk Cijk:=Eθ [∂il∂jl∂kl] expected α-connections PΓαk ij:= − 1+α 2 Cijk = Eθ ∂i∂jl + 1−α 2 ∂il∂jl (∂kl) ≡ equivalence of geometric structures References [1] Pierre-Antoine Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, 2009. 37 [2] Shun-ichi Amari. Theory of information spaces — A differential–geometrical foundation of statistics. Post. Res. Assoc. Appl. Geom. Mem. Report, 106:64–67, 1980. [3] Shun-ichi Amari. Differential-geometrical methods in statistics. Lecture Notes on Statistics, 28, 1985. second edition in 1990. [4] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016. [5] Shun-ichi Amari and Hiroshi Nagaoka. Methods of Information Geometry. American Mathe- matical Society, 2007. [6] Khadiga A. Arwini and Christopher Terence John Dodson. Information Geometry: Near Randomness and Near Independance. Springer, 2008. [7] John Ashburner and Karl J. Friston. Diffeomorphic registration using geodesic shooting and Gauss-Newton optimisation. NeuroImage, 55(3):954–967, 2011. [8] Nihat Ay and Shun-ichi Amari. A novel approach to canonical divergences within information geometry. Entropy, 17(12):8111–8129, 2015. [9] John C. Baez and Derek K. Wise. Teleparallel gravity as a higher gauge theory. Communica- tions in Mathematical Physics, 333(1):153–186, 2015. [10] Ole E. Barndorff-Nielsen, David Roxbee Cox, and Nancy Reid. The role of differential geometry in statistical theory. International Statistical Review, pages 83–96, 1986. [11] Jean-Daniel Boissonnat, Frank Nielsen, and Richard Nock. Bregman Voronoi diagrams. Dis- crete & Computational Geometry, 44(2):281–307, 2010. [12] Ovidiu Calin and Constantin Udriste. Geometric Modeling in Probability and Statistics. Math- ematics and Statistics. Springer International Publishing, 2014. [13] L. Lorne Campbell. An extended Čencov characterization of the information metric. Proceed- ings of the American Mathematical Society, 98(1):135–141, 1986. [14] Nikolai N. Chentsov. Statistical decision rules and optimal inference. Monographs, American Mathematical Society, Providence, RI, 1982. [15] Jean-Pierre Crouzeix. A relationship between the second derivatives of a convex function and of its conjugate. Mathematical Programming, 13(1):364–365, 1977. [16] Imre Csiszár and Paul C Shields. Information theory and statistics: A tutorial. Foundations and Trends R
in Communications and Information Theory, 1(4):417–528, 2004. [17] Anand Ganesh Dabak. A geometry for detection theory. PhD thesis, Rice University, 1993. [18] D. C. Dowson and Basil V. Landau. The Fréchet distance between multivariate normal dis- tributions. Journal of multivariate analysis, 12(3):450–455, 1982. [19] Bradley Efron et al. Defining the curvature of a statistical problem (with applications to second order efficiency). The Annals of Statistics, 3(6):1189–1242, 1975. 38 [20] Shinto Eguchi. Second order efficiency of minimum contrast estimators in a curved exponential family. The Annals of Statistics, pages 793–803, 1983. [21] Bent Fuglede and Flemming Topsøe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium on Information Theory (ISIT), page 31. IEEE. [22] Evarist Giné and Richard Nickl. Mathematical foundations of infinite-dimensional statistical models, volume 40. Cambridge University Press, 2015. [23] Masahito Hayashi. Quantum information. Springer, 2006. [24] Jean-Baptiste Hiriart-Urruty and Claude Lemaréchal. Fundamentals of convex analysis. Springer Science & Business Media, 2012. [25] Harold Hotelling. Spaces of statistical parameters. Bulletin of the American Mathematical Society (AMS), 36:191, 1930. [26] Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. A, 186(1007):453–461, 1946. [27] Jiantao Jiao, Thomas A Courtade, Albert No, Kartik Venkat, and Tsachy Weissman. Infor- mation measures: the curious case of the binary alphabet. IEEE Transactions on Information Theory, 60(12):7616–7626, 2014. [28] Robert E. Kass and Paul W. Vos. Geometrical Foundations of Asymptotic Inference. Wiley- Interscience, July 1997. [29] Takashi Kurose. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math- ematical Journal, Second Series, 46(3):427–433, 1994. [30] Stefan L. Lauritzen. Statistical manifolds. Differential geometry in statistical inference, 10:163– 216, 1987. [31] Uwe Mühlich. Fundamentals of tensor calculus for engineers with a primer on smooth mani- folds, volume 230. Springer, 2017. [32] Michael Murray and John Rice. Differential geometry and statistics. Number 48 in Monographs on Statistics and Applied Probability. Chapman and Hall, 1993. [33] Frank Nielsen. Cramér-Rao lower bound and information geometry. In Connected at Infinity II, pages 18–37. Springer, 2013. [34] Frank Nielsen. Hypothesis testing, information divergence and computational geometry. In GSI, pages 241–248, 2013. [35] Frank Nielsen. An information-geometric characterization of Chernoff information. IEEE SPL, 20(3):269–272, 2013. [36] Frank Nielsen. What is... an information projection? Notices of the AMS, 65(10):321–324, 2018. 39 [37] Frank Nielsen and Frédéric Barbaresco, editors. Geometric Science of Information, volume 8085 of Lecture Notes in Computer Science. Springer, 2013. [38] Frank Nielsen and Frédéric Barbaresco, editors. Geometric Science of Information, volume 9389 of Lecture Notes in Computer Science. Springer, 2015. [39] Frank Nielsen and Frédéric Barbaresco, editors. Geometric Science of Information, volume 10589 of Lecture Notes in Computer Science. Springer, 2017. [40] Frank Nielsen and Sylvain Boltz. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans- actions on Information Theory, 57(8), 2011. [41] Frank Nielsen and Vincent Garcia. Statistical exponential families: A digest with flash cards. arXiv preprint arXiv:0911.4863, 2009. [42] Frank Nielsen and Gaëtan Hadjeres. Monte carlo information geometry: The dually flat case. CoRR, abs/1803.07225, 2018. [43] Frank Nielsen and Richard Nock. Sided and symmetrized Bregman centroids. IEEE transac- tions on Information Theory, 55(6), 2009. [44] Frank Nielsen and Richard Nock. Hyperbolic Voronoi diagrams made easy. In International Conference on Computational Science and Its Applications (ICCSA), pages 74–80. IEEE, 2010. [45] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approx- imating f -divergences. IEEE SPL, 21(1):10–13, 2014. [46] Frank Nielsen and Richard Nock. Total Jensen divergences: Definition, properties and clus- tering. In ICASSP, pages 2016–2020, 2015. [47] Frank Nielsen and Richard Nock. Patch matching with polynomial exponential families and projective divergences. In SISAP, pages 109–116, 2016. [48] Frank Nielsen and Richard Nock. On the geometric of mixtures of prescribed distributions. In IEEE ICASSP, 2018. [49] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016. [50] Frank Nielsen, Ke Sun, and Stéphane Marchand-Maillet. On Hölder projective divergences. Entropy, 19(3):122, 2017. [51] Richard Nock, Frank Nielsen, and Shun-ichi. Amari. On conformal divergences and their population minimizers. IEEE TIT, 62(1):527–538, 2016. [52] Katsumi Nomizu, Nomizu Katsumi, and Takeshi Sasaki. Affine differential geometry: Geom- etry of affine immersions. Cambridge university press, 1994. [53] Aleksandr Petrovich Norden. On pairs of conjugate parallel displacements in multidimensional spaces. In Doklady Akademii nauk SSSR, volume 49, pages 1345–1347, 1945. Kazan State University, Comptes rendus de l’Académie des sciences de l’URSS. 40 [54] Atsumi Ohara and Takashi Tsuchiya. An information geometric approach to polynomial-time interior-point algorithms: Complexity bound via curvature integral. The Institute of Statistical Mathematics, 1055, 2007. Research Memorandum. [55] Marı́a del Carmen Pardo and Igor Vajda. About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE transactions on information theory, 43(4):1288–1293, 1997. [56] Gia-Thuy Pham, Rémy Boyer, and Frank Nielsen. Computational information geometry for binary classification of high-dimensional random tensors. Entropy, 20(3):203, 2018. [57] Giovanni Pistone. Nonparametric information geometry. In Geometric Science of Information, pages 5–36. Springer, 2013. [58] Yu Qiao and Nobuaki Minematsu. A study on invariance of f-divergence and its application to speech recognition. IEEE Transactions on Signal Processing, 58(7):3884–3890, 2010. [59] C. Radhakrishna Rao. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in statistics, pages 235–247. Springer, 1992. [60] Radhakrishna C. Rao. Information and the accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37:81–91, 1945. [61] Ferran Reverter and Josep M. Oller. Computing the Rao distance for Gamma distributions. Journal of computational and applied mathematics, 157(1):155–167, 2003. [62] Claude Elwood Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:623–656, 1948. [63] Hirohiko Shima. The Geometry of Hessian Structures. World Scientific, 2007. [64] Anuj Srivastava, Wei Wu, Sebastian Kurtek, Eric Klassen, and James Stephen Marron. Reg- istration of Functional Data Using Fisher-Rao Metric. ArXiv e-prints, March 2011. [65] Stephen M. Stigler. The epic story of maximum likelihood. Statistical Science, pages 598–620, 2007. [66] Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning large modular models. In ICML, pages 3289–3298, 2017. [67] Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005–1026, 2011. [68] Igor Vajda. On metric divergences of probability measures. Kybernetika, 45(6):885–900, 2009. [69] Hông Vân Lê. Statistical manifolds are statistical models. Journal of Geometry, 84(1-2):83–93, 2006. [70] Hông Vân Lê. The uniqueness of the Fisher metric as information metric. Annals of the Institute of Statistical Mathematics, 69(4):879–896, 2017. 41 [71] Cédric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008. [72] Abraham Wald. Statistical decision functions. The Annals of Mathematical Statistics, pages 165–205, 1949. [73] Abraham Wald. Statistical decision functions. Wiley, 1950. [74] Jun Zhang. Divergence functions and geometric structures they induce on a manifold. In Frank Nielsen, editor, Geometric Theory of Information, pages 1–30. Springer, 2014. [75] Jun Zhang. On monotone embedding in information geometry. Entropy, 17(7):4485–4499, 2015. [76] Jun Zhang. Reference duality and representation duality in information geometry. In AIP Conference Proceedings, volume 1641, pages 130–146. AIP, 2015. [77] Zhenning Zhang, Huafei Sun, and Fengwei Zhong. Information geometry of the power inverse Gaussian distribution. Applied Sciences, 9, 2007. 42