The exponential family in abstract information theory

DOI : You do not have permission to access embedded form.


The exponential family in abstract information theory


79.13 Ko


Creative Commons Aucune (Tous droits réservés)


Sponsors scientifique


Sponsors financier


Sponsors logistique

Séminaire Léon Brillouin Logo
<resource  xmlns:xsi=""
        <identifier identifierType="DOI">10.23723/2552/5068</identifier><creators><creator><creatorName>Jan Naudts</creatorName></creator><creator><creatorName>Ben Anthonis</creatorName></creator></creators><titles>
            <title>The exponential family in abstract information theory</title></titles>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sat 5 Oct 2013</date>
	    <date dateType="Updated">Mon 25 Jul 2016</date>
            <date dateType="Submitted">Sat 24 Feb 2018</date>
	    <alternateIdentifier alternateIdentifierType="bitstream">191c4b98c864b2b2bdd30403ee30c321926dfaa5</alternateIdentifier>
            <description descriptionType="Abstract"></description>

The exponential family in abstract information theory Jan Naudts and Ben Anthonis Universiteit Antwerpen Paris, August 2013 1 Outline Fisher information Example Abstract Information Theory Assumptions Examples of generalized divergences Deformed exponential families Conclusions 2 Fisher information The standard expression for the Fisher information matrix Ik,l (θ) = Eθ ∂ ∂θk ln pθ ∂ ∂θl ln pθ is a relevant quantity when pθ belongs to the exponential family. A different quantity is needed in the general case. It involves the Kullback-Leibler divergence D(p||pθ) = Ep ln p pθ . Remember that Ik,l (θ) = ∂2 ∂θk ∂θl D(p||pθ) p=pθ . 3 The divergence D(p||pθ) is a measure for the distance between an arbitrary state p and a point pθ of the statistical manifold. Let D(p||M) denote the minimal ‘distance’. Let Fθ denote the ‘fiber’ of all points p for which D(p||M) = D(p||pθ). minimum contrast leaf, Eguchi 1992 Definition The extended Fisher information of a pdf p (not necessarily in M) is Ik,l (p) = ∂2 ∂θk ∂θl D(p||pθ) p∈Fθ . 4 Note that on the manifold the two definitions coincide Ik,l(θ) = Ik,l (pθ). Proposition Ik,l (p) is covariant. Proof Let η be a function of θ. One calculates ∂2 ∂θk ∂θl D(x||θ) = ∂2 ∂ηm∂ηn D(x||θ) ∂ηm ∂θk ∂ηn ∂θl + ∂ ∂ηm D(x||θ) ∂2 ηm ∂θk ∂θl . The latter term vanishes because p ∈ Fθ. The former term is manifestly covariant. 5 Proposition If pθ belongs to the exponential family then Ik,l(p) is constant on the fibre Fθ. Proof p, pθ satisfies the Pythagorean relation D(p||pη) = D(p||pθ) + D(pθ||pη). Hence taking derivatives w.r.t. η only involves D(pθ||pη). Only afterwards put η = θ. One concludes that Ik,l (p) = Ik,l (pθ). Coordinate-independent method to verify that M is not an exponential family! 6 Example Suggested by H. Matsuzoe. Consider the manifold of normal distributions pµ,σ(x) with mean µ and standard deviation σ pµ,σ(x) = 1 √ 2πσ2 e−(x−µ)2 /2σ2 . Consider the submanifold M of normal distributions for which µ = σ pθ(x) = 1 √ 2πθ2 e−(x−θ)2 /2θ2 . Question Is M an exponential family? Answer It is known to be curved (Efron, 1975). Let us show that I(pµ,σ) is not constant along fibers Fθ. 7 The Kullback-Leibler divergence D(pµ,σ||pθ) is minimal when θ is the positive root of the equation θ2 + µθ = µ2 + σ2 . The Fisher information I(pµ,σ) equals I(pµ,σ) = θ2 + µ2 + σ2 θ4 . It is not constant on Fθ — it cannot be written as a function of θ. This implies that M is not an exponential family. 8 Abstract Information Theory Our aims ◮ Formulate the notion of an exponential family in the context of abstract information theory ◮ If M is not an exponential family w.r.t. the Kullback-Leibler divergence, can it be exponential w.r.t. some other divergence? Abstract information theory does not rely on probability theory. We try to bring classical and quantum information theory together in a single formalism. 9 A generalized divergence is a map D : X × M → [0, +∞] between two different spaces. ◮ A divergence is generically asymmetric in its two arguments. This is an indication that the two arguments play a different role. X is the space of data sets, M is a manifold of models. ◮ D(x||m) has the meaning of a loss of information when the data set x is replaced by the model point m. ◮ In the classical setting X is the space of empirical measures, M is a statistical manifold. One has in this case M ⊂ X. 10 Assumptions Let Q denote a linear space of continuous real functions of X. Instead of q(x) we write x|q to stress that Q is not an algebra. In the classical setting Q is the space of random variables. In the quantum setting Q is a space of operators on a Hilbert space. We consider a class of generalized divergences which can be written into the form D(x||m) = ξ(m) − ζ(x) − x|Lm , where ξ and ζ are real functions and L : M → Q is a map from the manifold M into the linear space Q. We assume in addition a compatibility and a consistency condition — see a later slide. 11 For instance, the quantities ln p, ln pθ appearing in the Kullback-Leibler divergence D(p||pθ) = Ep ln p − Ep ln pθ = p| ln p − p| ln pθ are used as random variables and belong to Q. One can define a map L : M → Q by Lpθ = ln pθ and write the divergence as D(p||pθ) = ξ(pθ) − ζ(p) − p|Lpθ with ξ(pθ) = 0, ζ(p) = −Ep ln p and p|q = Epq. The quantity ξ(pθ) has been called the corrector by Flemming Topsøe. ζ(p) is the entropy. We call L the logarithmic map. 12 Compatibility condition For each x ∈ X there exists a unique point m ∈ M which minimizes the divergence D(x||m). This means that each point of X belongs to some fiber Fm. Consistency condition Each point m of M can be approached by points x of Fm in the sense that D(x||m) can be made arbitrary small. 13 Example: Bregman divergence A divergence of the Bregman type is defined by D(x||m) = a F(x(a)) − F(m(a)) − (x(a) − m(a))f(m(a)) = a x(a) m(a) du [f(u) − f(m(a))] , where F is any strictly convex function defined on the interval (0, 1] and f = F′ is its derivative. L.M. Bregman, The relaxation method to find the common point of convex sets and its applications to the solution of problems in convex programming, USSR Comp. Math. Math. Phys. 7 (1967) 200–217. 14 In the notations of our abstract information theory one has ◮ x|q = Ex q; ◮ Lm(a) = f(m(a)); ◮ ζ(x) = − a F(x(a)); ◮ ξ(m) = a m(a)f(m(a)) − a F(m(a)). 15 Note that the Bregman divergence can be written as D(x||m) = a f(m(a)) f(x(a)) du [g(u) − x(a)]. g is the inverse function of f. N. Murata, T. Takenouchi, T. Kanamori, S. Eguchi, Information Geometry of U-Boost and Bregman Divergence, Neural Computation 16, 1437–1481 (2004). In the language of non-extensive statistical physics is f the deformed logarithm, g the deformed exponential function. The Kullback-Leibler divergence is recovered by taking F(u) = u ln u − 1. This implies g(u) = eu and f(u) = ln u. 16 Deformed exponential families A parametrized exponential family is of the form mθ(a) = c(a) exp(−α(θ) − θk Hk (a)). physicists′ notation This implies a logarithmic map of the form Lmθ(a) = ln mθ(a) c(a) = −α(θ) − θk Hk (a). It is obvious to generalize this definition by replacing the exponential function by a deformed exponential function. J. Naudts, J. Ineq. Pure Appl. Math. 5 102 (2004). S. Eguchi, Sugaku Expositions (Amer. Math. Soc.) 19, 197–216 (2006). P. D. Grünwald and A. Ph. Dawid, Ann. Statist. 32,1367–1433 (2004). 17 Can we give a definition of a deformed exponential family - which relies only on the divergence? - which does not involve canonical coordinates? - which has a geometric interpretation? Lafferty 1999: additive models mθ minimizes d(m||m0) + θk EmHk . This is a constraint maximum entropy principle. Our proposal: The Fisher information I(x) is constant along the fibers of minimal divergence. This property is a minimum requirement for a distribution to be a (deformed) exponential family. It is satisfied for the deformed exponential families based on Bregman type divergences. 18 Csiszár type of divergences Csiszár type of divergence D(x||m) = a m(a)F x(a) m(a) . The choice F(u) = u ln u reproduces Kullback-Leibler. Example In the context of non-extensive statistical mechanics both Csiszár and Bregman type divergences are being used. Fix the deformation parameter q = 1, 0 < q < 2. Csiszár Dq(x||m) = 1 q − 1 a x(a) x(a) m(a) q−1 − 1 , Bregman Dq(x||m) = 1 q − 1 a x(a) m(a)1−q − x(a)1−q + a [m(a) − x(a)] m(a)1−q . 19 Introduce the q-deformed exponential function expq(u) = [1 + (1 − q)u] 1/(1−q) + . The distribution of the form mθ(a) = expq(−α(θ) − θk Hk (a)) is a deformed exponential family relative to the Bregman type divergence, but not relative to the Csiszár type divergence. In the latter case the extended Fisher info is given by Ik,l (x) = z(x) ∂2 α ∂θk ∂θl with ∂α ∂θk = − 1 z(θ) a x(a)q Hk (a) and z(x) = a x(a)q . If q = 1 then z(x) = 1 and the extended Fisher info is constant along Fθ. If q = 1 it is generically not constant along Fθ. 20 Conclusions ◮ We consider Fisher information not only on the statistical manifold of model states but also for empirical measures. ◮ If the model is an exponential family then the Fisher information is constant along fibers of minimal divergence. ◮ We extend the notion of an exponential family to an abstract setting of information theory ◮ In the abstract setting the definition of a generalized exponential family only depends on the choice of the divergence. 21