Second-order Optimization over the Multivariate Gaussian Distribution

28/10/2015
Publication GSI2015
OAI : oai:www.see.asso.fr:11784:14292

Résumé

We discuss the optimization of the stochastic relaxation of a real-valued function, i.e., we introduce a new search space given by a statistical model and we optimize the expected value of the original function with respect to a distribution in the model. From the point of view of Information Geometry, statistical models are Riemannian manifolds of distributions endowed with the Fisher information metric, thus the stochastic relaxation can be seen as a continuous optimization problem defined over a differentiable manifold. In this paper we explore the second-order geometry of the exponential family, with applications to the multivariate Gaussian distributions, to generalize second-order optimization methods. Besides the Riemannian Hessian, we introduce the exponential and the mixture Hessians, which come from the dually flat structure of an exponential family. This allows us to obtain different Taylor formulæ according to the choice of the Hessian and of the geodesic used, and thus different approaches to the design of second-order methods, such as the Newton method.

Second-order Optimization over the Multivariate Gaussian Distribution

Collection

application/pdf Second-order Optimization over the Multivariate Gaussian Distribution Luigi Malagò, Giovanni Pistone

Média

Voir la vidéo

Métriques

79
4
615.32 Ko
 application/pdf
bitcache://824246db64ad72a497220eb72ff7640d78ae5e9a

Licence

Creative Commons Attribution-ShareAlike 4.0 International

Sponsors

Organisateurs

logo_see.gif
logocampusparissaclay.png

Sponsors

entropy1-01.png
springer-logo.png
lncs_logo.png
Séminaire Léon Brillouin Logo
logothales.jpg
smai.png
logo_cnrs_2.jpg
gdr-isis.png
logo_gdr-mia.png
logo_x.jpeg
logo-lix.png
logorioniledefrance.jpg
isc-pif_logo.png
logo_telecom_paristech.png
csdcunitwinlogo.jpg
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/11784/14292</identifier><creators><creator><creatorName>Giovanni Pistone</creatorName></creator><creator><creatorName>Luigi Malagò</creatorName></creator></creators><titles>
            <title>Second-order Optimization over the Multivariate Gaussian Distribution</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2015</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sun 8 Nov 2015</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Wed 19 Sep 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">824246db64ad72a497220eb72ff7640d78ae5e9a</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>24681</version>
        <descriptions>
            <description descriptionType="Abstract">
We discuss the optimization of the stochastic relaxation of a real-valued function, i.e., we introduce a new search space given by a statistical model and we optimize the expected value of the original function with respect to a distribution in the model. From the point of view of Information Geometry, statistical models are Riemannian manifolds of distributions endowed with the Fisher information metric, thus the stochastic relaxation can be seen as a continuous optimization problem defined over a differentiable manifold. In this paper we explore the second-order geometry of the exponential family, with applications to the multivariate Gaussian distributions, to generalize second-order optimization methods. Besides the Riemannian Hessian, we introduce the exponential and the mixture Hessians, which come from the dually flat structure of an exponential family. This allows us to obtain different Taylor formulæ according to the choice of the Hessian and of the geodesic used, and thus different approaches to the design of second-order methods, such as the Newton method.

</description>
        </descriptions>
    </resource>
.

GSI2015 2nd conference on Geometric Science of Information 28-30 Oct 2015 Ecole Polytechnique Paris-Saclay Second-order Optimization over the Multivariate Gaussian Distribution Luigi Malag`o 1 2 Giovanni Pistone 1Shinshu University JP & INRIA Saclay FR 2de Castro Statistics, Collegio Carlo Alberto, Moncalieri IT Introduction • This is is the presentation by Giovanni of the paper with the same title in the Proceedings. • Unfortunately, Giovanni is the least qualified of the two authors to present this specific application of Information Geometry, his specific field of expertise being non-parametric Information Geometry and its applications in Probability and Statistical Physics. Luigi is currently working in Japan and could not make it. • Among the two of us, Luigi is the responsible for the idea of using gradient methods and later, Newton methods, in black box optimization. Our collaboration started with the preparation of the FOGA 2011 paper • L. Malag`o, M. Matteucci, and G. Pistone. Towards the geometry of estimation of distribution algorithms based on the exponential family. In Proceedings of the 11th workshop on Foundations of genetic algorithms, FOGA ’11, pages 230–242, New York, NY, USA, 2011. ACM Summary 1. Geometry of the Exponential Family 2. Second-Order Optimization: The Newton Method 3. Applications to the Gaussian Distribution 4. Discussion and Future Work • A short introduction for Taylor formulæ on Gaussian exponential families is provided. The binary case has been previously discussed in • L. Malag`o and G. Pistone. Combinatorial optimization with information geometry: Newton method. Entropy, 16:4260–4289, 2014. • Riemannian Newton methods are discussed in a Session of this Conference cf, • P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ, 2008. With a foreword by Paul Van Dooren • The focus of this short presenation is on a specific framework for Information Geometry we call statistical bundle. Hilbert vs Tangent vs Statistical Bundle • S. Amari. Dual connections on the Hilbert bundles of statistical models. In Geometrization of statistical theory (Lancaster, 1987), pages 123–151, Lancaster, 1987. ULDM Publ • R. E. Kass and P. W. Vos. Geometrical foundations of asymptotic inference. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons, Inc., New York, 1997. Statistical Bundle: Gaussian case • Hα(x), x ∈ Rm , are Hermite polynomials of order 1 and 2. • E.g, m = 3, H010(x) = x2, H011(x) = x2x3, H020(x) = x2 2 − 1. • The Gaussian model with sufficient statistics B = {X1, . . . , Xn} ⊂ {Hα||α| = 1, 2}, is N =    p(x; θ) = exp   n j=1 θj Xj − ψ(θ)      • The fibers are Vp = Span (Xj − Ep [Xj ]|j = 1, . . . , n) • The statistical bundle is SN = {(p, U)|p ∈ N, U ∈ Vp} • Each U ∈ Vp, p ∈ N, is a polynomial of degree up to 2 and t → Eq etU is finite around 0, q ∈ N • Every polynomial X belongs to ∩q∈N L2 (q) Parallel transports Definition • e-transport: e Uq p : Vp U → U − Eq [U] ∈ Vq . • m-transport: for each U ∈ Vp and V ∈ Vq U, m Up qV p = e Uq pU, V q Properties • e Ur q e Uq p = e Ur p • m Ur q m Uq p = m Ur p • e Uq pU, m Uq pV q = U, V p • If q p V ∈ L2 (p), then m Up qV is its orthogonal projection onto Vp. Parallel transports in coordinates I We define on the statistical bundle SN a system of moving frames. 1. The exponential frame of the fiber SpN = Vp is the vector basis Bp = {Xj − Ep [Xj ]|j = 1, . . . , n} 2. Each element U ∈ Vp is uniquely written as U = n j=1 αj (U)(Xj − Ep [Xj ]) = α(U)T (X − Ep [X]) 3. The expression in the exponential frame of the scalar product is the Fisher information matrix: e Iij (p) = Xi − Ep [Xi ] , Xj − Ep [Xj ] p = Covp (Xi , Xj ) = ∂2 ∂θi 2 θj ψ(θ) 4. U → α(U) = e I (p)−1 Covp (X, U) Parallel transports in coordinates II 5. The mixture frame of the fiber SpN = Vp is e I (p) −1 Bp = n i=1 e Iij (p)(Xi − Ep [Xi ]) j = 1, . . . , n 6. Each element V ∈ Vp is uniquely written as V = n j=1 βj (V ) n i=1 e Iij (p)(Xi −Ep [Xi ]) = β(V )T e I (p)−1 (X−Ep [X]) 7. The coordinates in the mixture basis are given in matrix form by V → β(V ) = Covp (X, V ) . 8. The matrix m I (p) = e I (p)−1 is the matrix expression of the metric in the mixture frame. α(U) = m I (p)β(U), β(U) = e I (p)α(U) . Parallel transports in the moving frames • The e-transport acts on the exponential coordinates as the identity, α e Uq pU = α(U) • Equivalently, = e I (q)−1 Covq (X, U) = e I (p)−1 Covp (X, U) • The m-transport acts on the mixture coordinates as the identity, β m Uq pV = β(V ) REMARK A section or vector field of the statistical bundle is a mapping F : N p → F(p) ∈ Vp. As there are two distingushed charts on the model (exponential p → θ(p) and mixture p → η(p) = ψ(θ(p))) and two distinguished frames on each fiber, there are in general four distinguished expression of each section. Score and statistical gradient Definition t → p(t) is a curve in the model N and f : N → R. • The score of the curve t → p(t) is a curve in the statistical bundle t → (p(t), Dp(t)) ∈ SN such that for all X ∈ Span (1, X1, . . . , Xn) it holds d dt Ep(t) [X] = X − Ep(t) [X] , Dp(t) p(t) • Usually, Dp(t) = ˙p(t) p(t) = d dt log p(t) • The statistical gradient of f : N → R is a section of the statistical bundle, p → (p, grad f (p)) ∈ SN such that for each regular curve t → p(t), it holds d dt f (p(t)) = grad f (p(t)), Dp(t) p(t) Score and statistical gradient in coordinates • Let the regular curve t → p(t) be expressed in the exponential coordinates by t → θ(t). The score t → Dp(t) is expressed in the exponential frame by t → ˙θ(t) that is, Dp(t) = n j=1 ˙θj (t)(Xj − ∂ ∂θj ψ(θ(t))) • Let the regular curve t → p(t) be expressed in the mixture coordinates by t → η(t) = ψ(θ(t)). The score is expressed in the mixture frame as t → ˙η(t). • Let X be a random variable which belongs to all L2 (p), p ∈ N and f (p) = Ep [f ]. Then p → grad f (p) exists and equals the orthogonal projection of X onto Vp, namely grad(p → Ep [X]) = e I (p)−1 Covp (X, X) (X − Ep [X]), X = (X1, . . . , Xn) . • The expressions of grad f are of interest in optimization. Taylor formula in the Statistical Bundle • For a curve t → p(t) ∈ N connecting p = p(0) to q = p(1) and a function f : N → R the Taylor formula is f (q) = f (p) + d dt f (p(t)) t=o + 1 2 d2 dt2 f (p(t)) t=o + R2(f , p, q) • The first derivative is computed with the statistical gradient and the score f (q) = f (p) + grad f (p(0)), Dp(0) p + 1 2 d dt grad f (p(t)), Dp(t) p(t) t=o + R2(f , p, q) Accelleration and Hessian d dt grad f (p(t)), Dp(t) p(t) t=o = d dt e U p(0) p(t) grad f (p(t)), m U p(0) p(t)Dp(t) p(0) t=o d dt grad f (p(t)), Dp(t) p(t) t=o = d dt m U p(0) p(t) grad f (p(t)), m U p(0) p(t)Dp(t) p(0) t=o d dt grad f (p(t)), Dp(t) p(t) t=o = d dt Ep(0) p(t) p(0) grad f (p(t))Dp(t) t=o Accellerations • Let us define the acceleration at t of a curve t → p(t) ∈ N. The velocity is defined to be t → (p(t), Dp(t)) = p(t), d dt log (p(t)) ∈ SN • The exponential acceleration is e D2 p(t) = d ds e U p(t) p(s)Dp(s) s=t • The mixture acceleration is m D2 p(t) = d ds m U p(t) p(s)Dp(s) s=t • The Riemannian acceleration is 0 D2 p(t) = 1 2 e D2 p(t) + m D2 p(t) Covariant derivatives I • p → (p, F(p)), p → (p, G(p)), are sections of SN, with expressions in the moving frames F(p) = n j=1 αj (p)(Xj − Ep [Xj ]) , F(p) = n j=1 βj (p) n i=1 e Iij (p)(Xi − Ep [Xi ]) , G(p) = n j=1 γj (p)(Xj − Ep [Xj ]) , G(p) = n j=1 δj (p) n i=1 e Iij (p)(Xi − Ep [Xi ]) . Covariant derivatives II • The exponential covariant derivative is the vector field p → (p, e DG F(p)), where e DG F(p) = n j=1 grad αj (p), G(p) p (Xj − Ep [Xj ]) = n j=1 n i=1 γi (p)(∂i grad αj (p))(Xj − Ep [Xj ]) • The mixture covariant derivative is the vector field p → (p, m DG F(p)), where m DG F(p) = n j=1 grad βj (p), G(p) p n i=1 e Iij (p)(Xi − Ep [Xi ]) = n j=1 n k=1 γk (p) grad βj (p), Xk − Ep [Xk ] p n i=1 e Iij (p)(Xi −Ep [Xi ]) Covariant derivatives III • The Riemannian covariant derivative is the vector field p → (p, 0 DG F(p)) with 0 DG F = 1 2 (e DG F + m DG F) . Hessians • Let f : N → R be a mapping with gradient p → (p, grad f (p)). Let p → (p, G(p)) be a vector field (section) of SN. • The exponential Hessian of f is the vector field p → (p, e HessG f (p)), with e HessG f (p) = e DG grad f (p) . • The mixture Hessian of f is the vector field p → (p, m HessG f (p)), with m HessG f (p) = m DG grad f (p) . • The Riemannian Hessian of F is the vector field p → (p, 0 HessG F(p)), with 0 HessG f (p) = 0 DG grad f (p) . Taylor’s formulæ I 1. t → p(t) is the mixture geodesic connecting p = p(0) to q = p(1). f (q) = f (p) + grad f (p), Dp(0) p + 1 2 e HessDp(0)f (p), Dp(0) p + R+ 2 (p, q) R+ 2 (p, q) = 1 0 dt (1 − t) e HessDp(t)f (p(t)), Dp(t) p(t) − 1 2 e HessDp(0)f (p), Dp(0) p Taylor’s formulæ II 2. t → p(t) is the exponential geodesic connecting p = p(0) to q = p(1). f (q) = f (p) + grad f (p), Dp(0) p + 1 2 m HessDp(0)f (p), Dp(0) p + R− 2 (p, q) R− 2 (p, q) = 1 0 dt (1 − t) m HessDp(t)f (p(t)), Dp(t) p(t) − 1 2 m HessDp(0)f (p), Dp(0) p Taylor’s formulæ III 3. t → p(t) is the Riemannian geodesic connecting p = p(0) to q = p(1). f (q) = f (p) + grad f (p), Dp(0) p + 1 2 0 HessDp(0)f (p), Dp(0) p + R0 2 (p, q) where R0 2 (p, q) = 1 0 dt(1 − t) 0 HessDp(t)f (p(t)), Dp(t) p(t) − 1 2 0 HessDp(0)f (p), Dp(0) p Newton step • Let t → p(t) be the exponential geodesic starting at p = p(0) with Dp(0) = U. • Assume U is a critical point of Vp(0) U → f (p) + grad f (p(0)), U p(0) + 1 2 m HessU f (p), U p(0) that is grad f (p(0)) + m HessU f (p) = 0 • If q = p(1), then f (q) = f (p) − 1 2 0 HessU f (p), U p + R0 2 (p, q) Conclusion and work in progress • Comparisons between the Riemannian Newton method e.g., Absil et al., and the statistical bundle setup are being performed. • In particular, the use of alternative Hessians is of special interest.