The information geometry of mirror descent

28/10/2015
Publication GSI2015
OAI : oai:www.see.asso.fr:11784:14293

Résumé

We prove the equivalence of two online learning algorithms, mirror descent and natural gradient descent. Both mirror descent and natural gradient descent are generalizations of online gradient descent when the parameter of interest lies on a non-Euclidean manifold. Natural gradient descent selects the steepest descent direction along a Riemannian manifold by multiplying the standard gradient by the inverse of the metric tensor. Mirror descent induces non-Euclidean structure by solving iterative optimization problems using different proximity functions. In this paper, we prove that mirror descent induced by a Bregman divergence proximity functions is equivalent to the natural gradient descent algorithm on the Riemannian manifold in the dual coordinate system.We use techniques from convex analysis and connections between Riemannian manifolds, Bregman divergences and convexity to prove this result. This equivalence between natural gradient descent and mirror descent, implies that (1) mirror descent is the steepest descent direction along the Riemannian manifold corresponding to the choice of Bregman divergence and (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound.

The information geometry of mirror descent

Collection

application/pdf The information geometry of mirror descent Garvesh Raskutti, Sayan Mukherjee

Média

Voir la vidéo

Métriques

101
9
173.01 Ko
 application/pdf
bitcache://afeca379bca6202a1a0c18a6e1f8e7fd1347f99b

Licence

Creative Commons Attribution-ShareAlike 4.0 International

Sponsors

Organisateurs

logo_see.gif
logocampusparissaclay.png

Sponsors

entropy1-01.png
springer-logo.png
lncs_logo.png
Séminaire Léon Brillouin Logo
logothales.jpg
smai.png
logo_cnrs_2.jpg
gdr-isis.png
logo_gdr-mia.png
logo_x.jpeg
logo-lix.png
logorioniledefrance.jpg
isc-pif_logo.png
logo_telecom_paristech.png
csdcunitwinlogo.jpg
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/11784/14293</identifier><creators><creator><creatorName>Garvesh Raskutti</creatorName></creator><creator><creatorName>Sayan Mukherjee</creatorName></creator></creators><titles>
            <title>The information geometry of mirror descent</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2015</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sun 8 Nov 2015</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Wed 19 Sep 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">afeca379bca6202a1a0c18a6e1f8e7fd1347f99b</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>24682</version>
        <descriptions>
            <description descriptionType="Abstract">
We prove the equivalence of two online learning algorithms, mirror descent and natural gradient descent. Both mirror descent and natural gradient descent are generalizations of online gradient descent when the parameter of interest lies on a non-Euclidean manifold. Natural gradient descent selects the steepest descent direction along a Riemannian manifold by multiplying the standard gradient by the inverse of the metric tensor. Mirror descent induces non-Euclidean structure by solving iterative optimization problems using different proximity functions. In this paper, we prove that mirror descent induced by a Bregman divergence proximity functions is equivalent to the natural gradient descent algorithm on the Riemannian manifold in the dual coordinate system.We use techniques from convex analysis and connections between Riemannian manifolds, Bregman divergences and convexity to prove this result. This equivalence between natural gradient descent and mirror descent, implies that (1) mirror descent is the steepest descent direction along the Riemannian manifold corresponding to the choice of Bregman divergence and (2) mirror descent with log-likelihood loss applied to parameter estimation in exponential families asymptotically achieves the classical Cramér-Rao lower bound.

</description>
        </descriptions>
    </resource>
.

Information geometry of mirror descent Geometric Science of Information Anthea Monod Department of Statistical Science Duke University Information Initiative at Duke G. Raskutti (UW Madison) and S. Mukherjee (Duke) 29 Oct 2015 Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 1 / 18 Optimization of large-scale problems Optimization of a function f (θ) where θ ∈ Rp. O( √ p) - convergence rate of standard subgradient descent. A problem in modern optimization, e.g. machine learning. Mirror descent [A Nemirovski, 1979. A Beck & M Teboulle, 2003]: O(log p) - convergence rate of mirror descent. Widely used tool in optimization and machine learning. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 2 / 18 Differential geometry in statistics (1) Cram´er-Rao lower bound (Rao 1945) - Lower bound on the variance of an estimator is a function of curvature. Sometimes called Cram´er-Rao-Fr´echet-Darmois lower bound. (2) Invariant (non-informative) priors (Jeffreys 1946) - An uniformative prior distribution for a parameter space is based on a differential form. (3) Information geometry (Amari 1985) - Differential geometry of probability distributions. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 3 / 18 Stochastic gradient descent Given a convex differentiable cost function, f : Θ → R. Generate a sequence of parameters {θt}∞ t=1 which incur a loss f (θt) that minimize regret at a time T, T t=1 f (θt). One solution θt+1 = θt − αt f (θt), where (αt)∞ t=0 denotes a sequence of step-sizes. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 4 / 18 Natural gradient For certain cost functions (log-likelihoods of exponential family models) the set of parameters Θ are supported on a p-dimensional Riemannian manifold, (M, H). Typically the metric tensor H = (hjk) is determined by the Fisher information matrix (I(θ))ij = EData ∂ ∂θi f (x; θ) ∂ ∂θj f (x; θ) θ , i, j = 1, . . . , p. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 5 / 18 Natural gradient Given a cost function f on the Riemannian manifold f : M → R, the natural gradient descent step is: θt+1 = θt − αtH−1 (θt) f (θt), where H−1 is the inverse of the Riemannian metric. The natural gradient algorithm steps in the direction of steepest descent along the Riemannian manifold (M, H). It requires a matrix inversion. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 6 / 18 Mirror descent Gradient descent can be written θt+1 = arg min θ∈Θ θ, f (θt) + 1 2αt θ − θt 2 2 . For a (strictly) convex proximity function Ψ : Rp × Rp → R+ mirror descent is θt+1 = arg min θ∈Θ θ, f (θt) + 1 αt Ψ(θ, θt) . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 7 / 18 Bregman divergence Let G : Θ → R be a strictly convex twice-differentiable function the Bregman divergence is BG (θ, θ ) = G(θ) − G(θ ) − G(θ ), θ − θ . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 8 / 18 Bregman divergences for exponential family Family G(θ) BG (θ, θ ) N(θ, Ip×p) 1 2 θ 2 2 1 2 θ − θ 2 2 Poi(eθ) exp(θ) exp (θ/θ ) − exp(θ ), θ − θ Be 1 1+e−θ log(1 + exp(θ)) log 1+eθ 1+eθ − eθ 1+eθ , θ − θ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 9 / 18 Mirror descent Mirror descent using the Bregman divergence as the proximity function θt+1 = arg min θ θ, f (θt) + 1 αt BG (θ, θt) . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 10 / 18 Convex duals The convex conjugate function for a function G is defined to be: H(µ) := sup θ∈Θ { θ, µ − G(θ)} . Let µ = g(θ) ∈ Φ be the extremal point of the dual. The dual Bregnman divergence BH : Φ × Φ → R+ is BH(µ, µ ) = H(µ) − H(µ ) − H(µ ), µ − µ . Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 11 / 18 Dual Bregman divergences for exponential family G(θ) H(µ) BH(µ, µ ) 1 2 θ 2 2 1 2 µ 2 2 1 2 µ − µ 2 2 exp(θ) µ, log µ − µ µ log µ µ log(1 + exp(θ)) η log µ (1 − µ) log 1−µ 1−µ +(1 − µ) log(1 − µ) +µ log µ µ Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 12 / 18 Manifolds in primal and dual co-ordinates BG (·, ·) induces a Riemannian manifold (Θ, 2G) in the primal co-ordinates. Φ be the image of Θ under the continuous map g = G. BH : Φ × Φ → R+ induces the same Riemannian manifold (Φ, 2H) under dual co-ordinates Φ. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 13 / 18 Equivalence Theorem (Raskutti, Mukherjee) The mirror descent step with Bregman divergence defined by G applied to function f in the space Θ is equivalent to the natural gradient step along Riemannian manifold (Φ, 2H) in dual co-ordinates. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 14 / 18 Consequences Exponential family with density: p(y | θ) = h(y) exp( θ, y − G(θ)). Consider the following mirror descent step given yt θt+1 = arg min θ θ, θBG (θ, h(yt))|θ=θt + 1 αt BG (θ, θt) . In dual coordinates one would minimize ft(µ; yt) = − log p(yt | µ) = BH(yt, µ). The natural gradient step is µt+1 = µt − αt[ 2 H(µt)]−1 BH(yt, µt), = µt+1 = µt − αt(µt − yt), the curvature of the loss BH(yt, µt) matches the metric tensor 2H(µ). Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 15 / 18 Statistical efficiency Given independent samples YT = (y1, ..., yT ) and a sequence of unbiased estimators µT is Fisher efficient if lim T→∞ EYT [(µT − µ)(µT − µ)T ] → 1 T 2 H, where 2H is the inverse of the Fisher information matrix. Theorem (Raskutti, Mukherjee) The mirror descent step applied to the log loss (??) with step-sizes αt = 1 t asymptotically achieves the Cram´er-Rao lower bound. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 16 / 18 Challenges (1) Information geometry on mixture of manifolds. (2) Proximity functions for functions over the Grassmannian. (3) EM algorithms for mixtures. Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 17 / 18 Acknowledgements Funding: Center for Systems Biology at Duke NSF DMS and CCF DARPA AFOSR NIH Anthea Monod (Duke) Information geometry of mirror descent 29 Oct 2015 18 / 18