Information Geometry of Wasserstein Divergence

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22606
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

There are two geometrical structures in a manifold of probability distributions. One is invariant, based on the Fisher information, and the other is based on the Wasserstein distance of optimal transportation. We propose a uni ed framework which connects the Wasserstein distance and the Kullback-Leibler (KL) divergence to give a new information-geometrical theory. We consider the discrete case consisting of n elements and study the geometry of the probability simplex Sn-1, the set of all probability distributions over n atoms. The Wasserstein distance is introduced in Sn-1 by the optimal transportation of commodities from distribution p ∈ Sn-1 to q  Sn-1. We relax the optimal transportation by using entropy, introduced by M. Cuturi (2013) and show that the entropy-relaxed transportation plan naturally de nes the exponential family and the dually at structure of information geometry.
Although the optimal cost does not de ne a distance function, we introduce a novel divergence function in Sn-1, which connects the relaxed Wasserstein distance to the KL-divergence by one parameter.

Information Geometry of Wasserstein Divergence

Collection

application/pdf Information Geometry of Wasserstein Divergence Ryo Karakida, Shun-Ichi Amari
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

Information Geometry of Wasserstein Divergence

Média

Voir la vidéo

Métriques

0
0
71.32 Ko
 application/pdf
bitcache://ce8c8c22002f9af403971c715d2a16fe75303a43

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
logo_gdr-mia.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22606</identifier><creators><creator><creatorName>Shun-Ichi Amari</creatorName></creator><creator><creatorName>Ryo Karakida</creatorName></creator></creators><titles>
            <title>Information Geometry of Wasserstein Divergence</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Thu 8 Mar 2018</date>
	    <date dateType="Updated">Thu 8 Mar 2018</date>
            <date dateType="Submitted">Fri 20 Apr 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">ce8c8c22002f9af403971c715d2a16fe75303a43</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37353</version>
        <descriptions>
            <description descriptionType="Abstract">There are two geometrical structures in a manifold of probability distributions. One is invariant, based on the Fisher information, and the other is based on the Wasserstein distance of optimal transportation. We propose a uni ed framework which connects the Wasserstein distance and the Kullback-Leibler (KL) divergence to give a new information-geometrical theory. We consider the discrete case consisting of n elements and study the geometry of the probability simplex Sn-1, the set of all probability distributions over n atoms. The Wasserstein distance is introduced in Sn-1 by the optimal transportation of commodities from distribution p ∈ Sn-1 to q ∈ Sn-1. We relax the optimal transportation by using entropy, introduced by M. Cuturi (2013) and show that the entropy-relaxed transportation plan naturally de nes the exponential family and the dually at structure of information geometry.

Although the optimal cost does not de ne a distance function, we introduce a novel divergence function in Sn-1, which connects the relaxed Wasserstein distance to the KL-divergence by one parameter.
</description>
        </descriptions>
    </resource>
.

Information Geometry of Wasserstein Divergence Ryo Karakida1 and Shun-ichi Amari2 1 National Institute of Advanced Industrial Science and Technology, Koto-ku, Tokyo 135-0064, JAPAN, karakida.ryo@aist.go.jp 2 RIKEN Brain Science Institute, Wako-shi, Saitama 351-0198, JAPAN, amari@brain.riken.jp Abstract. There are two geometrical structures in a manifold of prob- ability distributions. One is invariant, based on the Fisher information, and the other is based on the Wasserstein distance of optimal trans- portation. We propose a unified framework which connects the Wasser- stein distance and the Kullback-Leibler (KL) divergence to give a new information-geometrical theory. We consider the discrete case consisting of n elements and study the geometry of the probability simplex Sn−1, the set of all probability distributions over n atoms. The Wasserstein distance is introduced in Sn−1 by the optimal transportation of com- modities from distribution p ∈ Sn−1 to q ∈ Sn−1. We relax the optimal transportation by using entropy, introduced by M. Cuturi (2013) and show that the entropy-relaxed transportation plan naturally defines the exponential family and the dually flat structure of information geometry. Although the optimal cost does not define a distance function, we intro- duce a novel divergence function in Sn−1, which connects the relaxed Wasserstein distance to the KL-divergence by one parameter. 1 Introduction Information geometry studies invariant properties of a manifold of probability distributions, which are useful for various applications in statistics, machine learning, signal processing, optimization and others. Two geometrical structures have been introduced from two different backgrounds. One is constructed based on the invariance principle: The geometry is invariant under reversible trans- formations of random variables. We then have the Fisher information matrix as the unique invariant Riemannian metric (Rao, 1945; Chentsov, 1972; Amari, 2016). Moreover, two dually coupled affine connections are given as invariant connections. These structures are useful for various applications. Another geo- metrical structure is introduced through the transportation problem. A distri- bution of commodities in a manifold is transported to another distribution. The transportation with the minimal cost defines a distance between the two dis- tributions, called the Monge-Kantorovich-Wasserstein distance or earth-mover distance. This gives a tool to study the geometry of distributions taking the metric of the supporting manifold into account. Let X = {1, · · · , n} be the support of a probability measure p. The invariant geometry gives a structure which is invariant under permutations of elements of X. It leads to an efficient estimator in statistical estimation. On the other hand, when we consider a picture over n2 pixels X = {(ij); i, j = 1, · · · , n}, neighboring pixels are close. A permutation of X destroys such a neighboring structure, so the invariance should not be required. The Wasserstein distance is responsible for such a structure. Therefore, it is useful for problems having neighboring structure in support X. An interesting question arises how these two geometrical structures are re- lated. They are useful structures in their own right, but it is intriguing to find a unified framework to include the two. For this purpose in mind, the present paper treats the discrete case over n elements, such that a probability distribu- tion is given by a probability vector p = (p1, · · · , pn) in the probability simplex Sn−1, letting a general case of continuous distributions over a manifold to be studied in future. M. Cuturi (2013) modified the transportation problem such that the cost is minimized under the entropy constraint. This is called the entropy-relaxed optimal translation problem. In many applications, his group showed the quasi- distance defined by the entropy-constrained optimal solution gives superior prop- erties to the information-geometric distance such as the KL divergence or the Hellinger distance. As an application, consider a set of normalized histograms over X. A clustering problem categorizes them in some classes such that a class consists of similar histograms. Since a histogram is regarded as an empirical probability distribution, the problem is formulated within the probability sim- plex Sn−1 in the discrete case and the distances among supporting pixels play a fundamental role. We follow the entropy-relaxed framework of Cuturi (2013), Cuturi and Avis (2014), Cuturi and Peyre (2016), etc. and introduce a Lagrangian function which is a linear combination of the transportation cost and the entropy. Given dis- tribution p of commodities at the sender and q at the receiver, the optimal transportation plan is the minimizer of the Lagrangian function. We reveal that it is a convex function of p and q so it defines a dually flat geometric structure in Sn−1 ×Sn−1. The m-flat coordinates are (p, q) and their dual, e-flat coordinates (α, β) are given from the Lagrangian duality of nonlinear optimization prob- lems. The set of the optimal transportation plans is an exponential family with the canonical parameters (α, β), where the expectation parameters are (p, q). Furthermore, we introduce a novel divergence between p and q in Sn−1. It con- nects the relaxed Wasserstein distance to the KL-divergence by a one parameter family. Our divergence will be expected to be useful for practical applications, because a divergence is a general concept including the square of a distance and more flexible admitting non-symmetricity between p and q. 2 Entropy-Constrained Transportation Problem Let us consider n terminals X = (X1, · · · , Xn) at which amounts p1, · · · , pn of commodities are stocked. We transport them within X such that amounts q1, · · · , qn are newly stored at X1, · · · , Xn. We normalize the total amount to be equal to 1, so p = (p1, · · · , pn) and q = (q1, · · · , qn) are regarded as probability distributions in the probability simplex Sn−1, ∑ pi = 1, ∑ qi = 1, pi > 0, qi > 0. (1) We consider a transportation plan P = (Pij), where Pij is the amount of commodity transported from Xi to Xj. A plan P is regarded as a joint proba- bility distribution of commodities flowing from Xi to Xj, satisfying the sender and receiver constraints, ∑ j Pij = pi, ∑ i Pij = qj. (2) The set of P’s satisfying (2) is denoted by U(p, q). Let c = (cij) be the cost matrix, where cij denotes the cost of transporting one unit of commodities from Xi to Xj. The transportation cost is defined by c(P) = ⟨c, P⟩ = ∑ cijPij. (3) The Wasserstein distance is defined by the minimal cost of transporting distri- bution p at the senders to q at the receivers, c(p, q) = min P⊂U(p,q) ⟨c, P⟩, (4) where min is taken over all P satisfying constraints (2). See e.g., Villani (2013). Given p and q, let us consider a special transportation plan PD defined by the direct product of p and q, PD = p ⊗ q = (piqj) . (5) This plan transports commodities from each sender to the receivers according to the receiver distribution q, irrespective of c. The entropy of PD, H (PD) = − ∑ PDij log PDij = H(p) + H(q), (6) is the minimum among all P’s belonging to U(p, q), because of H(P) ≤ H(p) + H(q), where H(P), H(p) and H(q) are the respective entropies and the equality holds for P = PD. We consider a constrained problem of searching for P that minimizes ⟨c, P⟩ within a KL-divergence ball centered at PD, KL [P : PD] ≤ d (7) for constant d. As d increases, the entropy of P increases within the ball. This is equivalent to the entropy constrained problem that minimizes a linear combi- nation of the transportation cost ⟨c, P⟩ and entropy H(P), Fλ(P) = 1 1 + λ ⟨c, P⟩ − λ 1 + λ H(P) (8) for constant λ (Cuturi, 2013). Here, λ is a Lagrangian multiplier and λ becomes smaller as d becomes larger. 3 Solution of Entropy-Constrained Problem Since P satisfies constraints (2), by using Lagrange multipliers αi, βj, minimiza- tion of (8) is formulated in the Lagrangian form, Lλ(P) = 1 1 + λ ⟨c, P⟩ − λ 1 + λ H(P) − ∑ i,j (αi + βj) Pij. (9) Let us fix λ, considering it as a parameter controlling the magnitude of the entropy or the size of the KL-ball. By differentiating (9) with respect to Pij, we have the following solution, Pij = exp { − cij λ + 1 + λ λ (αi + βj) − 1 } . (10) Let us put ai = exp ( 1 + λ λ αi ) bj = exp ( 1 + λ λ βj ) , Kij = exp { − cij λ } , (11) and the optimal solution is written as P∗ λij ∝ aibjKij, (12) where ai and bj correspond to the Lagrange multipliers αi and βj to be deter- mined from the constraints (2). Note that 2n constraints (2) are not independent. Because of ∑ pi = 1, we can obtain an by an = 1 − ∑ i̸=n ai. Further, we note that µa and b/µ give the same answer for any µ > 0, where a = (ai) and b = (bj). Therefore, the degrees of freedom of a and b are 2(n − 1), which are to be determined from p and q of which degrees of freedom are also 2(n − 1). Therefore, we may choose a and b such that they satisfy ∑ ai = 1, ∑ bj = 1. (13) Then, a, b ∈ Sn−1 and we have the following theorem. Theorem 1. The optimal transportation plan P∗ λ is given by P∗ λij = caibjKij, (14) c = 1 ∑ aibjKij , (15) where two vectors a and b are determined from p and q. Cuturi (2013) obtained the above P∗ λij and applied the Sinkhorn-Knopp algo- rithm to iteratively compute a and b. The following lemma is useful for later calculations. Lemma 1. The optimal value φλ(p, q) = min Fλ(P) (16) is given by φλ(p, q) = λ 1 + λ (∑ pi log ai + ∑ qj log bj + log c ) . (17) Proof. We first calculate H (P∗ λ). Substituting (15) in H (P∗ λ), we have H (P∗ λ) = − ∑ ij P∗ λij ( − cij λ + log caibj ) (18) = 1 λ ⟨c, P∗ λ⟩ − ∑ pi log ai − ∑ qj log bj − log c. (19) Hence, (17) follows. 4 Exponential Family of Optimal Transportation Plans A transportation plan P is a probability distribution over branches (i, j) con- necting terminals i and j. Let x denote branches and δij(x) = 1 when x is (i, j) and 0 otherwise. Then P is a probability distribution of random variable x, P(x) = n ∑ i,j=1 Pijδij(x). (20) By introducing new parameters θij = log Pij Pnn , θ = ( θij ) , (21) it is rewritten in a parameterized form as P(x, θ) = exp    ∑ i,j θij δij(x) + log Pnn    . (22) This shows that the set of transportation plans is an exponential family, where θij are the canonical parameters and ηij = Pij the expectation parameters. They form an ( n2 − 1 ) -dimensional manifold denoted by ST P , because θnn = 0. An optimal transportation plan is specified by (α, β) in (10), α = (αi), β = (βj) which are determined from (p, q). It is written as P(x, α, β) = exp   ∑ i,j { λ + 1 λ (αi + βj) − cij λ } δij(x) − λ + 1 λ ψ   , (23) where ψ(α, β) = λ 1 + λ log ∑ i,j exp { λ + 1 λ (αi + βj) − cij λ } (24) is the normalization factor. By putting θij = 1 + λ λ (αi + βj) − cij λ , (25) we see that the set SOT P of optimal transformation plans is a submanifold of ST P . Because (25) is linear in α and β, SOT P itself is an exponential family, where the canonical parameters are (1 + λ)/λ times (α, β) and the expectation parameters are (p, q) ∈ Sn−1 × Sn−1, since E   ∑ j δij(x)   = pi, (26) E [ ∑ i δij(x) ] = qj. (27) Since each of p, q ∈ Sn−1 has n − 1 degrees of freedom, SOP T is a 2(n − 1)- dimensional dually flat manifold. We may put αn = βn = 0 without loss of generality, which correspond to putting an