Maximal Information Divergence from Statistical Models defined by Neural Networks

28/08/2013
OAI : oai:www.see.asso.fr:2552:5129
DOI : You do not have permission to access embedded form.

Résumé

Maximal Information Divergence from Statistical Models defined by Neural Networks

Média

Voir la vidéo

Métriques

652
133
6.42 Mo
 application/pdf
bitcache://83cd594d13b964663eac643e5173b395a59a6159

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors financier

logo_gdr-mia.png
logo_inria.png
image010.png
logothales.jpg

Sponsors logistique

logo-minesparistech.jpg
logo-universite-paris-sud.jpg
logo_supelec.png
Séminaire Léon Brillouin Logo
logo_cnrs_2.jpg
logo_ircam.png
logo_imb.png
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/2552/5129</identifier><creators><creator><creatorName>Guido Montùfar</creatorName></creator><creator><creatorName>Johannes Rauh</creatorName></creator><creator><creatorName>Nihat Ay</creatorName></creator></creators><titles>
            <title>Maximal Information Divergence from Statistical Models defined by Neural Networks</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2013</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Tue 15 Oct 2013</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Sat 17 Feb 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">83cd594d13b964663eac643e5173b395a59a6159</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>15061</version>
        <descriptions>
            <description descriptionType="Abstract"></description>
        </descriptions>
    </resource>
.

Maximal Information Divergence from Statistical Models defined by Neural Networks Guido Mont´ufar1 , Johannes Rauh2 , Nihat Ay2 1 Pennsylvania State University 2 Max Planck Institute for Mathematics in the Sciences GSI 2013, 28–30 August, MINES ParisTech (Ecole des Mines de Paris) How well are neural networks able to approximate probability distributions? G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks 1 Introduction 2 Maximal Information Divergence 3 Discussion G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 1 Σ sig Restricted Boltzmann Machine trained on MNIST data base of 60,000 hand- written digits (28 × 28 pixel). G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 1 MNIST database of 60,000 handwritten digits G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 1 0 Σ sig 10 - Number of hidden units 100 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 2 St St+1 At At+1 α α π π Maximizing the Mutual Information of sensor values St and St+1 produces interesting coordinated behaviour! This is the divergence from P(St, St+1) to the product distribution P(St) · P(St+1). [Zahedi, Ay, and Der, 2010] G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Simplex of probability distributions. S2−1 R2 S3−1 R3 p ∈ R|X| , px ≥ 0, x∈X px = 1 (p1, p2) (p1, p2, p3) Kullback-Leibler divergence. D(p q) = x px log px qx G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 3 q(x, y) = q(x)q(y) G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 3 q(x, y) = q(x)q(y) p(x, y) ↓ ( y p(x, y))( x p(x, y)) = arginfq D(p q) G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 3 q(x, y) = q(x)q(y) p(x, y) ↓ ( y p(x, y))( x p(x, y)) = arginfq D(p q) Which p lies farthest? argsupp infq D(p q) G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 3 q(x, y) = q(x)q(y) p(x, y) ↓ ( y p(x, y))( x p(x, y)) = arginfq D(p q) Which p lies farthest? argsupp infq D(p q) G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks The independence model E1 n is the set of factorizing probability distributions p(x) = i∈[n] pi(xi) for all x = (x1, . . . , xn) ∈ X1 × · · · × Xn. • Simplest neural network model (fully observed non-interacting units). • Closed formula for the divergence infq∈E1 n D(· q) (difference of entropies). Lemma The maximal divergence from the q-ary independence model is DE1 n = (n − 1) log(q), and the maximizers of the divergence are the uniform distributions on maximal maximal-distance codes. L 8 [Ay and Knauf, 2006] G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks • The maximal divergence from a set of distributions M bounds the model approximation errors DM := sup p∈∆ inf q∈M D(p q) D(p M) . When DM = 0, then M is a universal approximator. • Compute D(· M) and DM! M exponential family neural network D(· M) sometimes easy mostly difficult MLE unique not unique DM often difficult ? [e.g., Hornik and Stinchcombe, 1989; Freund and Haussler, 1991; Sutskever and Hinton, 2008; Le Roux and Bengio, 2010; M. and Ay, 2011] [e.g., Mat´uˇs and Ay, 2003; Ay and Knauf, 2006; Rauh, 2012] G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Y X1 · · · Xn Y1 Ym · · · X1 · · · Xn Na¨ıve Bayes Model Restricted Boltzmann Machine Deep Belief Network q(x1, . . . , xn) = y1,...,ym 1 Z(w, b, c) exp( i,j wijxiyj + i bixi + j cjyj) q(x1, . . . , xn) = y N i=1 P(xi|y)Q(y) G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Example 4 • Na¨ıve Bayes model with three visible and one hidden binary variables. Simplest (non-trivial) neural network with hidden units. Has full dimension but small volume. [Zwiernik and Smith, 2011] • What is arginfq∈M3,2 D(p q) for uniform p on even-parity strings? −0.1 0 0.1 −0.1 0 0.1 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 µ 12 µ 12 ,µ 13 ,µ 23 with µ 1 =µ 2 =µ 3 =.5, µ 123 =−0.0245 µ 13 µ 23 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3 i.i.d. parameter MixtureWeight Mixture of δ 000 and i.i.d. 0.2 0.4 0.6 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 K0.8307, KCD0.83348 2 4 6 8 10 12 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Estimating the Information Divergence Our Approach 1. Identify a family of exponential sub-models of neural network models. 2. Bound the divergence from the union of these sub-models from above. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Estimating the Information Divergence Our Approach 1. Identify a family of exponential sub-models of neural network models. 2. Bound the divergence from the union of these sub-models from above. • If M ⊆ M then DM ≤ DM . In special cases equality is possible. L 5 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Estimating the Information Divergence Our Approach 1. Identify a family of exponential sub-models of neural network models. 2. Bound the divergence from the union of these sub-models from above. • If M ⊆ M then DM ≤ DM . In special cases equality is possible. L 5 • Conversely, simple sub-models can be used to study larger models. L 6 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Estimating the Information Divergence Sub-models • The Partition Model P with partition = {A1, . . . , Am} is the set of probability distributions with constant value on each partition block Ai. • Simplest exponential families (simplices). • Closed formula for the divergence and maximal divergence. L 7 • Optimally approximating exponential families (fixing dimension). [Rauh, 2013] • We also use Mixtures of products with disjoint supports. • Exponential families. • Closed formula for the divergence and maximal divergence. • Similar to mixture models. • Neural networks can be analyzed in terms of such sub-models. • Show containment of big such models. • Find optimal projection components. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Na¨ıve Bayes ModelsY X1 · · · Xn Y1 Ym · · · X1 · · · Xn The mixture of product distributions Mn,k, or na¨ıve Bayes model, is a star- graphical model with visible leafs and hidden internal node. Theorem The maximal divergence from the k-mixture of product distributions on q-ary arrays of length n is bounded by DMn,k ≤ n − 1 − logq(k) . This theorem subsumes independence models (k = 1). T 9 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Restricted Boltzmann Machines Xn Y1 Ym · · · X1 · · · Xn The restricted Boltzmann machine RBMn,m is a full bipartite undirected stochastic network (Hadamard product of Mn,k’s). Theorem The maximal divergence from the q-ary RBM with n visible and m hidden units is bounded by DRBMn,m ≤ n − 1 − logq(m(q − 1) + 1) . This theorem subsumes na¨ıve Bayes models (m = 1). [M., Rauh, Ay, 2011; M. and Morton, 2013] T 10 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Deep Belief NetworksYm Xn A deep belief network (DBN) is a layered stochastic network with directed and undirected interactions (RBM fed into directed network). Theorem The maximal divergence from the q-ary DBN with L layers of width n ≤ n + logq n + 1 is bounded by DDBN ≤ n − logq((L − 2)(q − 1) + 1) . This theorem subsumes restricted Boltzmann machines (L = 2). T 11 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Qualitative Illustration The maximal model approximation error of NBMs, RBMs, and DBNs • increases at most logarithmically with the number of visible states, and • decreases at least logarithmically with the number of parameters. n logq(L) D G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Small Examples Nr. hidden units KL−divergence Nr. visible units:3 0 1 2 3 0 0.5 1 1.5 2 maxCD+ML meanCD+ML maxth meanth targets: 500 a: 0.083333 samples: 2000 Nr. hidden units KL−divergence Nr. visible units:4 0 1 2 3 4 5 6 7 0 0.5 1 1.5 2 2.5 3 maxCD+ML meanCD+ML maxth meanth targets: 500 a: 0.0625 samples: 2000 Nr. hidden layers KL−divergence Nr. visible units:4 0 1 2 3 4 0 0.5 1 1.5 2 2.5 3 maxCD meanCD maxth meanth targets: 500 a: 0.0625 samples: 2000 Nr. hidden layers KL−divergence Nr. visible units:5 0 1 2 3 4 5 6 7 8 0 0.5 1 1.5 2 2.5 3 3.5 4 maxCD meanCD maxth meanth targets: 500 a: 0.05 samples: 2000 G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Summary • We bound the maximal Kullback-Leibler divergence of various statistical models defined by neural networks with hidden units. We proceed by: 1. Identifying families of exponential sub-models. 2. Upper-bounding the maximal divergence of the union of sub-models. • Our bounds show a logarithmic error-decay on the number of parameters. This behaviour is optimal for exponential families. [Rauh, 2013] • Vanishing bounds show universal approximation. • The exponential sub-model approach also allow us to compute expectation values for the divergence. [M. and Rauh, 2012] • Open problems: Lower-bounds for the maximal divergence and optimality. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Summary • We bound the maximal Kullback-Leibler divergence of various statistical models defined by neural networks with hidden units. We proceed by: 1. Identifying families of exponential sub-models. 2. Upper-bounding the maximal divergence of the union of sub-models. • Our bounds show a logarithmic error-decay on the number of parameters. This behaviour is optimal for exponential families. [Rauh, 2013] • Vanishing bounds show universal approximation. • The exponential sub-model approach also allow us to compute expectation values for the divergence. [M. and Rauh, 2012] • Open problems: Lower-bounds for the maximal divergence and optimality. Thanks! G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Appendix Information Divergence • Let p and q be probability distributions on a finite set X. The information divergence of q from p D(p q) := x∈X p(x) log p(x) q(x) measures the information loss when q is used to approximate p. • The maximal information divergence of a model M ⊆ ∆ DM := max p∈∆ inf q∈M D(p q) is a measure of the approximation capabilities of M. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Appendix Estimating the Information Divergence 1. If M ⊆ M then DM ≤ DM . In special cases equality is possible: Lemma If M ⊆ M and p is a maximizer of the divergence from M such that M contains an rI-projection pM of p to M, then p maximizes the divergence from M among the set {q ∈ ∆ : qM ∈ M for some rI-projection qM}. 2. Conversely, simple sub-models can be used to study larger models: Lemma Let {Mi} be sub-models of an exponential family E with DMi = K and divergence maximizers {Gi}. If there is a p ∈ ∩iGi with pE ∈ ∪iMi, then DE = K and the maximizers are the distributions p ∈ ∩iGi with pE ∈ ∩iMi. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Appendix Estimating the Information Divergence 1. If M ⊆ M then DM ≤ DM . In special cases equality is possible: Theorem The maximal divergence from the multinomial model of n q-ary variables is equal to (n − 1) log(q). [see also Jur´ıˇcek, 2010]. 2. Conversely, simple sub-models can be used to study larger models: Theorem The maximal divergence from the homogeneous independence model E1 n is (n − 1) log(q), and the maximizers are the uniform distributions on q-ary codes of cardinality q and minimum distance n. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Appendix Neural Networks • Neural networks define models of probability distributions on the states of their neurons, parametrized by interaction and bias weights. For example p(x1, . . . , xn; w, b) = exp( wijxixj + bixi − ψ(w, b)). • ∆(X) denotes the simplex of distributions on X := X1 × · · · × Xn. • We consider finite state spaces Xi, |Xi| = Ni, for all i ∈ [n]. Y X1 · · · Xn Y1 Ym · · · X1 · · · Xn Mn,k RBMn,m DBN G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Partition Models Lemma (Mat´uˇs and Ay, 2003) The maximal divergence from a partition model P of coarseness c( ) := maxi |Ai| is DP = log(c( )). The global maximizers p are the distributions with supp(p) ∩ Ai ≤ 1 for all i ∈ [m], with equality only if |Ai| = c( ). G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Independence Models Lemma (Ay and Knauf, 2006) Let Xi = Ni and |X1 × · · · × Xn| = N. The maximal divergence to E1 n is bounded by DE1 n ≤ log(N/ max i∈[n] Ni) . If all units are q-ary (Ni = q), then DE1 n = (n − 1) log(q), and the maximizers of the divergence, argsupp D(p E1 n ), are the uniform distributions on q-ary codes of cardinality q and minimum distance n. G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Na¨ıve Bayes Models The mixture of product distributions Mn,k, or na¨ıve Bayes model, is a star-graphical model with visible leafs and hidden internal node. Theorem Let A ⊆ [n]. If k ≥ N[n]\A, then DMn,k is bounded by DMn,k ≤ log(NA/ max j∈A Nj) . When all visible variables are binary, we have the tighter bound DMn,k ≤ n − log2(k) − k 2 log2(k) log(2) . G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Restricted Boltzmann Machines The restricted Boltzmann machine RBMn,m is the undirected stochastic network full bipartite hidden-visible units interactions. Theorem Let A ⊆ [n], and let M1, . . . , Mm be the sizes of the state spaces of the hidden variables. If 1 + j∈[m](Mj − 1) ≥ N[n]\A, then DRBMn,m ≤ log(NA/ max j∈A Nj) . When all units are binary, and m ≤ 2n−1 − 1, we have the tighter bound DRBMn,m ≤ n − log2(m + 1) − m + 1 2 log2(m+1) log(2) . G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks Maximal Divergence Deep Belief Networks A deep belief network (DBN) is a layered stochastic network with undirected bipartite interactions between the units in the deepest two layers, and directed bipartite interactions between all other pairs of subsequent layers, directed towards the first, visible, layer. Theorem Consider a DBN with L layers, each layer containing n units with state spaces of cardinalities q1, . . . , qn. Let m be any integer with n j=m+2 qj ≤ m ≤ n, and let q1 ≥ · · · ≥ qm. If L ≥ 2 + qS 1−1 q1−1 for some S ∈ {0, 1, . . . , m}, then DDBN ≤ log(N[m−S]) . In particular, when all units are binary and the network has L ≥ 1 + 2S layers of size n = 2k−1 + k, for some S ∈ {0, 1, . . . , 2k−1 }, then DDBN ≤ 2k−1 − S log(2) . G. Mont´ufar, J. Rauh, and N. Ay Maximal Information Divergence from Neural Networks