Online k-MLE for mixture modeling with exponential families

28/10/2015
Publication GSI2015
OAI : oai:www.see.asso.fr:11784:14291

Résumé

This paper address the problem of online learning finite statistical mixtures of exponential families. A short review of the Expectation-Maximization (EM) algorithm and its online extensions is done. From these extensions and the description of the k-Maximum Likelihood Estimator (k-MLE), three online extensions are proposed for this latter. To illustrate them, we consider the case of mixtures of Wishart distributions by giving details and providing some experiments.

Online k-MLE for mixture modeling with exponential families

Collection

application/pdf Online k-MLE for mixture modeling with exponential families Christophe Saint-Jean, Frank Nielsen
Détails de l'article
This paper address the problem of online learning finite statistical mixtures of exponential families. A short review of the Expectation-Maximization (EM) algorithm and its online extensions is done. From these extensions and the description of the k-Maximum Likelihood Estimator (k-MLE), three online extensions are proposed for this latter. To illustrate them, we consider the case of mixtures of Wishart distributions by giving details and providing some experiments.
Online k-MLE for mixture modeling with exponential families

Auteurs

Information geometry: Dualistic manifold structures and their uses
An elementary introduction to information geometry
k-Means Clustering with Hölder divergences
On the Error Exponent of a Random Tensor with Orthonormal Factor Matrices
Bregman divergences from comparative convexity
session Computational Information Geometry (chaired by Frank Nielsen, Olivier Schwander)
Opening and closing sessions (chaired by Frédéric Barbaresco, Frank Nielsen, Silvère Bonnabel)
GSI'17-Closing session
GSI'17 Opening session
Bag-of-components an online algorithm for batch learning of mixture models
TS-GNPR Clustering Random Walk Time Series
Online k-MLE for mixture modeling with exponential families
Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry
Computational Information Geometry (chaired by Frank Nielsen, Paul Marriott)
Keynote speach Marc Arnaudon (chaired by Frank Nielsen)
Oral session 6 Foundations and Geometry (John Skilling, Frank Nielsen, Ariel Caticha)
Oral session 5 Bayesian inference (Frank Nielsen, John Skilling, Romke Brontekoe)
Oral session 3 Information geometry (Ariel Caticha, Steeve Zozor, Frank Nielsen)
Tutorial session 2 (Frank Nielsen, Ariel Caticha, Ken H. Knuth)
A new implementation of k-MLE for mixture modelling of Wishart distributions
Hypothesis testing, information divergence and computational geometry
ORAL SESSION 6 Computational Information Geometry (Frank Nielsen)
lncs_8085_cover.pdf
Geometric Science of Information - GSI 2013 Proceedings

Média

Voir la vidéo

Métriques

105
4
423.18 Ko
 application/pdf
bitcache://32bcdd3c65a11932a51b9c8071f9c18b8bfb1be8

Licence

Creative Commons Attribution-ShareAlike 4.0 International

Sponsors

Organisateurs

logo_see.gif
logocampusparissaclay.png

Sponsors

entropy1-01.png
springer-logo.png
lncs_logo.png
Séminaire Léon Brillouin Logo
logothales.jpg
smai.png
logo_cnrs_2.jpg
gdr-isis.png
gdrmia_logo.png
logo_x.jpeg
logo-lix.png
logorioniledefrance.jpg
isc-pif_logo.png
logo_telecom_paristech.png
csdcunitwinlogo.jpg
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/11784/14291</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator><creator><creatorName>Christophe Saint-Jean</creatorName></creator></creators><titles>
            <title>Online k-MLE for mixture modeling with exponential families</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2015</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>Mixture modeling</subject><subject>Online learning</subject><subject>k-MLE Wishart distribution</subject></subjects><dates>
	    <date dateType="Created">Sun 8 Nov 2015</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Mon 10 Dec 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">32bcdd3c65a11932a51b9c8071f9c18b8bfb1be8</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>24680</version>
        <descriptions>
            <description descriptionType="Abstract">
This paper address the problem of online learning finite statistical mixtures of exponential families. A short review of the Expectation-Maximization (EM) algorithm and its online extensions is done. From these extensions and the description of the k-Maximum Likelihood Estimator (k-MLE), three online extensions are proposed for this latter. To illustrate them, we consider the case of mixtures of Wishart distributions by giving details and providing some experiments.

</description>
        </descriptions>
    </resource>
.

Online k-MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen Geometry Science Information 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay Application Context 2/27 We are interested in building a system (a model) which evolves when new data is available: x1, x2, . . . , xN, . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X Outline of this talk 3/27 1 Online learning exponential families 2 Online learning of mixture of exponential families Introduction, EM, k-MLE Recursive EM, Online EM Stochastic approximations of k-MLE Experiments 3 Conclusions Reminder : (Regular) Exponential Family 4/27 Firstly, π will be approximated by a member of a (regular) exponential family (EF): EF = {f (x; θ) = exp { s(x), θ + k(x) − F(θ)|θ ∈ Θ} Terminology: λ source parameters. θ natural parameters. η expectation parameters. s(x) sufficient statistic. k(x) auxiliary carrier measure. F(θ) the log-normalizer: differentiable, strictly convex Θ = {θ ∈ RD|F(θ) < ∞} is an open convex set Almost all common distributions are EF members but uniform, Cauchy distributions. Reminder : Maximum Likehood Estimate (MLE) 5/27 Maximum Likehood Estimate for general p.d.f: ˆθ(N) = argmax θ N i=1 f (xi ; θ) = argmin θ − 1 N N i=1 log f (xi ; θ) assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations. Maximum Likehood Estimate for an EF: ˆθ(N) = argmin θ − 1 N i s(xi ), θ − cst(χ) + F(θ) which is exactly solved in H, the space of expectation parameters: ˆη(N) = F(ˆθ(N) ) = 1 N i s(xi ) ≡ ˆθ(N) = ( F)−1 1 N i s(xi ) Exact Online MLE for exponential family 6/27 A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input: a sequence S of observations Input: Functions s and ( F)−1 for some EF Output: a sequence of MLE for all observations seen before ˆη(0) = 0; N = 1; for xN ∈ S do ˆη(N) = ˆη(N−1) + N−1(s(xN) − ˆη(N−1)); yield ˆη(N) or yield ( F)−1(ˆη(N)); N = N + 1; Analytical expressions of ( F)−1 exist for most EF (but not all) Case of Multivariate normal distribution (MVN) 7/27 Probability density function of MVN: N(x; µ, Σ) = (2π)−d 2 |Σ|−1 2 exp−1 2 (x−µ)T Σ−1(x−µ) One possible decomposition: N(x; θ1, θ2) = exp{ θ1, x + θ2, −xxT F − 1 4 t θ1θ−1 2 θ1 − d 2 log(π) + 1 2 log |θ2|} =⇒ s(x) = (x, −xxT ) ( F)−1(η1, η2) = (−η1ηT 1 − η2)−1η1, 1 2(−η1ηT 1 − η2)−1 Case of the Wishart distribution 8/27 See details in the paper. Finite (parametric) mixture models 9/27 Now, π will be approximated by a finite (parametric) mixture f (·; θ) indexed by θ: π(x) ≈ f (x; θ) = K j=1 wj fj (x; θj ), 0 ≤ wj ≤ 1, K j=1 wj = 1 where wj are the mixing proportions, fj are the component distributions. When all fj ’s are EFs, it is called a Mixture of EFs (MEF). −5 0 5 10 0.000.050.100.150.200.25 x 0.1*dnorm(x)+0.6*dnorm(x,4,2)+0.3*dnorm(x,−2,0.5) Unknown true distribution f* Mixture distribution f Components density functions f_j Incompleteness in mixture models 10/27 incomplete observable χ = {x1, . . . , xN} deterministic ← complete unobservable χc = {y1 = (x1, z1), . . . , yN} Zi ∼ catK (w) Xi |Zi = j ∼ fj (·; θj ) For a MEF, the joint density p(x, z; θ) is an EF: log p(x, z; θ) = K j=1 [z = j]{log(wj ) + θj , sj (x) + kj (x) − Fj (θj )} = K j=1 [z = j] [z = j] sj (x) , log wj − Fj (θj ) θj + k(x, z) Expectation-Maximization (EM) [1] 11/27 The EM algorithm maximizes iteratively Q(θ; ˆθ(t), χ). Algorithm 2: EM algorithm Input: ˆθ(0) initial parameters of the model Input: χ(N) = {x1, . . . , xN} Output: A (local) maximizer ˆθ(t∗) of log f (χ; θ) t ← 0; repeat Compute Q(θ; ˆθ(t), χ) := Eˆθ(t) [log p(χc; θ)|χ] ; // E-Step Choose ˆθ(t+1) = argmaxθ Q(θ; ˆθ(t), χ) ; // M-Step t ← t +1; until Convergence of the complete log-likehood; EM for MEF 12/27 For a mixture, the E-Step is always explicit: ˆz (t) i,j = ˆw (t) j f (xi ; ˆθ (t) j )/ j ˆw (t) j f (xi ; ˆθ (t) j ) For a MEF, the M-Step then reduces to: ˆθ(t+1) = argmax {wj ,θj } K j=1 i ˆz (t) i,j i ˆz (t) i,j sj (xi ) , log wj − Fj (θj ) θj ˆw (t+1) j = N i=1 ˆz (t) i,j /N ˆη (t+1) j = F(ˆθ (t+1) j ) = i ˆz (t) i,j sj (xi ) i ˆz (t) i,j (weighted average of SS) k-Maximum Likelihood Estimator (k-MLE) [2] 13/27 The k-MLE introduces a geometric split χ = K j=1 ˆχ (t) j to accelerate EM : ˜z (t) i,j = [argmax j wj f (xi ; ˆθ (t) j ) = j] Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the M-Step of the k-MLE then reduces to: ˆθ(t+1) = argmax {wj ,θj } K j=1 |ˆχ (t) j | xi ∈ˆχ (t) j sj (xi ) , log wj − Fj (θj ) θj ˆw (t+1) j = |ˆχ (t) j |/N ˆη (t+1) j = F(ˆθ (t+1) j ) = xi ∈ˆχ (t) j sj (xi ) |ˆχ (t) j | (cluster-wise unweighted average of SS) Online learning of mixtures 14/27 Consider now the online setting x1, x2, . . . , xN, . . . Denote ˆθ(N) or ˆη(N) the parameter estimate after dealing N observations Denote ˆθ(0) or ˆη(0) their initial values Remark: For a fixed-size dataset χ, one may apply multiple passes (with shuffle) on χ. The increase in the likelihood function is no more guaranteed after an iteration. Stochastic approximations of EM(1) 15/27 Two main approaches to online EM-like estimation: Stochastic M-Step : Recursive EM (1984) [5] ˆθ(N) = ˆθ(N−1) + {NIc(ˆθ(N−1) }−1 θ log f (xN; ˆθ(N−1) ) where Ic is the Fisher Information matrix for the complete data: Ic(ˆθ(N−1) ) = −Eˆθ (N−1) j log p(x, z; θ) ∂θ∂θT A justification for this formula comes from the Fisher’s Identity: log f (x; θ) = Eθ[log p(x, z; θ)|x] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert Ic after each iteration. Stochastic approximations of EM(2) 16/27 Stochastic E-Step : Online EM (2009) [7] ˆQ(N) (θ) = ˆQ(N−1) (θ)+α(N) Eˆθ(N−1) [log p(xN, zN; θ)|xN] − ˆQ(N−1) (θ) In case of a MEF, the algorithm works only with the cond. expectation of the sufficient statistics for complete data. ˆzN,j = Eθ(N−1) [zN,j |xN] ˆS (N) wj ˆS (N) θj = ˆS (N−1) wj ˆS (N−1) θj + α(N) ˆzN,j ˆzN,j sj (xN) − ˆS (N−1) wj ˆS (N−1) θj The M-Step is unchanged: ˆw (N) j = ˆη (N) wj = ˆS (N) wj ˆθ (N) j = ( Fj )−1 (ˆη (N) θj = ˆS (N) θj / ˆS (N) wj ) Stochastic approximations of EM(3) 17/27 Some properties: Initial values ˆS(0) may be used for introducing a ”prior”: ˆS (0) wj = wj , ˆS (0) θj = wj η (0) j Parameters constraints are automatically respected No matrix to invert ! Policy for α(N) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !! Stochastic approximations of k-MLE(1) 18/27 In order to keep previous advantages of online EM for an online k-MLE, our only choice concerns the way to affect xN to a cluster. Strategy 1 Maximize the likelihood of the complete data (xN, zN) ˜zN,j = [argmax j ˆw (N−1) j f (xN; ˆθ (N−1) j ) = j] Equivalent to Online CEM and similar to Mac-Queen iterative k-Means. Stochastic approximations of k-MLE(2) 19/27 Strategy 2 Maximize the likelihood of the complete data (xN, zN) after the M-Step: ˜zN,j = [argmax j ˆw (N) j f (xN; ˆθ (N) j ) = j] Similar to Hartigan’s method for k-means. Additional cost: pre-compute all possible M-Steps for the Stochastic E-Step. Stochastic approximations of k-MLE(3) 20/27 Strategy 3 Draw ˜zN,j from the categorical distribution ˜zN sampled from CatK ({pj = log( ˆw (N−1) j fj (xN; ˆθ (N−1) j ))}j ) Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of k-MLE. For strategies 1 and 3, the M-Step reduces the update of the parameters for a single component. Experiments 21/27 True distribution π = 0.5N(0, 1) + 0.5N(µ2, σ2 2) Different values for µ2, σ2 for more or less overlap between components. A small subset of observations has be taken for initialization (k-MLE++ / k-MLE). Video illustrating the inconsistency of online k-MLE. Experiments on Wishart 22/27 Conclusions - Future works 23/27 On consistency: EM, Online EM are consistent k-MLE, online k-MLE (Strategies 1,2) are inconsistent (due to the Bayes error in maximizing the classification likelihood) Online stochastic k-MLE (Strategy 3) : consistency ? So, when components overlap, online EM > k-MLE > online k-MLE for parameter learning. Need to study how the dimension influences the inconstancy/convergence rate for online k-MLE. Convergence rate is lower for online methods (sub-linear convergence of the SGD) Time for an update vs sample size: online k-MLE (1,3) < online EM < online k-MLE (2) << k-MLE 24/27 online EM appears to be the best compromise !! References I 25/27 Dempster, A.P., Laird, N.M. and Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977. Nielsen, F.: On learning statistical mixtures maximizing the complete likelihood Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), AIP Conference Proceedings Publishing, 1641, pp. 238-245, 214. Celeux, G. and Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14(3), pp. 315-332, 1992. References II 26/27 Sam´e, A., Ambroise, C., Govaert, G.: An online classification EM algorithm based on the mixture model Statistics and Computing, 17(3), pp. 209–218, 2007. Titterington, D. M. : Recursive Parameter Estimation Using Incomplete Data. Journal of the Royal Statistical Society. Series B (Methodological), Volume 46, Number 2, pp. 257–267, 1984. Amari, S. I. : Natural gradient works efficiently in learning. Neural Computation, Volume 10, Number 2, pp. 251?276, 1998. Capp´e, O., Moulines, E.: On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Series B (Methodological), 71(3):593-613, 2009. References III 27/27 Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in graphical models, pages 355-368. MIT Press, Cambridge, 1999. Bottou, L´eon : Online Algorithms and Stochastic Approximations. Online Learning and Neural Networks, Saad, David Eds.,Cambridge University Press, 1998.