Online k-MLE for mixture modeling with exponential families

Publication GSI2015
DOI : do not have permission to access embedded form.


This paper address the problem of online learning finite statistical mixtures of exponential families. A short review of the Expectation-Maximization (EM) algorithm and its online extensions is done. From these extensions and the description of the k-Maximum Likelihood Estimator (k-MLE), three online extensions are proposed for this latter. To illustrate them, we consider the case of mixtures of Wishart distributions by giving details and providing some experiments.

Online k-MLE for mixture modeling with exponential families


Voir la vidéo


423.18 Ko


Creative Commons Attribution-ShareAlike 4.0 International





Séminaire Léon Brillouin Logo
<resource  xmlns:xsi=""
        <identifier identifierType="DOI">10.23723/11784/14291</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator><creator><creatorName>Christophe Saint-Jean</creatorName></creator></creators><titles>
            <title>Online k-MLE for mixture modeling with exponential families</title></titles>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>Mixture modeling</subject><subject>Online learning</subject><subject>k-MLE Wishart distribution</subject></subjects><dates>
	    <date dateType="Created">Sun 8 Nov 2015</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Sat 24 Feb 2018</date>
	    <alternateIdentifier alternateIdentifierType="bitstream">32bcdd3c65a11932a51b9c8071f9c18b8bfb1be8</alternateIdentifier>
            <description descriptionType="Abstract">
This paper address the problem of online learning finite statistical mixtures of exponential families. A short review of the Expectation-Maximization (EM) algorithm and its online extensions is done. From these extensions and the description of the k-Maximum Likelihood Estimator (k-MLE), three online extensions are proposed for this latter. To illustrate them, we consider the case of mixtures of Wishart distributions by giving details and providing some experiments.


Online k-MLE for mixture modelling with exponential families Christophe Saint-Jean Frank Nielsen Geometry Science Information 2015 Oct 28-30, 2015 - Ecole Polytechnique, Paris-Saclay Application Context 2/27 We are interested in building a system (a model) which evolves when new data is available: x1, x2, . . . , xN, . . . The time needed for processing a new observation must be constant w.r.t the number of observations. The memory required by the system is bounded. Denote π the unknown distribution of X Outline of this talk 3/27 1 Online learning exponential families 2 Online learning of mixture of exponential families Introduction, EM, k-MLE Recursive EM, Online EM Stochastic approximations of k-MLE Experiments 3 Conclusions Reminder : (Regular) Exponential Family 4/27 Firstly, π will be approximated by a member of a (regular) exponential family (EF): EF = {f (x; θ) = exp { s(x), θ + k(x) − F(θ)|θ ∈ Θ} Terminology: λ source parameters. θ natural parameters. η expectation parameters. s(x) sufficient statistic. k(x) auxiliary carrier measure. F(θ) the log-normalizer: differentiable, strictly convex Θ = {θ ∈ RD|F(θ) < ∞} is an open convex set Almost all common distributions are EF members but uniform, Cauchy distributions. Reminder : Maximum Likehood Estimate (MLE) 5/27 Maximum Likehood Estimate for general p.d.f: ˆθ(N) = argmax θ N i=1 f (xi ; θ) = argmin θ − 1 N N i=1 log f (xi ; θ) assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations. Maximum Likehood Estimate for an EF: ˆθ(N) = argmin θ − 1 N i s(xi ), θ − cst(χ) + F(θ) which is exactly solved in H, the space of expectation parameters: ˆη(N) = F(ˆθ(N) ) = 1 N i s(xi ) ≡ ˆθ(N) = ( F)−1 1 N i s(xi ) Exact Online MLE for exponential family 6/27 A recursive formulation is easily obtained Algorithm 1: Exact Online MLE for EF Input: a sequence S of observations Input: Functions s and ( F)−1 for some EF Output: a sequence of MLE for all observations seen before ˆη(0) = 0; N = 1; for xN ∈ S do ˆη(N) = ˆη(N−1) + N−1(s(xN) − ˆη(N−1)); yield ˆη(N) or yield ( F)−1(ˆη(N)); N = N + 1; Analytical expressions of ( F)−1 exist for most EF (but not all) Case of Multivariate normal distribution (MVN) 7/27 Probability density function of MVN: N(x; µ, Σ) = (2π)−d 2 |Σ|−1 2 exp−1 2 (x−µ)T Σ−1(x−µ) One possible decomposition: N(x; θ1, θ2) = exp{ θ1, x + θ2, −xxT F − 1 4 t θ1θ−1 2 θ1 − d 2 log(π) + 1 2 log |θ2|} =⇒ s(x) = (x, −xxT ) ( F)−1(η1, η2) = (−η1ηT 1 − η2)−1η1, 1 2(−η1ηT 1 − η2)−1 Case of the Wishart distribution 8/27 See details in the paper. Finite (parametric) mixture models 9/27 Now, π will be approximated by a finite (parametric) mixture f (·; θ) indexed by θ: π(x) ≈ f (x; θ) = K j=1 wj fj (x; θj ), 0 ≤ wj ≤ 1, K j=1 wj = 1 where wj are the mixing proportions, fj are the component distributions. When all fj ’s are EFs, it is called a Mixture of EFs (MEF). −5 0 5 10 x 0.1*dnorm(x)+0.6*dnorm(x,4,2)+0.3*dnorm(x,−2,0.5) Unknown true distribution f* Mixture distribution f Components density functions f_j Incompleteness in mixture models 10/27 incomplete observable χ = {x1, . . . , xN} deterministic ← complete unobservable χc = {y1 = (x1, z1), . . . , yN} Zi ∼ catK (w) Xi |Zi = j ∼ fj (·; θj ) For a MEF, the joint density p(x, z; θ) is an EF: log p(x, z; θ) = K j=1 [z = j]{log(wj ) + θj , sj (x) + kj (x) − Fj (θj )} = K j=1 [z = j] [z = j] sj (x) , log wj − Fj (θj ) θj + k(x, z) Expectation-Maximization (EM) [1] 11/27 The EM algorithm maximizes iteratively Q(θ; ˆθ(t), χ). Algorithm 2: EM algorithm Input: ˆθ(0) initial parameters of the model Input: χ(N) = {x1, . . . , xN} Output: A (local) maximizer ˆθ(t∗) of log f (χ; θ) t ← 0; repeat Compute Q(θ; ˆθ(t), χ) := Eˆθ(t) [log p(χc; θ)|χ] ; // E-Step Choose ˆθ(t+1) = argmaxθ Q(θ; ˆθ(t), χ) ; // M-Step t ← t +1; until Convergence of the complete log-likehood; EM for MEF 12/27 For a mixture, the E-Step is always explicit: ˆz (t) i,j = ˆw (t) j f (xi ; ˆθ (t) j )/ j ˆw (t) j f (xi ; ˆθ (t) j ) For a MEF, the M-Step then reduces to: ˆθ(t+1) = argmax {wj ,θj } K j=1 i ˆz (t) i,j i ˆz (t) i,j sj (xi ) , log wj − Fj (θj ) θj ˆw (t+1) j = N i=1 ˆz (t) i,j /N ˆη (t+1) j = F(ˆθ (t+1) j ) = i ˆz (t) i,j sj (xi ) i ˆz (t) i,j (weighted average of SS) k-Maximum Likelihood Estimator (k-MLE) [2] 13/27 The k-MLE introduces a geometric split χ = K j=1 ˆχ (t) j to accelerate EM : ˜z (t) i,j = [argmax j wj f (xi ; ˆθ (t) j ) = j] Equivalently, it amounts to maximize Q over partition Z [3] For a MEF, the M-Step of the k-MLE then reduces to: ˆθ(t+1) = argmax {wj ,θj } K j=1 |ˆχ (t) j | xi ∈ˆχ (t) j sj (xi ) , log wj − Fj (θj ) θj ˆw (t+1) j = |ˆχ (t) j |/N ˆη (t+1) j = F(ˆθ (t+1) j ) = xi ∈ˆχ (t) j sj (xi ) |ˆχ (t) j | (cluster-wise unweighted average of SS) Online learning of mixtures 14/27 Consider now the online setting x1, x2, . . . , xN, . . . Denote ˆθ(N) or ˆη(N) the parameter estimate after dealing N observations Denote ˆθ(0) or ˆη(0) their initial values Remark: For a fixed-size dataset χ, one may apply multiple passes (with shuffle) on χ. The increase in the likelihood function is no more guaranteed after an iteration. Stochastic approximations of EM(1) 15/27 Two main approaches to online EM-like estimation: Stochastic M-Step : Recursive EM (1984) [5] ˆθ(N) = ˆθ(N−1) + {NIc(ˆθ(N−1) }−1 θ log f (xN; ˆθ(N−1) ) where Ic is the Fisher Information matrix for the complete data: Ic(ˆθ(N−1) ) = −Eˆθ (N−1) j log p(x, z; θ) ∂θ∂θT A justification for this formula comes from the Fisher’s Identity: log f (x; θ) = Eθ[log p(x, z; θ)|x] One can recognize a second order Stochastic Gradient Ascent which requires to update and invert Ic after each iteration. Stochastic approximations of EM(2) 16/27 Stochastic E-Step : Online EM (2009) [7] ˆQ(N) (θ) = ˆQ(N−1) (θ)+α(N) Eˆθ(N−1) [log p(xN, zN; θ)|xN] − ˆQ(N−1) (θ) In case of a MEF, the algorithm works only with the cond. expectation of the sufficient statistics for complete data. ˆzN,j = Eθ(N−1) [zN,j |xN] ˆS (N) wj ˆS (N) θj = ˆS (N−1) wj ˆS (N−1) θj + α(N) ˆzN,j ˆzN,j sj (xN) − ˆS (N−1) wj ˆS (N−1) θj The M-Step is unchanged: ˆw (N) j = ˆη (N) wj = ˆS (N) wj ˆθ (N) j = ( Fj )−1 (ˆη (N) θj = ˆS (N) θj / ˆS (N) wj ) Stochastic approximations of EM(3) 17/27 Some properties: Initial values ˆS(0) may be used for introducing a ”prior”: ˆS (0) wj = wj , ˆS (0) θj = wj η (0) j Parameters constraints are automatically respected No matrix to invert ! Policy for α(N) has to be chosen (see [7]) Consistent, asymptotically equivalent to the recursive EM !! Stochastic approximations of k-MLE(1) 18/27 In order to keep previous advantages of online EM for an online k-MLE, our only choice concerns the way to affect xN to a cluster. Strategy 1 Maximize the likelihood of the complete data (xN, zN) ˜zN,j = [argmax j ˆw (N−1) j f (xN; ˆθ (N−1) j ) = j] Equivalent to Online CEM and similar to Mac-Queen iterative k-Means. Stochastic approximations of k-MLE(2) 19/27 Strategy 2 Maximize the likelihood of the complete data (xN, zN) after the M-Step: ˜zN,j = [argmax j ˆw (N) j f (xN; ˆθ (N) j ) = j] Similar to Hartigan’s method for k-means. Additional cost: pre-compute all possible M-Steps for the Stochastic E-Step. Stochastic approximations of k-MLE(3) 20/27 Strategy 3 Draw ˜zN,j from the categorical distribution ˜zN sampled from CatK ({pj = log( ˆw (N−1) j fj (xN; ˆθ (N−1) j ))}j ) Similar to sampling in Stochastic EM [3] The motivation is to try to break the inconsistency of k-MLE. For strategies 1 and 3, the M-Step reduces the update of the parameters for a single component. Experiments 21/27 True distribution π = 0.5N(0, 1) + 0.5N(µ2, σ2 2) Different values for µ2, σ2 for more or less overlap between components. A small subset of observations has be taken for initialization (k-MLE++ / k-MLE). Video illustrating the inconsistency of online k-MLE. Experiments on Wishart 22/27 Conclusions - Future works 23/27 On consistency: EM, Online EM are consistent k-MLE, online k-MLE (Strategies 1,2) are inconsistent (due to the Bayes error in maximizing the classification likelihood) Online stochastic k-MLE (Strategy 3) : consistency ? So, when components overlap, online EM > k-MLE > online k-MLE for parameter learning. Need to study how the dimension influences the inconstancy/convergence rate for online k-MLE. Convergence rate is lower for online methods (sub-linear convergence of the SGD) Time for an update vs sample size: online k-MLE (1,3) < online EM < online k-MLE (2) << k-MLE 24/27 online EM appears to be the best compromise !! References I 25/27 Dempster, A.P., Laird, N.M. and Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), pp. 1–38, 1977. Nielsen, F.: On learning statistical mixtures maximizing the complete likelihood Bayesian Inference and Maximum Entropy Methods in Science and Engineering (MaxEnt 2014), AIP Conference Proceedings Publishing, 1641, pp. 238-245, 214. Celeux, G. and Govaert, G.: A classification EM algorithm for clustering and two stochastic versions. Computational Statistics and Data Analysis, 14(3), pp. 315-332, 1992. References II 26/27 Sam´e, A., Ambroise, C., Govaert, G.: An online classification EM algorithm based on the mixture model Statistics and Computing, 17(3), pp. 209–218, 2007. Titterington, D. M. : Recursive Parameter Estimation Using Incomplete Data. Journal of the Royal Statistical Society. Series B (Methodological), Volume 46, Number 2, pp. 257–267, 1984. Amari, S. I. : Natural gradient works efficiently in learning. Neural Computation, Volume 10, Number 2, pp. 251?276, 1998. Capp´e, O., Moulines, E.: On-line expectation-maximization algorithm for latent data models. Journal of the Royal Statistical Society. Series B (Methodological), 71(3):593-613, 2009. References III 27/27 Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in graphical models, pages 355-368. MIT Press, Cambridge, 1999. Bottou, L´eon : Online Algorithms and Stochastic Approximations. Online Learning and Neural Networks, Saad, David Eds.,Cambridge University Press, 1998.