A new implementation of k-MLE for mixture modelling of Wishart distributions

28/08/2013
OAI : oai:www.see.asso.fr:2552:5071
DOI :

Résumé

A new implementation of k-MLE for mixture modelling of Wishart distributions

Auteurs

Information geometry: Dualistic manifold structures and their uses
An elementary introduction to information geometry
k-Means Clustering with Hölder divergences
On the Error Exponent of a Random Tensor with Orthonormal Factor Matrices
Bregman divergences from comparative convexity
session Computational Information Geometry (chaired by Frank Nielsen, Olivier Schwander)
Opening and closing sessions (chaired by Frédéric Barbaresco, Frank Nielsen, Silvère Bonnabel)
GSI'17-Closing session
GSI'17 Opening session
Bag-of-components an online algorithm for batch learning of mixture models
TS-GNPR Clustering Random Walk Time Series
Online k-MLE for mixture modeling with exponential families
Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry
Computational Information Geometry (chaired by Frank Nielsen, Paul Marriott)
Keynote speach Marc Arnaudon (chaired by Frank Nielsen)
Oral session 6 Foundations and Geometry (John Skilling, Frank Nielsen, Ariel Caticha)
Oral session 5 Bayesian inference (Frank Nielsen, John Skilling, Romke Brontekoe)
Oral session 3 Information geometry (Ariel Caticha, Steeve Zozor, Frank Nielsen)
Tutorial session 2 (Frank Nielsen, Ariel Caticha, Ken H. Knuth)
A new implementation of k-MLE for mixture modelling of Wishart distributions
Hypothesis testing, information divergence and computational geometry
ORAL SESSION 6 Computational Information Geometry (Frank Nielsen)
lncs_8085_cover.pdf
Geometric Science of Information - GSI 2013 Proceedings

Métriques

90
4
1.77 Mo
 application/pdf
bitcache://9b5f9c1a0c5993766901dfc1c198ef6e5a97490f

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors financier

gdrmia_logo.png
logo_inria.png
image010.png
logothales.jpg

Sponsors logistique

logo-minesparistech.jpg
logo-universite-paris-sud.jpg
logo_supelec.png
Séminaire Léon Brillouin Logo
sonycsl.jpg
logo_cnrs_2.jpg
logo_ircam.png
logo_imb.png
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/2552/5071</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator><creator><creatorName>Christophe Saint-Jean</creatorName></creator></creators><titles>
            <title>A new implementation of k-MLE for mixture modelling of Wishart distributions</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2013</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sat 5 Oct 2013</date>
	    <date dateType="Updated">Mon 25 Jul 2016</date>
            <date dateType="Submitted">Mon 12 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">9b5f9c1a0c5993766901dfc1c198ef6e5a97490f</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>25056</version>
        <descriptions>
            <description descriptionType="Abstract"></description>
        </descriptions>
    </resource>
.

A new implementation of k-MLE for mixture modelling of Wishart distributions Christophe Saint-Jean Frank Nielsen Geometric Science of Information 2013 August 28, 2013 - Mines Paris Tech Application Context (1) 2/31 We are interested in clustering varying-length sets of multivariate observations of same dim. p. X1 =   3.6 0.05 −4. 3.6 0.05 −4. 3.6 0.05 −4.   , . . . , XN =       5.3 −0.5 2.5 3.6 0.5 3.5 1.6 −0.5 4.6 −1.6 0.5 5.1 −2.9 −0.5 6.1       Sample mean is a good but not discriminative enough feature. Second order cross-product matrices tXi Xi may capture some relations between (column) variables. Application Context (2) 3/31 The problem is now the clustering of a set of p × p PSD matrices : χ = x1 = t X1X1, x2 = t X2X2, . . . , xN = t XNXN Examples of applications : multispectral/DTI/radar imaging, motion retrieval system, ... Application Context (2) 3/31 The problem is now the clustering of a set of p × p PSD matrices : χ = x1 = t X1X1, x2 = t X2X2, . . . , xN = t XNXN Examples of applications : multispectral/DTI/radar imaging, motion retrieval system, ... Outline of this talk 4/31 1 MLE and Wishart Distribution Exponential Family and Maximum Likehood Estimate Wishart Distribution Two sub-families of the Wishart Distribution 2 Mixture modeling with k-MLE Original k-MLE k-MLE for Wishart distributions Heuristics for the initialization 3 Application to motion retrieval Reminder : Exponential Family (EF) 5/31 An exponential family is a set of parametric probability distributions EF = {p(x; λ) = pF (x; θ) = exp { t(x), θ + k(x) − F(θ)|θ ∈ Θ} Terminology: λ source parameters. θ natural parameters. t(x) sufficient statistic. k(x) auxiliary carrier measure. F(θ) the log-normalizer: differentiable, strictly convex Θ = {θ ∈ RD|F(θ) < ∞} is an open convex set Almost all commonly used distributions are EF members but uniform, Cauchy distributions. Reminder : Maximum Likehood Estimate (MLE) 6/31 Maximum Likehood Estimate principle is a very common approach for fitting parameters of a distribution ˆθ = argmax θ L(θ; χ) = argmax θ N i=1 p(xi ; θ) = argmin θ − 1 N N i=1 log p(xi ; θ) assuming a sample χ = {x1, x2, ..., xN} of i.i.d observations. Log density have a convenient expression for EF members log pF (x; θ) = t(x), θ + k(x) − F(θ) It follows ˆθ = argmax θ N i=1 log pF (xi ; θ) = argmax θ N i=1 t(xi ), θ − NF(θ) MLE with EF 7/31 Since F is a strictly convex, differentiable function, MLE exists and is unique : F(ˆθ) = 1 N N i=1 t(xi ) Ideally, we have a closed form : ˆθ = F−1 1 N N i=1 t(xi ) Numerical methods including Newton-Raphson can be successfully applied. Wishart Distribution 8/31 Definition (Central Wishart distribution) Wishart distribution characterizes empirical covariance matrices for zero-mean gaussian samples: Wd (X; n, S) = |X| n−d−1 2 exp − 1 2tr(S−1X) 2 nd 2 |S| n 2 Γd n 2 where for x > 0, Γd (x) = π d(d−1) 4 d j=1 Γ x − j−1 2 is the multivariate gamma function. Remarks : n > d − 1, E[X] = nS The multivariate generalization of the chi-square distribution. Wishart Distribution as an EF 9/31 It’s an exponential family: log Wd (X; θn, θS ) = < θn, log |X| >R + < θS , − 1 2 X >HS + k(X) − F(θn, θS ) with k(X) = 0 and (θn, θS ) = ( n − d − 1 2 , S−1 ), t(X) = (log |X|, − 1 2 X), F(θn, θS ) = θn + (d + 1) 2 (d log(2) − log |θS |)+log Γd θn + (d + 1) 2 MLE for Wishart Distribution 10/31 In the case of the Wishart distribution, a closed form would be obtained by solving the following system ˆθ = F−1 1 N N i=1 t(xi ) ≡    d log(2) − log |θS | + Ψd θn + (d+1) 2 = ηn − θn + (d+1) 2 θ−1 S = ηS (1) with ηn and ηS the expectation parameters and Ψd the derivative of the log Γd . Unfortunately, no closed-form solution is known. Two sub-families of the Wishart Distribution (1) 11/31 Case n fixed (n = 2θn + d + 1) Fn(θS ) = nd 2 log(2) − n 2 log |θS | + log Γd n 2 kn(X) = n − d − 1 2 log |X| Case S fixed (S = θ−1 S ) FS (θn) = θn + d + 1 2 log |2S| + log Γd θn + d + 1 2 kS (X) = − 1 2 tr(S−1 X) Two sub-families of the Wishart Distribution (2) 12/31 Both are exponential families and MLE equations are solvable ! Case n fixed: − n 2 ˆθ−1 S = 1 N N i=1 − 1 2 Xi =⇒ ˆθS = Nn N i=1 Xi −1 (2) Case S fixed : ˆθn = Ψ−1 d 1 N N i=1 log |Xi | − log |2S| − d + 1 2 , ˆθn > 0 (3) with Ψ−1 d the functional reciprocal of Ψd . An iterative estimator for the Wishart Distribution 13/31 Algorithm 1: An estimator for parameters of the Wishart Input: A sample X1, X2, . . . , XN of Sd ++ Output: Final values of ˆθn and ˆθS Initialize ˆθn with some value > 0; repeat Update ˆθS using Eq. 2 with n = 2ˆθn + d + 1; Update ˆθn using Eq. 3 with S the inverse matrix of ˆθS ; until convergence of the likelihood; Questions and open problems 14/31 From a sample of Wishart matrices, distr. parameters are recovered in few iterations. Major question : do you have a MLE ? probably ... Minor question : sample size N = 1 ? Under-determined system Regularization by sampling around X1 Mixture Models (MM) 15/31 A additive (finite) mixture is a flexible tool to model a more complex distribution m: m(x) = k j=1 wj pj (x), 0 ≤ wj ≤ 1, k j=1 wj = 1 where pj are the component distributions of the mixture, wj the mixing proportions. In our case, we consider pj as member of some parametric family (EF) m(x; Ψ) = k j=1 wj pFj (x; θj ) with Ψ = (w1, w2, ..., wk−1, θ1, θ2, ..., θk) Expectation-Maximization is not fast enough [5] ... Original k-MLE (primal form.) in one slide 16/31 Algorithm 2: k-MLE Input: A sample χ = {x1, x2, ..., xN}, F1, F2, ..., Fk Bregman generator Output: Estimate ˆΨ of mixture parameters A good initialization for Ψ (see later); repeat repeat foreach xi ∈ χ do zi = argmaxj log ˆwj pFj (xi ; ˆθj ); foreach Cj := {xi ∈ χ|zi = j} do ˆθj = MLEFj (Cj ); until Convergence of the complete likelihood; Update mixing proportions : ˆwj = |Cj |/N until Further convergence of the complete likelihood; k-MLE’s properties 17/31 Another formulation comes with the connection between EF and Bregman divergences [3]: log pF (x; θ) = −BF∗ (t(x) : η) + F∗ (t(x)) + k(x) Bregman divergence BF (. : .) associated to a strictly convex and differentiable function F : Original k-MLE (dual form.) in one slide 18/31 Algorithm 3: k-MLE Input: A sample χ = {y1 = t(x1), y2 = x2, ..., yn = t(xN)}, F∗ 1 , F∗ 2 , ..., F∗ k Bregman generator Output: ˆΨ = (ˆw1, ˆw2, ..., ˆwk−1, ˆθ1 = F∗(ˆη1), ..., ˆθk = F∗(ˆηk)) A good initialization for Ψ (see later); repeat repeat foreach xi ∈ χ do zi = argminj BF∗ j (yi : ˆηj ) − log ˆwj ; foreach Cj := {xi ∈ χ|zi = j} do ˆηj = xi ∈Cj yi /|Cj | until Convergence of the complete likelihood; Update mixing proportions : ˆwj = |Cj |/N until Further convergence of the complete likelihood; k-MLE for Wishart distributions 19/31 Practical considerations impose modifications of the algorithm: During the assignment empty clusters may appear (High dimensional data get this worse). A possible solution is to consider Hartigan and Wang’s strategy [6] instead of Lloyd’s strategy: Optimally transfer one observation at a time Update the parameters of involved clusters. Stop when no transfer is possible. This should guarantees non-empty clusters [7] but does not work when considering weighted clusters... Get back to an“old school”criterion : |Czi | > 1 Experimentally shown to perform better in high dimension than the Lloyd’s strategy. k-MLE - Hartigan and Wang 20/31 Criterion for potential transfer (Max): log ˆwzi pFzi (xi ; ˆθzi ) log ˆwz∗ i pFz∗ i (xi ; ˆθzi ∗ ) < 1 with z∗ i = argmaxj log ˆwj pFj (xi ; ˆθj ) Update rules : ˆθzi = MLEFj (Czi \{xi }) ˆθz∗ i = MLEFj (Cz∗ i ∪ {xi }) OR Criterion for potential transfer (Min): BF∗ (yi : ηz∗ i ) − log wz∗ i BF∗ (yi : ηzi ) − log wzi < 1 with z∗ i = argminj (BF∗ (yi : ηj ) − log wj ) Update rules : ηzi = |Czi |ηzi − yi |Czi | − 1 ηz∗ i = |Cz∗ i |ηz∗ i + yi |Cz∗ i | + 1 Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Towards a good initialization... 21/31 Classical initializations methods : random centers, random partition, furthest point (2-approximation), ... Better approach is k-means++ [8]: “Sampling prop. to sq. distance to the nearest center.” Fast and greedy approximation : Θ(kN) Probabilistic guarantee of good initialization: OPTF ≤ k-meansF ≤ O(log k)OPTF Dual Bregman divergence BF∗ may replace the square distance Heuristic to avoid to fix k 22/31 K-means imposes to fix k, the number of clusters We propose on-the-fly cluster creation together with the k-MLE++ (inspired by DP-k-means [9]) : “Create cluster when there exists observations contributing too much to the loss function with already selected centers” Heuristic to avoid to fix k 22/31 K-means imposes to fix k, the number of clusters We propose on-the-fly cluster creation together with the k-MLE++ (inspired by DP-k-means [9]) : “Create cluster when there exists observations contributing too much to the loss function with already selected centers” Heuristic to avoid to fix k 22/31 K-means imposes to fix k, the number of clusters We propose on-the-fly cluster creation together with the k-MLE++ (inspired by DP-k-means [9]) : “Create cluster when there exists observations contributing too much to the loss function with already selected centers” It may overestimate the number of clusters... Initialization with DP-k-MLE++ 23/31 Algorithm 4: DP-k-MLE++ Input: A sample y1 = t(X1), . . . , yN = t(XN), F , λ > 0 Output: C a subset of y1, . . . , yN, k the number of clusters Choose first seed C = {yj }, for j uniformly random in {1, 2, . . . , N}; repeat foreach yi do compute pi = BF∗ (yi : C)/ N i =1 BF∗ (yi : C) where BF∗ (yi : C) = minc∈CBF∗ (yi : c) ; if ∃pi > λ then Choose next seed s among y1, y2, . . . , yN with prob. pi ; Add selected seed to C : C = C ∪ {s} ; until all pi ≤ λ; k = |C|; Motion capture 24/31 Real dataset: Motion capture of contemporary dancers (15 sensors in 3d). Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with different row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ˆΨi . Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with different row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ˆΨi . Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with different row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ˆΨi . Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with different row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ˆΨi . Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with different row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ˆΨi . Remark: Size of each sub-motion is known (so its θn) Application to motion retrieval(1) 25/31 Motion capture data can be view as matrices Xi with different row sizes but same column size d. The idea is to describe Xi through one mixture model parameters ˆΨi . Mixture parameters can be viewed as a sparse representation of local dynamics in Xi . Application to motion retrieval(2) 26/31 Comparing two movements amounts to compute a dissimilarity measure between ˆΨi and ˆΨj . Remark 1 : with DP-k-MLE++, the two mixtures would not probably have the same number of components. Remark 2 : when both mixtures have one component, a natural choice is KL(Wd (.; ˆθ)||Wd (.; ˆθ )) = BF∗ (ˆη : ˆη ) = BF (ˆθ : ˆθ) A closed form is always available ! No closed form exists for KL divergence between general mixtures. Application to motion retrieval(3) 27/31 A possible solution is to use the CS divergence [10]: CS(m : m ) = − log m(x)m (x)dx m(x)2dx m (x)2dx It has a analytic formula for m(x)m (x)dx = k j=1 k j =1 wj wj exp F(θj +θj )−(F(θj )+F(θj )) Note that this expression is well defined since natural parameter space Θ = R+ ∗ × Sp ++ is a convex cone. Implementation 28/31 Early specific code in MatlabTM. Today implementation in Python (based on pyMEF [2]) Ongoing proof of concept (with Herranz F., Beuriv´e A.) Conclusions - Future works 29/31 Still some mathematical work to be done: Solve MLE equations to get F∗ = ( F)−1 then F∗ Characterize our estimator for full Wishart distribution. Complete and validate the prototype of system for motion retrieval. Speeding-up algorithm: computational/numerical/algorithmic tricks. library for bregman divergences learning ? Possible extensions: Reintroduce mean vector in the model : Gaussian-Wishart Online k-means -> online k-MLE ... References I 30/31 Nielsen, F.: k-MLE: A fast algorithm for learning statistical mixture models. In: International Conference on Acoustics, Speech and Signal Processing. (2012) pp. 869–872 Schwander, O. and Nielsen, F. pyMEF - A framework for Exponential Families in Python in Proceedings of the 2011 IEEE Workshop on Statistical Signal Processing Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J. Clustering with bregman divergences. Journal of Machine Learning Research (6) (2005) 1705–1749 Nielsen, F., Garcia, V.: Statistical exponential families: A digest with flash cards. http://arxiv.org/abs/0911.4863 (11 2009) Hidot, S., Saint Jean, C.: An Expectation-Maximization algorithm for the Wishart mixture model: Application to movement clustering. Pattern Recognition Letters 31(14) (2010) 2318–2324 References II 31/31 Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1) (1979) 100–108 Telgarsky, M., Vattani, A.: Hartigan’s method: k-means clustering without Voronoi. In: Proc. of International Conference on Artificial Intelligence and Statistics (AISTATS). (2010) pp. 820–827 Arthur, D., Vassilvitskii, S.: k-means++: The advantages of careful seeding In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms (2007) pp. 1027–1035 Kulis, B., Jordan, M.I.: Revisiting k-means: New algorithms via Bayesian nonparametrics. In: International Conference on Machine Learning (ICML). (2012) Nielsen, F.: Closed-form information-theoretic divergences for statistical mixtures. In: International Conference on Pattern Recognition (ICPR). (2012) pp. 1723–1726