Variational Bayesian Approximation method for Classification and Clustering with a mixture of Studen

28/10/2015
Publication GSI2015
OAI : oai:www.see.asso.fr:11784:14270

Résumé

Clustering, classification and Pattern Recognition in a set of data are between the most important tasks in statistical researches and in many applications. In this paper, we propose to use a mixture of Student-t distribution model for the data via a hierarchical graphical model and the Bayesian framework to do these tasks. The main advantages of this model is that the model accounts for the uncertainties of variances and covariances and we can use the Variational Bayesian Approximation (VBA) methods to obtain fast algorithms to be able to handle large data sets.

Variational Bayesian Approximation method for Classification and Clustering with a mixture of Studen

Collection

application/pdf Variational Bayesian Approximation method for Classification and Clustering with a mixture of Studen Ali Mohammad-Djafari

Média

Voir la vidéo

Métriques

131
12
239.6 Ko
 application/pdf
bitcache://2385b4d7c90db325a822bf7b94d72f3361ac61ee

Licence

Creative Commons Attribution-ShareAlike 4.0 International

Sponsors

Organisateurs

logo_see.gif
logocampusparissaclay.png

Sponsors

entropy1-01.png
springer-logo.png
lncs_logo.png
Séminaire Léon Brillouin Logo
logothales.jpg
smai.png
logo_cnrs_2.jpg
gdr-isis.png
logo_gdr-mia.png
logo_x.jpeg
logo-lix.png
logorioniledefrance.jpg
isc-pif_logo.png
logo_telecom_paristech.png
csdcunitwinlogo.jpg
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/11784/14270</identifier><creators><creator><creatorName>Ali Mohammad-Djafari</creatorName></creator></creators><titles>
            <title>Variational Bayesian Approximation method for Classification and Clustering with a mixture of Studen</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2015</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sat 7 Nov 2015</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Fri 20 Apr 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">2385b4d7c90db325a822bf7b94d72f3361ac61ee</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>24664</version>
        <descriptions>
            <description descriptionType="Abstract">
Clustering, classification and Pattern Recognition in a set of data are between the most important tasks in statistical researches and in many applications. In this paper, we propose to use a mixture of Student-t distribution model for the data via a hierarchical graphical model and the Bayesian framework to do these tasks. The main advantages of this model is that the model accounts for the uncertainties of variances and covariances and we can use the Variational Bayesian Approximation (VBA) methods to obtain fast algorithms to be able to handle large data sets.

</description>
        </descriptions>
    </resource>
.

. Variational Bayesian Approximation method for Classification and Clustering with a mixture of Student-t model Ali Mohammad-Djafari Laboratoire des Signaux et Syst`emes (L2S) UMR8506 CNRS-CentraleSup´elec-UNIV PARIS SUD SUPELEC, 91192 Gif-sur-Yvette, France http://lss.centralesupelec.fr Email: djafari@lss.supelec.fr http://djafari.free.fr http://publicationslist.org/djafari A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 1/20 Contents 1. Mixture models 2. Different problems related to classification and clustering Training Supervised classification Semi-supervised classification Clustering or unsupervised classification 3. Mixture of Student-t 4. Variational Bayesian Approximation 5. VBA for Mixture of Student-t 6. Conclusion A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 2/20 Mixture models General mixture model p(x|a, Θ, K) = K k=1 ak pk(xk|θk), 0 < ak < 1 Same family pk(xk|θk) = p(xk|θk), ∀k Gaussian p(xk|θk) = N(xk|µk, Σk) with θk = (µk, Σk) Data X = {xn, n = 1, · · · , N} where each element xn can be in one of these classes cn. ak = p(cn = k), a = {ak, k = 1, · · · , K}, Θ = {θk, k = 1, · · · , K} p(Xn, cn = k|a, θ) = N n=1 p(xn, cn = k|a, θ). A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 3/20 Different problems Training: Given a set of (training) data X and classes c, estimate the parameters a and Θ. Supervised classification: Given a sample xm and the parameters K, a and Θ determine its class k∗ = arg max k {p(cm = k|xm, a, Θ, K)} . Semi-supervised classification (Proportions are not known): Given sample xm and the parameters K and Θ, determine its class k∗ = arg max k {p(cm = k|xm, Θ, K)} . Clustering or unsupervised classification (Number of classes K is not known): Given a set of data X, determine K and c. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 4/20 Training Given a set of (training) data X and classes c, estimate the parameters a and Θ. Maximum Likelihood (ML): (a, Θ) = arg max (a,Θ) {p(X, c|a, Θ, K)} . Bayesian: Assign priors p(a|K) and p(Θ|K) = K k=1 p(θk) and write the expression of the joint posterior laws: p(a, Θ|X, c, K) = p(X, c|a, Θ, K) p(a|K) p(Θ|K) p(X, c|K) where p(X, c|K) = p(X, c|a, Θ|K)p(a|K) p(Θ|K) da dΘ Infer on a and Θ either as the Maximum A Posteriori (MAP) or Posterior Mean (PM). A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 5/20 Supervised classification Given a sample xm and the parameters K, a and Θ determine p(cm = k|xm, a, Θ, K) = p(xm, cm = k|a, Θ, K) p(xm|a, Θ, K) where p(xm, cm = k|a, Θ, K) = akp(xm|θk) and p(xm|a, Θ, K) = K k=1 ak p(xm|θk) Best class k∗: k∗ = arg max k {p(cm = k|xm, a, Θ, K)} A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 6/20 Semi-supervised classification Given sample xm and the parameters K and Θ (not the proportions a), determine the probabilities p(cm = k|xm, Θ, K) = p(xm, cm = k|Θ, K) p(xm|Θ, K) where p(xm, cm = k|Θ, K) = p(xm, cm = k|a, Θ, K)p(a|K) da and p(xm|Θ, K) = K k=1 p(xm, cm = k|Θ, K) Best class k∗, for example the MAP solution: k∗ = arg max k {p(cm = k|xm, Θ, K)} . A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 7/20 Clustering or non-supervised classification Given a set of data X, determine K and c. Determination of the number of classes: p(K = L|X) = p(X, K = L) p(X) = p(X|K = L) p(K = L) p(X) and p(X) = L0 L=1 p(K = L) p(X|K = L), where L0 is the a priori maximum number of classes and p(X|K = L) = n L k=1 akp(xn, cn = k|θk)p(a|K) p(Θ|K) da dΘ When K and c are determined, we can also determine the characteristics of those classes a and Θ. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 8/20 Mixture of Student-t model Student-t and its Infinite Gaussian Scaled Model (IGSM): T (x|ν, µ, Σ) = ∞ 0 N(x|µ, z−1 Σ) G(z| ν 2 , ν 2 ) dz where N(x|µ, Σ)= |2πΣ|−1 2 exp −1 2(x − µ) Σ−1 (x − µ) = |2πΣ|−1 2 exp −1 2Tr (x − µ)Σ−1 (x − µ) and G(z|α, β) = βα Γ(α) zα−1 exp [−βz] . Mixture of Student-t: p(x|{νk, ak, µk, Σk, k = 1, · · · , K}, K) = K k=1 ak T (xn|νk, µk, Σk). A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 9/20 Mixture of Student-t model Introducing znk, zk = {znk, n = 1, · · · , N}, Z = {znk}, c = {cn, n = 1, · · · , N}, θk = {νk, ak, µk, Σk}, Θ = {θk, k = 1, · · · , K} Assigning the priors p(Θ) = k p(θk), we can write: p(X, c, Z, Θ|K) = n k akN(xn|µk, z−1 n,k Σk) G(znk|νk 2 , νk 2 ) p(θk) Joint posterior law: p(c, Z, Θ|X, K) = p(X, c, Z, Θ|K) p(X|K) . The main task now is to propose some approximations to it in such a way that we can use it easily in all the above mentioned tasks of classification or clustering. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 10/20 Variational Bayesian Approximation (VBA) Main idea: to propose easy computational approximation q(c, Z, Θ) for p(c, Z, Θ|X, K). Criterion: KL(q : p) Interestingly, by noting that p(c, Z, Θ|X, K) = p(X, c, Z, Θ|K)/p(X|K) we have: KL(q : p) = −F(q) + ln p(X|K) where F(q) = − ln p(X, c, Z, Θ|K) q is called free energy of q and we have the following properties: – Maximizing F(q) or minimizing KL(q : p) are equivalent and both give un upper bound to the evidence of the model ln p(X|K). – When the optimum q∗ is obtained, F(q∗) can be used as a criterion for model selection. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 11/20 VBA: choosing the good families Using KL(q : p) has the very interesting property that using q to compute the means we obtain the same values if we have used p (Conservation of the means). Unfortunately, this is not the case for variances or other moments. If p is in the exponential family, then choosing appropriate conjugate priors, the structure of q will be the same and we can obtain appropriate fast optimization algorithms. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 12/20 Hierarchical graphical model ξ0 d d‚    © αk   βk   znk   E γ0, Σ0 c Σk   µ0, η0 c µk   k0 c a   d d‚    © d d‚    © ¨ ¨¨¨ ¨¨%xn   E Figure : Graphical representation of the model. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 13/20 VBA for mixture of Student-t In our case, noting that p(X, c, Z, Θ|K) = n k p(xn, cn, znk|ak, µk, Σk, νk) k [p(αk) p(βk) p(µk|Σk) p(Σk)] with p(xn, cn, znk|ak, µk, Σk, νk) = N(xn|µk, z−1 n,k Σk) G(znk|αk, βk) is separable, in one side for [c, Z] and in other size in components of Θ, we propose to use q(c, Z, Θ) = q(c, Z) q(Θ). A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 14/20 VBA for mixture of Student-t With this decomposition, the expression of the Kullback-Leibler divergence becomes: KL(q1(c, Z)q2(Θ) : p(c, Z, Θ|X, K) = c q1(c, Z)q2(Θ) ln q1(c, Z)q2(Θ) p(c, Z, Θ|X, K) dΘ dZ The expression of the Free energy becomes: F(q1(c, Z)q2(Θ)) = c q1(c, Z)q2(Θ) ln p(X, c, Z|Θ, K)p(Θ|K) q1(c, Z)q2(Θ) dΘ dZ A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 15/20 Proposed VBA for Mixture of Student-t priors model Using a generalized Student-t obtained by replacing G(zn,k|νk 2 , νk 2 ) by G(zn,k|αk, βk) it will be easier to propose conjugate priors for αk, βk than for νk. p(xn, cn = k, znk|ak, µk, Σk, αk, βk, K) = ak N(xn|µk, z−1 n,k Σk) G(zn,k|αk, βk). In the following, noting by Θ = {(ak, µk, Σk, αk, βk), k = 1, · · · , K}, we propose to use the factorized prior laws: p(Θ) = p(a) k [p(αk) p(βk) p(µk|Σk) p(Σk)] with the following components:    p(a) = D(a|k0), k0 = [k0, · · · , k0] = k01 p(αk) = E(αk|ζ0) = G(αk|1, ζ0) p(βk) = E(βk|ζ0) = G(αk|1, ζ0) p(µk|Σk) = N(µk|µ01, η−1 0 Σk) p(Σk) = IW(Σk|γ0, γ0Σ0) A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 16/20 Proposed VBA for Mixture of Student-t priors model where D(a|k) = Γ( l kk) l Γ(kl ) l akl −1 l is the Dirichlet pdf, E(t|ζ0) = ζ0 exp [−ζ0t] is the Exponential pdf, G(t|a, b) = ba Γ(a) ta−1 exp [−bt] is the Gamma pdf and IW(Σ|γ, γ∆) = |1 2∆|γ/2 exp −1 2Tr ∆Σ−1 ΓD(γ/2)|Σ| γ+D+1 2 . is the inverse Wishart pdf. With these prior laws and the likelihood: joint posterior law: pk(c, Z, Θ|X) = p(X, c, Z, Θ) p(X) . A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 17/20 Expressions of q q(c, Z, Θ) = q(c, Z) q(Θ) = n k[q(cn = k|znk) q(znk)] k[q(αk) q(βk) q(µk|Σk) q(Σk)] q(a). with:    q(a) = D(a|˜k), ˜k = [˜k1, · · · , ˜kK ] q(αk) = G(αk|˜ζk, ˜ηk) q(βk) = G(βk|˜ζk, ˜ηk) q(µk|Σk) = N(µk|µ, ˜η−1Σk) q(Σk) = IW(Σk|˜γ, ˜γ ˜Σ) With these choices, we have F(q(c, Z, Θ)) = ln p(X, c, Z, Θ|K) q(c,Z,Θ) = k n F1kn + k F2k F1kn = ln p(xn, cn, znk, θk) q(cn=k|znk )q(znk ) F2k = ln p(xn, cn, znk, θk) q(θk )A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 18/20 VBA Algorithm step Expressions of the updating expressions of the tilded parameters are obtained by following three steps: E step: Optimizing F with respect to q(c, Z) when keeping q(Θ) fixed, we obtain the expression of q(cn = k|znk) = ˜ak, q(znk) = G(znk|αk, βk). M step: Optimizing F with respect to q(Θ) when keeping q(c, Z) fixed, we obtain the expression of q(a) = D(a|˜k), ˜k = [˜k1, · · · , ˜kK ], q(αk) = G(αk|˜ζk, ˜ηk), q(βk) = G(βk|˜ζk, ˜ηk), q(µk|Σk) = N(µk|µ, ˜η−1Σk), and q(Σk) = IW(Σk|˜γ, ˜γ ˜Σ), which gives the updating algorithm for the corresponding tilded parameters. F evaluation: After each E step and M step, we can also evaluate the expression of F(q) which can be used for stopping rule of the iterative algorithm. Final value of F(q) for each value of K, noted Fk, can be used as a criterion for model selection, i.e.; the determination of the number of clusters. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 19/20 Conclusions Clustering and classification of a set of data are between the most important tasks in statistical researches for many applications such as data mining in biology. Mixture models and in particular Mixture of Gaussians are classical models for these tasks. We proposed to use a mixture of generalised Student-t distribution model for the data via a hierarchical graphical model. To obtain fast algorithms and be able to handle large data sets, we used conjugate priors everywhere it was possible. The proposed algorithm has been used for clustering, classification and discriminant analysis of some biological data (Cancer research related), but in this paper, we only presented the main algorithm. A. Mohammad-Djafari, VBA for Classification and Clustering..., GSI2015, October 28-30, 2015, Polytechnique, France 20/20