k-Means Clustering with Hölder divergences

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22566
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

We introduced two novel classes of Hölder divergences and Hölder pseudo-divergences that are both invariant to rescaling, and that both encapsulate the Cauchy-Schwarz divergence and the skew Bhattacharyya divergences. We review the elementary concepts of those parametric divergences, and perform a clustering analysis on two synthetic datasets. It is shown experimentally that the symmetrized Hölder divergences consistently outperform signi cantly the Cauchy-Schwarz divergence in clustering tasks.

k-Means Clustering with Hölder divergences

Collection

application/pdf k-Means Clustering with Hölder divergences Frank Nielsen, Ke Sun, Stéphane Marchand-Maillet
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

We introduced two novel classes of Hölder divergences and Hölder pseudo-divergences that are both invariant to rescaling, and that both encapsulate the Cauchy-Schwarz divergence and the skew Bhattacharyya divergences. We review the elementary concepts of those parametric divergences, and perform a clustering analysis on two synthetic datasets. It is shown experimentally that the symmetrized Hölder divergences consistently outperform signi cantly the Cauchy-Schwarz divergence in clustering tasks.
k-Means Clustering with Hölder divergences

Auteurs

Information geometry: Dualistic manifold structures and their uses
An elementary introduction to information geometry
k-Means Clustering with Hölder divergences
On the Error Exponent of a Random Tensor with Orthonormal Factor Matrices
Bregman divergences from comparative convexity
session Computational Information Geometry (chaired by Frank Nielsen, Olivier Schwander)
Opening and closing sessions (chaired by Frédéric Barbaresco, Frank Nielsen, Silvère Bonnabel)
GSI'17-Closing session
GSI'17 Opening session
Bag-of-components an online algorithm for batch learning of mixture models
TS-GNPR Clustering Random Walk Time Series
Online k-MLE for mixture modeling with exponential families
Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry
Computational Information Geometry (chaired by Frank Nielsen, Paul Marriott)
Keynote speach Marc Arnaudon (chaired by Frank Nielsen)
Oral session 6 Foundations and Geometry (John Skilling, Frank Nielsen, Ariel Caticha)
Oral session 5 Bayesian inference (Frank Nielsen, John Skilling, Romke Brontekoe)
Oral session 3 Information geometry (Ariel Caticha, Steeve Zozor, Frank Nielsen)
Tutorial session 2 (Frank Nielsen, Ariel Caticha, Ken H. Knuth)
A new implementation of k-MLE for mixture modelling of Wishart distributions
Hypothesis testing, information divergence and computational geometry
ORAL SESSION 6 Computational Information Geometry (Frank Nielsen)
lncs_8085_cover.pdf
Geometric Science of Information - GSI 2013 Proceedings

Média

Voir la vidéo

Métriques

0
0
318.74 Ko
 application/pdf
bitcache://27f8dc60c778f4140708622f9bff271c6c4af5f1

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
gdrmia_logo.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22566</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator><creator><creatorName>Ke Sun</creatorName></creator><creator><creatorName>Stéphane Marchand-Maillet</creatorName></creator></creators><titles>
            <title>k-Means Clustering with Hölder divergences</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Wed 7 Mar 2018</date>
	    <date dateType="Updated">Wed 7 Mar 2018</date>
            <date dateType="Submitted">Tue 13 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">27f8dc60c778f4140708622f9bff271c6c4af5f1</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37286</version>
        <descriptions>
            <description descriptionType="Abstract">We introduced two novel classes of Hölder divergences and Hölder pseudo-divergences that are both invariant to rescaling, and that both encapsulate the Cauchy-Schwarz divergence and the skew Bhattacharyya divergences. We review the elementary concepts of those parametric divergences, and perform a clustering analysis on two synthetic datasets. It is shown experimentally that the symmetrized Hölder divergences consistently outperform signi cantly the Cauchy-Schwarz divergence in clustering tasks.
</description>
        </descriptions>
    </resource>
.

k-Means Clustering with Hölder divergences Frank Nielsen1,2 , Ke Sun3 , and Stéphane Marchand-Maillet4 1 École Polytechnique 2 Sony Computer Science Laboratories Inc. 3 King Abdullah University of Science and Technology (KAUST) 4 University of Geneva Abstract. We introduced two novel classes of Hölder divergences and Hölder pseudo-divergences that are both invariant to rescaling, and that both encapsulate the Cauchy-Schwarz divergence and the skew Bhat- tacharyya divergences. We review the elementary concepts of those para- metric divergences, and perform a clustering analysis on two synthetic datasets. It is shown experimentally that the symmetrized Hölder diver- gences consistently outperform significantly the Cauchy-Schwarz diver- gence in clustering tasks. 1 Introduction To build dissimilarity measures between p and q in a common domain, one can use bi-parametric inequalities (Mitrinovic et al., 2013) lhs(p, q) ≤ rhs(p, q), and measure the inequality tightness. When lhs(p, q) > 0, a dissimilarity can be constructed by the log-ratio gap: D(p : q) = − log  lhs(p, q) rhs(p, q)  = log  rhs(p, q) lhs(p, q)  ≥ 0. (1) Notice that this divergence construction allows one to consider the equivalence class of scaled inequalities: λ × lhs(p, q) ≤ λ × rhs(p, q), ∀λ > 0. Following this divergence construction principle, we defined Hölder divergences based on the Hölder’s inequality, and presented the basic properties of this divergence fam- ily (Nielsen et al., 2017). In this paper, we further extend the empirical clus- tering study with respect to Hölder divergences, and show that symmetrized Hölder divergences consistently outperform significantly the Cauchy-Schwarz divergence (Hasanbelliu et al., 2014). We build Hölder divergences that are in- variant by rescaling: These divergences D are called projective divergences and satisfy the property D(λp : λ0 q) = D(p : q), ∀λ, λ0 > 0. The term “Hölder divergence” was coined previously based on the definition of the Hölder score (Kanamori et al., 2014; Kanamori, 2014): The score-induced Hölder divergence D(p : q) is a proper gap divergence that yields a scale-invariant divergence. A key difference with our work is that this score-induced divergence is not projective and does not include the Cauchy-Schwarz (CS) divergence, while our definition is projective and includes the CS divergence. This paper is organized as follows: Section 2 reviews the definition of Hölder pseudo divergence and Hölder proper divergence. Section 3 gives algorithms for clustering based on Hölder divergences, and presents the experimental clustering results. Section 4 concludes this work. 2 Hölder Divergences: Definitions and properties Let (X, F, µ) be a measurable space where µ is the Lebesgue measure, and let Lγ (X, µ) denote the space of functions with their γ-th power of absolute value Lebesgue integrable, for any γ > 0. When γ ≥ 1, this is a Lebesgue space but we consider the wider scope of γ > 0 in this work. Hölder’s inequality states that kpqk1 ≤ kpkαkqkβ for conjugate exponents α > 0 and β > 0 (satisfying 1 α + 1 β = 1), p ∈ Lα (X, µ) and q ∈ Lβ (X, µ). Let p(x) ∈ Lασ (X, µ) and q(x) ∈ Lβτ (X, µ) be positive measures where σ > 0 and τ > 0 are prescribed parameters. We define (Nielsen et al., 2017) a tri-parametric family of divergences as follows: Definition 1 (Hölder pseudo-divergence). The Hölder pseudo-divergence (HPD) between p(x) and q(x) is the log-ratio-gap: DH α,σ,τ (p : q) := − log   R X p(x)σ q(x)τ dx R X p(x)ασdx  1 α R X q(x)βτ dx  1 β   . The non-negativeness follows straightforwardly from Hölder’s inequality (1889). However the symmetry, the triangle-inequality, and the law of indiscernibles (self distance equals to zero) are not satisfied for HPDs. Therefore these dissimilarity measures are said to belong to a broader class of “pseudo-divergences.” In order to have a proper divergence with the law of the indiscernibles, we note that the equality DH α,σ,τ (p : q) = 0 holds if and only if p(x)ασ ∝ q(x)βτ (almost everywhere). To make this equality condition to be p(x) = q(x) (ae.) for probability distributions, we take γ := ασ = βτ. Let p(x) and q(x) be positive measures in Lγ (X, µ) for a prescribed scalar value γ > 0. Let α, β > 0 be conjugate exponents. We define (Nielsen et al., 2017) a bi-parametric divergence family, which is a sub-family of HPD that satisfies both non-negativeness and law of the indiscernibles as follows: Definition 2 (Proper Hölder divergence). The proper Hölder divergence (HD) between two densities p(x) and q(x) is: DH α,γ(p : q) = DH α, γ α , γ β (p : q) := − log R X p(x)γ/α q(x)γ/β dx ( R X p(x)γdx)1/α( R X q(x)γdx)1/β ! . Notice that DH is used to denote both HPD and HD. One has to check the number of subscripts to distinguish between these two pseudo and proper cases. An important fact about Hölder divergences is that they encapsulate both the Cauchy-Schwarz divergence and the one-parameter family of skew Bhat- tacharyya divergences (Nielsen and Boltz, 2011). In the definition of HD, setting α = β = γ = 2 yields the CS divergence: DH 2,2(p : q) = DH 2,1,1(p : q) = CS(p : q) := − log   R X p(x)q(x)dx R X p(x)2dx 1 2 R X q(x)2dx 1 2   . In the definition of HD, setting γ = 1 yields the skew Bhattacharyya divergences: DH α,1(p : q) = DH α, 1 α , 1 β (p : q) = − log Z X p(x)1/α q(x)1/β dx := B1/α(p : q). It is easy to check from definition 1 that the HPD is a projective divergence which is invariant to scaling of its parameter densities: DH α,σ,τ (λp : λ0 q) = DH α,σ,τ (p : q) (∀λ, λ0 > 0). Figure 1 illustrates the relationships between those divergence families. Projective divergence Hölder pseudo-divergence DH α,σ,τ Hölder divergence (proper) DH α,γ skew Bhattacharyya divergence (proper) B1/α Cauchy-Schwarz (CS) divergence (proper) Fig. 1: Hölder proper divergence (bi-parametric) and Hölder improper pseudo- divergence (tri-parametric) encompass the Cauchy-Schwarz divergence and the skew Bhattacharyya divergences. By definition, the HPD is asymmetric and satisfies the reference duality: DH α,σ,τ (p : q) = DH β,τ,σ(q : p), for conjugate exponents α and β. Similarly, the HD satisfies: DH α,γ(p : q) = DH β,γ(q : p). The HPD and the HD admit closed-form formulas for exponential family distributions. For example, consider p(x) = exp(θ> p t(x) − F(θp)) and q(x) = exp(θ> q t(x) − F(θq)), where t(x) is a vector of sufficient statistics, and F(θ) is the convex cumulant generating function. Then from straightforward derivations, the symmetrized Hölder divergence is: SH α,γ(p : q) := 1 2 DH α,γ(p : q) + DH α,γ(q : p)  = 1 2  F(γθp) + F(γθq) − F  γ α θp + γ β θq  − F  γ β θp + γ α θq  . This SH α,γ has the key advantage that its centroid can be solved efficiently using the concave-convex procedure (CCCP) (Nielsen and Boltz, 2011). Con- sider a set of fixed densities {θ1, · · · , θn} with positive weights {w1, · · · , wn} ( Pn i=1 wi = 1) of the same exponential family. The symmetrized HD centroid with respect to α, γ > 0 is defined as: Oα,γ := arg min θ n X i=1 wiSH α,γ(θi : θ) = arg min θ n X i=1 wi  F(γθ) − F  γ α θi + γ β θ  − F  γ β θi + γ α θ  . (2) Because F(θ) is a strictly convex function, the energy function to be minimized in eq. (2) is the difference between two convex functions. Setting the derivatives to zero, we get the CCCP iterations given by: Ot+1 α,γ = 1 γ (∇F)−1 " n X i=1 wi  1 β ∇F  γ α θi + γ β Ot α,γ  + 1 α ∇F  γ β θi + γ α Ot α,γ # , where ∇F and (∇F)−1 are forward and reverse transformations between the natural parameters and the dual expectation parameters, respectively. 3 Clustering Based on Symmetric Hölder Divergences Given a set of densities {p1, · · · , pn}, we can perform variational k-means (Nielsen and Nock, 2015) clustering based on SH α,γ. The cost function is the Hölder infor- mation: E := n X i=1 SH α,γ(pi : Oli ), (3) where O1, · · · , OL are the cluster centers and li ∈ {1, · · · , L} is the cluster label of pi. Algorithm 1 presents a revision of the clustering algorithm given in (Nielsen et al., 2017) with k-means++ initialization (Arthur and Vassilvitskii, 2007). We investigate two different datasets. The first (Nielsen et al., 2017) consists of n random 2D Gaussians with two or three clusters. In the first cluster, the mean of each Gaussian G(µ, Σ) has the prior distribution µ ∼ G((−2, 0), I); the covariance matrix is obtained by first generating σ1 ∼ Γ(7, 0.01), σ2 ∼ Γ(7, 0.003), where Γ means a gamma distribution with prescribed shape and scale, then rotating the covariance matrix diag(σ1, σ2) so that the resulting Gaus- sian has a “radial direction” with respect to the center (−2, 0). The second and third clusters are similar to the first cluster with the only difference being that their µ’s are centered around (2, 0) and (0, 2 √ 3), respectively. The second dataset consists of multinomial distributions in ∆9, the 9D prob- ability simplex. The dataset presents two or three clusters. For each cluster, we first pick a random center (c0, · · · , cd) based on the uniform distribution in ∆9. Then we randomly generate a distribution (p0, · · · , pd) based on pi = Algorithm 1: Hölder variational k-means. Input: p1, · · · , pn; number of clusters L; 1 < α ≤ 2; γ > 0 Output: A clustering scheme assigning each pi to a label in {1, · · · , L} 1 Randomly pick one center O1 ∈ {pi}n i=1, then sequentially pick Ok (2 ≤ k ≤ L) with probability proportional to mink−1 j=1 SH α,γ(pi : Oj) 2 while not converged do 3 for i = 1, . . . , n do 4 Assign li = arg minl SH α,γ(pi : Ol) 5 for l = 1, . . . , L do /* Carry CCCP iterations until the current center improves the former cluster Hölder information */ 6 Compute the centroid Ol = arg minO P i:li=l SH α,γ(pi : O) 7 return {li}n i=1 exp(log ci+σi) Pd i=0 exp(log ci+σi) , where σ > 0 is a noise level parameter, and each i follows independently a standard Gaussian distribution. We repeat generating random samples for each cluster center, and make sure that different clusters have almost the same number of samples. Our clustering algorithm involves two additional hyper-parameters γ and α as compared with standard k-means clustering. Therefore it is meaningful to study how these two hyper-parameters affect the performance. We extend the experiments reported previously (Nielsen et al., 2017) (where α = γ is applied for simplicity) with a grid of α and γ values. Notice that the reference duality gives SH α,γ = SH β,γ for conjugate exponents α and β. Therefore we assume 1 < α ≤ 2 without loss of generality. If we choose α = γ = 2, then SH α,γ becomes the CS divergence, and Algorithm 1 reduces to traditional CS clustering. We performed clustering experiments by setting the number of clusters k ∈ {2, 3} and setting the sample size n ∈ {50, 100}. Tables 1 and 2 show the clus- tering accuracy measured by the Normalized Mutual Information (NMI). The large variance of the clustering accuracy is because different runs are based on different random datasets. We see that the symmetric Hölder divergence can give strikingly better clustering as compared to CS clustering. An empirical range of well-performed parameter values is given by γ ∈ [0.5, 1.5] and α ∈ (1, 2]. In prac- tice, one has to setup a configuration grid and apply cross-validation to find the best α and γ values. This hints that one should use instead the general Hölder divergence to re- place CS in similar clustering applications (Hasanbelliu et al., 2014; Rami et al., 2016). Although one faces the problem of tuning the parameter α and γ, Hölder divergences can potentially give much better results. 4 Conclusion We experimentally confirmed the usefulness of the novel parametric Hölder classes of statistical divergences and pseudo-divergences. In general one should use Hölder clustering instead of Cauchy-Schwarz clustering to get much bet- ter results. These new concepts can open up new insights and applications in statistics and information sciences. Reproducible source code is available online at: https://www.lix.polytechnique.fr/~nielsen/HPD/ . Bibliography Mitrinovic, D.S.; Pecaric, J.; Fink, A.M. Classical and New Inequalities in Anal- ysis; Springer Science & Business Media: New York, NY, USA, 2013; Volume 61. Kanamori, T.; Fujisawa, H. Affine invariant divergences associated with proper composite scoring rules and their applications. Bernoulli 2014, 20, 2278–2304. Kanamori, T. Scale-invariant divergences for density functions. Entropy 2014, 16, 2611–2628. Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding ACM-SIAM symposium on Discrete algorithms, 2007; pp. 1027–1035 Nielsen, F.; Sun, K.; Marchand-Maillet, S. On Hölder projective divergences. Entropy 2017, 19, 122. Holder, O.L. Über einen Mittelwertssatz. Nachr. Akad. Wiss. Gottingen Math. Phys. Kl. 1889, vol. 44, 38–47. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clus- tering. In Proceedings of the 2015 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, Australia, 19–24 April 2015; pp. 2016–2020. Hasanbelliu, E.; Giraldo, L.S.; Principe, J.C. Information theoretic shape match- ing. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 2436–2451. Rami, H.; Belmerhnia, L.; Drissi El Maliani, A.; El Hassouni, M. Texture Retrieval Using Mixtures of Generalized Gaussian Distribution and Cauchy- Schwarz Divergence in Wavelet Domain. Image Commun. 2016, 42, 45–58. Table 1: Performance in NMI (mean±std) when clustering 2D Gaussians based on 1000 independent runs for each configuration. Bold numbers indicate the best obtained performance. The boxed numbers are given by the Cauchy-Schwarz (CS) clustering. (a) k = 2; n = 50 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.91 ± 0.10 0.89 ± 0.13 0.86 ± 0.14 0.85 ± 0.14 0.85 ± 0.15 0.85 ± 0.16 γ = 0.5 0.92 ± 0.09 0.86 ± 0.16 0.84 ± 0.17 0.84 ± 0.17 0.82 ± 0.19 0.82 ± 0.17 γ = 0.75 0.92 ± 0.10 0.85 ± 0.16 0.83 ± 0.17 0.82 ± 0.18 0.82 ± 0.19 0.81 ± 0.18 γ = 1 0.92 ± 0.10 0.84 ± 0.18 0.81 ± 0.20 0.82 ± 0.18 0.82 ± 0.20 0.81 ± 0.20 γ = 1.5 0.92 ± 0.10 0.82 ± 0.18 0.80 ± 0.20 0.81 ± 0.19 0.81 ± 0.19 0.80 ± 0.21 γ = 2 0.92 ± 0.10 0.81 ± 0.20 0.82 ± 0.19 0.80 ± 0.21 0.80 ± 0.20 0.81 ± 0.20 (b) k = 2; n = 100 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.91 ± 0.07 0.88 ± 0.08 0.87 ± 0.09 0.86 ± 0.10 0.86 ± 0.10 0.86 ± 0.10 γ = 0.5 0.91 ± 0.07 0.87 ± 0.12 0.85 ± 0.11 0.85 ± 0.13 0.84 ± 0.14 0.84 ± 0.12 γ = 0.75 0.91 ± 0.07 0.86 ± 0.11 0.84 ± 0.13 0.84 ± 0.13 0.84 ± 0.14 0.84 ± 0.14 γ = 1 0.92 ± 0.07 0.86 ± 0.12 0.83 ± 0.15 0.83 ± 0.13 0.83 ± 0.14 0.84 ± 0.12 γ = 1.5 0.92 ± 0.07 0.84 ± 0.14 0.83 ± 0.14 0.83 ± 0.15 0.83 ± 0.14 0.83 ± 0.13 γ = 2 0.91 ± 0.08 0.84 ± 0.14 0.82 ± 0.15 0.83 ± 0.14 0.83 ± 0.14 0.83 ± 0.14 (c) k = 3; n = 50 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.88 ± 0.12 0.83 ± 0.14 0.81 ± 0.15 0.80 ± 0.15 0.79 ± 0.14 0.80 ± 0.15 γ = 0.5 0.88 ± 0.12 0.80 ± 0.15 0.77 ± 0.16 0.77 ± 0.15 0.77 ± 0.15 0.76 ± 0.16 γ = 0.75 0.89 ± 0.12 0.80 ± 0.14 0.77 ± 0.15 0.76 ± 0.16 0.75 ± 0.15 0.76 ± 0.16 γ = 1 0.88 ± 0.12 0.78 ± 0.15 0.76 ± 0.16 0.75 ± 0.16 0.75 ± 0.16 0.76 ± 0.15 γ = 1.5 0.88 ± 0.13 0.76 ± 0.16 0.76 ± 0.16 0.76 ± 0.15 0.76 ± 0.16 0.76 ± 0.16 γ = 2 0.88 ± 0.12 0.76 ± 0.16 0.75 ± 0.16 0.74 ± 0.16 0.75 ± 0.16 0.76 ± 0.16 (d) k = 3; n = 100 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.89 ± 0.08 0.84 ± 0.11 0.82 ± 0.12 0.82 ± 0.11 0.82 ± 0.11 0.82 ± 0.12 γ = 0.5 0.89 ± 0.08 0.83 ± 0.11 0.81 ± 0.12 0.79 ± 0.12 0.78 ± 0.14 0.79 ± 0.13 γ = 0.75 0.89 ± 0.09 0.81 ± 0.12 0.79 ± 0.13 0.78 ± 0.13 0.77 ± 0.14 0.78 ± 0.14 γ = 1 0.88 ± 0.10 0.80 ± 0.12 0.78 ± 0.14 0.78 ± 0.14 0.78 ± 0.13 0.78 ± 0.13 γ = 1.5 0.89 ± 0.09 0.78 ± 0.13 0.77 ± 0.14 0.77 ± 0.14 0.76 ± 0.14 0.77 ± 0.14 γ = 2 0.89 ± 0.09 0.78 ± 0.13 0.77 ± 0.13 0.77 ± 0.14 0.77 ± 0.13 0.78 ± 0.13 Table 2: Performance in NMI (mean±std) when clustering multinomial distribu- tions in ∆9 based on 1000 independent runs for each configuration. Bold numbers indicate the best obtained performance. The boxed numbers are given by the Cauchy-Schwarz (CS) clustering. (a) k = 2; n = 50 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.93 ± 0.14 0.93 ± 0.15 0.93 ± 0.13 0.93 ± 0.15 0.92 ± 0.16 0.93 ± 0.13 γ = 0.5 0.91 ± 0.16 0.92 ± 0.15 0.90 ± 0.18 0.91 ± 0.17 0.91 ± 0.16 0.91 ± 0.16 γ = 0.75 0.87 ± 0.20 0.86 ± 0.21 0.87 ± 0.20 0.87 ± 0.20 0.88 ± 0.19 0.88 ± 0.19 γ = 1 0.83 ± 0.23 0.83 ± 0.23 0.83 ± 0.23 0.82 ± 0.24 0.81 ± 0.23 0.82 ± 0.23 γ = 1.5 0.75 ± 0.26 0.71 ± 0.28 0.72 ± 0.27 0.70 ± 0.28 0.71 ± 0.28 0.71 ± 0.28 γ = 2 0.68 ± 0.28 0.65 ± 0.29 0.64 ± 0.29 0.62 ± 0.29 0.62 ± 0.30 0.61 ± 0.30 (b) k = 2; n = 100 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.93 ± 0.12 0.93 ± 0.12 0.93 ± 0.11 0.93 ± 0.12 0.94 ± 0.11 0.93 ± 0.12 γ = 0.5 0.92 ± 0.14 0.91 ± 0.14 0.92 ± 0.13 0.91 ± 0.15 0.92 ± 0.14 0.91 ± 0.14 γ = 0.75 0.89 ± 0.16 0.88 ± 0.16 0.89 ± 0.17 0.89 ± 0.16 0.88 ± 0.16 0.89 ± 0.15 γ = 1 0.83 ± 0.20 0.84 ± 0.19 0.84 ± 0.19 0.83 ± 0.19 0.84 ± 0.19 0.84 ± 0.19 γ = 1.5 0.77 ± 0.24 0.74 ± 0.25 0.74 ± 0.23 0.73 ± 0.24 0.74 ± 0.23 0.74 ± 0.24 γ = 2 0.70 ± 0.26 0.67 ± 0.26 0.65 ± 0.27 0.64 ± 0.27 0.63 ± 0.27 0.63 ± 0.27 (c) k = 3; n = 50 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.87 ± 0.16 0.87 ± 0.16 0.87 ± 0.16 0.87 ± 0.15 0.87 ± 0.16 0.86 ± 0.16 γ = 0.5 0.84 ± 0.17 0.84 ± 0.17 0.84 ± 0.17 0.83 ± 0.17 0.84 ± 0.17 0.84 ± 0.18 γ = 0.75 0.80 ± 0.18 0.79 ± 0.18 0.79 ± 0.18 0.78 ± 0.19 0.79 ± 0.18 0.78 ± 0.19 γ = 1 0.73 ± 0.20 0.72 ± 0.20 0.73 ± 0.20 0.73 ± 0.20 0.72 ± 0.20 0.71 ± 0.21 γ = 1.5 0.65 ± 0.21 0.63 ± 0.20 0.61 ± 0.19 0.59 ± 0.20 0.59 ± 0.20 0.60 ± 0.20 γ = 2 0.57 ± 0.20 0.55 ± 0.20 0.53 ± 0.19 0.51 ± 0.18 0.52 ± 0.18 0.51 ± 0.18 (d) k = 3; n = 100 α = 1.01 α = 1.2 α = 1.4 α = 1.6 α = 1.8 α = 2 γ = 0.25 0.90 ± 0.13 0.88 ± 0.14 0.88 ± 0.14 0.88 ± 0.13 0.89 ± 0.13 0.89 ± 0.12 γ = 0.5 0.87 ± 0.14 0.86 ± 0.14 0.86 ± 0.15 0.86 ± 0.14 0.86 ± 0.14 0.86 ± 0.14 γ = 0.75 0.82 ± 0.16 0.82 ± 0.17 0.83 ± 0.15 0.82 ± 0.16 0.82 ± 0.16 0.82 ± 0.16 γ = 1 0.77 ± 0.18 0.77 ± 0.17 0.77 ± 0.18 0.75 ± 0.18 0.76 ± 0.18 0.76 ± 0.18 γ = 1.5 0.66 ± 0.19 0.63 ± 0.19 0.64 ± 0.19 0.63 ± 0.19 0.63 ± 0.19 0.63 ± 0.19 γ = 2 0.57 ± 0.18 0.56 ± 0.18 0.54 ± 0.18 0.53 ± 0.18 0.53 ± 0.19 0.53 ± 0.18