Résumé

We present in this paper a novel non-parametric approach useful for clustering independent identically distributed stochastic processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any information. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal http://www.datagrapple.com .

TS-GNPR Clustering Random Walk Time Series

Collection

application/pdf TS-GNPR Clustering Random Walk Time Series Gautier Marti, Frank Nielsen, Philippe Very, Philippe Donnat
Détails de l'article
We present in this paper a novel non-parametric approach useful for clustering independent identically distributed stochastic processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any information. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal http://www.datagrapple.com .
TS-GNPR Clustering Random Walk Time Series

Auteurs

Information geometry: Dualistic manifold structures and their uses
An elementary introduction to information geometry
k-Means Clustering with Hölder divergences
On the Error Exponent of a Random Tensor with Orthonormal Factor Matrices
Bregman divergences from comparative convexity
session Computational Information Geometry (chaired by Frank Nielsen, Olivier Schwander)
Opening and closing sessions (chaired by Frédéric Barbaresco, Frank Nielsen, Silvère Bonnabel)
GSI'17-Closing session
GSI'17 Opening session
Bag-of-components an online algorithm for batch learning of mixture models
TS-GNPR Clustering Random Walk Time Series
Online k-MLE for mixture modeling with exponential families
Approximating Covering and Minimum Enclosing Balls in Hyperbolic Geometry
Computational Information Geometry (chaired by Frank Nielsen, Paul Marriott)
Keynote speach Marc Arnaudon (chaired by Frank Nielsen)
Oral session 6 Foundations and Geometry (John Skilling, Frank Nielsen, Ariel Caticha)
Oral session 5 Bayesian inference (Frank Nielsen, John Skilling, Romke Brontekoe)
Oral session 3 Information geometry (Ariel Caticha, Steeve Zozor, Frank Nielsen)
Tutorial session 2 (Frank Nielsen, Ariel Caticha, Ken H. Knuth)
A new implementation of k-MLE for mixture modelling of Wishart distributions
Hypothesis testing, information divergence and computational geometry
ORAL SESSION 6 Computational Information Geometry (Frank Nielsen)
lncs_8085_cover.pdf
Geometric Science of Information - GSI 2013 Proceedings

Média

Voir la vidéo

Métriques

107
6
3.14 Mo
 application/pdf
bitcache://5f2308a9b746c6b695523d2ef8340f879d6184b0

Licence

Creative Commons Attribution-ShareAlike 4.0 International

Sponsors

Organisateurs

logo_see.gif
logocampusparissaclay.png

Sponsors

entropy1-01.png
springer-logo.png
lncs_logo.png
Séminaire Léon Brillouin Logo
logothales.jpg
smai.png
logo_cnrs_2.jpg
gdr-isis.png
gdrmia_logo.png
logo_x.jpeg
logo-lix.png
logorioniledefrance.jpg
isc-pif_logo.png
logo_telecom_paristech.png
csdcunitwinlogo.jpg
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/11784/14294</identifier><creators><creator><creatorName>Frank Nielsen</creatorName></creator><creator><creatorName>Gautier Marti</creatorName></creator><creator><creatorName>Philippe Very</creatorName></creator><creator><creatorName>Philippe Donnat</creatorName></creator></creators><titles>
            <title>TS-GNPR Clustering Random Walk Time Series</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2015</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sun 8 Nov 2015</date>
	    <date dateType="Updated">Wed 31 Aug 2016</date>
            <date dateType="Submitted">Mon 10 Dec 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">5f2308a9b746c6b695523d2ef8340f879d6184b0</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>24683</version>
        <descriptions>
            <description descriptionType="Abstract">
We present in this paper a novel non-parametric approach useful for clustering independent identically distributed stochastic processes. We introduce a pre-processing step consisting in mapping multivariate independent and identically distributed samples from random variables to a generic non-parametric representation which factorizes dependency and marginal distribution apart without losing any information. An associated metric is defined where the balance between random variables dependency and distribution information is controlled by a single parameter. This mixing parameter can be learned or played with by a practitioner, such use is illustrated on the case of clustering financial time series. Experiments, implementation and results obtained on public financial time series are online on a web portal http://www.datagrapple.com .

</description>
        </descriptions>
    </resource>
.

Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Clustering Random Walk Time Series GSI 2015 - Geometric Science of Information Gautier Marti, Frank Nielsen, Philippe Very, Philippe Donnat 29 October 2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Context (data from www.datagrapple.com) Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What is a clustering program? Definition Clustering is the task of grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than those in different groups. Example of a clustering program We aim at finding k groups by positioning k group centers {c1, . . . , ck} such that data points {x1, . . . , xn} minimize minc1,...,ck n i=1 mink j=1 d(xi , cj )2 But, what is the distance d between two random walk time series? Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What are clusters of Random Walk Time Series? French banks and building materials CDS over 2006-2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion What are clusters of Random Walk Time Series? French banks and building materials CDS over 2006-2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Geometry of RW TS ≡ Geometry of Random Variables i.i.d. observations: X1 : X1 1 , X2 1 , . . . , XT 1 X2 : X1 2 , X2 2 , . . . , XT 2 . . . , . . . , . . . , . . . , . . . XN : X1 N, X2 N, . . . , XT N Which distances d(Xi , Xj ) between dependent random variables? Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Pitfalls of a basic distance Let (X, Y ) be a bivariate Gaussian vector, with X ∼ N(µX , σ2 X ), Y ∼ N(µY , σ2 Y ) and whose correlation is ρ(X, Y ) ∈ [−1, 1]. E[(X − Y )2 ] = (µX − µY )2 + (σX − σY )2 + 2σX σY (1 − ρ(X, Y )) Now, consider the following values for correlation: ρ(X, Y ) = 0, so E[(X − Y )2] = (µX − µY )2 + σ2 X + σ2 Y . Assume µX = µY and σX = σY . For σX = σY 1, we obtain E[(X − Y )2] 1 instead of the distance 0, expected from comparing two equal Gaussians. ρ(X, Y ) = 1, so E[(X − Y )2] = (µX − µY )2 + (σX − σY )2. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Pitfalls of a basic distance Let (X, Y ) be a bivariate Gaussian vector, with X ∼ N (µX , σ2 X ), Y ∼ N (µY , σ2 Y ) and whose correlation is ρ(X, Y ) ∈ [−1, 1]. E[(X − Y ) 2 ] = (µX − µY ) 2 + (σX − σY ) 2 + 2σX σY (1 − ρ(X, Y )) Now, consider the following values for correlation: ρ(X, Y ) = 0, so E[(X − Y )2 ] = (µX − µY )2 + σ2 X + σ2 Y . Assume µX = µY and σX = σY . For σX = σY 1, we obtain E[(X − Y )2 ] 1 instead of the distance 0, expected from comparing two equal Gaussians. ρ(X, Y ) = 1, so E[(X − Y )2 ] = (µX − µY )2 + (σX − σY )2 . 30 20 10 0 10 20 30 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 Probability density functions of Gaus- sians N(−5, 1) and N(5, 1), Gaus- sians N(−5, 3) and N(5, 3), and Gaussians N(−5, 10) and N(5, 10). Green, red and blue Gaussians are equidistant using L2 geometry on the parameter space (µ, σ). Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Sklar’s Theorem Theorem (Sklar’s Theorem (1959)) For any random vector X = (X1, . . . , XN) having continuous marginal cdfs Pi , 1 ≤ i ≤ N, its joint cumulative distribution P is uniquely expressed as P(X1, . . . , XN) = C(P1(X1), . . . , PN(XN)), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Sklar’s Theorem Theorem (Sklar’s Theorem (1959)) For any random vector X = (X1, . . . , XN ) having continuous marginal cdfs Pi , 1 ≤ i ≤ N, its joint cumulative distribution P is uniquely expressed as P(X1, . . . , XN ) = C(P1(X1), . . . , PN (XN )), where C, the multivariate distribution of uniform marginals, is known as the copula of X. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Copula Transform Definition (The Copula Transform) Let X = (X1, . . . , XN) be a random vector with continuous marginal cumulative distribution functions (cdfs) Pi , 1 ≤ i ≤ N. The random vector U = (U1, . . . , UN) := P(X) = (P1(X1), . . . , PN(XN)) is known as the copula transform. Ui , 1 ≤ i ≤ N, are uniformly distributed on [0, 1] (the probability integral transform): for Pi the cdf of Xi , we have x = Pi (Pi −1 (x)) = Pr(Xi ≤ Pi −1 (x)) = Pr(Pi (Xi ) ≤ x), thus Pi (Xi ) ∼ U[0, 1]. Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Copula Transform Definition (The Copula Transform) Let X = (X1, . . . , XN ) be a random vector with continuous marginal cumulative distribution functions (cdfs) Pi , 1 ≤ i ≤ N. The random vector U = (U1, . . . , UN ) := P(X) = (P1(X1), . . . , PN (XN )) is known as the copula transform. 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 X∼U[0,1] 10 8 6 4 2 0 2 Y∼ln(X) ρ≈0.84 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 PX (X) 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 PY(Y) ρ=1 The Copula Transform invariance to strictly increasing transformation Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Deheuvels’ Empirical Copula Transform Let (Xt 1 , . . . , Xt N ), 1 ≤ t ≤ T, be T observations from a random vector (X1, . . . , XN ) with continuous margins. Since one cannot directly obtain the corresponding copula observations (Ut 1, . . . , Ut N ) = (P1(Xt 1 ), . . . , PN (Xt N )), where t = 1, . . . , T, without knowing a priori (P1, . . . , PN ), one can instead Definition (The Empirical Copula Transform) estimate the N empirical margins PT i (x) = 1 T T t=1 1(Xt i ≤ x), 1 ≤ i ≤ N, to obtain the T empirical observations ( ˜Ut 1, . . . , ˜Ut N ) = (PT 1 (Xt 1 ), . . . , PT N (Xt N )). Equivalently, since ˜Ut i = Rt i /T, Rt i being the rank of observation Xt i , the empirical copula transform can be considered as the normalized rank transform. In practice x_transform = rankdata(x)/len(x) Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Generic Non-Parametric Distance d2 θ (Xi , Xj ) = θ3E |Pi (Xi ) − Pj (Xj )|2 + (1 − θ) 1 2 R dPi dλ − dPj dλ 2 dλ (i) 0 ≤ dθ ≤ 1, (ii) 0 < θ < 1, dθ metric, (iii) dθ is invariant under diffeomorphism Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Generic Non-Parametric Distance d2 0 : 1 2 R dPi dλ − dPj dλ 2 dλ = Hellinger2 d2 1 : 3E |Pi (Xi ) − Pj (Xj )|2 = 1 − ρS 2 = 2−6 1 0 1 0 C(u, v)dudv Remark: If f (x, θ) = cΦ(u1, . . . , uN; Σ) N i=1 fi (xi ; νi ) then ds2 = ds2 GaussCopula + N i=1 ds2 margins Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion The Hierarchical Block Model A model of nested partitions The nested partitions defined by the model can be seen on the distance matrix for a proper distance and the right permutation of the data points In practice, one observe and work with the above distance matrix which is identitical to the left one up to a permutation of the data Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Results: Data from Hierarchical Block Model Adjusted Rand Index Algo. Distance Distrib Correl Correl+Distrib HC-AL (1 − ρ)/2 0.00 ±0.01 0.99 ±0.01 0.56 ±0.01 E[(X − Y )2 ] 0.00 ±0.00 0.09 ±0.12 0.55 ±0.05 GPR θ = 0 0.34 ±0.01 0.01 ±0.01 0.06 ±0.02 GPR θ = 1 0.00 ±0.01 0.99 ±0.01 0.56 ±0.01 GPR θ = .5 0.34 ±0.01 0.59 ±0.12 0.57 ±0.01 GNPR θ = 0 1 0.00 ±0.00 0.17 ±0.00 GNPR θ = 1 0.00 ±0.00 1 0.57 ±0.00 GNPR θ = .5 0.99 ±0.01 0.25 ±0.20 0.95 ±0.08 AP (1 − ρ)/2 0.00 ±0.00 0.99 ±0.07 0.48 ±0.02 E[(X − Y )2 ] 0.14 ±0.03 0.94 ±0.02 0.59 ±0.00 GPR θ = 0 0.25 ±0.08 0.01 ±0.01 0.05 ±0.02 GPR θ = 1 0.00 ±0.01 0.99 ±0.01 0.48 ±0.02 GPR θ = .5 0.06 ±0.00 0.80 ±0.10 0.52 ±0.02 GNPR θ = 0 1 0.00 ±0.00 0.18 ±0.01 GNPR θ = 1 0.00 ±0.01 1 0.59 ±0.00 GNPR θ = .5 0.39 ±0.02 0.39 ±0.11 1 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Results: Application to Credit Default Swap Time Series Distance matrices computed on CDS time series exhibit a hierarchical block structure Marti, Very, Donnat, Nielsen IEEE ICMLA 2015 (un)Stability of clusters with L2 distance Stability of clusters with the proposed distance Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Consistency Definition (Consistency of a clustering algorithm) A clustering algorithm A is consistent with respect to the Hierarchical Block Model defining a set of nested partitions P if the probability that the algorithm A recovers all the partitions in P converges to 1 when T → ∞. Definition (Space-conserving algorithm) A space-conserving algorithm does not distort the space, i.e. the distance Dij between two clusters Ci and Cj is such that Dij ∈ min x∈Ci ,y∈Cj d(x, y), max x∈Ci ,y∈Cj d(x, y) . Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Consistency Theorem (Consistency of space-conserving algorithms (Andler, Marti, Nielsen, Donnat, 2015)) Space-conserving algorithms (e.g., Single, Average, Complete Linkage) are consistent with respect to the Hierarchical Block Model. T = 100 T = 1000 T = 10000 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion 1 Introduction 2 Geometry of Random Walk Time Series 3 The Hierarchical Block Model 4 Conclusion Gautier Marti, Frank Nielsen Clustering Random Walk Time Series Introduction Geometry of Random Walk Time Series The Hierarchical Block Model Conclusion Discussion and questions? Avenue for research: distances on (copula,margins) clustering using multivariate dependence information clustering using multi-wise dependence information Optimal Copula Transport for Clustering Multivariate Time Series, Marti, Nielsen, Donnat, 2015 Gautier Marti, Frank Nielsen Clustering Random Walk Time Series