Bootstrapping Descriptors for Non-Euclidean Data

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22570
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

For data carrying a non-Euclidean geometric structure it is natural to perform statistics via geometric descriptors. Typical candidates are means, geodesics, or more generally, lower dimensional subspaces, which carry specific structure. Asymptotic theory for such descriptors is slowly unfolding and its application to statistical testing usually requires one more step: Assessing the distribution of such descriptors.
To this end, one may use the bootstrap that has proven to be a very successful tool to extract inferential information from small samples. In this communication we review asymptotics for descriptors of manifold valued data and study a non-parametric bootstrap test that aims at a high power, also under the alternative.

Bootstrapping Descriptors for Non-Euclidean Data

Collection

application/pdf Bootstrapping Descriptors for Non-Euclidean Data Benjamin Eltzner, Stephan Huckemann
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

For data carrying a non-Euclidean geometric structure it is natural to perform statistics via geometric descriptors. Typical candidates are means, geodesics, or more generally, lower dimensional subspaces, which carry specific structure. Asymptotic theory for such descriptors is slowly unfolding and its application to statistical testing usually requires one more step: Assessing the distribution of such descriptors.
To this end, one may use the bootstrap that has proven to be a very successful tool to extract inferential information from small samples. In this communication we review asymptotics for descriptors of manifold valued data and study a non-parametric bootstrap test that aims at a high power, also under the alternative.
Bootstrapping Descriptors for Non-Euclidean Data

Média

Voir la vidéo

Métriques

0
0
720.61 Ko
 application/pdf
bitcache://3c5d2727f77435f178ac3f4e5e7805d5ece37ab3

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
gdrmia_logo.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22570</identifier><creators><creator><creatorName>Benjamin Eltzner</creatorName></creator><creator><creatorName>Stephan Huckemann</creatorName></creator></creators><titles>
            <title>Bootstrapping Descriptors for Non-Euclidean Data</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Thu 8 Mar 2018</date>
	    <date dateType="Updated">Thu 8 Mar 2018</date>
            <date dateType="Submitted">Mon 12 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">3c5d2727f77435f178ac3f4e5e7805d5ece37ab3</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37291</version>
        <descriptions>
            <description descriptionType="Abstract">For data carrying a non-Euclidean geometric structure it is natural to perform statistics via geometric descriptors. Typical candidates are means, geodesics, or more generally, lower dimensional subspaces, which carry specific structure. Asymptotic theory for such descriptors is slowly unfolding and its application to statistical testing usually requires one more step: Assessing the distribution of such descriptors.<br />
To this end, one may use the bootstrap that has proven to be a very successful tool to extract inferential information from small samples. In this communication we review asymptotics for descriptors of manifold valued data and study a non-parametric bootstrap test that aims at a high power, also under the alternative.
</description>
        </descriptions>
    </resource>
.

Bootstrapping Descriptors for Non-Euclidean Data Benjamin Eltzner1 and Stephan Huckemann1 University of Goettingen, Germany, Felix-Bernstein-Institute for Mathematical Statistics in the Biosciences, Acknowledging the Niedersachsen Vorab of the Volkswagen Foundation Abstract. For data carrying a non-Euclidean geometric structure it is natural to perform statistics via geometric descriptors. Typical candi- dates are means, geodesics, or more generally, lower dimensional sub- spaces, which carry specific structure. Asymptotic theory for such de- scriptors is slowly unfolding and its application to statistical testing usually requires one more step: Assessing the distribution of such de- scriptors. To this end, one may use the bootstrap that has proven to be a very successful tool to extract inferential information from small samples. In this communication we review asymptotics for descriptors of manifold valued data and study a non-parametric bootstrap test that aims at a high power, also under the alternative. 1 Introduction In recent years, the study of data on non-Euclidean spaces has found increasing attention in statistics. Non-Euclidean data spaces have lead to a surge of spe- cialized fields: directional statistics is concerned with data on spheres of different dimensions (e.g. [15]); shape analysis studies lead to data on quotient spaces (e.g. [6]), some of which are manifolds and some of which are non-manifold stratified spaces; and applications in population genetics have lead to increasing interest in data on non-manifold phylogenetic tree spaces (e.g. [4]) and to graph data in general. As a basis for statistics on these spaces, it is important to investigate asymp- totic consistency of estimators, as has been done for intrinsic and extrinsic Fréchet means on manifolds by [8, 3], and more generally for a class of descrip- tors called generalized Fréchet means by [11, 12]. Examples of such generalized Fréchet means are not only Procrustes means on non-manifold shape spaces ([6, 11]) but also geodesic principal components on such spaces (cf. [10]), or more generally, barycentric subspaces by [17], see also [16] for a similar approach on phylogenetic tree spaces, or more specifically, small and great subspheres for spherical data by [14, 18]. In particular, the question of asymptotic consistency and normality of princi- pal nested spheres analysis [14], say, goes beyond generalized Fréchet means anal- ysis. In all nested schemes, several estimators are determined sequentially, where each estimation depends on all previous ones. Recently, asymptotic consistency of nested generalized Fréchet means was introduced in [13], as a generalization of classical PCA’s asymptotics, e.g. by [1], where nestedness of approximating subspaces is not an issue because it is trivially given. Based on asymptotic consistency of nested and non-nested descriptors, hy- pothesis tests, like the two-sample test can be considered. Since by construction, every sample determines only one single descriptor and not its distribution, re- sampling techniques like the bootstrap are necessary to produce confidence sets. Notably, this is a very generic technique independent of specific sample spaces and descriptors. In the following, after introducing non-nested and nested gen- eralized Fréchet means, we will elaborate on bootstrapping quantiles for a two- sample test. We will show that a separated approach in general leads to greatly increased power of the test in comparison to a pooled approach, both with cor- rect asymptotic size. Also, we illustrate the benefit of nested over non-nested descriptors. 2 Descriptors for Manifold Valued Data 2.1 Single Descriptors With a silently underlying probability space (Ω, A, P), random elements on a topological space Q are mappings X : Ω → Q that are measurable with respect to the Borel σ-algebra of Q. For a topological space Q we say that a continuous function d : Q × Q → [0, ∞) is a loss function if d(q, q0 ) = 0 if and only if q = q0 . Definition 1 (Generalized Fréchet Means [11]) Let Q be a separable topo- logical space, called the data space, and P a separable topological space, called the descriptor space, with loss function d : P × P → [0, ∞) and a continuous map ρ : Q × P → [0, ∞). Random elements X1, . . . , Xn i.i.d. ∼ X on Q give rise to population and sample descriptors µ ∈ argmin p∈P E[ρ(X, p)2 ], µn ∈ argmin p∈P n X j=1 ρ(Xj, p)2 . The descriptors are also called generalized ρ-Fréchet means. The sample descrip- tor is a least squares M-estimator. Asymptotic theory for generalized ρ-Fréchet means under additional assump- tions, among them that the means be unique and attained on a twice differen- tiable manifold part of P has been established by [11, 12]. 2.2 Nested Descriptors For nested descriptors, we need to establish a notion of nestedness and the relations between the successive descriptor spaces. Definition 2 ([13]) A separable topological data space Q admits backward nested families of descriptors (BNFDs) if (i) there is a collection Pj (j = 0, . . . , m) of topological separable spaces with loss functions dj : Pj × Pj → [0, ∞); (ii) Pm = {Q}; (iii) every p ∈ Pj (j = 1, . . . , m) is itself a topological space and gives rise to a topological space ∅ 6= Sp ⊂ Pj−1 which comes with a continuous map ρp : p × Sp → [0, ∞) ; (iv) for every pair p ∈ Pj (j = 1, . . . , m) and s ∈ Sp there is a measurable projection map πp,s : p → s . For j ∈ {1, . . . , m − 2} call a family f = {pj , . . . , pm−1 }, with pk−1 ∈ Spk , k = j + 1, . . . , m a backward nested family of descriptors (BNFD) ending in Pj, where we ignore the unique pm = Q ∈ Pm. The space of all BNFDs ending in Pj is given by Tj = n f = {pk }m−1 k=j : pk−1 ∈ Spk , k = j + 1, . . . , m o ⊆ m−1 Y k=j Pk . For j ∈ {1, . . . , m}, given a BNFD f = {pk }m−1 k=j set πf = πpj+1,pj ◦ . . . ◦ πpm,pm−1 : pm → pj which projects along each descriptor. For another BNFD f0 = {p0k }m−1 k=j ∈ Tj set dj (f, f0 ) = v u u t m−1 X k=j dk(pk, p0k )2 . Building on this notion, we can now define nested population and sample descriptors similar to Definition 1. Definition 3 (Nested Generalized Fréchet Means [13]) Random elements X1, . . . , Xn i.i.d. ∼ X on a data space Q admitting BNFDs give rise to backward nested population and sample descriptors (abbreviated as BN descriptors) {Efj : j = m − 1, . . . , 0}, {E fj n n : j = m − 1, . . . , 0} recursively defined using pm = Q = pm n via Efj = argmin s∈Spj+1 E[ρpj+1 (πfj+1 ◦ X, s)2 ], fj = {pk }m−1 k=j E fj n n = argmin s∈S p j+1 n n X i=1 ρpj+1 n (πfj+1 n ◦ Xi, s)2 , fj n = {pk n}m−1 k=j . where pj ∈ Efj and pj n ∈ Efj n is a measurable choice for j = 1, . . . , m − 1. We say that a BNFD f = {pk }m−1 k=0 gives unique BN population descriptors if Efj = {pj } with fj = {pk }m−1 k=j for all j = 0, . . . , m − 1. Each of the Efj and E fj n n is called a nested generalized Fréchet mean and E fj n n can be viewed as nested least squares M-estimator. Asymptotic theory for such backward nested families of descriptors, again un- der additional assumptions, among them being assumed on twice-differentiable manifold parts, has been established in [13]. In order to asses asymptotics of single elements in a family of nested gener- alized ρ-Fréchet means, the last element, say, a key ingredient is the following definition from [13]. Definition 4 (Factoring Charts [13]) Let W ⊂ Tj, U ⊂ Pj open subsets with C2 manifold structure, f0 = (p0m−1 , . . . , p0j ) ∈ W and p0j ∈ U, and with local chart ψ : W → ψ(W) ⊂ Rdim(W ) , f = (pm−1 , . . . , pj ) 7→ η = (θ, ξ) the chart ψ factors, if there is a chart φ and projections πU , πφ(U) φ : U → φ(U) ⊂ Rdim(U) , pj 7→ θ πU : W → U, f 7→ pj , πφ(U) : ψ(W) → φ(U), (θ, ξ) 7→ θ such that the following diagram commutes W ψ(W) U φ(U) ψ πU πφ(U) φ (1) In case that factoring charts exist, from the asymptotics of an entire backward nested descriptor family it is possible to project to a chart, describing the last element descriptor only, and such a projection preserves asymptotic Gaussianity, cf. [13]. 3 Bootstrap Testing Based on the central limit theorems proved in [11, 13], it is possible to introduce a T2 -like two-sample test for non-nested descriptors, BNFDs and single nested descriptors. 3.1 The Test Statistic Suppose that we have two independent i.i.d. samples X1, . . . , Xn ∼ X ∈ Q, Y1, . . . , Ym ∼ Y ∈ Q in a data space Q admitting non-nested descriptors, BNFDs and single nested descriptors in P and we want to test H0 : X ∼ Y versus H1 : X 6∼ Y using descriptors in p ∈ P. Here, p ∈ P stands either for a single pk ∈ Pk or for a suitable sequence f ∈ Tj. We assume that the first sample gives rise to p̂X n ∈ P, the second to p̂Y m ∈ P, and that these are unique. We introduce shorthand notation to simplify the following complex expressions dX,∗ n,b = φ(p̂X,∗ n,b ) − φ(p̂X n ) dY,∗ m,b = φ(p̂Y,∗ m,b) − φ(p̂Y m) ΣX,∗ φ,n,b : = 1 B B X b=1 dX,∗ n,b dX,∗ n,b T ΣY,∗ φ,m,b : = 1 B B X b=1 dY,∗ m,bdY,∗ m,b T . Define the statistic T2 : = φ(p̂X n ) − φ(p̂Y m) T  ΣX,∗ φ,n,b + ΣY,∗ φ,m,b −1 φ(p̂X n ) − φ(p̂Y m)  . (2) Under H0 and the assumptions of the CLTs shown in [11, 13], this is asymp- totically Hotelling T2 distributed if the corresponding bootstrapped covariance matrices exist. Notably, under slightly stronger regularity assumptions, which are needed for the bootstrap, this estimator is asymptotically consistent, cf. [5, Corollary 1]. 3.2 Pooled Bootstrapped Quantiles Since the test statistic (2) is only asymptotically T2 distributed and especially deeply nested estimators may have sizable bias for finite sample size, it can be advantageous to use the bootstrap to simulate quantiles, whose covering rate usually has better convergence properties, cf. [7]. A pooled approach to simulated quantiles runs as follows. From X1, . . . , Xn, Y1 . . . , Ym, sample Z1,b, . . . , Zn+m,b and compute the corresponding T∗2 b (b = 1, . . . , B) following (2) from X∗ i,b = Zi,b, Y ∗ j,b = Zn+j,b (i = 1, . . . , n, j = 1, . . . , m). From these, for a given level α ∈ (0, 1) we compute the empirical quantile c∗ 1−α such that P  T∗2 ≤ c∗ 1−α|X1, . . . , Xn, Y1, . . . , Ym = 1 − α . We have then under H0 that c∗ 1−α gives an asymptotic coverage of 1 − α for T2 , i. e. P{T2 ≤ c∗ 1−α} → 1 − α as n, m → ∞ if n/m → c with a fixed c ∈ (0, ∞). Under H1, however, the bootstrap samples X∗ i,b and Y ∗ j,b have substantially higher variance than both the original Xi and Yj. This leads to a large spread between the values of the quantiles and thus to diminished power of the test. This will be exemplified in the simulations below. 3.3 Separated Bootstrapped Quantiles To improve the power of the test while still achieving the asymptotic size, we simulate a slightly changed statistic under H0, by again bootstrapping, but now separately, from X1, . . . , Xn and Y1 . . . , Ym (for b = 1, . . . , B), T∗2 =  dX,∗ n,b − dY,∗ m,b T  ΣX,∗ φ,n,b + ΣY,∗ φ,m,b −1  dX,∗ n,b − dY,∗ m,b  . (3) From these values, for a given level α ∈ (0, 1) we compute the empirical quantile c∗ 1−α such that P  T∗2 (A) ≤ c∗ 1−α|X1, . . . , Xn, Y1, . . . , Ym = 1 − α . Then, in consequence of [2, Theorems 3.2 and 3.5], asymptotic normality of √ n( φ(p̂X n )−φ(p̂X )  , and √ m( φ(p̂Y m)−φ(p̂Y )  , guaranteed by the CLT in [13], extends to the same asymptotic normality for √ n dX ∗ n b , and √ m dY ∗ m b, respec- tively. We have then under H0 that c∗ 1−α gives an asymptotic coverage of 1 − α for T2 from equation (2), i. e. P{T∗2 ≤ c∗ 1−α} → 1 − α as B, n, m → ∞ if n/m → c with a fixed c ∈ (0, ∞). We note that also the argument from [3, Corollary 2.3 and Remark 2.6] extends at once to our setup, as we assume that the corresponding population covariance matrix Σψ or Σφ, respectively, is invertible. 4 Simulations We perform simulations to illustrate two important points. For our simulations we use the nested descriptors of Principal Nested Great Spheres (PNGS) analysis [14] and the intrinsic Fréchet mean [3]. In all tests and simulated quantiles we use B = 1000 bootstrap samples for each data set. 4.1 Differences Between Pooled and Separated Bootstrap The first simulated example uses the nested mean and first geodesic principal component (GPC) to compare the two different bootstrapped quantiles with T2 - distribution quantiles in order to illustrate the benefits provided by separated quantiles. The two data sets we use are concentrated along two great circle arcs on an S2 which are perpendicular to each other. The data sets are normally distributed along these clearly different great circles with common nested mean and have sample size of 60 and 50 points, respectively, cf. Fig. 1a. We simulate 100 samples from the two distributions and compare the p-values for the different quantiles. By design, we expect a roughly uniform distribution of p-values for the nested mean, indicating correct size of the test, and a clear rejection of the null for the first GPC, showing the power of the test. Both is satisfied for the separated quantiles and T2 -quantiles but not for the pooled quantiles, leading to diminished power under the alternative, cf. Fig. 1c. Under closer inspection, Fig. 1b shows that separated quantile p-values are closer to T2 - quantile p-values than pooled quantile p-values, which are systematically higher due to the different covariance structures rendering the test too conservative. (a) Data set I (b) p-values for nested mean (c) p-values for first GPC Fig. 1: Simulated data set I on S2 (a) with correct size under the null hypothesis of equal nested means (b) and power under the alternative of different first GPCs (c). The red sample has 50 points, the blue 60 points; we use p-values for 100 simulations each. 4.2 Nested Descriptors May Outperform Non-Nested Descriptors The second point we highlight is that the nested mean of PNGS analysis is generically much closer to the data than the ordinary intrinsic mean and can thus, in specific situations, be more suitable to distinguish two populations. The same may also hold true for other nested estimators in comparison with their non-nested kin. The data set II considered here provides an example for such a situation. It consists of two samples of 300 and 100 points, respectively, on an S2 with coinciding intrinsic mean but different nested mean. (a) Data set II (b) p-values for data set II Fig. 2: Simulated data set II (red: 100 points, blue: 300 points) on S2 (left), and box plots displaying the distribution of 100 p-values for PNGS nested mean and intrinsic mean (right) from the two-sample test. Here we only consider separated simulated quantiles, for both nested and intrinsic means. For the intrinsic mean two-sample test, we also use the bootstrap to estimate covariances for simplicity as outlined by [3], although closed forms for variance estimates exist, cf. [9]. Data set II and the distribution of resulting p-values are displayed in Figure 2. These values are in perfect agreement with the intuition guiding the design of the data showing that the nested mean is suited to distinguish the data sets where the intrinsic mean fails to do so. References 1. Anderson, T.: Asymptotic theory for principal component analysis. Ann. Math. Statist. 34(1), 122–148 (1963) 2. Arcones, M.A., Giné, E.: On the bootstrap of m-estimators and other statistical functionals. Exploring the Limits of Bootstrap, ed. by R. LePage and L. Billard, Wiley pp. 13–47 (1992) 3. Bhattacharya, R.N., Patrangenaru, V.: Large sample theory of intrinsic and ex- trinsic sample means on manifolds II. The Annals of Statistics 33(3), 1225–1259 (2005) 4. Billera, L., Holmes, S., Vogtmann, K.: Geometry of the space of phylogenetic trees. Advances in Applied Mathematics 27(4), 733–767 (2001) 5. Cheng, G.: Moment consistency of the exchangeably weighted bootstrap for semi- parametric m-estimation. Scandinavian Journal of Statistics 42(3), 665–684 (2015) 6. Dryden, I.L., Mardia, K.V.: Statistical Shape Analysis. Wiley, Chichester (1998) 7. Fisher, N.I., Hall, P., Jing, B.Y., Wood, A.T.: Improved pivotal methods for con- structing confidence regions with directional data. Journal of the American Statis- tical Association 91(435), 1062–1070 (1996) 8. Hendriks, H., Landsman, Z.: Asymptotic behaviour of sample mean location for manifolds. Statistics & Probability Letters 26, 169–178 (1996) 9. Huckemann, S., Hotz, T., Munk, A.: Intrinsic MANOVA for Riemannian manifolds with an application to Kendall’s space of planar shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(4), 593–603 (2010) 10. Huckemann, S., Hotz, T., Munk, A.: Intrinsic shape analysis: Geodesic principal component analysis for Riemannian manifolds modulo Lie group actions (with discussion). Statistica Sinica 20(1), 1–100 (2010) 11. Huckemann, S.: Inference on 3D Procrustes means: Tree boles growth, rank- deficient diffusion tensors and perturbation models. Scandinavian Journal of Statis- tics 38(3), 424–446 (2011) 12. Huckemann, S.: Intrinsic inference on the mean geodesic of planar shapes and tree discrimination by leaf growth. The Annals of Statistics 39(2), 1098–1124 (2011) 13. Huckemann, S.F., Eltzner, B.: Backward nested descriptors asymptotics with in- ference on stem cell differentiation (2017), arXiv:1609.00814 14. Jung, S., Dryden, I.L., Marron, J.S.: Analysis of principal nested spheres. Biometrika 99(3), 551–568 (2012) 15. Mardia, K.V., Jupp, P.E.: Directional Statistics. Wiley, New York (2000) 16. Nye, T., Tang, X., d Weyenberg, G., Yoshida, R.: Principal component analysis and the locus of the Fréchet mean in the space of phylogenetic trees. arXiv:1609.03045 (2016) 17. Pennec, X.: Barycentric subspace analysis on manifolds. arXiv preprint arXiv:1607.02833 (2016) 18. Schulz, J., Jung, S., Huckemann, S., Pierrynowski, M., Marron, J., Pizer, S.M.: Analysis of rotational deformations from directional data. Journal of Computa- tional and Graphical Statistics 24(2), 539 – 560 (2015)