Geodesic Least Squares Regression on the Gaussian Manifold with an Application in Astrophysics
07/11/2017- Accès libre pour les ayants-droit
Résumé
Collection
- Accès libre pour les ayants-droit
Auteurs
Geert Verdoolaege |
Média
Métriques
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd"> <identifier identifierType="DOI">10.23723/17410/22528</identifier><creators><creator><creatorName>Geert Verdoolaege</creatorName></creator></creators><titles> <title>Geodesic Least Squares Regression on the Gaussian Manifold with an Application in Astrophysics</title></titles> <publisher>SEE</publisher> <publicationYear>2018</publicationYear> <resourceType resourceTypeGeneral="Text">Text</resourceType><subjects><subject>robust regression</subject><subject>geodesic least squares</subject><subject>Rao geodesic dis- tance</subject><subject>Tully-Fisher scaling</subject></subjects><dates> <date dateType="Created">Wed 7 Mar 2018</date> <date dateType="Updated">Wed 7 Mar 2018</date> <date dateType="Submitted">Wed 23 Jan 2019</date> </dates> <alternateIdentifiers> <alternateIdentifier alternateIdentifierType="bitstream">cf32775b8ad3c08d3a2d9d8817f5cd8f314aca48</alternateIdentifier> </alternateIdentifiers> <formats> <format>application/pdf</format> </formats> <version>37246</version> <descriptions> <description descriptionType="Abstract">We present a new regression method called geodesic least squares (GLS), which is particularly robust against data and model uncertainty. It is based on minimization of the Rao geodesic distance on a probabilistic manifold. We apply GLS to Tully-Fisher scaling of the total baryonic mass vs. the rotation velocity in disk galaxies and we show the excellent robustness properties of GLS for estimating the coefficients and the tightness of the scaling. </description> </descriptions> </resource>
Geodesic Least Squares Regression on the Gaussian Manifold with an Application in Astrophysics Geert Verdoolaege1,2 1 Department of Applied Physics, Ghent University, B-9000 Ghent, Belgium 2 Laboratory for Plasma Physics – Royal Military Academy (LPP–ERM/KMS), B-1000 Brussels, Belgium geert.verdoolaege@ugent.be Abstract. We present a new regression method called geodesic least squares (GLS), which is particularly robust against data and model un- certainty. It is based on minimization of the Rao geodesic distance on a probabilistic manifold. We apply GLS to Tully-Fisher scaling of the total baryonic mass vs. the rotation velocity in disk galaxies and we show the excellent robustness properties of GLS for estimating the coefficients and the tightness of the scaling. Keywords: robust regression, geodesic least squares, Rao geodesic dis- tance, Tully-Fisher scaling 1 Introduction Many natural phenomena can be described by means of scaling laws, often in the form of a power law, e.g. in astrophysics, fluid and plasma dynamics, biology, geology, climatology and finance. However, in many application fields relatively simple or outdated statistical techniques are frequently used to estimate power laws. In the vast majority of cases, ordinary least squares (OLS) is applied to estimate the exponents (coefficients) of the power law on a logarithmic scale, despite its often poor performance in all but the simplest regression problems. Indeed, in more realistic settings, particularly when the goal is extrapolation of the scaling law, robustness is at least as important a quality compared to goodness-of-fit. This can become an issue in the presence of model uncertainty, heterogeneous data, atypical measurements (outliers) and skewed likelihoods [1]. Astrophysical data are often relatively complex from the statistical per- spective and it has long been recognized that various assumptions of ordinary least squares regression are not valid in many applications in the field. Accor- dingly, several techniques from the domains of frequentist statistics and Bayesian probability theory have been applied to address the shortcomings of OLS. Howe- ver, presently most techniques are designed to address one or a few shortcomings of OLS, but not all. In addition, judicious application of these techniques may require considerable expertise from the practitioner in statistics or probability theory, which can be an issue in various physics-centered application fields. Pre- sently, in many application domains there is a need for a robust general-purpose regression technique for estimating scaling laws. For these reasons we have developed a new, robust regression method that is simple to implement, called geodesic least squares regression (GLS). It is based on minimization of the Rao geodesic distance between, on the one hand, the pro- bability distribution of the response variable predicted by the regression model, and, on the other hand, a more data-driven distribution model of the response variable. GLS has recently been tested and applied in the field of magnetic con- finement fusion [2, 3], showing its enhanced robustness over various traditional methods. In this contribution, we apply GLS regression to estimate a key scaling law in astrophysics: the baryonic Tully-Fisher relation. This is a remarkably tight relation between the total baryonic mass of disk galaxies and their rotational velocity, of great practical and theoretical significance in astrophysics and cos- mology. 2 Geodesic least squares regression 2.1 Principles of GLS We here provide a brief overview of the GLS regression method. A more detailed description can be found in [1]. Implicitly, GLS performs regression on a pro- babilistic manifold characterized by the Fisher information. However, it is not directly based on a manifold regression technique like geodesic regression [4], where the relation between a manifold-valued response variable and a scalar predictor variable is modeled as a geodesic curve on the manifold. Rather, the idea behind GLS is to consider two different proposals for the distribution of a real-valued response variable y, conditional on the real-valued predictor varia- bles, all of which can be affected by uncertainty. On the one hand, there is the distribution that one would expect if all assumptions were correct regarding the deterministic component of the regression model (regression function) and the stochastic component. We call this the modeled distribution. On the other hand, we try to capture the distribution of y by relying as little as possible on the mo- del assumptions, and much more on the actual measurements of y. For this we will use the term observed distribution. In this sense, GLS is similar to minimum distance estimation (MDE), where the Hellinger distance is a popular similarity measure [5], but there are several differences. First and foremost, GLS calcula- tes the geodesic distance between each individual pair of modeled and observed distributions of the response variable. This often corresponds to an individual measurement point, together with an estimate of its error bar, provided by the experimentalist. The error bar estimate may have been obtained from previous experiments, or from a time series obtained at fixed (or stationary) values of the predictor variables. As such, each single data point is replaced by a probability density function describing the distribution of the response variable under fixed measured values of the predictor variables. In contrast, MDE usually considers a distance between a kernel density estimate of the distribution of residuals on the one hand, and the parametric model on the other hand, but based on the entire data sample. Secondly, we explicitly model all parameters of the modeled distribution, similar to the idea behind the link function in the generalized linear model. In the present work this will be accomplished by explicitly modeling both the mean and standard deviation of the Gaussian modeled distribution. A final difference is that we use the Rao geodesic distance as a similarity measure. 2.2 The GLS algorithm We start from a parametric multiple regression model between m predictor va- riables ξj (j = 1, . . . , m) and a single response variable η, all assumed to be infinitely precise. For n realizations of these variables, the regression model can be written as follows: ηi = f(ξi1, . . . , ξim, β1, . . . , βp) ≡ f({ξij}, {βk}), ∀i = 1, . . . , n. (1) Here, f is the regression model function, in general nonlinear and characterized by p parameters βk (k = 1, . . . , p). In regression analysis within the astronomy community, it is customary to add a noise variable to the idealized relation (1). This so-called intrinsic scatter serves to model the intrinsic uncertainty on the theoretical relation, i.e. uncertainty not related to the measurement process. We take another route for capturing model uncertainty, however. In any realistic situation, we have no access to the quantities ξij and ηi. Instead, a series of noisy measurements xij, resp. yi is acquired for the predictor and response variables: yi = ηi + y,i, y,i ∼ N 0, σ2 y,i , xij = ξij + x,ij, x,ij ∼ N 0, σ2 x,ij . We have assumed independent Gaussian noise, but this can be generalized to any distribution. Also, in general the standard deviations are different for each point. For instance, in many real-world situations, such as the one discussed in this paper, there is a constant relative error on the measurements, so the standard deviation can be modeled to be proportional to the measurement itself. Under this model, the distribution of the variable y, conditional on measured values xij of the m predictor variables (fixed i), as well as the parameters βk, is given by pmod(y|{xij}, {βk}) = 1 √ 2πσmod,i exp − 1 2 h yi − f {xij}, {βk} i2 σ2 mod,i . (2) This is the modeled distribution, where we suppose that estimates of the stan- dard deviations σx,ij and σy,i are available. The uncertainty on the predictor variables propagates through the function f and adds to the conditional uncer- tainty on the response variable, determined by σmod,i. We use standard Gaussian error propagation theory as a practical solution for this purpose. For example, referring to f {xij}, {βk} as the modeled mean µmod,i, for a linear model we have (with relabeled βk): µmod,i ≡ β0 + β1xi1 + . . . + βmxim, σ2 mod,i ≡ σ2 y,i + β2 1σ2 x,i1 + . . . + β2 mσ2 x,im. Relying on the maximum likelihood method, one would proceed to estimate the parameters βk by maximizing (2), or, under the assumption of symmetry of the likelihood distribution and homoscedasticity, by minimizing the sum of squared differences (Euclidean distances) between each measured yi and pre- dicted µmod,i. However, this assumes that the model is exact, specifically that σmod,i is the only source of data variability. In order to take into account ad- ditional uncertainty sources, in particular model uncertainty, we therefore also consider the observed distribution of y, relying on as few assumptions as possible regarding the regression model. Specifically, we replace each data point yi by a distribution pobs(y|yi). In the context of the GLM, this is known as the saturated model. In the present application, we choose again the normal distribution, but centered on each data point: N yi, σ2 obs,i , where σobs,i is to be estimated from the data. The extra parameters σobs,i give the method added flexibility, since they are not a priori required to equal σmod,i. As a result, GLS is less sensitive to incorrect model assumptions. Choosing a Gaussian form for both the modeled and observed distribution offers a computational advantage, since the correspon- ding expression for the GD has a closed form [6]. Also, in principle, σobs,i can be different for each point, although in practice it is clear that we will need to introduce some sort of regularization to render the model identifiable. In this paper we either assume σobs,i a constant sobs, or proportional to the response variable, σobs,i = robs|ȳi|. The parameters sobs or robs have to be estimated from the data. More complicated (parametrized) relations between σobs,i and the re- sponse variable or other data would be possible too, but one should be careful not to put too many restrictions on pobs, thereby defeating its purpose. GLS now proceeds by minimizing the total GD between, on the one hand, the joint observed distribution of the n realizations of the variable y and, on the other hand, the joint modeled distribution. Owing to the independence assumption in this example, we can write this in terms of products of the corresponding marginal distributions (including all dependencies and with γobs either sobs or robs): n β̂k, γ̂obs o = argmin βk,γobs∈R GD2 " n Y i=1 pobs (y|yi, γobs) , n Y i=1 pmod y|{xij}, {βk}, σyi , {σxij } # = argmin βk,γobs∈R n X n=1 GD2 h pobs (y|yi, γobs) , pmod y|{xij}, {βk}, σyi , {σxij } i . (3) Note that the parameters βk occur both in the mean and the variance of the mo- deled distribution. The last equality in (3) entails a considerable simplification, owing to the property that the squared GD between products of distributions can be written as the sum of squared GDs between the corresponding factors [6]. Hence, the optimization procedure involves, on the level of each measurement, matching not only yi with µmod,i, but also σobs,i with σmod,i, in a way dictated by the geometry of the likelihood distribution. As will be shown in the experi- ments, the result is that GLS is relatively insensitive to uncertainties in both the stochastic and deterministic components of the regression model. The same quality renders the method also robust against outliers. In the experiments below, we employed a classic active-set algorithm to carry out the optimization [7]. Furthermore, presently the GLS method does not di- rectly offer confidence (or credible) intervals on the estimated quantities. Future work will address this issue in more detail, but for now error estimates were de- rived by a bootstrap procedure. The bootstrapping involved creating, from the measured data set, 100 artificial data sets of the same size, by resampling with replacement. The regression analysis was then carried out on each of the data sets and the mean and standard deviation, over all data sets, of each estima- ted regression parameter were used as estimates of the parameter and its error bar, respectively. This scheme typically results in rather conservative error bars, which could possibly be narrowed down using more sophisticated methods. Incidentally, forcing σobs,i ≡ σmod,i in (3), ∀ i, would take us back to standard maximum likelihood estimation, since the Rao GD between two Gaussians p1 and p2 with means yi, resp. f {xij}, {βk} , but with identical standard deviations σi (fixed along the geodesic path), is precisely the Mahalanobis distance [8]: GD(p1, p2) = yi − f {xij}, {βk} σi . (a) (b) Fig. 1: Baryonic mass Mb vs. rotation velocity Vf for 47 gas-rich galaxies and the fitted BTFR using OLS, Bayesian inference and GLS. (a) On the logarithmic scale and (b) on the original scale. 3 Application of GLS to Tully-Fisher scaling 3.1 The baryonic Tully-Fisher relation The baryonic Tully-Fisher relation (BTFR) between the total (stellar + gase- ous) baryonic mass Mb of disk galaxies and their rotational velocity Vf is of fundamental importance in astrophysics and cosmology [9]. It is a remarkably simple and tight empirical relation of the form Mb = β0V β1 f . (4) The BTFR serves as a tool for determining cosmic distances, provides constraints on galaxy formation and evolution models, and serves as a test for the Lambda cold dark matter paradigm (ΛCDM) in cosmology. In this scaling problem, we use data from 47 gas-rich galaxies, as plotted in Figure 1 and detailed in [9]. The data also contain estimates of the observational errors, which we treat here as a single standard deviation. Figure 2 shows a scatter plot of σmod,i, which is almost entirely determined by σMb , vs. Mb for the 47 galaxies in the database. This suggests a measurement error on the response variable proportional to Mb, about 38%, i.e. a constant error bar on the logarithmic scale. Table 1: Regression estimates for the BTFR parameters using loglinear and nonlinear OLS, Bayesian inference and GLS. Loglinear β̂0 β̂1 Nonlinear β̂0 β̂1 OLS 310 3.56 OLS 0.063 5.37 Bayes 160 3.72 Bayes 91 3.80 GLS 110 3.81 GLS 79 3.83 3.2 Regression analysis Owing to the power law character of most scaling laws, they are often estimated by linear regression on a logarithmic scale. However, it is known that this may lead to unreliable estimates, as the logarithm (heavily) distorts the distribution of the data [1]. This is in particular the case when the estimation is done using simple OLS or when there are outliers in the data. In contrast, we will show that GLS regression produces consistent results on both the logarithmic and original scales, demonstrating its robustness. In view of the proportional error on Mb, the observed standard deviation in GLS is modeled here as σobs,i = robsMb, with robs an unknown scale factor to be estimated from the data using the optimization routine. We compare the results of GLS regression with OLS and a standard Bayesian method. For the latter we choose the likelihood given in (2), allowing uncertainty on the standard deviation through a scale factor with a Jeffreys prior. We also use uninformative prior distributions for the regression parameters. The scalings obtained using the various methods are shown in Figure 1a for the case of linear regression on the logarithmic scale, and in Figure 1b for power- law regression on the original scale. The coefficient estimates are given in Table 1. It is clear that GLS yields estimates that are much more consistent compared to OLS. In particular, whereas the data point corresponding to the largest Vf and Mb does not have the characteristics of an outlier on the logarithmic scale, it may be considered as such on the original scale. The nonlinear OLS estimate for the exponent β1 is heavily influenced by this point, causing the discrepancy with the estimate on the logarithmic scale. Furthermore, the results of GLS are verified by the Bayesian method. Next, 100 bootstrap samples were created from the data, yielding average parameter estimates and 95% confidence intervals on the basis of the OLS and GLS results, shown in Table 2. Again, the enhanced robustness of GLS compared to OLS stands out. Finally, Figure 2 shows the plot of σobs = robsMb vs. Mb, with the scale factor robs (observed relative error) amounting to 63%. This is considerably larger than the value of 38% predicted by the model, possibly indicating that the scatter on the scaling law is not due to measurement error alone. This will be an important area of further investigation, as it may provide evidence for the ΛCDM vs. MOND cosmological models. Table 2: Average regression estimates and 95% confidence intervals for the BTFR using loglinear and nonlinear OLS, Bayesian inference and GLS, obtained from 100 bootstrap samples. Loglinear β̂0 β̂1 OLS 360 ± 220 3.57 ± 0.15 Bayes 220 ± 220 3.72 ± 0.19 GLS 140 ± 82 3.80 ± 0.16 Nonlinear β̂0 β̂1 OLS (3.6 ± 6.2) × 103 4.56 ± 1.19 Bayes 130 ± 160 3.80 ± 0.21 GLS 390 ± 280 3.85 ± 0.18 Fig. 2: Plot of σMb (≈ σmod) and robsMb (= σobs) vs. Mb, as estimated by GLS. 4 Conclusion We have introduced geodesic least squares, a versatile and robust regression met- hod based on regression between probability distributions. Part of the strength of the method is its simplicity, allowing straightforward application by users in various application fields, without the need for parameter tuning. We have applied GLS to baryonic Tully-Fisher scaling, thereby demonstrating the robus- tness of the method and providing an alternative means for testing cosmological models based on the estimated intrinsic scatter. References 1. G. Verdoolaege. A new robust regression method based on minimization of geo- desic distances on a probabilistic manifold: Application to power laws. Entropy, 17(7):4602–4626, 2015. 2. G. Verdoolaege and J.-M. Noterdaeme. Robust scaling in fusion science: case study for the L-H power threshold. Nucl. Fusion, 55(11):113019 (19 pp.), 2015. 3. G. Verdoolaege, A. Shabbir, and G. Hornung. Robust analysis of trends in noisy to- kamak confinement data using geodesic least squares regression. Rev. Sci. Instrum., 87(11):11D422 (3 pp.), 2016. 4. P.T. Fletcher. Geodesic regression and the theory of least squares on Riemannian manifolds. Int. J. Comput. Vis., 105(2):171–185, 2013. 5. R.J. Pak. Minimum Hellinger distance estimation in simple regression models; dis- tribution and efficiency. Stat. Probab. Lett., 26(3):263–269, 1996. 6. J. Burbea and C.R. Rao. Entropy differential metric, distance and divergence mea- sures in probability spaces: a unified approach. J. Multivariate Anal., 12(4):575–596, 1982. 7. P.E. Gill, W. Murray, and M.H. Wright. Numerical linear algebra and optimization, Vol. 1. Addison Wesley, Boston, MA, 1991. 8. C.R. Rao. Differential metrics in probability spaces. In Differential Geometry in Statistical Inference. Institute of Mathematical Statistics, Hayward, CA, 1987. 9. S.S. McGaugh. The baryonic Tully-Fisher relation of gas-rich galaxies as a test of ΛCDM and MOND. Astron. J., 143(2):40 (15 pp.), 2012.