Information Geometry of Predictor Functions in a Regression Model

07/11/2017
Publication GSI2017
OAI : oai:www.see.asso.fr:17410:22635
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit
 

Résumé

We discuss an information-geometric framework for a regression model, in which the regression function accompanies with the predictor function and the conditional density function. We introduce the e-geodesic and m-geodesic on the space of all predictor functions, of which the pair leads to the Pythagorean identity for a right triangle spinned by the two geodesics. Further, a statistical modeling to combine predictor functions in a nonlinear fashion is discussed by generalized average, and in particular, we observe the exible property of the log-exp average.

Information Geometry of Predictor Functions in a Regression Model

Collection

application/pdf Information Geometry of Predictor Functions in a Regression Model (slides)
application/pdf Information Geometry of Predictor Functions in a Regression Model Shinto Eguchi, Katsuhiro Omae
Détails de l'article
contenu protégé  Document accessible sous conditions - vous devez vous connecter ou vous enregistrer pour accéder à ou acquérir ce document.
- Accès libre pour les ayants-droit

We discuss an information-geometric framework for a regression model, in which the regression function accompanies with the predictor function and the conditional density function. We introduce the e-geodesic and m-geodesic on the space of all predictor functions, of which the pair leads to the Pythagorean identity for a right triangle spinned by the two geodesics. Further, a statistical modeling to combine predictor functions in a nonlinear fashion is discussed by generalized average, and in particular, we observe the exible property of the log-exp average.
Information Geometry of Predictor Functions in a Regression Model

Média

Voir la vidéo

Métriques

0
0
102.82 Ko
 application/pdf
bitcache://ea6f9a75e38a16ac12b95dc23ffcb9c8b1e04ce5

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors Platine

alanturinginstitutelogo.png
logothales.jpg

Sponsors Bronze

logo_enac-bleuok.jpg
imag150x185_couleur_rvb.jpg

Sponsors scientifique

logo_smf_cmjn.gif

Sponsors

smai.png
gdrmia_logo.png
gdr_geosto_logo.png
gdr-isis.png
logo-minesparistech.jpg
logo_x.jpeg
springer-logo.png
logo-psl.png

Organisateurs

logo_see.gif
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/17410/22635</identifier><creators><creator><creatorName>Shinto Eguchi</creatorName></creator><creator><creatorName>Katsuhiro Omae</creatorName></creator></creators><titles>
            <title>Information Geometry of Predictor Functions in a Regression Model</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2018</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Fri 9 Mar 2018</date>
	    <date dateType="Updated">Fri 9 Mar 2018</date>
            <date dateType="Submitted">Tue 13 Nov 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">ea6f9a75e38a16ac12b95dc23ffcb9c8b1e04ce5</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>37392</version>
        <descriptions>
            <description descriptionType="Abstract">We discuss an information-geometric framework for a regression model, in which the regression function accompanies with the predictor function and the conditional density function. We introduce the e-geodesic and m-geodesic on the space of all predictor functions, of which the pair leads to the Pythagorean identity for a right triangle spinned by the two geodesics. Further, a statistical modeling to combine predictor functions in a nonlinear fashion is discussed by generalized average, and in particular, we observe the exible property of the log-exp average.
</description>
        </descriptions>
    </resource>
.

Information Geometry of Predictor Functions in a Regression Model Shinto Eguchi1,2 and Katsuhiro Omae2 1 The Institute of Statistical Mathematics, Tachikawa Tokyo 190-8562, Japan 2 Department of Statistical Science, The Graduate University for Advanced Studies, Tachikawa, 190-8562 Tokyo, Japan Abstract. We discuss an information-geometric framework for a regres- sion model, in which the regression function accompanies with the pre- dictor function and the conditional density function. We introduce the e-geodesic and m-geodesic on the space of all predictor functions, of which the pair leads to the Pythagorean identity for a right triangle spinned by the two geodesics. Further, a statistical modeling to combine predictor functions in a nonlinear fashion is discussed by generalized average, and in particular, we observe the flexible property of the log-exp average. 1 Introducton We discuss a framework of information geometry for a regression model to be compatible with that for a probability density model. The framework consists of three spaces of the regression, predictor and conditional density functions which are connected in a one-to-one correspondence. In this way, the framework of the regression analysis is more complicated than that only of a density func- tion model. The key to build the new framework is to keep compatibility with the information geometry established in the density function model such that the dualistic pair of e-geodesic and m-geodesics plays a central role under the information metric, cf. [1]. Let X be a p-dimensional explanatory vector and Y a response variable, in which our interests are focused on an association of X with Y . Thus we write the regression function by µ(x) = E(Y |X = x). (1) A major goal in the regression analysis is to make an inference on the regression function, which is described by the conditional density function. The predictor function models the conditional density function. We adopt the formulation of the generalized linear model (GLM) to formulate more precisely the relation among the regression, predictor and conditional density functions. GLM is a standard model for the statistical regression analysis, which gives comprehensive unification of Gaussian, Poisson, Gamma, logistic regression models and so forth, cf [4]. We begin with a predictor function f(x) rather than the regression function µ(x). Because µ(x) is sometimes confined to be in a finite range as in a binary regression, which implies that it is difficult to directly model µ(x) by a parameter such as a linear model. Let F be the space of all predictor functions f(x)’s which satisfy mild regularity conditions for the smoothness. In the formation of GLM, f(x) and µ(x) are connected by a one-to-one function ℓ, called the mean link function such that µ(x) = ℓ(f(x)). The conditional density function of Y given X = x is assumed by p(y|x, f) = exp{yϑ(f(x)) − ψ(ϑ(f(x)))}, (2) where f ∈ F and ϑ(f) = (∂ψ/∂θ)−1 (ℓ(f)), called canonical link function. Note that the canonical and mean parameters θ and µ are connected by µ = (∂ψ/∂θ)(θ), which is a basic property in the exponential model (2). Typically, if (2) is a Bernoulli distribution, then a logistic model is led to as µ(x) = 1/(1 + exp{−f(x)}), or equivalently f(x) = log{µ(x)/(1 − µ(x))}. In the standard framework we write a linear predictor function by f(x) = β⊤ x + α (3) with the slope vector β and intercept α, in which the conditional density function is reduced to a parametric model. Thus, the logistic model is written by E(Y |X = x, α, β) = 1 1 + exp{−(β⊤x + α)} (4) via the mean link function. In practice, the model density function (2) is often added to a dispersion parameter in order to give a reasonable fitting to data with overdispersion, however we omit this description for notational simplicity. Based on this formulation, GLM has been expended to the hierarchy Bayesian GLM and generalized additive model, see [4] for recent developments. Unless we consider such a parametric form (3), then the conditional density function is in a semiparametric model P = {p(y|x, f) : f ∈ F}, (5) where f is a nonparametric component in the exponential density model. We give an information-geometric framework for F in association with this semi- parametric model P. In reality, we like to explore more flexible form than the linear predictor function (3). Let f0 and f1 be in F and ϕ a strictly increasing function defined on R. Then we introduce a one-parameter family, f (ϕ) t (x) = ϕ−1 ((1 − t)ϕ(f0(x)) + tϕ(f1(x))) (6) for t ∈ [0, 1]. We call f (ϕ) t ϕ-geodesic connecting between f0 and f1. It is noted that, if f0 and f1 are both linear predictor functions as in (3) and ϕ is an identity function, or ϕ(f) = f, then f (ϕ) t (x) is also a linear predictor function because f (ϕ) t (x) = {(1 − t)β0 + tβ1}⊤ + (1 − t)α0 + tα1 for fa(x) = β⊤ a x + αa with a = 0, 1. The ϕ-geodesic induces to a one-parameter density family p(y|x, f (ϕ) t ) in the space P, which connects p(y|x, f0) with p(y|x, f1). Specifically, if we take the canonical link function ϑ(f) as ϕ(f), then f (ϑ) t (x) = ϑ−1 ((1 − t)ϑ(f0(x)) + tϑ(f1(x))), (7) which induces to a conditional density function p(y|x, f (ϑ) t ) = exp{yθt(x) − ψ(θt(x))}, (8) where θt(x) = (1 − t)ϑ(f0(x)) + tϑ(f1(x)). This is nothing but the e-geodesic connecting between p(y|x, f0) and p(y|x, f1) in P because we observe that p(y|x, f (ϑ) t ) = ztp0(y|x)1−t p1(y|x)t , (9) where zt is a normalizing factor and pa(y|x) = p(y|x, fa) for a = 0, 1. Back to the sandard case of GLM with the canonical link function. Then we have a linear model θ(x) = β⊤ x + α. If f0(x) and f1(x) is in the linear model, then f (ϑ) t (x) is also in the linear model with θt(x) = {(1 − t)β0 + tβ1}⊤ x + (1 − t)α0 + tα1 for ϑ(fa(x)) = β⊤ a x + αa (a = 0, 1). F R P ) ( t f ) ( t f 0 f 1 f ) ( t  0  1  ) ( t p 0 p 1 p   Fig. 1. The e-gedesic and m-geodesic in F are injectively connected with the e-gedesic in R and the m-geodesic in P, respectively. Alternatively, we can consider a connection between F and the space of regression function, say R. If we take the mean link function ℓ(f) as ϕ(f), then f (ℓ) t (x) = ℓ−1 ((1 − t)ℓ(f0(x)) + tℓ(f1(x))), (10) which leads to the mixture geodesic µ (ℓ) t (x) = (1 − t)µ0(x) + tµ1(x) (11) in R, where µa(x) = ℓ(fa(x)) for a = 0, 1. Here we implicitly assume that µ (ℓ) t (x) = (1 − t)µ0(x) + tµ1(x) belongs to R for any f0(x) and f1(x) of F. Through the discussion as given above, the canonical link function induces to the exponential geodesic in P; the mean link function induces to the mixture geodesic in R, see Figure 1. Henceforth, we refer to (7) and (10) as e-geodesic and m-geodesic in F, respectively. We next discuss a triangle associated with three points in R. Let C1 = {ft : t ∈ [0, 1]} and C2 = {gs : s ∈ [0, 1]} be curves intersecting when (s, t) = (1, 1), so that g1 = f1, say f. Then C1 and C2 are said to orthogonally intersects at f if E { ∂ ∂t ℓ(ft(X)) ∂ ∂s ϑ(gs(X)) } (s,t)=(1,1) = 0. (12) The orthogonality is induced from that for curves of density functions defined by the information metric, see [1]. We like to consider a divergence measure between f and g be in F. Thus we define DKL(f, g) = E{ e DKL(p(·|X, f), p(·|X, g))}, (13) where e DKL is the Kullback-Leibler divergence in P. Thus, we can write DKL(f, g) = E[ℓ(f(X)){ϑ(f(X)) − ϑ(g(X))} − ψ(ϑ(f(X))) + ψ(ϑ(g(X)))]. Hence, we can discuss the Pythagorean identity on F as in the space of density functions. Proposition 1. Let f, g and h be in F. Consider two curves: one is the m- geodesic connectig between f and g as C(m) = {f (ℓ) t (x) := ℓ−1 ((1 − t)ℓ(f(x)) + tℓ(g(x))) : t ∈ [0, 1]} (14) and the other is the e-geodesic connecting between h and g as C(e) = {h(ϑ) s (x) := ϑ−1 ((1 − s)ϑ(h(x)) + sϑ(g(x))) : s ∈ [0, 1]}. (15) Then the triangle induced by vertices f, g and h satisfies a Pythagorean identity DKL(f, g) + DKL(g, h) = DKL(f, h) (16) if and only if the curves defined in (14) and (15) orthogonally intersects at g. Proof is easily confirmed by a fact that E { ∂ ∂t ℓ(f (ℓ) t (X)) ∂ ∂s ϑ(h(ϑ) s (X)) } s=t=1 = DKL(f, h) − {DKL(f, g) + DKL(g, h)}. We note that the orthogonality means the further identity: DKL(f (ℓ) t , g) + DKL(g, h(ϑ) s ) = DKL(f (ℓ) t , h(ϑ) s ) (17) for any (t, s) ∈ [0, 1]2 . In accordance with Proposition 1 the information geometry for the predictor space F is induced from that for the density space P by way of that for the regression space R, in which the distribution assumption (2) plays a central role on the induction from P and R to F. If a regression function (1) is degenerated, or constant in x, then any predictor function should be so, which means that both R and F are singletons, and P is just a one-parameter exponential family such as a Gaussian, Bernoulli and Poisson distributions. In effect, R should be sufficiently a rich space so that the richness can cover a flexible association of the explanatory vector X with the response variable Y . 2 Log-exp means We have discussed that a geometric property associated with the pair of e- geodesic (7) and m-geodesic (10) as specific ϕ-geodesics in F. There is another potential expansion of ϕ-geodesics which enables to flexibly modeling nonlinear predictor functions in F. Let fk(x) be a predictor function for k = 1, · · · , K. Then we propose a generalized mean as f(ϕ) τ,π (x) = 1 τ ϕ−1 ( K ∑ k=1 πkϕ ( τfk(x) )) (18) for a generator function ϕ assumed to be strictly increasing, where τ is a shape parameter and πk’s are proportions of the k-th predictor, so that ∑K k=1 πk = 1. Two properties of f (ϕ) τ,π (x) are observed as follows. Proposition 2. Let f (ϕ) τ,π (x) be a generalized mean defined in (18). Then, the following property holds: min 1≤k≤K fk(x) ≤ f(ϕ) τ,π (x) ≤ max 1≤k≤K fk(x) (19) for any τ, and lim τ→0 f(ϕ) τ,π (x) = K ∑ k=1 πkfk(x). We focus on another typical example of ϕ-geodesic than e-geodesic and m- geodesic taking a choice as ϕ = exp such that f(exp) τ,π (x) = 1 τ log { K ∑ k=1 πk exp(τfk(x)) } , (20) we call log-exp mean, cf. the geometric mean for density functions discussed in [3]. We observe specific behaviors with respect to τ as follows: lim τ→−∞ f(exp) τ,π (x) = min 1≤k≤K fk(x) and lim τ→∞ f(exp) τ,π (x) = max 1≤k≤K fk(x). (21) This suggests that the combined predictor function f (exp) τ,π (x) attains the two bounds observed in (19) when τ goes to −∞ or ∞. In this way f (exp) τ,π (x) can express flexible performance of the prediction via an appropriate selection for the tuning parameter τ. 3 Statistical model of log-exp means We discuss a practical application of ϕ-geodesics focusing on the log-exp mean. Typical candidates for K predictors are linear predictor functions, however it would suffer from model identification as in a Gaussian mixture model. Further, it is difficult to get any reasonable understanding for the association between X and Y because we have got K different slope vectors and intercepts for the explanatory vector. This problem is closely related with that of the multilayer perceptron, in which the causality is confounding by several links among the mul- tilayer components. As a solution of this problem we will discuss a parsimonious model in the following subsection. 3.1 Parsimonious model We consider a parsimonious modeling for combing linear predictor functions to keep the model identifiability as follows. Assume that the explanatory vector x is partitioned into sub-vectors x(k) with k = 1, · · · , K in unsupervised learning manner, for example, K-means for an empirical dataset, so that x = (x(1), · · · , x(K)). (22) Then we employ fk(x) = β⊤ (k)x(k) + α for k, 1 ≤ k ≤ K such that fτ (x, β, π, α) = 1 τ log [ K ∑ k=1 πk exp{τ(β⊤ (k)x(k) + α)} ] , (23) where β = (β(1), · · · , β(K)), see [7] for detailed discussion. We note that α is the threshold of the integrated predictor as fτ (x, β, π, α) = fτ (x, β, π, 0) + α, (24) cf. [6, 5] for the log-sum-exp trick. We have a reasonable understanding similar to the linear model (3) since K ∑ k=1 β⊤ (k)x(k) = β⊤ x. (25) Thus, the model (23) connects among K linear predictor functions in a flexibly nonlinear manner including linear combination as discussed in Proposition 2. In effect, we can assume for the proportions πk’s to be known because πk’s are estimated in the preprocessing by the unsupervised learning like K-means. Then we remark that the dimension of the model (23) is exactly equal to the linear predictor model (3). Subsequently, we will take another parametrization as fτ (x, β, γ) = 1 τ log [ K ∑ k=1 exp{τ(β⊤ (k)x(k) + γk)} ] , (26) where γk = α + τ−1 log πk. We investigate a local behavior of (26). The gradient vectors are ∂ ∂β(k) fτ (x, β, γ) = wτ k(x, β, γ)x(k), (27) ∂ ∂γk fτ (x, β, γ) = wτ k(x, β, γ), (28) where wτ k(x, β, γ) = exp{τ(β⊤ (k)x(k) + γk)} ∑K ℓ=1 exp{τ(β⊤ (ℓ)x(ℓ) + γℓ)} . (29) We remark that lim τ→∞ wτ k(x, β, γ) = { 1 if k = kmax 0 otherwise (30) where kmax = argmax1≤k≤K β⊤ k x(k)+γk. On the other hand, if τ gos to −∞, then wτ k(x, β, γ) converges to a weight vector degenerated at the index to minimize β⊤ k x(k) + γk with respect to k. Taking a limit of τ into 0, wτ k(x, β, γ) becomes the weight πk. We discuss to incorporate the model (26) with the generalized linear model (2). Let D = {(xi, yi) : 1 ≤ i ≤ n} be a data set. Then the log-likelihood function is written by LD(β, γ) = n ∑ i=1 {yiϑ(fτ (xi, β, γ))) − ψ(ϑ(fτ (xi, β, γ)))}. (31) The gradient vector is given in a weighted form as ∂ ∂β(k) LD(β, γ) = n ∑ i=1 e wτ (xi, β, γ)x(k){yi − ℓ(fτ (xi, β, γ))}, (32) ∂ ∂γ LD(β, γ) = n ∑ i=1 e wτ (xi, β, γ){yi − ℓ(fτ (xi, β, γ))}, (33) where e wτ (xi, β, γ) = ℓ′ (fτ (xi, β, γ))wτ k(xi, β, γ) var(ϑ(fτ (xi, β, γ))) (34) with var(θ) = (∂2 /∂θ2 )ψ(θ). Hence the MLE for (β, γ) can be obatained by a gradient-type algorithm, or the Fisher score method in a straightforward manner. 4 Discussion The framework of information geometry for a regression model utilizes the close relation among three function spaces F, R and P with one-to-one correspon- dence, where the e-geodesic and m-geodesic on F are induced from those in P and R, respectively. The information metric on P is naturally translated into the space F taking the marginal expctation. Consider a parametric model M = {fω(x) : ω ∈ Ω} (35) embedded in F with a parameter vector ω. Then the e-connection and m- connection on M are induced from those on the model f M = {p(y|x, fω) : ω ∈ Ω} (36) embedded in P. If we consider a divergence D than DKL, then another pair of linear connctions on M is associated, see [2] for detailed formulae. The discussion in this paper srongly depends on the assumption for the con- ditional density function p(y|x, f) as in (2) in accordance with GLM formulation. If p(y|x, f) does not belong to such an exponential model but another type of model, then the framework of the information geometry should be adapted. For example, if p(y|x, f) is in a deformed exponential model, then the geometry is suggested by the deformation. However, the structure is still valid including du- ally flatness and Pythogorean relation associated with the canonical divergence. References 1. Amari, Shun-ichi and Nagaoka, Hiroshi. Methods of Information Geometry. Oxford University Press: Oxford, UK, 2000. 2. Eguchi, Shinto. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. 3. Eguchi, Shinto and Komori, Osamu. Path connectedness on a space of probabil- ity density functions. In Geometric Science of Information. Springer International Publishing, 2015, 615-624. 4. Hastie, Trevor, Tibshirani, Robert and Friedman, Jerome. The Elements of Statis- tical Learning. Springer, New York, 2009. 5. Nielsen, Frank and Sun, Ke. Guaranteed bounds on the Kullback-Leibler di- vergence of univariate mixtures using piecewise log-sum-exp inequalities. arXiv preprint arXiv:1606.05850, 2016. 6. Murphy, Kevin. Naive bayes classifiers. University of British Columbia, 2006. 7. Omae, Katsuhiro, Osamu Komori, and Shinto Eguchi. Quasi-linear score for cap- turing heterogeneous structure in biomarkers. BMC bioinformatics 2017, 18.1:308.