Information Criterion for Selection of Ubiquitous Factors

21/09/2014
Auteurs :
Publication MaxEnt 2014
OAI : oai:www.see.asso.fr:9603:11329
DOI : You do not have permission to access embedded form.

Résumé

Information Criterion for Selection of Ubiquitous Factors

Métriques

11
8
412.48 Ko
 application/pdf
bitcache://6d8b8e0c59c5d178f6a65dded8ada0002562b82d

Licence

Creative Commons Aucune (Tous droits réservés)

Sponsors

Sponsors scientifique

logo_smf_cmjn.gif
smai.png

Sponsors logistique

logo_cnrs_2.jpg
logo_supelec.png
logo-universite-paris-sud.jpg
logo_see.gif

Sponsors financier

bsu-logo.png
entropy1-01.png
<resource  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xmlns="http://datacite.org/schema/kernel-4"
                xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
        <identifier identifierType="DOI">10.23723/9603/11329</identifier><creators><creator><creatorName>Julio Stern</creatorName></creator><creator><creatorName>Hellinton Takada</creatorName></creator></creators><titles>
            <title>Information Criterion for Selection of Ubiquitous Factors</title></titles>
        <publisher>SEE</publisher>
        <publicationYear>2014</publicationYear>
        <resourceType resourceTypeGeneral="Text">Text</resourceType><dates>
	    <date dateType="Created">Sat 30 Aug 2014</date>
	    <date dateType="Updated">Mon 2 Oct 2017</date>
            <date dateType="Submitted">Tue 16 Jan 2018</date>
	</dates>
        <alternateIdentifiers>
	    <alternateIdentifier alternateIdentifierType="bitstream">6d8b8e0c59c5d178f6a65dded8ada0002562b82d</alternateIdentifier>
	</alternateIdentifiers>
        <formats>
	    <format>application/pdf</format>
	</formats>
	<version>34210</version>
        <descriptions>
            <description descriptionType="Abstract"></description>
        </descriptions>
    </resource>
.

Information Criterion for Selection of Ubiquitous Factors Hellinton H. Takadaa, b and Julio M. Sternb a Quantitative Research, Itaú Asset Management, São Paulo, Brazil b Institute of Mathematics and Statiscs, University of São Paulo, São Paulo, Brazil Abstract. Factor analysis is a statistical procedure to describe observed data in terms of unobserved variables called factors. Naturally, it is necessary to determine the number of factors to represent the system and there are several existent criteria to deal with the tradeoff between reduction of approximation error and avoidance of overparameterization. However, given the factors there is a lack of an approach to verify if they are really equally inherent to the entire data. In this paper, the term ubiquitous factors is coined to describe such equally omnipresent factors and it is proposed an information criterion to fill the existent blank. Additionally, it is also shown the possibility to use the criterion to compare ubiquity of factors from two different techniques: principal component analysis and non-negative matrix factorization. Finally, the proposed criterion is extended to identify factors more suitable to describe only a partition of the data. Keywords: Information theory, Entropy, Financial markets. PACS: 89.70.-a, 89.70.Cf, 89.65.Gh INTRODUCTION Originally, factor analysis (FA) was developed in social sciences and psychology and it is a statistical procedure to describe observed data in terms of unobserved variables called factors . The objective of FA is to reduce the dimensionality of the original data , using an approximation such that: (1) where is the matrix of factors or unobserved (latent) variables, is the matrix of factor loadings or weights, represents the number of factors and . In the literature, there are some factorization techniques to find and . The most popular approach is the principal component analysis (PCA) and it was introduced by Pearson and developed by Hotelling . An example of a more recent technique is the non- negative matrix factorization (NNMF) introduced by Paatero and Tapper and popularized by Lee and Seung . In exploratory FA, it is necessary to determine the number of factors . PCA has a long list of possible approaches to select : Akaike information criterion , minimum description length , imbedded error function , cumulative percent variance , scree test on residual percent variance , average eigenvalue , parallel analysis , autocorrelation , cross validation based on the PRESS and R ratio , variance of the reconstruction error , etc. On the other hand, NNMF has also some alternatives to choose: three Bayesian information criterion , relative root of sum of square differences , volume-based method , cophenetic correlation coefficient method , bi-cross-validation method , etc. Obviously, the existent criteria deal with the tradeoff between reduction of approximation error and avoidance of overparameterization. However, it is not true that the factors produced using the mentioned criteria are necessarily equally inherent to all data. In the FA literature, the factors are usually referred as common trends. However, that is not true because sometimes obtained factors describe only part of the columns of . In this paper, given the factors a criterion is presented to find the most ubiquitous (or omnipresent) factor or factors to all of the columns of . Additionally, it is possible to use the proposed criterion to compare the ubiquity degree of factors obtained from different factorization techniques. The paper is organized as follows: firstly, the ubiquitous factor criterion (UFC) is introduced. Then, the UFC is applied to PCA and NNMF in the context of financial time series to find the most ubiquitous factors. In the sequence, the UFC is extended to enable the identification of specific factors for partitions of the columns of . Finally, the conclusion together with more comments about the results are given at the end. UBIQUITOUS FACTORS Ubiquitous Factor Criterion In this section, the ubiquitous factor criterion (UFC) is introduced. The factor model given by is usually implemented with the following restrictions on factor loadings: (2) Considering the restriction and noticing that , it is possible to define for each factor using the discrete Shannon entropy as follows: (3) The Shannon entropy quantifies the expected value of information contained in the sequence . In the previous definition, it is usual to consider . Using , it is possible to state the UFC: Given a number of factors and calculating , the higher the value of , the more ubiquitous (or omnipresent) the factor . It is also important to notice that the lower the value of , the more specific the factor . In the next section, a sample application using financial time series is presented. Sample Application In this section, the UFC is applied to PCA and NNMF to find the most ubiquitous factors in financial time series. PCA has been applied to several problems in finance from yield curves to investment risk factors. On the other hand, NNMF was applied in to identify factors in stock market data. The prices considered here are from some exchange tradable funds (ETFs) from Brazilian stock exchange (BM&F Bovespa) for the period from 01/02/2012 to 03/19/2014. Specifically, the ETFs chosen are: 1) BOVA11, 2) BRAX11, 3) CSMO11, 4) DIVO11, 5) FIND11, 6) GOVE11, 7) ISUS11, 8) MATB11, 9) MILA11, 10) MOBI11, 11) PIBB11 and 12) SMAL11. Consequently, and . Additionally, all the prices were normalized to begin at , the resulting factors are in variance decreasing order, the restriction is respected and, for comparison purposes, it will be adopted for both PCA and NNMF. Singular value decomposition (SVD) is a technique from linear algebra used to obtain the principal components . The SVD factorization results: (4) where is obtained mean centering the data matrix , , , , , , the columns of and are orthonormal eigenvectors of and is a diagonal matrix containing the square roots of the corresponding eigenvalues from or such that , since usually . Given , the PCA -factor model is: (5) where and . The columns of are the factors and the columns of are the corresponding factor loadings. Consequently, the UFC statistics for PCA are given by: (6) The obtained factors and factor loadings for PCA are in FIGURE 1 and FIGURE 2, respectively. The UFC statistics are in TABLE 1. It is possible to notice that the first factor is the most ubiquitous one. On the other hand, the third factor is the second most ubiquitous one while the second factor is the third in terms of ubiquity. FIGURE 1. Factors obtained using PCA. FIGURE 2. Factor loadings obtained using PCA. Since the matrix of historical prices is nonnegative and given the integer , the NNMF problem is to find the following approximation: (7) where , and . It is possible to notice that the columns of represent the factors and the rows of the factor loadings. The NNMF optimization procedures minimizes the approximation error between and . In a generalized way, the Bregman divergence is used as the objective function to be minimized [26,27]. Considering only separable Bregman divergences, (8) where is a strictly convex function with a continuous first derivative. Formally, the resulting optimization problems are: (9) or (10) where and are penalty functions to enforce certain application-dependent characteristics of the solution, such as sparsity and/or smoothness. It is also important to remember that the Bregman divergences are not symmetric in general. Consequently, it will be considered here . Adopting and , there are some known algorithms to solve the NNMF problem divided in general classes [28]: gradient descent algorithms, multiplicative update algorithms and alternating least squares algorithms (ALS). Here, it will be adopted the ALS (the use of other algorithms does not provide great differences to the sample example presented here) and the UFC statistics for NNMF are (11) The obtained factors and factor loadings for NNMF are in FIGURE 3 and FIGURE 4, respectively. The UFC statistics are in TABLE 1. It is possible to notice that factors are already in the decreasing ubiquity degree order. FIGURE 3. Factors obtained using NNMF. FIGURE 4. Factor loadings obtained using NNMF. TABLE 1. UFC and SFC statistics for PCA and NNMF factors. first factor ( ) 2,1599 2,3991 0,3250 0,0858 second factor ( ) 1,6152 2,3628 0,8697 0,1221 third factor ( ) 1,6904 2,1394 0,7945 0,3455 Finally, it is also possible to notice that the ubiquity degree for NNMF factors are higher when compared with the statistics for PCA. Consequently, for the considered data the NNMF factors represent better ubiquitous factors than PCA. In other words, the NNMF factors are better common trends than PCA factors. SPECIFIC FACTOR CRITERION Cluster analysis has the objective of grouping objects in partitions. In the literature, there are several related algorithms: hierarchical clustering and k-means are some popular examples. Additionally, the use of information theory in cluster analysis is not new and, particularly, the Kullback-Leibler divergence has already been applied [29]. However, the problem here is quite different: given the factors it is proposed a criterion to select the best factor that describes partitions of the columns of . For each factor , it is possible to define a statistic based on the discrete Kullback-Leibler [30] divergence: (12) The discrete Kullback-Leibler divergence is a non-symmetric measure of the difference between two mass distributions. Using , it is possible to state the specific factor criterion (SFC): Given a number of factors and calculating , the lower the value of , the more specific the factor to a partition of the columns of described by . The vector is chosen to create partitions of the columns of . In the following, some particular cases of are empirically studied using the same data from the previous section. Considering a vector given by: (13) the SFC acts as the UFC. The SFC statistics obtained are presented in TABLE 1 and they bring the same conclusions obtained using the UFC statistics. Choosing a vector such that: (14) and a second vector : (15) where is a very small positive number (considered here ), it was calculated the SFC statistics and the results are in TABLE 2. Clearly, the factor that best describes the partition given by is the factor 3 and the partition is the factor 2. Observing FIGURE 3, it is possible to notice: an increasing trend (given by factor 3) and a decreasing trend (given by factor 2). Obviously, the ETFs CSMO11, FIND11 and ISUS11 have predominantly increased, while BOVA11, MOBI11 and SMAL11 have predominantly decreased in the considered historical data. Consequently, the SFC identified the factors that best describe the common trend of each set of ETFs chosen. TABLE 2. SFC statistics for PCA and NNMF factors. first factor ( ) 8,8522 9,1667 second factor ( ) 8,4927 6,1651 third factor ( ) 4,6169 10,3602 CONCLUSIONS In the literature, there are several existent criteria to find the number of factors considering the tradeoff between reduction of approximation error and avoidance of overfitting. However, given the factors there is a lack of an approach to verify if they are really ubiquitous to the entire data. In this paper, the ubiquitous factor criterion is introduced to fill the blank. Additionally, it is also proposed a criterion to identify more suitable factors to describe only a partition of the data. Applications of the criteria using financial time series show their usefulness to select the best overall and partition specific trends and to compare different factorization techniques such as PCA and NNMF. REFERENCES 1. C. Spearman, ""General intelligence," objectively determined and measured," Am. J. Psychol. 15 (2), 201–292 (1904). 2. Z. Ghaharamani and G. E. Hilton, "The EM algorithm for mixtures of factor analyzers," Tech. Rep. CRG-TR-96-1, Dept. of Computer Science, Univ. of Toronto, 1997. 3. B. S. Everitt, Latent Variable Models, London: Chapman and Hall, 1984. 4. K. Pearson, "On lines and planes of closest fit to systems of points in space," Philos. Mag. 2 (11), 559–572 (1901). 5. H. Hotelling, "Analysis of a complex of statistical variables into principal components," J. Educational Psychol. 24 (6), 417– 441 (1933). 6. P. Paatero and U. Tapper, "Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values," Environmetrics 5 (2), 111–126 (1994). 7. D. Lee and H. Seung, "Learning the parts of objects by non-negative matrix factorization," Nature 401, 788–791 (1999). 8. H. Akaike, "Information theory and an extension of the maximum likelihood principle," Proc. 2nd International Symposium on Information Theory, 267–281 (1974). 9. J. Rissanen, "Modeling by shortest data description," Automatica 14, 465–471 (1978). 10. E. R. Malinowski, "Determination of the number of factors and the experimental error in a data matrix," Analytical Chemistry 49 (4), 612–617 (1977). 11. E. R. Malinowski, Factor Analysis in Chemistry, New York: Wiley-Interscience, 1991. 12. R. B. Cattell, "The scree test for the number of factors," Multivariate Behavioral Research 1, 245–276 (1966). 13. H. F. Kaiser, "The application of electronic computers to factor analysis," Educational and Psychological Measurement 20 (1), 141–151 (1960). 14. W. R. Zwick and W. F. Velicer, "Comparison of five rules for determining the number of components to retain," Psychological Bulletin 99 (3), 432–442 (1986). 15. R. I. Shrager and R. W. Hendler, "Titration of individual components in a mixture with resolution of difference spectra, pKs, and redox transitions," Analytical Chemistry 54 (7), 1147–1152 (1982). 16. S. Wold, "Cross validatory estimation of the number of components in factor and principal components analysis," Technometrics 20, 397–406 (1978). 17. S. J. Qin and R. Dunia, "Determining the number of principal components for best reconstruction," J. Process Control 10, 245–250 (2000). 18. J. Bai and S. Ng, "Determining the number of factors in approximate factor models," Econometrica 70 (1), 191–221 (2002). 19. X. Shao, G. Wang, S. Wang and Q. Su, "Extraction of mass spectra and chromato-graphic profiles from overlapping GC/MS signal with background," Analytical Chemistry 76 (17), 5143–5148 (2004). 20. P. Fogel, S. S. Young, D. M. Hawkins and N. Ledirac, "Inferential, robust non-negative matrix factorization analysis of microarray data," Bioinformatics 23 (1), 44–49 (2007). 21. J. Brunet, P. Tamayo, T. R. Golub and J. P. Mesirov, "Metagenes and molecular pattern discovery using matrix factorization," Proc. National Academy of Sciences of the United States of America 101 (12), 4164–4169 (2004). 22. A. B. Owen and P. O. Perry, "Bi-cross-validation of the SVD and the non-negative matrix factorization," Tech. Rep., Stanford Univ., 2008. 23. C. E. Shannon, "A mathematical theory of communication," Bell System Tech. J. 27, 379-423 (1948). 24. K. Drakakis, S. Rickard, R. de Fréin and A. Cichocki, "Analysis of financial data using non-negative matrix factorization," International Mathematical Forum 3 (38), 1853–1870 (2008). 25. G. H. Golub and C. F. V. Loan, Matrix Computations, The Johns Hopkins Univ. Press, 1996. 26. I. S. Dhillon and S. Sra, "Generalized nonnegative matrix approximations with Bregman divergences," Advances in Neural Information Processing Systems 18, 283–290 (2005). 27. L. Li, G. Lebanon and H. Park, "Fast Bregman divergence NMF using Taylor expansion and coordinate descent," Proc. 18th ACM SIGKDD international conference on Knowledge discovery and data mining, August, 2012. 28. M. W. Berry, M. Browne, A. N. Langville, V. P. Pauca, R. J. Plemmons, "Algorithms and applications for approximate nonnegative matrix factorization," Computational Statistics and Data Analysis 52, 155–173 (2007). 29. A. Sheehy, "Maximal Kullback-Leibler divergence cluster analysis," Tech. Rep. 113, Dept. of Statistics, Univ. of Washington, 1987. 30. S. Kullback and R. A. Leibler, "On information and sufficiency," Annals Mathematical Statistics 22, 79-86 (1951).