In simple words, principal component analysis(PCA) is a method of extracting important variables (in form of components) from a large set of variables . It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible. With fewer variables, visualization also becomes much more meaningful. This is why PCA is called dimension reduction technique. PCA is more useful when dealing with higher dimensional data and the variables have significant correlation among them.
Principal components analysis is one of the simplest of the multivariate methods. The objective of the analysis is to take p variables (x1,x2,x3.....xp) and find linear combination of these to produce transformed variabels (z1,z2,z3...zp) so that they are uncorelated in order of their importance and that describe the overall variation in the data set.
The lack of correlation means that the indices are measuring different “dimensions” of the data, and the ordering is such that var(z1)≥var(z2)≥var(z3)....var(zp), where var denotes the variance of . The Z indices are then the principal components. When doing principal components analysis, there is always the hope that the variances of most of the indices will be as low as to be negligible. In that case, most of the variation in the full data set can be adequately described by the few Z variables with variances that are not negligible, and some degree of economy is then achieved. For this reason this is also called dimension reduction technique. Often the significant variances explained by the Z variables have a dominant load factor associated with the original X variables and Z describe a specific degree of quantitative or qualitative nature of the X attributes. Hence such newly formed Z variables are called latent factor analysis.
Principal components analysis does not always work, in the sense that a large number of original variables are reduced to a small number of transformed variables. Indeed, if the original variables are uncorrelated, then the analysis achieves nothing. The best results are obtained when the original variables are very highly correlated, positively or negatively. If that is the case, then it is quite conceivable that for example 20 or more original variables can be adequately represented by two or three principal components. If this desirable state of affairs does occur, then the important principal components will be of some interest as measures of the underlying dimensions in the data. It will also be of value to know that there is a good deal of redundancy in the original variables, with most of them measuring similar things.
Where it is used?
A multi-dimensional hyper-space is often difficult to visualize. The main objectives of unsupervised learning methods are to reduce dimensionality, scoring all observations based on a composite index and clustering similar observations together based on multivariate attributes. Summarizing multivariate attributes by two or three variables that can be displayed graphically with minimal loss of information is useful in knowledge discovery. Because it is hard to visualize a multi-dimensional space, PCA is mainly used to reduce the dimensionality of d multivariate attributes into two or three dimensions.
PCA summarizes the variation in correlated multivariate attributes to a set of non-correlated components, each of which is a particular linear combination of the original variables. The extracted non-correlated components are called Principal Components (PC) and are estimated from the eigenvectors of the covariance matrix of the original variables. Therefore, the objective of PCA is to achieve parsimony and reduce dimensionality by extracting the smallest number components that account for most of the variation in the original multivariate data and to summarize the data with little loss of information.
A few use cases where PCA is used:
Survey data: Any kind of market survey data which is collected in a Likert scale (0-5/0-10 etc.) can be used to derived principal components that can describe a specific sentiment of the customers/participants in the survey. The principal components with Eigen value >1 are the important ones to be considered.
Market mix model: In developing market mix model usually 52-104 weeks of sales and marketing spend data along with many brand image variables that are measured in monthly/quarterly basis are used to derive the contribution of the marketing spends in generating revenue. In the overall ROI calculation a mix model is developed. Realized sales/Revenue/Pipeline sales are modeled with the help of many spend related attributes and its various derived adstock values . In such scenario PCA is used to reduce the overall dimension of the data.
Brand image: To create brand image from many brand variables often PCA is used to calculate brand value index
NPA score calculation: In the calculation of NPA (Net promoter score) from customer survey data often PCA is used by considering the overall effect of all the considered variables
CSAT score calculation: Similarly in CSAT score calculation PCA is used.