I'm am doing PCA and I am interested in which original features were most important. Let me illustrate this with an example:
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[1,-1, -1,-1], [1,-2, -1,-1], [1,-3, -2,-1], [1,1, 1,-1], [1,2,1,-1], [1,3, 2,-0.5]])
print(X)
Which outputs:
[[ 1.  -1.  -1.  -1. ]
[ 1.  -2.  -1.  -1. ]
[ 1.  -3.  -2.  -1. ]
[ 1.   1.   1.  -1. ]
[ 1.   2.   1.  -1. ]
[ 1.   3.   2.  -0.5]]
Intuitively, one could already say that feature 1 and feature 4 are not very important due to their low variance. Let's apply pca on this set:
pca = PCA(n_components=2)
pca.fit_transform(X)
comps = pca.components_
Output:
array([[ 0.        ,  0.8376103 ,  0.54436943,  0.04550712],
       [-0.        ,  0.54564656, -0.8297757 , -0.11722679]])
This output represents the importance of each original feature for each of the two principal components (see this for reference). In other words, for the first principal component, feature 2 is most important, then feature 3. For the second principal component, feature 3 looks most important.
The question is, which feature is most important, which one second most etc? Can I use the component_ attribute for this? Or am I wrong and is PCA not the correct method for doing such analyses (and should I use a feature selection method instead)?
PCA is a dimensionality reduction technique that has four main parts: feature covariance, eigendecomposition, principal component transformation, and choosing components in terms of explained variance.
PCA helps you interpret your data, but it will not always find the important patterns. Principal component analysis (PCA) simplifies the complexity in high-dimensional data while retaining trends and patterns. It does this by transforming the data into fewer dimensions, which act as summaries of features.
we can conclude that feature 1, 3 and 4 are the most important for PC1. Similarly, we can state that feature 2 and then 1 are the most important for PC2. To sum up, we look at the absolute values of the eigenvectors' components corresponding to the k largest eigenvalues.
The first principal component (PC1) is the line that best accounts for the shape of the point swarm. It represents the maximum variance direction in the data. Each observation (yellow dot) may be projected onto this line in order to get a coordinate value along the PC-line. This value is known as a score.
The component_ attribute is not the right spot to look for feature importance. The loadings in the two arrays (i.e. the two componments PC1 and PC2) tell you how your original matrix is transformed by each feature (taken together, they form a rotational matrix). But they don't tell you how much each component contributes to describing the transformed feature space, so you don't know yet how to compare the loadings across the two components.
However, the answer that you linked actually tells you what to use instead: the explained_variance_ratio_ attribute. This attribute tells you how much of the variance in your feature space is explained by each principal component:
In [5]: pca.explained_variance_ratio_
Out[5]: array([ 0.98934303,  0.00757996])
This means that the first prinicpal component explaines almost 99 percent of the variance. You know from components_ that PC1 has the highest loading for the second feature. It follows, therefore, that feature 2 is the most important feature in your data space. Feature 3 is the next most important feature, as it has the second highest loading in PC1.
In PC2, the absolute loadings are nearly swapped between feature 2 and feature 3. But as PC2 explains next to nothing of the overall variance, this can be neglected.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With