Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apply PCA and keep a percentage of the total variance

I want to perform Principal Component Analysis on a particular dataset and then feed the principal components to a LogisticRegression classifier.

Specifically, I want to apply PCA and keep the 90% of the total variance, using the function computePrincipalComponentsAndExplainedVariance.

Here's the code for reading the dataset:

// Load the data
val text = sparkSession.sparkContext.textFile("dataset.data")        
val data = text.map(line => line.split(',').map(_.toDouble))  
// Separate to label and features
val dataLP = data.map(t => (t(57), Vectors.dense(t.take(57)))) 

I am not quite sure how to perform PCA so that the 90% of the total variance is maintained.

like image 499
Giorgos Myrianthous Avatar asked Feb 02 '26 02:02

Giorgos Myrianthous


1 Answers

Using the function computePrincipalComponentsAndExplainedVariance the return value will be a matrix as well as a vector with values indicating the variance explained for each principal component. From the documentation:

Returns: a matrix of size n-by-k, whose columns are principal components, and a vector of values which indicate how much variance each principal component explains

By using a large enough k as input, you can simply sum up the numbers in the vector until it's 90% or above, and then use that many columns from the matrix.

like image 125
Shaido Avatar answered Feb 04 '26 15:02

Shaido



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!