How to do feature selection/feature importance using PySpark?

Question

I am trying to get feature selection/feature importances from my dataset using PySpark but I am having trouble doing it with PySpark.

This is what I have done using Python Pandas to do it but I would like to accomplish it using PySpark:

cols = [col for col in new_result.columns if col not in ['treatment']]
data = new_result[cols]
target = new_result['treatment']

model = ExtraTreesClassifier()
model.fit(data,target)
print(model.feature_importances_)

feat_importances = pd.Series(model.feature_importances_, index=data.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

This is what I have tried but I don't feel the code for PySpark have achieved what I wanted. I know the model is different but I would like to get the same result as what I did for Pandas please:

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import RandomForestClassifier

assembler = VectorAssembler(
  inputCols=['Primary_ID',
             'Age',
             'Gender',
             'Country',
             'self_employed',
             'family_history',
             'work_interfere',
             'no_employees',
             'remote_work',
             'tech_company',
             'benefits',
             'care_options',
             'wellness_program',
             'seek_help',
             'anonymity',
             'leave',
             'mental_health_consequence',
             'phys_health_consequence',
             'coworkers',
             'supervisor',
             'mental_vs_physical',
             'obs_consequence',
             'mental_issue_in_tech'],
                outputCol="features")

output = assembler.transform(new_result)

from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="treatment", outputCol="treatment_index")
output_fixed = indexer.fit(output).transform(output)
final_data = output_fixed.select("features",'treatment_index')
train_data,test_data = final_data.randomSplit([0.7,0.3])

rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="treatment", seed=42)
model = rf.fit(output)

model.featureImportances

Return result of SparseVector(23, {2: 0.0961, 5: 0.1798, 6: 0.3232, 11: 0.0006, 14: 0.1307, 22: 0.2696}) What does this mean? Please advise and thank you in advance for all the help!

Som · Accepted Answer

Vectors are represented in 2 flavours internally in the spark.

DenseVector
- This takes more memory as all the elements are stored as Array[Double]
SparseVector
- This is memory efficient way of storing the vector. representation having 3 parts-
  - size of vector
  - array of indices - It contains only those indices which has value other than 0.
  - array of values - it contains actual values associated with the indices.
    
    Example -

val sparseVector = SparseVector(4, [1, 3], [3.0, 4.0])
println(sparseVector.toArray.mkString(", "))
// 0.0, 3.0, 0.0, 4.0

all the missing values are considered as 0

Regarding your problem-

you can map your sparse vector having feature importance with vector assembler input columns. Please note that size of feature vector and the feature importance are same.

 val vectorToIndex = vectorAssembler.getInputCols.zipWithIndex.map(_.swap).toMap
    val featureToWeight = rf.fit(trainingData).featureImportances.toArray.zipWithIndex.toMap.map{
      case(featureWeight, index) => vectorToIndex(index) -> featureWeight
    }
    println(featureToWeight)

The similar code should work in python too

How to do feature selection/feature importance using PySpark?

Tags:

python

pandas

dataframe

apache-spark

pyspark

devaaron2

1 Answers

Regarding your problem-

Som

Recent Activity

Donate For Us

How to do feature selection/feature importance using PySpark?

Tags:

python

pandas

dataframe

apache-spark

pyspark

devaaron2

1 Answers

Regarding your problem-

Som

Related questions

Recent Activity

Donate For Us