How to check selected features with PySpark's ChiSqSelector?

Question

I'm using PySpark's ChiSqSelector to select the most important features. The code is running well, however I can't verify what my features are in terms of index or name.

So my question is: How can I identify what the values in selectedFeatures are referring to?

I have the sample code below that I use only four columns for the purpose of facilitating the visualization, however, I have to do this for a DF with almost 100 columns.

df=df.select("IsBeta","AVProductStatesIdentifier","IsProtected","Firewall","HasDetections")


from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)

selector = ChiSqSelector(featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
result = selector.fit(vec_df).transform(vec_df)
print(result.show())

enter image description here

And yet, when trying to apply the solution I found in this question. I still cannot understand which columns are selected in terms of name or index. That is, which are the features that are being selected.

model = selector.fit(vec_df)
model.selectedFeatures

enter image description here

pythonic833 · Accepted Answer

First: Please don't use one hot encoded features, the ChiSqSelector should be directly used on categorical (non-encoded) columns, as you can see here. Without the one-hot encoded stuff the selector usage is straight forward:

Now let's look at how the ChiSqSelector is used and how to find the relevant features by name. For example usage I'll create a df with only 2 relevant columns (AVProductStatesIdentifier and Firewall), the other 2 (IsBeta and IsProtected) will be constant:

from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
import numpy as np
import pandas as pd

#create df
df_p = pd.DataFrame([np.ones(1000, dtype=int),
             np.ones(1000, dtype=int),
             np.random.randint(0,500, 1000, dtype=int),
             np.random.randint(0,2, 1000, dtype=int)
             ], index=['IsBeta', 'IsProtected', 'Firewall', 'HasDetections']).T
df_p['AVProductStatesIdentifier'] = np.random.choice(['a', 'b', 'c'], 1000)

schema=StructType([StructField("IsBeta",IntegerType(),True),
                   StructField("AVProductStatesIdentifier",StringType(),True),
            StructField("IsProtected",IntegerType(),True),
            StructField("Firewall",IntegerType(),True),
            StructField("HasDetections",IntegerType(),True),
            ])

df = spark.createDataFrame(
    df_p[['IsBeta', 'AVProductStatesIdentifier', 'IsProtected', 'Firewall', 'HasDetections']],
    schema
)

First let's make the column AVProductStatesIdentifier categorical

mapping = {l.AVProductStatesIdentifier:i for i,l in enumerate(df.select('AVProductStatesIdentifier').distinct().collect())}

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df = df.withColumn("AVProductStatesIdentifier", mapping_expr.getItem(col("AVProductStatesIdentifier")))

Now, let's assemble that and select the 2 most important columns

from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)

selector = ChiSqSelector(numTopFeatures=2,featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
model = selector.fit(vec_df)

Now execute:

np.array(df.columns)[model.selectedFeatures]

which results in

array(['AVProductStatesIdentifier', 'Firewall'], dtype='<U25')

The two non-constant columns.

How to check selected features with PySpark's ChiSqSelector?

Tags:

python

machine-learning

apache-spark

pyspark

feature-selection

Tazz

1 Answers

pythonic833

Recent Activity

Donate For Us

How to check selected features with PySpark's ChiSqSelector?

Tags:

python

machine-learning

apache-spark

pyspark

feature-selection

Tazz

1 Answers

pythonic833

Related questions

Recent Activity

Donate For Us