I'm using PySpark's ChiSqSelector to select the most important features. The code is running well, however I can't verify what my features are in terms of index or name.
So my question is: How can I identify what the values in selectedFeatures are referring to?
I have the sample code below that I use only four columns for the purpose of facilitating the visualization, however, I have to do this for a DF with almost 100 columns.
df=df.select("IsBeta","AVProductStatesIdentifier","IsProtected","Firewall","HasDetections")
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)
selector = ChiSqSelector(featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
result = selector.fit(vec_df).transform(vec_df)
print(result.show())

And yet, when trying to apply the solution I found in this question. I still cannot understand which columns are selected in terms of name or index. That is, which are the features that are being selected.
model = selector.fit(vec_df)
model.selectedFeatures

First: Please don't use one hot encoded features, the ChiSqSelector should be directly used on categorical (non-encoded) columns, as you can see here. Without the one-hot encoded stuff the selector usage is straight forward:
Now let's look at how the ChiSqSelector is used and how to find the relevant features by name. For example usage I'll create a df with only 2 relevant columns (AVProductStatesIdentifier and Firewall), the other 2 (IsBeta and IsProtected) will be constant:
from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
import numpy as np
import pandas as pd
#create df
df_p = pd.DataFrame([np.ones(1000, dtype=int),
np.ones(1000, dtype=int),
np.random.randint(0,500, 1000, dtype=int),
np.random.randint(0,2, 1000, dtype=int)
], index=['IsBeta', 'IsProtected', 'Firewall', 'HasDetections']).T
df_p['AVProductStatesIdentifier'] = np.random.choice(['a', 'b', 'c'], 1000)
schema=StructType([StructField("IsBeta",IntegerType(),True),
StructField("AVProductStatesIdentifier",StringType(),True),
StructField("IsProtected",IntegerType(),True),
StructField("Firewall",IntegerType(),True),
StructField("HasDetections",IntegerType(),True),
])
df = spark.createDataFrame(
df_p[['IsBeta', 'AVProductStatesIdentifier', 'IsProtected', 'Firewall', 'HasDetections']],
schema
)
First let's make the column AVProductStatesIdentifier categorical
mapping = {l.AVProductStatesIdentifier:i for i,l in enumerate(df.select('AVProductStatesIdentifier').distinct().collect())}
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df = df.withColumn("AVProductStatesIdentifier", mapping_expr.getItem(col("AVProductStatesIdentifier")))
Now, let's assemble that and select the 2 most important columns
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)
selector = ChiSqSelector(numTopFeatures=2,featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
model = selector.fit(vec_df)
Now execute:
np.array(df.columns)[model.selectedFeatures]
which results in
array(['AVProductStatesIdentifier', 'Firewall'], dtype='<U25')
The two non-constant columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With