Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check selected features with PySpark's ChiSqSelector?

I'm using PySpark's ChiSqSelector to select the most important features. The code is running well, however I can't verify what my features are in terms of index or name.

So my question is: How can I identify what the values ​​in selectedFeatures are referring to?

I have the sample code below that I use only four columns for the purpose of facilitating the visualization, however, I have to do this for a DF with almost 100 columns.

df=df.select("IsBeta","AVProductStatesIdentifier","IsProtected","Firewall","HasDetections")


from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)

selector = ChiSqSelector(featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
result = selector.fit(vec_df).transform(vec_df)
print(result.show())

enter image description here

And yet, when trying to apply the solution I found in this question. I still cannot understand which columns are selected in terms of name or index. That is, which are the features that are being selected.

model = selector.fit(vec_df)
model.selectedFeatures

enter image description here

like image 560
Tazz Avatar asked Oct 24 '25 05:10

Tazz


1 Answers

First: Please don't use one hot encoded features, the ChiSqSelector should be directly used on categorical (non-encoded) columns, as you can see here. Without the one-hot encoded stuff the selector usage is straight forward:

Now let's look at how the ChiSqSelector is used and how to find the relevant features by name. For example usage I'll create a df with only 2 relevant columns (AVProductStatesIdentifier and Firewall), the other 2 (IsBeta and IsProtected) will be constant:

from pyspark.sql.types import StructType, StructField, IntegerType
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
import numpy as np
import pandas as pd

#create df
df_p = pd.DataFrame([np.ones(1000, dtype=int),
             np.ones(1000, dtype=int),
             np.random.randint(0,500, 1000, dtype=int),
             np.random.randint(0,2, 1000, dtype=int)
             ], index=['IsBeta', 'IsProtected', 'Firewall', 'HasDetections']).T
df_p['AVProductStatesIdentifier'] = np.random.choice(['a', 'b', 'c'], 1000)

schema=StructType([StructField("IsBeta",IntegerType(),True),
                   StructField("AVProductStatesIdentifier",StringType(),True),
            StructField("IsProtected",IntegerType(),True),
            StructField("Firewall",IntegerType(),True),
            StructField("HasDetections",IntegerType(),True),
            ])

df = spark.createDataFrame(
    df_p[['IsBeta', 'AVProductStatesIdentifier', 'IsProtected', 'Firewall', 'HasDetections']],
    schema
)

First let's make the column AVProductStatesIdentifier categorical

mapping = {l.AVProductStatesIdentifier:i for i,l in enumerate(df.select('AVProductStatesIdentifier').distinct().collect())}

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df = df.withColumn("AVProductStatesIdentifier", mapping_expr.getItem(col("AVProductStatesIdentifier")))

Now, let's assemble that and select the 2 most important columns

from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols = ["IsBeta","AVProductStatesIdentifier","IsProtected","Firewall"], outputCol="features")
vec_df = vec_assembler.transform(df)

selector = ChiSqSelector(numTopFeatures=2,featuresCol='features', fpr=0.05, outputCol="selectedFeatures",labelCol= "HasDetections")
model = selector.fit(vec_df)

Now execute:

np.array(df.columns)[model.selectedFeatures]

which results in

array(['AVProductStatesIdentifier', 'Firewall'], dtype='<U25')

The two non-constant columns.

like image 162
pythonic833 Avatar answered Oct 25 '25 18:10

pythonic833



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!