Say I have categorical features in a dataframe. In order to do ML on the dataframe, I do one hot encoding on the categorical columns using OneHotEncoderEstimator()
and then use VectorAssembler()
to assemble all the features to a single column. When reading the Spark docs I've seen the use of VectorIndexer()
to index categorical features in a features vector column. If I have already performed one hot encoding on the categorical columns before formulating the features vector column, is there any point in applying the VectorIndexer()
on it.
OneHotEncoder(Estimator)
and VectorIndexer
are quite different beasts and are not exchangeable. OneHotEncoder(Estimator)
is used primarily when the downstream process uses a linear model (it can be also used with Naive Bayes).
Let's consider a simple Dataset
val df = Seq(1.0, 2.0, 3.0).toDF
and a Pipeline
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._
val m1 = new Pipeline().setStages(Array(
new OneHotEncoderEstimator()
.setInputCols(Array("value")).setOutputCols(Array("features"))
)).fit(df)
If such model is applied to our data it will be one-hot-encoded (depending on a configuration OneHotEncoderEstimator
supports both one-hot-encoding and dummy encoding) - in other words each level, excluding reference will be represented as a separate binary column:
m1.transform(df).schema("features").metadata
org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"binary":[{"idx":0,"name":"0"},{"idx":1,"name":"1"},{"idx":2,"name":"2"}]},"num_attrs":3}}
Please note that such representation is inefficient and impractical to use with algorithms which handle categorical features natively.
In contrast, VectorIndexer
only analyzes the data, and adjust metadata accordingly
val m2 = new Pipeline().setStages(Array(
new VectorAssembler().setInputCols(Array("value")).setOutputCol("raw"),
new VectorIndexer().setInputCol("raw").setOutputCol("features")
)).fit(df)
m2.transform(df).schema("features").metadata
org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"nominal":[{"ord":false,"vals":["1.0","2.0","3.0"],"idx":0,"name":"value"}]},"num_attrs":1}}
In other words it is more or less equivalent to a vectorized variant of StringIndexer
(you can achieve a similar result, with more control over the output, using a set of StringIndexers
followed by VectorAssembler
).
Such features are unsuitable for linear models, but are valid input for decision trees and tree ensembles.
To summarize - in practice OneHotEncoder(Esitmator)
and VectorIndexer
are mutually exclusive and the choice, of which one should be used, depends on the downstream process.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With