Addressing categorical features with one hot encoding and vector assembler vs vector indexer

Question

Say I have categorical features in a dataframe. In order to do ML on the dataframe, I do one hot encoding on the categorical columns using OneHotEncoderEstimator() and then use VectorAssembler() to assemble all the features to a single column. When reading the Spark docs I've seen the use of VectorIndexer() to index categorical features in a features vector column. If I have already performed one hot encoding on the categorical columns before formulating the features vector column, is there any point in applying the VectorIndexer() on it.

10465355 · Accepted Answer

OneHotEncoder(Estimator) and VectorIndexer are quite different beasts and are not exchangeable. OneHotEncoder(Estimator) is used primarily when the downstream process uses a linear model (it can be also used with Naive Bayes).

Let's consider a simple Dataset

val df = Seq(1.0, 2.0, 3.0).toDF

and a Pipeline

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val m1 = new Pipeline().setStages(Array(
  new OneHotEncoderEstimator()
   .setInputCols(Array("value")).setOutputCols(Array("features"))
)).fit(df)

If such model is applied to our data it will be one-hot-encoded (depending on a configuration OneHotEncoderEstimator supports both one-hot-encoding and dummy encoding) - in other words each level, excluding reference will be represented as a separate binary column:

m1.transform(df).schema("features").metadata

 org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"binary":[{"idx":0,"name":"0"},{"idx":1,"name":"1"},{"idx":2,"name":"2"}]},"num_attrs":3}}

Please note that such representation is inefficient and impractical to use with algorithms which handle categorical features natively.

In contrast, VectorIndexer only analyzes the data, and adjust metadata accordingly

val m2 = new Pipeline().setStages(Array(
  new VectorAssembler().setInputCols(Array("value")).setOutputCol("raw"),
  new VectorIndexer().setInputCol("raw").setOutputCol("features")
)).fit(df)

m2.transform(df).schema("features").metadata

org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"nominal":[{"ord":false,"vals":["1.0","2.0","3.0"],"idx":0,"name":"value"}]},"num_attrs":1}}

In other words it is more or less equivalent to a vectorized variant of StringIndexer (you can achieve a similar result, with more control over the output, using a set of StringIndexers followed by VectorAssembler).

Such features are unsuitable for linear models, but are valid input for decision trees and tree ensembles.

To summarize - in practice OneHotEncoder(Esitmator) and VectorIndexer are mutually exclusive and the choice, of which one should be used, depends on the downstream process.

Addressing categorical features with one hot encoding and vector assembler vs vector indexer

Tags:

machine-learning

scala

apache-spark

categorical-data

apache-spark-ml

rasthiya

1 Answers

10465355

Recent Activity

Donate For Us

Addressing categorical features with one hot encoding and vector assembler vs vector indexer

Tags:

machine-learning

scala

apache-spark

categorical-data

apache-spark-ml

rasthiya

1 Answers

10465355

Related questions

Recent Activity

Donate For Us