Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Addressing categorical features with one hot encoding and vector assembler vs vector indexer

Say I have categorical features in a dataframe. In order to do ML on the dataframe, I do one hot encoding on the categorical columns using OneHotEncoderEstimator() and then use VectorAssembler() to assemble all the features to a single column. When reading the Spark docs I've seen the use of VectorIndexer() to index categorical features in a features vector column. If I have already performed one hot encoding on the categorical columns before formulating the features vector column, is there any point in applying the VectorIndexer() on it.

like image 541
rasthiya Avatar asked Sep 06 '25 20:09

rasthiya


1 Answers

OneHotEncoder(Estimator) and VectorIndexer are quite different beasts and are not exchangeable. OneHotEncoder(Estimator) is used primarily when the downstream process uses a linear model (it can be also used with Naive Bayes).

Let's consider a simple Dataset

val df = Seq(1.0, 2.0, 3.0).toDF

and a Pipeline

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature._

val m1 = new Pipeline().setStages(Array(
  new OneHotEncoderEstimator()
   .setInputCols(Array("value")).setOutputCols(Array("features"))
)).fit(df)

If such model is applied to our data it will be one-hot-encoded (depending on a configuration OneHotEncoderEstimator supports both one-hot-encoding and dummy encoding) - in other words each level, excluding reference will be represented as a separate binary column:

m1.transform(df).schema("features").metadata
 org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"binary":[{"idx":0,"name":"0"},{"idx":1,"name":"1"},{"idx":2,"name":"2"}]},"num_attrs":3}}

Please note that such representation is inefficient and impractical to use with algorithms which handle categorical features natively.

In contrast, VectorIndexer only analyzes the data, and adjust metadata accordingly

val m2 = new Pipeline().setStages(Array(
  new VectorAssembler().setInputCols(Array("value")).setOutputCol("raw"),
  new VectorIndexer().setInputCol("raw").setOutputCol("features")
)).fit(df)

m2.transform(df).schema("features").metadata
org.apache.spark.sql.types.Metadata = {"ml_attr":{"attrs":{"nominal":[{"ord":false,"vals":["1.0","2.0","3.0"],"idx":0,"name":"value"}]},"num_attrs":1}}

In other words it is more or less equivalent to a vectorized variant of StringIndexer (you can achieve a similar result, with more control over the output, using a set of StringIndexers followed by VectorAssembler).

Such features are unsuitable for linear models, but are valid input for decision trees and tree ensembles.

To summarize - in practice OneHotEncoder(Esitmator) and VectorIndexer are mutually exclusive and the choice, of which one should be used, depends on the downstream process.

like image 193
10465355 Avatar answered Sep 09 '25 03:09

10465355