This may be a novice question, however I'm unable to comprehend if there is any specific advantage of using QuantileDiscretizer over Bucketizerin spark 2.1 ?
I understand that QuantileDiscretizer is an estimator and handles NAN values whereas Bucketizer is a transformer and raises error if data has NAN values.
from the spark documentation , below code produces similar outputs
from pyspark.ml.feature import QuantileDiscretizer
from pyspark.ml.feature import Bucketizer
data = [(0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2)]
df = spark.createDataFrame(data, ["id", "hour"])
result_discretizer = QuantileDiscretizer(numBuckets=3, inputCol="hour",outputCol="result").fit(df).transform(df)
result_discretizer.show()
splits = [-float("inf"),3, 10,float("inf")]
result_bucketizer = Bucketizer(splits=splits, inputCol="hour",outputCol="result").transform(df)
result_bucketizer.show()
Output :
+---+----+------+
| id|hour|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4| 2.2| 0.0|
+---+----+------+
+---+----+------+
| id|hour|result|
+---+----+------+
| 0|18.0| 2.0|
| 1|19.0| 2.0|
| 2| 8.0| 1.0|
| 3| 5.0| 1.0|
| 4| 2.2| 0.0|
+---+----+------+
Please let me know if there is any significant advantage of one over other?
Bucketizer is used to transform a column of continuous features to a column of feature buckets. We specify the n+1 splits parameter for mapping continuous features into n buckets. The splits should be in a strictly increasing order. Typically, we add Double. NegativeInfinity and Double.
QuantileDiscretizer takes a column with continuous features and outputs a column with binned categorical features. The number of bins can be set using the numBuckets parameter.
Bucketizer: The bucketizer transforms a column of continuous features to a column of feature buckets. The buckets are decided by the parameter “splits”. A bucket defined by the splits x, y holds values in the range [x, y) except the last bucket which also includes y.
A feature transformer that merges multiple columns into a vector column.
It is possible that the number of buckets used will be smaller than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter.
The number of bins can be set using the numBuckets parameter. It is possible that the number of buckets used will be less than this value, for example, if there are too few distinct values of the input to create enough distinct quantiles. Since 3.0.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter.
Since 2.3.0, QuantileDiscretizer can map multiple columns at once by setting the inputCols parameter. If both of the inputCol and inputCols parameters are set, an Exception will be thrown.
Bucketing is a technique in both Spark and Hive used to optimize the performance of the task. In bucketing buckets (clustering columns) determine data partitioning and prevent data shuffle. Based on the value of one or more bucketing columns, the data is allocated to a predefined number of buckets. Figure 1.1
QuantileDiscretizer determines the bucket splits based on the data.
Bucketizer puts data into buckets that you specify via splits.
So use Bucketizer when you know the buckets you want, and QuantileDiscretizer to estimate the splits for you.
That the outputs are similar in the example is due to the contrived data and the splits chosen. Results may vary significantly in other scenarios.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With