How to extract average metrics with Cross-Validation in PySpark

Question

I'm trying to perform a Cross-Validation over Random Forest in Spark 1.6.0 and I'm finding hard to obtain the evaluation metrics (precision, recall, f1...). I want the average of the metrics of all folds. Is this possible to obtain them with CrossValidator and MulticlassClassificationEvaluator?

I only found examples where the evaluation is performed later over an independent test dataset and using the best model from the Cross-Validation. I'm not planning to use a train and test set, but to use all the dataframe (df) for the cross validation, let it make the splits, and then take the average metrics.

paramGrid = ParamGridBuilder().build()
evaluator = MulticlassClassificationEvaluator()    

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=5)

model = crossval.fit(df)

evaluator.evaluate(model.transform(df))

For now, I obtain the best model metric with the last line of the above code evaluator.evaluate(model.transform(df)) and I'm not totally sure that I'm doing it correctly.

Algorithman · Accepted Answer

In Spark 2.x, it is possible to get the average metrics using model.avgMetrics. This returns an array of double containing the metrics used to train your cross validation model.

For MulticlassClassificationEvaluator, this gives an array of: f1, weightedPrecision, weightedRecall, accuracy (as documented here). These metrics can be overridden as needed using setter in the evaluator class.

If you also need to get the best model parameters chosen by the cross validator, please see my answer in here.

How to extract average metrics with Cross-Validation in PySpark

Tags:

apache-spark

pyspark

Ed.

1 Answers

Algorithman

Recent Activity

Donate For Us

How to extract average metrics with Cross-Validation in PySpark

Tags:

apache-spark

pyspark

Ed.

1 Answers

Algorithman

Related questions

Recent Activity

Donate For Us