I wish to apply cross-validation to an LDA algorithm to determine the number of topics (K). My doubt is regarding the evaluator, as I wish to use the log-likelihood. What do I set on .setEvaluator(????) when creating a cross validation?
// Define a simple LDA
val lda = new LDA()
.setMaxIter(10)
.setFeaturesCol("features")
// We use a ParamGridBuilder to construct a grid of parameters to search over.
val range = 1 to 20
val paramGrid = new ParamGridBuilder()
.addGrid(lda.k, range.toArray )
.build()
// Create a CrossValidator
val cv = new CrossValidator()
.setEstimator(lda)
.setEvaluator(????)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
Cross-validation isn't going to be straightforward to apply when you are effectively doing unsupervised learning. Unless you have labelled training data, the interfaces provided by the CrossValidator are unlikely to be appropriate. The fact that you're trying different values of k, the number of topics produced by LDA, suggests that you may not have this sort of labelled training data.
If you were to try re-purposing the CrossValidator, I don't think there's any suitable Evaluators available (at least as of Spark-2.2). If you are exploring different dimensionalities of model (such as varying the number of topics, k) then the log-likelihood of the data is not trivial to compare between models that have different dimensionalities. For example, as you increase the number of classes, you'd expect the likelihood of the data to increase, but at the risk of overfitting. One standard approach is to use something like the Akaike Information criterion to penalize models that have more complexity (e.g. greater k). Again, I don't think that's currently supported in the CrossValidator.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With