I am using TrainTestSplit in ML.NET, to repeatedly split my data set into a training and test set. In e.g. sklearn, the corresponding function takes a seed as an input, so that it is possible to obtain different splits, but in ML.NET repeated calls to TrainTestSplit seems to return the same split. Is it possible to change the random seed used by TrainTestSplit?
Right now TrainTestSplit doesn't take a random seed. There is a bug opened in ML.NET to fix this: https://github.com/dotnet/machinelearning/issues/1635
As a short-term workaround, I recommend manually adding a random column to the data view, and using it as a stratificationColumn in TrainTestSplit:
data = new GenerateNumberTransform(mlContext,  new GenerateNumberTransform.Arguments
                {
                    Column = new[] { new GenerateNumberTransform.Column { Name = "random" } },
                    Seed = 42 // change seed to get a different split
                }, data);
(var train, var test) = mlContext.Regression.TrainTestSplit(data, stratificationColumn: "random");
This code will work with ML.NET 0.7, and we will fix the seed in 0.8.
As of today (ML.NET v1.0), this has been solved. TrainTestSplit takes a seed as input, and it also supports stratification by setting samplingKeyColumnName:
TrainTestSplit(IDataView data, double testFraction = 0.1, string samplingKeyColumnName = null, Nullable<int> seed = null);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With