I have a very large folder of images, as well as a CSV file containing the class labels for each of those images. Because it's all in one giant folder, I'd like to split them up into training/test/validation sets; maybe create three new folders and move images into each based on a Python script of some kind. I'd like to do stratified sampling so I can keep the % of classes the same across all three sets.
What would be the approach to go about making a script that can do this?
We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.
Using the file panel, select the zip folder that you want to split. Click Add to Zip and select the split option. Choose the save location and split the folder.
Use the python library split-folder.
pip install split-folders
Let all the images be stored in Data folder.
Then apply as follows:
import splitfolders
splitfolders.ratio('Data', output="output", seed=1337, ratio=(.8, 0.1,0.1)) 
On running the above code snippet, it will create 3 folders in the output directory:
The number of images in each folder can be varied using the values in the ratio argument(train:val:test).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With