What should the format of the input dataset be for Google AutoML Natural Language multi-label text classification? I know that for multi-class classification I need a column of text and another column for labels. The labels column include one label per row.
I have multiple labels for each text and I want to do multi-label classification. I tried having one column per label and one-hot encoding but I got this error message: Max 1000 labels supported. Found 9823 labels.
It was very confusing at first but later I managed to find the format in the documentation, which is a CSV file like:
text1, label1, label2
text2, label2
text3, label3, label2, label1
The parser doesn't understand a table with NULL cells saved as a standard CSV file, which is like:
text1, label1, label2,
text2, label2,,
text3, label3, label2, label1
I had to manually remove extra commas from the CSV file generated by Pandas.
Google AutoML has updated their parser. The following format is fine:
text1, label1, label2, label3,
text1, label1, label2, ,
text1, label1, label2, , ,
At least that worked for me on 27th Jan 2019
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With