Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Format of the input dataset for Google AutoML Natural Language multi-label text classification

What should the format of the input dataset be for Google AutoML Natural Language multi-label text classification? I know that for multi-class classification I need a column of text and another column for labels. The labels column include one label per row.

I have multiple labels for each text and I want to do multi-label classification. I tried having one column per label and one-hot encoding but I got this error message: Max 1000 labels supported. Found 9823 labels.

like image 371
Behzad Avatar asked Oct 20 '25 10:10

Behzad


2 Answers

It was very confusing at first but later I managed to find the format in the documentation, which is a CSV file like:

text1, label1, label2 text2, label2 text3, label3, label2, label1

The parser doesn't understand a table with NULL cells saved as a standard CSV file, which is like:

text1, label1, label2, text2, label2,, text3, label3, label2, label1

I had to manually remove extra commas from the CSV file generated by Pandas.

like image 61
Behzad Avatar answered Oct 22 '25 05:10

Behzad


Google AutoML has updated their parser. The following format is fine:

text1, label1, label2, label3,
text1, label1, label2, ,
text1, label1, label2, , ,

At least that worked for me on 27th Jan 2019

like image 28
Tim Hong Avatar answered Oct 22 '25 03:10

Tim Hong



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!