Format of the input dataset for Google AutoML Natural Language multi-label text classification

Question

What should the format of the input dataset be for Google AutoML Natural Language multi-label text classification? I know that for multi-class classification I need a column of text and another column for labels. The labels column include one label per row.

I have multiple labels for each text and I want to do multi-label classification. I tried having one column per label and one-hot encoding but I got this error message: Max 1000 labels supported. Found 9823 labels.

Behzad · Accepted Answer

It was very confusing at first but later I managed to find the format in the documentation, which is a CSV file like:

text1, label1, label2 text2, label2 text3, label3, label2, label1

The parser doesn't understand a table with NULL cells saved as a standard CSV file, which is like:

text1, label1, label2, text2, label2,, text3, label3, label2, label1

I had to manually remove extra commas from the CSV file generated by Pandas.

Tim Hong · Answer

Google AutoML has updated their parser. The following format is fine:

text1, label1, label2, label3,
text1, label1, label2, ,
text1, label1, label2, , ,

At least that worked for me on 27th Jan 2019

Format of the input dataset for Google AutoML Natural Language multi-label text classification

Tags:

google-cloud-nl

google-natural-language

google-cloud-automl-nl

Behzad

2 Answers

Behzad

Tim Hong

Recent Activity

Donate For Us

Format of the input dataset for Google AutoML Natural Language multi-label text classification

Tags:

google-cloud-nl

google-natural-language

google-cloud-automl-nl

Behzad

2 Answers

Behzad

Tim Hong

Related questions

Recent Activity

Donate For Us