How can I handle this datasets to create a datasetDict?

Question

I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:

test_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 21489
})

and

train_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 54159
})

In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159 - x
    })
    validation: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: x
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

Thank you in advance and pardon me for being a noob :)

Lin · Accepted Answer

to get the validation dataset, you can do like this:

train_dataset, validation_dataset= train_dataset.train_test_split(test_size=0.1).values()

This function will divide 10% of the train dataset into the validation dataset.

and to obtain "DatasetDict", you can do like this:

import datasets
dd = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})

How can I handle this datasets to create a datasetDict?

Tags:

python

dataset

deep-learning

pytorch

nlp-question-answering

Peppe95

1 Answers

Lin

Recent Activity

Donate For Us

How can I handle this datasets to create a datasetDict?

Tags:

python

dataset

deep-learning

pytorch

nlp-question-answering

Peppe95

1 Answers

Lin

Related questions

Recent Activity

Donate For Us