Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I handle this datasets to create a datasetDict?

I'm trying to build a datasetDictionary object to train a QA model on PyTorch. I have these two different datasets:

test_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 21489
})

and

train_dataset

Dataset({
    features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
    num_rows: 54159
})

In the dataset's documentation I didn't find anything. I'm quite a noob, thus the solution may be really easy. What I wish to obtain is something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

I really don't find how to use two datasets to create a dataserDict or how to set the keys. Moreover, I wish to "cut" the train set in two: train and validation sets, but also this passage is hard for me to handle. The final result should be something like this:

dataset

DatasetDict({
    train: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 54159 - x
    })
    validation: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: x
    })
    test: Dataset({
        features: ['answer_text', 'answer_start', 'title', 'context', 'question', 'answers', 'id'],
        num_rows: 21489
    })
})

Thank you in advance and pardon me for being a noob :)

like image 537
Peppe95 Avatar asked Dec 06 '25 17:12

Peppe95


1 Answers

to get the validation dataset, you can do like this:

train_dataset, validation_dataset= train_dataset.train_test_split(test_size=0.1).values()

This function will divide 10% of the train dataset into the validation dataset.

and to obtain "DatasetDict", you can do like this:

import datasets
dd = datasets.DatasetDict({"train":train_dataset,"test":test_dataset})
like image 151
Lin Avatar answered Dec 08 '25 07:12

Lin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!