I have a very large dataset: 7.9 GB of CSV files. 80% of which shall serve as the training data, and the remaining 20% shall serve as test data. When I'm loading the training data (6.2 GB), I'm having MemoryError at the 80th iteration (80th file). Here's the script I'm using in loading the data:
import pandas as pd
import os
col_names = ['duration', 'service', 'src_bytes', 'dest_bytes', 'count', 'same_srv_rate',
        'serror_rate', 'srv_serror_rate', 'dst_host_count', 'dst_host_srv_count',
        'dst_host_same_src_port_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate',
        'flag', 'ids_detection', 'malware_detection', 'ashula_detection', 'label', 'src_ip_add',
        'src_port_num', 'dst_ip_add', 'dst_port_num', 'start_time', 'protocol']
# create a list to store the filenames
files = []
# create a dataframe to store the contents of CSV files
df = pd.DataFrame()
# get the filenames in the specified PATH
for (dirpath, dirnames, filenames) in os.walk(path):
    ''' Append to the list the filenames under the subdirectories of the <path> '''
    files.extend(os.path.join(dirpath, filename) for filename in filenames)
for file in files:
    df = df.append(pd.read_csv(filepath_or_buffer=file, names=col_names, engine='python'))
    print('Appending file : {file}'.format(file=files[index]))
pd.set_option('display.max_colwidth', -1)
print(df)
There are 130 files in the 6.2 GB worth of CSV files.
Money-costing solution: One possible solution is to buy a new computer with a more robust CPU and larger RAM that is capable of handling the entire dataset. Or, rent a cloud or a virtual memory and then create some clustering arrangement to handle the workload.
Note: As our dataset is too large to fit in memory, we have to load the dataset from the hard disk in batches to our memory. To do so, we are going to create a custom generator. Our Custom Generator is going to load the dataset from the hard disk in batches to memory.
Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine's main memory (this is called out-of-core learning).
For large datasets - and we may already count 6.2GB as large - reading all the data in at once might not be the best idea. As you are going to train your network batch by batch anyway, it is sufficient to only load the data you need for the batch which is going to be used next.
The tensorflow documentation provides a good overview on how to implement a data reading pipeline. Stages according to the documentation linked are:
- The list of filenames
- Optional filename shuffling
- Optional epoch limit
- Filename queue
- A Reader for the file format
- A decoder for a record read by the reader
- Optional preprocessing
- Example queue
I second Nyps's answer, I just don't have enough reputation to add a comment just yet. Additionally, it might be interesting for you to open Task Manager or equivalent and observe the used memory of your system as you run this. I would guess that when your RAM entirely fills up, that's when you're getting your error.
TensorFlow supports queues, which allow you to only read portions of data at once, in order to not exhaust your memory. Examples for this are in the documentation that Nyps linked. Also, TensorFlow has recently added a new way to handle input datasets in TensorFlow Dataset docs.
Also, I would suggest converting all your data to TensorFlow's TFRecord format, as it will save space, and can speed up data accessing over 100 times compared to converting CSV files to tensors at training time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With