I want to run simple MLP Classifier (Scikit learn) with following set of data.
Data set consists of 100 files, containing sound signals. Each file has two columns (two signals) and rows (length of the signals). The length of rows (signals) vary from file to file ranges between 70 to 80 values. So the dimensions of file are 70 x 2 to 80 x 2. Each file represent one complete record.
The problem I am facing how to train simple MLP with variable length of data, with training and testing set contains 75 and 25 files respectively.
One solution is to concatenate all file and make one file i.e. 7500 x 2 and train MLP. But important information of signals is no longer useful in this case.
Three approaches in order of usefulness. Approach 1 is strongly recommended.
1st Approach - LSTM/GRU
You don't use simple MLP. The type of data you're dealing with is a sequential data. Recurrent networks (LSTM/GRU) have been created for this purpose. They are capable of processing variable length sequences.
2nd Approach - Embeddings
Find a function that can transform your data into a fixed-length sequence, called embedding. An example of network producing time series embedding is TimeNet. However, that essentially brings us back to the first approach.
3rd Approach - Padding
If you you can find a reasonable upper bound for the length of sequence, you can pad shorter series to the length of the longest one (pad 0 at the beginning/end of the series, interpolate/forecast the remaining values), or cut longer series to the length of the shortest one. Obviously you will either introduce noise or lose information, respectively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With