I have array x_train
and targets_train
. I want to shuffle the training data and split it into smaller batches and use the batches as training data. My original data has 1000 rows and each time I try to use 250 rows of them :
x_train = np.memmap('/home/usr/train', dtype='float32', mode='r', shape=(1000, 1, 784))
# print(x_train)
targets_train = np.memmap('/home/usr/train_label', dtype='int32', mode='r', shape=(1000, 1))
train_idxs = [i for i in range(x_train.shape[0])]
np.random.shuffle(train_idxs)
num_batches_train = 4
def next_batch(start, train, labels, batch_size=250):
newstart = start + batch_size
if newstart > train.shape[0]:
newstart = 0
idxs = train_idxs[start:start + batch_size]
# print(idxs)
return train[idxs, :], labels[idxs, :], newstart
# x_train_lab = x_train[:200]
# # x_train = np.array(targets_train)
# targets_train_lab = targets_train[:200]
for i in range(num_batches_train):
x_train, targets_train, newstart = next_batch(i*batch_size, x_train, targets_train, batch_size=250)
The problem is, when I shuffle the training data and try to access to batches I get error saying:
return train[idxs, :], labels[idxs, :], newstart
IndexError: index 250 is out of bounds for axis 0 with size 250
Is there anybody who knows what am I doing wrong?
(edit - first guess about newstart
deleted)
In this line:
x_train, targets_train, newstart = next_batch(i*batch_size, x_train, targets_train, batch_size=250)
you change the size of x_train
with each iteration, yet you continue to use the train_idxs
array that you created for the full size array.
It's one thing to pull out random values from x_train
in batches, but you have to keep the selection arrays consistent.
========================
Reducing your code to a simple case
import numpy as np
x_train = np.arange(20).reshape(20,1)
train_idxs = np.arange(x_train.shape[0])
np.random.shuffle(train_idxs)
num_batches_train = 4
batch_size=5
def next_batch(start, train):
idxs = train_idxs[start:start + batch_size]
print(train.shape, idxs)
return train[idxs, :]
for i in range(num_batches_train):
x_train = next_batch(i*batch_size, x_train)
print(x_train)
a run produces:
1658:~/mypy$ python3 stack39919181.py
(20, 1) [ 7 18 3 0 9]
[[ 7]
[18]
[ 3]
[ 0]
[ 9]]
(5, 1) [13 5 2 15 1]
Traceback (most recent call last):
File "stack39919181.py", line 14, in <module>
x_train = next_batch(i*batch_size, x_train)
File "stack39919181.py", line 11, in next_batch
return train[idxs, :]
IndexError: index 13 is out of bounds for axis 0 with size 5
I fed the (5,1) x_train
back to the next_batch
but tried to index it as though it were the original.
Changing the iteration to:
for i in range(num_batches_train):
x_batch = next_batch(i*batch_size, x_train)
print(x_batch)
lets it run through producing 4 batches of 5 rows.
The problem is in this line in the function definition:
idxs = train_idxs[start:start + batch_size]
Change it to:
idxs = train_idxs[start: newstart]
Then it should work as expected!
Also, please change the variable names in the for
loop to something like:
batch_size = 250
for i in range(num_batches_train):
x_train_split, targets_train_split, newstart = next_batch(i*batch_size,
x_train,
targets_train,
batch_size=250)
print(x_train_split.shape, targets_train_split.shape, newstart)
Sample output:
(250, 1, 784) (250, 1) 250
(250, 1, 784) (250, 1) 500
(250, 1, 784) (250, 1) 750
(250, 1, 784) (250, 1) 1000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With