Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy split with percentage on a matrix

I have issues to understand the following coding and I am new to python:

data_a, data_b, data_C = np.split(original_data.sample(frac=1, random_state=1729), 
                               [int(0.7 * len(original_data)), int(0.9*len(original_data))])

so my original data set has a complete of 38000 rows. After this split method the data_a has 26600 rows. Now data_b has 7600 rows, data_c has 3800 rows. So I do get that 70% of original_data will be 26600 rows. But why does data_b has 7600 rows and data_c 3800. I read the documentation about that split method and from what I understand the coding I would have suggested that for the rest of 30% data from my initial 38000 rows, 90% will split into data_b that would be 10260 rows. Not 7600 rows.

like image 462
MaradonaAtCoding Avatar asked Jan 25 '26 11:01

MaradonaAtCoding


2 Answers

You have do it sequentially, if you want split the remaining 30% into 90-10. Try this!

data_a, remaining_data = np.split(original_data.sample(frac=1, random_state=1729), 
                                   [int(0.7 * len(original_data))])
data_b, data_C = np.split(remaining_data,[int(0.9 * len(remaining_data))])

data_a.shape, data_b.shape, data_C.shape

output:

((26600,), (10260,), (1140,))
like image 75
Venkatachalam Avatar answered Jan 27 '26 23:01

Venkatachalam


the splits percentages there are relative to the original dataset, so if you want data_b to be 90% of the 30% left after the first split you need to do something like this

data_a, data_b, data_C = np.split(original_data.sample(frac=1, random_state=1729), [int(0.7 * len(original_data)), int(0.97*len(original_data))])

that is because you specify the split points rather than the ratios of result data sets

like image 38
Alex Avatar answered Jan 28 '26 01:01

Alex



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!