Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python random sample selection based on multiple conditions

Tags:

python

pandas

I want to make a random sample selection in python from the following df such that at least 65% of the resulting sample should have color yellow and cumulative sum of the quantities selected to be less than or equals to 18.

Original Dataset:

Date        Id      color       qty
02-03-2018  A       red         5
03-03-2018  B       blue        2
03-03-2018  C       green       3
04-03-2018  D       yellow      4
04-03-2018  E       yellow      7
04-03-2018  G       yellow      6
04-03-2018  H       orange      8
05-03-2018  I       yellow      1
06-03-2018  J       yellow      5

I have got total qty. selected condition covered but stuck on how to move forward with integrating the % condition:

df2 = df1.sample(n=df1.shape[0])

df3= df2[df2.qty.cumsum() <= 18]

Required dataset:

Date        Id      color       qty
03-03-2018  B       blue        2
04-03-2018  D       yellow      4
04-03-2018  G       yellow      6
06-03-2018  J       yellow      5

Or something like this:

Date        Id      color       qty
02-03-2018  A       red         5
04-03-2018  D       yellow      4
04-03-2018  E       yellow      7
05-03-2018  I       yellow      1

Any help would be really appreciated!

Thanks in advance.

like image 814
Analytics_TM Avatar asked Oct 28 '25 11:10

Analytics_TM


1 Answers

  1. Filter rows with 'yellow' and select a random sample of at least 65% of your total sample size

    import random
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df3[df3['color'] == 'yellow'].sample(yellow_size*sample_size)
    
  2. Filter rows with other colors and select a random sample for the remaining of your sample size.

    others_size = 1 - yellow_size
    df_others = df3[df3['color'] != 'yellow].sample(others_size*sample_size)
    
  3. Combine them both and shuffle the rows.

    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
    

UPDATE:

If you want to check for both conditions simultaneously, this could be one way to do it:

import random

df_sample = df

while sum(df_sample['qty']) > 18:
    yellow_size = float(random.randint(65,100)) / 100
    df_yellow = df[df['color'] == 'yellow'].sample(yellow_size*sample_size)
    others_size = 1 - yellow_size
    df_others = df[df['color'] != 'yellow'].sample(others_size*sample_size)
    df_sample = pd.concat([df_yellow, df_others]).sample(frac=1)
like image 70
panktijk Avatar answered Oct 31 '25 00:10

panktijk



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!