Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Column with random, increasing numbers in pandas

I have a dataframe of about 10,000 rows. Each record includes a Recordid and an integer that represents the number of days since the start for that RecordID. For example:

 df = pd.DataFrame( { 
"RecordID" : ["id1", "id2", "id3", "id4", "id5" , "id1", "id2", "id3", "id4", "id5" ] , 
"number_of_days" : [1,1,1,1,1,2,2,2,2,2]})
df
RecordID    number_of_days
0   id1     1
1   id2     1
2   id3     1
3   id4     1
4   id5     1
5   id1     2
6   id2     2
7   id3     2
8   id4     2
9   id5     2

I'd like to add a column that contains a random number between 1 and 100. But I need the random to be higher than the previous random value for each ID. This is an example of what I would like to get:

RecordID    number_of_days  random_value
0   id1     1               10
1   id1     2               13
2   id1     3               45
3   id1     4               50
4   id1     5               62
5   id1     6               80
6   id1     7               81
7   id1     8               82
8   id1     9               92
9   id1     10              99
10  id2     2               12
11  id2     4               31

I see posts about creating a field with random values. I'm not finding any that address the need for generating random values that increase, though.

like image 407
JamesMiller Avatar asked Oct 22 '25 03:10

JamesMiller


2 Answers

You can generate random numbers, sort and then assign to df,

df = pd.DataFrame( { 

"RecordID" : ["id1", "id2", "id3", "id4", "id5" , "id1", "id2", "id3", "id4", "id5" ] , 
"number_of_days" : [1,1,1,1,1,2,2,2,2,2]})

df['random_value'] = np.sort(np.random.randint(1,100, len(df)))


    RecordID    number_of_days  random_value
0   id1         1               5
1   id2         1               7
2   id3         1               19
3   id4         1               34
4   id5         1               45
5   id1         2               53
6   id2         2               67
7   id3         2               72
8   id4         2               72
9   id5         2               80

Edit: If you want the random_value by group, you can group data by ID and then assign sorted random numbers,

df.groupby('RecordID').apply(lambda x: pd.Series(np.sort(np.random.randint(1,100, len(x))))).reset_index(name = 'random_value')
like image 147
Vaishali Avatar answered Oct 24 '25 16:10

Vaishali


Generate all random numbers, slice it properly based on the group sizes, sort each slice, and assign back. First we need to sort the DataFrame so that assignment occurs properly.

import numpy as np
import pandas as pd

df = df.sort_values('RecordID')

arr = np.array_split(np.random.randint(1, 100, len(df)),
                     df.groupby('RecordID').size().cumsum()[:-1])

df['Random_Value'] = np.sort(arr, axis=1).ravel()

Output

  RecordID  number_of_days  Random_Value
0      id1               1            19
5      id1               2            41
1      id2               1            53
6      id2               2            56
2      id3               1            33
7      id3               2            68
3      id4               1            57
8      id4               2            67
4      id5               1            39
9      id5               2            49

As always, it's best to avoid groupby.apply(lambda x: ... as this is a slow loop over the groups.

N = 10000
df = pd.DataFrame({"RecordID": list(range(N))*10,
                   "number_of_days": np.repeat(range(10), N)})

def ALollz(df):
    df = df.sort_values(['RecordID', 'number_of_days'])

    arr = np.array_split(np.random.randint(1, 100, len(df)),
                         df.groupby('RecordID').size().cumsum()[:-1])

    df['Random_Value'] = np.sort(arr, axis=1).ravel()

    return df

%timeit ALollz(df)
#54 ms ± 1.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.assign(random_value=df.groupby('RecordID').transform(lambda x: np.sort(np.random.randint(1,100, len(x))))).sort_values('RecordID')
#15.9 s ± 124 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.groupby('RecordID').apply(lambda x: pd.Series(np.sort(np.random.randint(1,100, len(x))))).reset_index()
#1.23 s ± 25.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 32
ALollz Avatar answered Oct 24 '25 16:10

ALollz