Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lookup value in dictionary between range pandas

Tags:

python

pandas

I have a "lookup" table formatted as such:

Min | Max | Val
  1 |  99 | "Principal"
100 | 199 | "Partner"
... | ... | ...

There is a CURRENT_POINTS series in my dataframe that is between Min or Max (inclusive).

Question: how do I create a VAL column that is based on the above lookup table? My initial thought was to use df.lookup, but there are 800K rows in df and so the two tables are not equally sized.

Thanks in advance for your help!

Any thoughts?

like image 614
user791411 Avatar asked Nov 28 '25 05:11

user791411


1 Answers

I would use cut() method.

Assuming you have the following DFs:

In [187]: lkp
Out[187]:
   Min  Max  Val
0    1   99  AAA
1  100  199  BBB
2  200  299  CCC
3  300  399  DDD

In [188]: df
Out[188]:
   CURRENT_POINTS
0              55
1              10
2              20
3             144
4             194
5             143
6             397
7             233
8             128
9             215

Using cut() method we can produce a new column of a category dtype, which might save a lot of memory:

In [189]: df['Val'] = pd.cut(df.CURRENT_POINTS,
     ...:                    bins=[0] + lkp[['Min','Max']].stack()[1::2].tolist(),
     ...:                    labels=lkp.Val.tolist())
     ...:

In [190]: df
Out[190]:
   CURRENT_POINTS  Val
0              55  AAA
1              10  AAA
2              20  AAA
3             144  BBB
4             194  BBB
5             143  BBB
6             397  DDD
7             233  CCC
8             128  BBB
9             215  CCC

In [191]: df.dtypes
Out[191]:
CURRENT_POINTS       int32
Val               category
dtype: object

Category dtype can save a lot of memory:

In [192]: big = pd.concat([df] * 10**5, ignore_index=True)

In [193]: big.shape
Out[193]: (1000000, 2)

In [194]: big['str_col'] = 'AAA'

In [198]: big.dtypes
Out[198]:
CURRENT_POINTS       int32
Val               category
str_col             object
dtype: object

In [195]: big.memory_usage()
Out[195]:
Index                  80
CURRENT_POINTS    4000000
Val               1000032     # <--- `category` column takes 1 byte per row (plus 32 bytes overhead)
str_col           8000000

In [197]: big.head()
Out[197]:
   CURRENT_POINTS  Val str_col
0              55  AAA     AAA
1              10  AAA     AAA
2              20  AAA     AAA
3             144  BBB     AAA
4             194  BBB     AAA

NOTE: pay attention at memory usage for the category column Val and for the str_col column (dtype: object)

Explanation:

bins:

In [199]: lkp[['Min','Max']]
Out[199]:
   Min  Max
0    1   99
1  100  199
2  200  299
3  300  399

In [200]: lkp[['Min','Max']].stack()
Out[200]:
0  Min      1
   Max     99
1  Min    100
   Max    199
2  Min    200
   Max    299
3  Min    300
   Max    399
dtype: int64

In [201]: lkp[['Min','Max']].stack()[1::2].tolist()
Out[201]: [99, 199, 299, 399]

In [202]: [0] + lkp[['Min','Max']].stack()[1::2].tolist()
Out[202]: [0, 99, 199, 299, 399]

labels:

In [203]: lkp.Val.tolist()
Out[203]: ['AAA', 'BBB', 'CCC', 'DDD']

NOTE: lkp must be sorted by ['Min', 'Max'] before using it for bins and labels.

Here is a small demo for sorting:

In [2]: lkp
Out[2]:
   Min  Max  Val
0  300  399  DDD
1  100  199  BBB
2    1   99  AAA
3  200  299  CCC

In [4]: lkp = lkp.sort_values(['Min','Max'])

In [5]: lkp
Out[5]:
   Min  Max  Val
2    1   99  AAA
1  100  199  BBB
3  200  299  CCC
0  300  399  DDD
like image 197
MaxU - stop WAR against UA Avatar answered Nov 30 '25 18:11

MaxU - stop WAR against UA



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!