how to get a subgroup start finish indexes of dataframe

Tags:

pandas

df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})

    C1      C2
0   USA     A
1   USA     B
2   USA     A
3   USA     A
4   USA     A
5   JAPAN   A
6   JAPAN   A
7   JAPAN   A
8   USA     B
9   USA     A

This is a watered version of my problem so to keep it simple, my objective is to iterate a sub group of the dataframe where C2 has B in it. If a B is in C2 - I look at C1 and need the entire group. So in this example, I see USA and it starts at index 0 and finish at 4. Another one is between 8 and 9.

So my desired result would be the indexes such that:

[[0,4],[8,9]]

I tried to use groupby but it wouldn't work because it groups all the USA together

my_index = list(df[df['C2']=='B'].index)
my_index

woudld give 1,8 but how to get the start/finish?

269

asked Apr 18 '21 16:04

ProcolHarum

2 Answers

Here is one approach where you can first mask the dataframe on groups which has atleast 1 B, then grab the index and create a helper column to aggregate the first and last index values:

s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]

print(out)
[[0, 4], [8, 9]]

101

answered Sep 27 '22 19:09

anky

Solution

b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()

Explanations

shift column C1 and comapre the shifted column with the non-shifted one to create a boolean mask then take a cumulative sum on this mask to identify the blocks of rows where the value in column C1 stays the same

>>> b

0    1
1    1
2    1
3    1
4    1
5    2
6    2
7    2
8    3
9    3
Name: C1, dtype: int64

Create a boolean mask m to identify the blocks of rows that contain at least on B

>>> m

0     True
1     True
2     True
3     True
4     True
5    False
6    False
7    False
8     True
9     True
Name: C1, dtype: bool

Filter the index by using boolean masking with mask m, then group the filtered index by the identified blocks b and aggregate using first and last to get the indices.

>>> i

array([[0, 4],
       [8, 9]])

answered Sep 27 '22 19:09

Shubham Sharma

Related questions
                            
                                How to fill a new list while iterating over another list?
                            
                                Where to place __all__ in a Python file?
                            
                                how can I clean this data type object and transform it to type float maintaining Null and NaN
                            
                                Select rows if string begins with certain characters in pandas
                            
                                Converting PNG to JPG in python
                            
                                Abstract dataclass without abstract methods in Python: prohibit instantiation
                            
                                LabelEncoder for categorical features?
                            
                                opencv python reading image as RGB
                            
                                Change the text color of cells in Plotly table based on value (string)
                            
                                Align Button To The center Of the window using pysimplegui
                            
                                Running poetry fails with /usr/bin/env: ‘python’: No such file or directory
                            
                                Custom authentication for FastAPI
                            
                                Pandas: Summing all elements in a dataframe? [duplicate]
                            
                                Easily check if table exists with python, sqlalchemy on an sql database
                            
                                Python Logging - AttributeError: module 'logging' has no attribute 'handlers'
                            
                                How to dump confusion matrix using TensorBoard logger in pytorch-lightning?
                            
                                Extract distinct values from Dataframe and insert them into new Dataframe with same column Name
                            
                                nbconvert use '--allow-chromium-download'
                            
                                How to pass each row of a dataFrame to an array
                            
                                Why am I getting this Index out of range error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With