df=pd.DataFrame({"C1":['USA','USA','USA','USA','USA','JAPAN','JAPAN','JAPAN','USA','USA'],'C2':['A','B','A','A','A','A','A','A','B','A']})
C1 C2
0 USA A
1 USA B
2 USA A
3 USA A
4 USA A
5 JAPAN A
6 JAPAN A
7 JAPAN A
8 USA B
9 USA A
This is a watered version of my problem so to keep it simple, my objective is to iterate a sub group of the dataframe where C2 has B in it. If a B is in C2 - I look at C1 and need the entire group. So in this example, I see USA and it starts at index 0 and finish at 4. Another one is between 8 and 9.
So my desired result would be the indexes such that:
[[0,4],[8,9]]
I tried to use groupby but it wouldn't work because it groups all the USA together
my_index = list(df[df['C2']=='B'].index)
my_index
woudld give 1,8 but how to get the start/finish?
You can use the following basic syntax to use GroupBy on a pandas DataFrame with a multiindex: #calculate sum by level 0 and 1 of multiindex df. groupby(level=[0,1]). sum() #calculate count by level 0 and 1 of multiindex df.
To reset index after group by, at first group according to a column using groupby(). After that, use reset_index().
To get the index of a Pandas DataFrame, call DataFrame. index property. The DataFrame. index property returns an Index object representing the index of this DataFrame.
Here is one approach where you can first mask the dataframe on groups which has atleast 1 B, then grab the index and create a helper column to aggregate the first and last index values:
s = df['C1'].ne(df['C1'].shift()).cumsum()
i = df.index[s.isin(s[df['C2'].eq("B")])]
p = np.where(np.diff(i)>1)[0]+1
split_ = np.split(i,p)
out = [[i[0],i[-1]] for i in split_]
print(out)
[[0, 4], [8, 9]]
b = df['C1'].ne(df['C1'].shift()).cumsum()
m = b.isin(b[df['C2'].eq('B')])
i = m.index[m].to_series().groupby(b).agg(['first', 'last']).values.squeeze()
shift column C1 and comapre the shifted column with the non-shifted one to create a boolean mask then take a cumulative sum on this mask to identify the blocks of rows where the value in column C1 stays the same
>>> b
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
8 3
9 3
Name: C1, dtype: int64
Create a boolean mask m to identify the blocks of rows that contain at least on B
>>> m
0 True
1 True
2 True
3 True
4 True
5 False
6 False
7 False
8 True
9 True
Name: C1, dtype: bool
Filter the index by using boolean masking with mask m, then group the filtered index by the identified blocks b and aggregate using first and last to get the indices.
>>> i
array([[0, 4],
[8, 9]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With