Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Struggle with iterrows using python pandas [duplicate]

Tags:

python

pandas

I have a dataframe mixed state and region together. Those value have [edit] means states in US.

    RegionName
0   Alabama[edit]
1   Auburn [1]
2   Florence
3   Jacksonville [2]
4   Livingston [2]
5   Montevallo [2]
6   Troy [2]
7   Tuscaloosa [3][4]
8   Tuskegee [5]
9   Alaska[edit]    

The result I want is

    State               RegionName
0   Alabama[edit]       Auburn[1]
1                       Florence
2                       Jacksonville [2]
3                          ...
4   Alaska[edit]           ...   

I tried to use the code below but it failed

for row in df.iterrows():
    if row['RegionName'][-6:] == '[edit]':
        row['state'] = row[:-6]

The error message is

TypeError: tuple indices must be integers or slices, not str

Could someone give me some advice? Thank

like image 352
Bradley Avatar asked Apr 01 '26 07:04

Bradley


1 Answers

You can use mask, for select last six chars indexing with str:

mask = df.RegionName.str[-6:] != '[edit]'
print (mask)
0    False
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9    False
Name: RegionName, dtype: bool

#filter by mask and replace NaN by forward filling
df['State'] = df.RegionName.mask(mask).ffill()
#remove same values in both columns
df = df[df.State != df.RegionName]
print (df)
          RegionName          State
1         Auburn [1]  Alabama[edit]
2           Florence  Alabama[edit]
3   Jacksonville [2]  Alabama[edit]
4     Livingston [2]  Alabama[edit]
5     Montevallo [2]  Alabama[edit]
6           Troy [2]  Alabama[edit]
7  Tuscaloosa [3][4]  Alabama[edit]
8       Tuskegee [5]  Alabama[edit]
#keep only first duplicates, another replace by empty string
df['State'] = df.State.mask(df.State.duplicated(), '')
#change order of columns
df = df[['State','RegionName']].reset_index(drop=True)
print (df)
           State         RegionName
0  Alabama[edit]         Auburn [1]
1                          Florence
2                  Jacksonville [2]
3                    Livingston [2]
4                    Montevallo [2]
5                          Troy [2]
6                 Tuscaloosa [3][4]
7                      Tuskegee [5]

But if need remove [] and numbers is possible use a bit modified answer:

df.insert(0, 'State', df['RegionName'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['RegionName'].str.contains('\[edit\]')].reset_index(drop=True)
#change ( to [
df['RegionName'] = df['RegionName'].str.replace(r' \[.+$', '')
print (df)
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo
5  Alabama          Troy
6  Alabama    Tuscaloosa
7  Alabama      Tuskegee

df['State'] = df.State.mask(df.State.duplicated(), '')
print (df)
     State    RegionName
0  Alabama        Auburn
1               Florence
2           Jacksonville
3             Livingston
4             Montevallo
5                   Troy
6             Tuscaloosa
7               Tuskegee

EDIT by comment:

There are multiple problems, if need very slow loop solution:

#add i for index value else get tuples
for i, row in df.iterrows():
    print (row)
    if row['RegionName'][-6:] == '[edit]':
        #for appending new column with values use loc 
        df.loc[i, 'state'] = row['RegionName'][:-6]

print (df)
         RegionName    state
0     Alabama[edit]  Alabama
1        Auburn [1]      NaN
2          Florence      NaN
3  Jacksonville [2]      NaN
4    Livingston [2]      NaN
5    Montevallo [2]      NaN
like image 196
jezrael Avatar answered Apr 03 '26 21:04

jezrael