Extract specific words from a dataframe - Python

Question

The first dataframe I have is the following:

String1
Table 671usa50452.tab has been created as of the process date (12-19-22).
Table 643usa50552.tab has been created as of the process date (12-19-22).
Table 681usa50532.tab has been created as of the process date (12-19-22).
Table 621usa56452.tab has been created as of the process date (12-19-22).
Table 547usa67452.tab has been created as of the process date (12-19-22).

I would like to extract all the accounts that contain 'usa' in between and the date specified in each row to have something like this:

String1	Account	Date
Table 671usa50452.tab has been created as of the process date (12-19-22).	671usa50452	12-19-22
Table 643usa50552.tab has been created as of the process date (12-19-22).	643usa50552	12-19-22
Table 681usa50532.tab has been created as of the process date (12-19-22).	681usa50532	12-19-22
Table 621usa56452.tab has been created as of the process date (12-19-22).	621usa56452	12-19-22
Table 547usa67452.tab has been created as of the process date (12-19-22).	547usa67452	12-19-22

I've been trying to use the following, but the information is not getting into the columns of my new dataframe:

df_list1[['Account', 'Date']] = df_list1['String'].str.extract(r'\b(\d+usa\d+)\b.*?(\d{2}-\d{2}-\d{4})')

The second dataframe is something similar:

String2
3203usa34088 : Asset USA1 / asd011245
3203usa34088 : Asset USA2 / ghf023345
3203usa34088 : Asset USA3 / hgf012735
3203usa34088 : Asset USA4 / wet012455
3203usa34088 : Asset USA5 / nbj012245

And I would like to have the following:

String2	Account2
3200usa34088 : Asset USA1 / asd011245	3200usa34088
3201usa34088 : Asset USA2 / ghf023345	3201usa34088
3202usa34088 : Asset USA3 / hgf012735	3202usa34088
3203usa34088 : Asset USA4 / wet012455	3203usa34088
3204usa34088 : Asset USA5 / nbj012245	3204usa34088

Tim Biegeleisen · Accepted Answer

For the first dataframe we can use str.extract as follows:

df["Account"] = df["String1"].str.extract(r'(\w+)\.tab\b')
df["Date"] = df["String1"].str.extract(r'$(\d{2}-\d{2}-\d{2})$')

For the second dataframe:

df["Account2"] = df["String2"].str.extract(r'^(\w+)')

João Victor Fernandes · Answer

I think this can work:

# Pandas lib
import pandas as pd


# -------------------------------------------------------------- FIRST DATAFRAME

# I will suppose youre importing the df from excel ok?
df1 = pd.read_excel("First_df.xlsx")

#First case:
list_account = []
list_date = []
for string in df1['String1']:
    if "usa" in string:
        new_string = string.split()
        newnew_string = new_string[1].split(".")
        date_string = new_string[10].split("(")
        datedate_string = date_string[0].split(")")
        
        list_account.append(newnew_string[0])
        list_date.append(datedate_string[0])

df_output = pd.DataFrame({'Account': list_account})
df_output['Date'] = list_date

# -------------------------------------------------------------- SECOND DATAFRAME

df2 = pd.read_excel("Second_df.xlsx")

list_account2 = []

for string in df2['String2']:
    if "usa" in string:
        new_string = string.split()
        list_account2.append(new_string[0])
        
df_output2 = pd.DataFrame({'Account2': list_account2})

Extract specific words from a dataframe - Python

Tags:

python

pandas

exenlediatán

2 Answers

Tim Biegeleisen

João Victor Fernandes

Recent Activity

Donate For Us

Extract specific words from a dataframe - Python

Tags:

python

pandas

exenlediatán

2 Answers

Tim Biegeleisen

João Victor Fernandes

Related questions

Recent Activity

Donate For Us