Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract specific words from a dataframe - Python

Tags:

python

pandas

  1. The first dataframe I have is the following:
String1
Table 671usa50452.tab has been created as of the process date (12-19-22).
Table 643usa50552.tab has been created as of the process date (12-19-22).
Table 681usa50532.tab has been created as of the process date (12-19-22).
Table 621usa56452.tab has been created as of the process date (12-19-22).
Table 547usa67452.tab has been created as of the process date (12-19-22).

I would like to extract all the accounts that contain 'usa' in between and the date specified in each row to have something like this:

String1 Account Date
Table 671usa50452.tab has been created as of the process date (12-19-22). 671usa50452 12-19-22
Table 643usa50552.tab has been created as of the process date (12-19-22). 643usa50552 12-19-22
Table 681usa50532.tab has been created as of the process date (12-19-22). 681usa50532 12-19-22
Table 621usa56452.tab has been created as of the process date (12-19-22). 621usa56452 12-19-22
Table 547usa67452.tab has been created as of the process date (12-19-22). 547usa67452 12-19-22

I've been trying to use the following, but the information is not getting into the columns of my new dataframe:

df_list1[['Account', 'Date']] = df_list1['String'].str.extract(r'\b(\d+usa\d+)\b.*?(\d{2}-\d{2}-\d{4})')

  1. The second dataframe is something similar:
String2
3203usa34088 : Asset USA1 / asd011245
3203usa34088 : Asset USA2 / ghf023345
3203usa34088 : Asset USA3 / hgf012735
3203usa34088 : Asset USA4 / wet012455
3203usa34088 : Asset USA5 / nbj012245

And I would like to have the following:

String2 Account2
3200usa34088 : Asset USA1 / asd011245 3200usa34088
3201usa34088 : Asset USA2 / ghf023345 3201usa34088
3202usa34088 : Asset USA3 / hgf012735 3202usa34088
3203usa34088 : Asset USA4 / wet012455 3203usa34088
3204usa34088 : Asset USA5 / nbj012245 3204usa34088
like image 460
exenlediatán Avatar asked Sep 19 '25 13:09

exenlediatán


2 Answers

For the first dataframe we can use str.extract as follows:

df["Account"] = df["String1"].str.extract(r'(\w+)\.tab\b')
df["Date"] = df["String1"].str.extract(r'\((\d{2}-\d{2}-\d{2})\)')

For the second dataframe:

df["Account2"] = df["String2"].str.extract(r'^(\w+)')
like image 133
Tim Biegeleisen Avatar answered Sep 22 '25 04:09

Tim Biegeleisen


I think this can work:

# Pandas lib
import pandas as pd


# -------------------------------------------------------------- FIRST DATAFRAME

# I will suppose youre importing the df from excel ok?
df1 = pd.read_excel("First_df.xlsx")

#First case:
list_account = []
list_date = []
for string in df1['String1']:
    if "usa" in string:
        new_string = string.split()
        newnew_string = new_string[1].split(".")
        date_string = new_string[10].split("(")
        datedate_string = date_string[0].split(")")
        
        list_account.append(newnew_string[0])
        list_date.append(datedate_string[0])

df_output = pd.DataFrame({'Account': list_account})
df_output['Date'] = list_date

# -------------------------------------------------------------- SECOND DATAFRAME

df2 = pd.read_excel("Second_df.xlsx")

list_account2 = []

for string in df2['String2']:
    if "usa" in string:
        new_string = string.split()
        list_account2.append(new_string[0])
        
df_output2 = pd.DataFrame({'Account2': list_account2})


    
like image 44
João Victor Fernandes Avatar answered Sep 22 '25 06:09

João Victor Fernandes