String1 |
---|
Table 671usa50452.tab has been created as of the process date (12-19-22). |
Table 643usa50552.tab has been created as of the process date (12-19-22). |
Table 681usa50532.tab has been created as of the process date (12-19-22). |
Table 621usa56452.tab has been created as of the process date (12-19-22). |
Table 547usa67452.tab has been created as of the process date (12-19-22). |
I would like to extract all the accounts that contain 'usa' in between and the date specified in each row to have something like this:
String1 | Account | Date |
---|---|---|
Table 671usa50452.tab has been created as of the process date (12-19-22). | 671usa50452 | 12-19-22 |
Table 643usa50552.tab has been created as of the process date (12-19-22). | 643usa50552 | 12-19-22 |
Table 681usa50532.tab has been created as of the process date (12-19-22). | 681usa50532 | 12-19-22 |
Table 621usa56452.tab has been created as of the process date (12-19-22). | 621usa56452 | 12-19-22 |
Table 547usa67452.tab has been created as of the process date (12-19-22). | 547usa67452 | 12-19-22 |
I've been trying to use the following, but the information is not getting into the columns of my new dataframe:
df_list1[['Account', 'Date']] = df_list1['String'].str.extract(r'\b(\d+usa\d+)\b.*?(\d{2}-\d{2}-\d{4})')
String2 |
---|
3203usa34088 : Asset USA1 / asd011245 |
3203usa34088 : Asset USA2 / ghf023345 |
3203usa34088 : Asset USA3 / hgf012735 |
3203usa34088 : Asset USA4 / wet012455 |
3203usa34088 : Asset USA5 / nbj012245 |
And I would like to have the following:
String2 | Account2 |
---|---|
3200usa34088 : Asset USA1 / asd011245 | 3200usa34088 |
3201usa34088 : Asset USA2 / ghf023345 | 3201usa34088 |
3202usa34088 : Asset USA3 / hgf012735 | 3202usa34088 |
3203usa34088 : Asset USA4 / wet012455 | 3203usa34088 |
3204usa34088 : Asset USA5 / nbj012245 | 3204usa34088 |
For the first dataframe we can use str.extract
as follows:
df["Account"] = df["String1"].str.extract(r'(\w+)\.tab\b')
df["Date"] = df["String1"].str.extract(r'\((\d{2}-\d{2}-\d{2})\)')
For the second dataframe:
df["Account2"] = df["String2"].str.extract(r'^(\w+)')
I think this can work:
# Pandas lib
import pandas as pd
# -------------------------------------------------------------- FIRST DATAFRAME
# I will suppose youre importing the df from excel ok?
df1 = pd.read_excel("First_df.xlsx")
#First case:
list_account = []
list_date = []
for string in df1['String1']:
if "usa" in string:
new_string = string.split()
newnew_string = new_string[1].split(".")
date_string = new_string[10].split("(")
datedate_string = date_string[0].split(")")
list_account.append(newnew_string[0])
list_date.append(datedate_string[0])
df_output = pd.DataFrame({'Account': list_account})
df_output['Date'] = list_date
# -------------------------------------------------------------- SECOND DATAFRAME
df2 = pd.read_excel("Second_df.xlsx")
list_account2 = []
for string in df2['String2']:
if "usa" in string:
new_string = string.split()
list_account2.append(new_string[0])
df_output2 = pd.DataFrame({'Account2': list_account2})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With