Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract multiword expressions from annotated text

I have two columns in my dataframe, one with text and one with an annotation for each mwe in my text, that indicates the type of mwe and the range of characters where this word is included. For example,

Text column:

Barack Obama was president of the United States in 2008.

Annotation column:

MWE_type 0 12 

This indicates from char 0 to 12, so the word is Barack Obama. And,

MWE_type 34 47

So, it is United States

How can I use the annotation to extract the words from the text and save them in a new column (for the example text will be something like [Barack Obama, United States] ) ?

Thank you for your time! If you need something more specific I will be glad to add some information!

like image 778
Radix Avatar asked Nov 22 '25 19:11

Radix


1 Answers

If I didn't get you wrong, based on your definition in the comments of the main post, I've figured out a way that will do the job.

First, according to you, the data will look like this:

data = {'text' : ['Barack Obama was president of the United States in 2008.'],
    'annotation' : ['MWE_type 0 12 MWE_type 34 47']}

We will maintain a final_list which is basically a list of list, where is inner list will be the output for each row.

We can iterate over each row by df.iterrows() and extract the result for each row from row['text'] and using row['annotation'].

for index, row in df.iterrows():

We can extract the pair of indexes through the use of regular expression:

re.findall(r'\d+ \d+', row['annotation'])

We can iterate over this list of index pairs and append the corresponding substring to our row based result list.

for indexes in index_list:
        start, end = map(int, indexes.split())
        result.append(row['text'][start:end])

At the end of iterating a row, we can append the row based result list to the final_list:

final_list.append(result)

Finally, assign the final_list to df['result']:

df['result'] = final_list

The whole program is as below:

import pandas as pd
import re

data = {'text' : ['Barack Obama was president of the United States in 2008.'],
    'annotation' : ['MWE_type 0 12 MWE_type 34 47']}

df = pd.DataFrame(data)

final_list = []

for index, row in df.iterrows():
    result = []
    index_list = re.findall(r'\d+ \d+', row['annotation'])
    for indexes in index_list:
        start, end = map(int, indexes.split())
        result.append(row['text'][start:end])
    final_list.append(result)

df['result'] = final_list

print(df)

And you'll get:

                                                text  ...                         result
0  Barack Obama was president of the United State...  ...  [Barack Obama, United States]
like image 182
devReddit Avatar answered Nov 25 '25 11:11

devReddit