This question might look long but I promise it is really not complicted.
I have a DF with textblocks and some ID columns. I want to create a new DF that contains each sentence as its own row.
original_df = pd.DataFrame(data={"year":[2018,2019], "text_nr":[1,2], "text":["This is one sentence. This is another!","Please help me. I am lost. "]})
original_df
>>>
year text_nr text
0 2018 1 "This is one sentence. This is another!"
1 2019 2 "Please help me. I am lost."
I would like to split each textblock into individual sentences using spacy and create a new DF that looks like this:
sentences_df
>>>
year text_nr sent_nr sentence
0 2018 1 1 "This is one sentence".
1 2018 1 2 "This is another!"
2 2019 2 1 "Please help me."
3 2019 2 2 "I am lost."
I have found a way to do it like this:
import spacy
nlp = spacy.load("en_core_news_sm")
sentences_list = []
for i, row in original_df.iterrows():
doc = nlp(row["text"])
sentences = [(row["year"],row["text_nr"],str(i+1),sent.string.replace('\n','').replace('\t','').strip()) for i, sent in enumerate(doc.sents)]
sentences_list = sentences_list+sentences
sentences_df = pd.DataFrame(sentences_list, columns = ["year",text_nr","sent_nr","sentence"])
But it is not very elegant and I read that df.apply(lambda: ...)
method is much faster.
However, when I try it, I never manage to get the correct result.I tried these two ways:
nlp = spacy.load("en_core_news_sm")
def sentencizer (x, nlp_model):
sentences = {}
doc = nlp_model(x["text"])
for i, sent in enumerate(doc.sents):
sentences["year"]=x["year"]
sentences["text_nr"]=x["text_nr"]
sentences["sent_nr"] = str(i+1)
sentences["sentence"] = sent.string.replace('\n','').replace('\t','').strip()
return sentences
sentences_df = original_df.head().apply(lambda x: pd.Series(sentencizer(x,nlp)),axis=1)
This only ever gets the last sentence
sentences_df
>>>
year text_nr sent_nr sentence
0 2018 1 2 "This is another!"
1 2019 2 2 "I am lost!"
nlp = spacy.load("en_core_news_sm")
def sentencizer (x, nlp_model):
sentences = {"year":[],"text_nr":[],"sent_nr":[],"sentence":[]}
doc = nlp_model(x["text"])
for i, sent in enumerate(doc.sents):
sentences["year"].append(x["year"])
sentences["text_nr"].append(x["text_nr"])
sentences["sent_nr"].append(str(i+1))
sentences["sentence"].append(sent.string.replace('\n','').replace('\t','').strip())
return sentences
sentences_df = original_df.apply(lambda x: pd.Series(sentencizer(x,nlp)),axis=1)
This yields me a DF with lists as entrys:
sentences_df
>>>
year text_nr sent_nr sentence
0 [2018, 2018] [1, 1] [1, 2] ["This is one sentence.", "This is another!"]
1 [2019, 2019] [2, 2] [1, 2] ["Please help me.", "I am lost."]
I could probably try to expand this last df, but I am sure there is a way to do this correctly in one go. I would like to use spacy
for splitting the text as it has more advanced sentence boundry detection than just using regex/ string splitting. You do not need to download spacy
to help me (->string.split()
is fine for the dummy data here). I just need to find a logic that works along the same lines as the following so I can rewrite it to use it with spacy
.
nlp = spacy.load("en_core_news_sm")
doc = nlp("This is a sentence.\n This is another! ")
sentences = [sent.string.strip() for sent in doc.sents] #doc.sents is a generator
sentences
>>>
["This is a sentence", "This is another!"]
So something along the lines of this would be great:
text = "This is a sentence.\n This is another! "
sentences = [sent.replace("\n","").strip() for sent in text.split(".")]
sentences
>>>
["This is a sentence", "This is another!"]
Thanks a lot for any help. I am quite new to programming so please have mercy :)
Found a solution that works:
nlp = spacy.load("en_core_news_sm")
def splitter(x,nlp):
doc = nlp(x["text"])
a = [str(sent) for sent in doc.sents]
b = len(a)
dictionary = {"text_nr": np.repeat(x["text_nr"],b), "sentence_nr": list(range(1, b+1)), "sentence": a}
dictionaries = [{key : value[i] for key, value in dictionary.items()} for i in range(b)]
for dictionary in dictionaries:
rows_list.append(dictionary)
original_df.apply(lambda x: splitter(x,nlp), axis = 1)
new_df = pd.DataFrame(rows_list, columns=['text_nr', 'sentence_nr','sentence'])
Something along this line might work:
# update punctuations list if needed
punctuations = '\.\!\?'
(original_df.drop('text',axis=1)
.merge(original_df.text
.str.extractall(f'(?P<sentence>[^{punctuations}]+[{punctuations}])\s?')
.reset_index('match'),
left_index=True, right_index=True, how='left')
)
Output:
year text_nr match sentence
0 2018 1 0 This is one sentence.
0 2018 1 1 This is another!
1 2019 2 0 Please help me.
1 2019 2 1 I am lost.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With