Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get all the words around a word within a fixed proximity

I have texts of variable size (1k-100k characters). I want to get all the words around a given word within fixed proximity. The given word is obtained from a regex so I have the start and the end of the word.

For example:

PROXIMITY_LENGTH = 10  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()

print(f'start = {start}, stop = {stop}')
print(my_text[start - PROXIMITY_LENGTH: start]) 
print(my_text[stop: stop + PROXIMITY_LENGTH])

left_limit = my_text[:start - PROXIMITY_LENGTH].rfind(' ') + 1
right_limit = stop + PROXIMITY_LENGTH + my_text[stop + PROXIMITY_LENGTH:].find(' ') 

print('\n')
print(my_text[left_limit: start]) 
print(my_text[stop: right_limit])

output:

start = 18, stop = 22
dom words 
 word1 wor


random words 
 word1 word123

The issues are at the limit, the fixed proximity can cut the last word(from right/left limit). In the above example, I tried to come with a solution, but my solution fails if I have tabs or newline as delimitator between words, ex:

for my_text = 'some\trandom words 1123 word1 word123 a' with my solution I got on the left side: some random words which is wrong.

Any help is appreciated! Thx!

like image 600
kederrac Avatar asked Dec 29 '25 22:12

kederrac


2 Answers

Instead of looking at characters, I will look for words. In that way, you will say, find my target and add N words before and after it:

PROXIMITY_LENGTH = 2  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a \t1123 this too will work'.split()

found = [x.find('1123') for x in my_text]

k = [' '.join(my_text[index-PROXIMITY_LENGTH:index+PROXIMITY_LENGTH+1]) for index, item in enumerate(found) if item == 0]


print(k)

# ['random words 1123 word1 word123', 'word123 a 1123 this too']

Using regex, we can replace found variable with;


found = []
for x in my_text:
    if re.search(r'\b1123\b',x):
        found.append(0)
    else:
        found.append(-1)

The only think I do is split the string to a list :)

like image 117
Prayson W. Daniel Avatar answered Dec 31 '25 11:12

Prayson W. Daniel


This can be done by simply expanding your regex pattern to include the desired number of words around the target match:

L = 2 # using a proximity length of just 2 for demo
my_text = 'some random words 1123 word1 word123 a'
print(re.search(r'(\w+\s+){{0,{0}}}\b1123\b(\s+\w+){{0,{0}}}'.format(L), my_text).group())

This outputs:

random words 1123 word1 word123
like image 30
blhsing Avatar answered Dec 31 '25 12:12

blhsing



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!