I have texts of variable size (1k-100k characters). I want to get all the words around a given word within fixed proximity. The given word is obtained from a regex so I have the start and the end of the word.
For example:
PROXIMITY_LENGTH = 10 # the fixed proximity
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()
print(f'start = {start}, stop = {stop}')
print(my_text[start - PROXIMITY_LENGTH: start])
print(my_text[stop: stop + PROXIMITY_LENGTH])
left_limit = my_text[:start - PROXIMITY_LENGTH].rfind(' ') + 1
right_limit = stop + PROXIMITY_LENGTH + my_text[stop + PROXIMITY_LENGTH:].find(' ')
print('\n')
print(my_text[left_limit: start])
print(my_text[stop: right_limit])
output:
start = 18, stop = 22
dom words
word1 wor
random words
word1 word123
The issues are at the limit, the fixed proximity can cut the last word(from right/left limit). In the above example, I tried to come with a solution, but my solution fails if I have tabs or newline as delimitator between words, ex:
for my_text = 'some\trandom words 1123 word1 word123 a' with my solution I got on the left side: some random words which is wrong.
Any help is appreciated! Thx!
Instead of looking at characters, I will look for words. In that way, you will say, find my target and add N words before and after it:
PROXIMITY_LENGTH = 2 # the fixed proximity
my_text = 'some random words 1123 word1 word123 a \t1123 this too will work'.split()
found = [x.find('1123') for x in my_text]
k = [' '.join(my_text[index-PROXIMITY_LENGTH:index+PROXIMITY_LENGTH+1]) for index, item in enumerate(found) if item == 0]
print(k)
# ['random words 1123 word1 word123', 'word123 a 1123 this too']
Using regex, we can replace found variable with;
found = []
for x in my_text:
if re.search(r'\b1123\b',x):
found.append(0)
else:
found.append(-1)
The only think I do is split the string to a list :)
This can be done by simply expanding your regex pattern to include the desired number of words around the target match:
L = 2 # using a proximity length of just 2 for demo
my_text = 'some random words 1123 word1 word123 a'
print(re.search(r'(\w+\s+){{0,{0}}}\b1123\b(\s+\w+){{0,{0}}}'.format(L), my_text).group())
This outputs:
random words 1123 word1 word123
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With