How to get all the words around a word within a fixed proximity

Question

I have texts of variable size (1k-100k characters). I want to get all the words around a given word within fixed proximity. The given word is obtained from a regex so I have the start and the end of the word.

For example:

PROXIMITY_LENGTH = 10  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()

print(f'start = {start}, stop = {stop}')
print(my_text[start - PROXIMITY_LENGTH: start]) 
print(my_text[stop: stop + PROXIMITY_LENGTH])

left_limit = my_text[:start - PROXIMITY_LENGTH].rfind(' ') + 1
right_limit = stop + PROXIMITY_LENGTH + my_text[stop + PROXIMITY_LENGTH:].find(' ') 

print('
')
print(my_text[left_limit: start]) 
print(my_text[stop: right_limit])

output:

start = 18, stop = 22
dom words 
 word1 wor


random words 
 word1 word123

The issues are at the limit, the fixed proximity can cut the last word(from right/left limit). In the above example, I tried to come with a solution, but my solution fails if I have tabs or newline as delimitator between words, ex:

for my_text = 'some random words 1123 word1 word123 a' with my solution I got on the left side: some random words which is wrong.

Any help is appreciated! Thx!

Prayson W. Daniel · Accepted Answer

Instead of looking at characters, I will look for words. In that way, you will say, find my target and add N words before and after it:

PROXIMITY_LENGTH = 2  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a 	1123 this too will work'.split()

found = [x.find('1123') for x in my_text]

k = [' '.join(my_text[index-PROXIMITY_LENGTH:index+PROXIMITY_LENGTH+1]) for index, item in enumerate(found) if item == 0]


print(k)

# ['random words 1123 word1 word123', 'word123 a 1123 this too']

Using regex, we can replace found variable with;


found = []
for x in my_text:
    if re.search(r'\b1123\b',x):
        found.append(0)
    else:
        found.append(-1)

The only think I do is split the string to a list :)

blhsing · Answer

This can be done by simply expanding your regex pattern to include the desired number of words around the target match:

L = 2 # using a proximity length of just 2 for demo
my_text = 'some random words 1123 word1 word123 a'
print(re.search(r'(\w+\s+){{0,{0}}}\b1123\b(\s+\w+){{0,{0}}}'.format(L), my_text).group())

This outputs:

random words 1123 word1 word123

How to get all the words around a word within a fixed proximity

Tags:

python

python-3.x

text-processing

kederrac

2 Answers

Prayson W. Daniel

blhsing

Recent Activity

Donate For Us

How to get all the words around a word within a fixed proximity

Tags:

python

python-3.x

text-processing

kederrac

2 Answers

Prayson W. Daniel

blhsing

Related questions

Recent Activity

Donate For Us