How can I find the position of the list of substrings from the string?
Given a string:
"The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
And a list of substring:
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
Desired output:
>>> s = "The plane, bound for St Petersburg, crashed in Egypt's Sinai desert just 23 minutes after take-off from Sharm el-Sheikh on Saturday."
>>> tokens = ['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
>>> find_offsets(tokens, s)
[(0, 3), (4, 9), (9, 10), (11, 16), (17, 20), (21, 23), (24, 34),
(34, 35), (36, 43), (44, 46), (47, 52), (52, 54), (55, 60), (61, 67),
(68, 72), (73, 75), (76, 83), (84, 89), (90, 98), (99, 103), (104, 109),
(110, 119), (120, 122), (123, 131), (131, 132)]
Explanation of the output, the first substring "The" can be found using the (start, end) index by using the string s. So from the desired output.
So if we loop through all the tuples of integers from the desired output we'll get back the list of substrings, i.e.
>>> [s[start:end] for start, end in out]
['The', 'plane', ',', 'bound', 'for', 'St', 'Petersburg', ',', 'crashed', 'in', 'Egypt', "'s", 'Sinai', 'desert', 'just', '23', 'minutes', 'after', 'take-off', 'from', 'Sharm', 'el-Sheikh', 'on', 'Saturday', '.']
I've tried:
def find_offset(tokens, s):
index = 0
offsets = []
for token in tokens:
start = s[index:].index(token) + index
index = start + len(token)
offsets.append((start, index))
return offsets
Is there another way to find the position of the list of substrings from the string?
First solution:
#use list comprehension and list.index function.
[tuple((s.index(e),s.index(e)+len(e))) for e in t]
Second solution to correct the issues in the first solution:
def find_offsets(tokens, s):
tid = [list(e) for e in tokens]
i = 0
for id_token,token in enumerate(tid):
while (token[0]!=s[i]):
i+=1
tid[id_token] = tuple((i,i+len(token)))
i+=len(token)
return tid
find_offsets(tokens, s)
Out[201]:
[(0, 3),
(4, 9),
(9, 10),
(11, 16),
(17, 20),
(21, 23),
(24, 34),
(34, 35),
(36, 43),
(44, 46),
(47, 52),
(52, 54),
(55, 60),
(61, 67),
(68, 72),
(73, 75),
(76, 83),
(84, 89),
(90, 98),
(99, 103),
(104, 109),
(110, 119),
(120, 122),
(123, 131),
(131, 132)]
#another test
s = 'The plane, plane'
t = ['The', 'plane', ',', 'plane']
find_offsets(t,s)
Out[212]: [(0, 3), (4, 9), (9, 10), (11, 16)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With