After a long search I didn't find any answer to my question that is why I decided to put my question here. I am trying to achive some specific result with RE and NLTK.
Given a sentence, on each character I have to use the BIS format, that is, tagging each character as B (beginning of the token), I (intermediate or end position of the token), S for space.
For instance, given the sentence:
The pen is on the table.
The system will have to provide the following output:
BIISBIISBISBISBIISBIIIIB
which can be read as:
<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <1-char token>)
My result is kinda close but instead of:
BIISBIISBISBISBIISBIIIIB
I get:
BIISBIISBISBISBIISBIIIISB
Meaning I get space between table and dot .
The output should be:
<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <1-char token>
Mine is :
<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <space> <1-char token>
My code so far:
from nltk.tokenize import word_tokenize
import re
p = "The pen is on the table."
# Split text into words using NLTK
text = word_tokenize(p)
print(text)
initial_char = [x.replace(x[0],'B') for x in text]
print(initial_char)
def listToString(s):
# initialize an empty string
str1 = " "
# return string
return (str1.join(s))
new = listToString(initial_char)
print(new)
def start_from_sec(my_text):
return ' '.join([f'{word[0]}{(len(word) - 1) * "I"}' for word in my_text.split()])
res = start_from_sec(new)
p = re.sub(' ', 'S', res)
print(p)
You may use a single regex to tokenize the string:
(\w)(\w*)|([^\w\s])|\s
See the regex demo
Pattern details
(\w)(\w*) - Group 1: any word char (letter, digit or _) and then Group 2: any 0 or more word chars| - or([^\w\s]) - Group 3: any char but a word and whitespace char| - or\s - a whitespace charIf Group 1 matches, the return value is B + the same number of Is as the number of chars in Group 2. If Group 3 matches, replace with B. Else, a whitespace is matched, replace with S.
This can be customized further, e.g.
_ only as punctuation: r'([^\W_])([^\W_]*)|([^\w\s]|_)|\s'S: r'([^\W_])([^\W_]*)|([^\w\s]|_)|\s+'See the Python demo online:
import re
p = "The pen is on the table."
def repl(x):
if x.group(1):
return "B{}".format("I"*len(x.group(2)))
elif x.group(3):
return "B"
else:
return "S"
print( re.sub(r'(\w)(\w*)|([^\w\s])|\s', repl, p) )
# => BIISBIISBISBISBIISBIIIIB
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With