Split the sentence into its tokens as a character annotation Python

Question

After a long search I didn't find any answer to my question that is why I decided to put my question here. I am trying to achive some specific result with RE and NLTK. Given a sentence, on each character I have to use the BIS format, that is, tagging each character as B (beginning of the token), I (intermediate or end position of the token), S for space. For instance, given the sentence:

The pen is on the table.

The system will have to provide the following output:

BIISBIISBISBISBIISBIIIIB

which can be read as:

<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <1-char token>)

My result is kinda close but instead of:

BIISBIISBISBISBIISBIIIIB

I get:

BIISBIISBISBISBIISBIIIISB

Meaning I get space between table and dot . The output should be:

<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <1-char token>

Mine is :

<3-char token> <space> <3-char token> <space> <2-char token> <space> <2-char token> <space> <3-char token> <space> <5-char token> <space> <1-char token>

My code so far:

from nltk.tokenize import word_tokenize
import re
p = "The pen is on the table."
# Split text into words using NLTK
text = word_tokenize(p)
print(text)
initial_char = [x.replace(x[0],'B') for x in text]
print(initial_char)
def listToString(s):  
    # initialize an empty string 
    str1 = " " 
    # return string   
    return (str1.join(s)) 
new = listToString(initial_char)
print(new)
def start_from_sec(my_text):
    return ' '.join([f'{word[0]}{(len(word) - 1) * "I"}' for word in my_text.split()])
res = start_from_sec(new)
p = re.sub(' ', 'S', res)
print(p)

Wiktor Stribiżew · Accepted Answer

You may use a single regex to tokenize the string:

(\w)(\w*)|([^\w\s])|\s

See the regex demo

Pattern details

(\w)(\w*) - Group 1: any word char (letter, digit or _) and then Group 2: any 0 or more word chars
| - or
([^\w\s]) - Group 3: any char but a word and whitespace char
| - or
\s - a whitespace char

If Group 1 matches, the return value is B + the same number of Is as the number of chars in Group 2. If Group 3 matches, replace with B. Else, a whitespace is matched, replace with S.

This can be customized further, e.g.

Treat _ only as punctuation: r'([^\W_])([^\W_]*)|([^\w\s]|_)|\s'
Replace 1 or more whitespaces with a single S: r'([^\W_])([^\W_]*)|([^\w\s]|_)|\s+'

See the Python demo online:

import re
p = "The pen is on the table."
def repl(x):
    if x.group(1):
        return "B{}".format("I"*len(x.group(2)))
    elif x.group(3):
        return "B"
    else:
        return "S"

print( re.sub(r'(\w)(\w*)|([^\w\s])|\s', repl, p) )
# => BIISBIISBISBISBIISBIIIIB

Split the sentence into its tokens as a character annotation Python

Tags:

python

python-3.x

tokenize

nltk

python-re

Elias

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

Split the sentence into its tokens as a character annotation Python

Tags:

python

python-3.x

tokenize

nltk

python-re

Elias

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us