Python Replace Single Quotes Except Apostrophes

Question

I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substitution, and then print each word and punctuation tag on its own line for further processing later. I am unsure how to replace every single quote with a tag or excepting all apostrophes. My current method is to use a compiled regex:

apo = re.compile("[A-Za-z]'[A-Za-z]")

and perform the following operation:

if "'" in word and !apo.search(word):
    word = word.replace("'","
<singlequote>")

but this ignores cases where a single quote is used around a word with an apostrophe. It also does not indicate to me whether the single quote is abutting the start of a word of the end of a word.

Example input:

don't
'George
ma'am
end.'
didn't.'
'Won't

Example output (after processing and printing to file):

don't
<opensingle>
George
ma'am
end
<period>
<closesingle>
didn't
<period>
<closesingle>
<opensingle>
Won't

I do have a further question in relation to this task: since the distinguishment of <opensingle> vs <closesingle> seems rather difficult, would it be wiser to perform substitutions like

word = word.replace('.','
<period>')
word = word.replace(',','
<comma>')

after performing the replacement operation?

wp78de · Accepted Answer

I suggest working smart here: use nltk's or another NLP toolkit instead.

Tokenize words like this:

import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)

You may not like the fact that contractions like don't are separated. Actually, this is expected behavior. See Issue 401.

However, TweetTokenizer can help with that:

from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")

If it gets more involved a RegexpTokenizer could be helpful:

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88
in New York.  Please don't buy me
just one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

Then it should be much easier to annotate the tokenized words correctly.

Further references:

http://www.nltk.org/api/nltk.tokenize.html
http://www.nltk.org/_modules/nltk/tokenize/regexp.html

Valdi_Bo · Answer

What you really need to properly replace starting and ending ' is regex. To match them you should use:

^' for starting ' (opensingle),
'$ for ending ' (closesingle).

Unfortunately, replace method does not support regexes, so you should use re.sub instead.

Below you have an example program, printing your desired output (in Python 3):

import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
    word = re.sub(r"^'", '<opensingle>
', word)
    word = re.sub(r"'$", '
<closesingle>', word)
    word = word.replace('.', '
<period>')
    word = word.replace(',', '
<comma>')
    print(word)

Python Replace Single Quotes Except Apostrophes

Tags:

python

regex

single-quotes

substitution

mas

2 Answers

wp78de

Valdi_Bo

Recent Activity

Donate For Us

Python Replace Single Quotes Except Apostrophes

Tags:

python

regex

single-quotes

substitution

mas

2 Answers

wp78de

Valdi_Bo

Related questions

Recent Activity

Donate For Us