There already is an approaching answer in R gsub("[^[:alnum:]['-]", " ", my_string), but it does not work in Python:
my_string = 'compactified on a calabi-yau threefold @ ,.'
re.sub("[^[:alnum:]['-]", " ", my_string)
gives 'compactified on a calab yau threefold  @ ,.'
So not only does it remove the intra-word dash, it also removes the last letter of the word preceding the dash. And it does not remove punctuation
Expected result (string without any punctuation but intra-word dash): 'compactified on a calabi-yau threefold'
R uses TRE (POSIX) or PCRE regex engine depending on the perl option (or function used). Python uses a modified, much poorer Perl-like version as re library. Python does not support POSIX character classes, as [:alnum:] that matches alpha (letters) and num (digits).
In Python, [:alnum:] can be replaced with [^\W_] (or ASCII only [a-zA-Z0-9]) and the negated [^[:alnum:]] - with [\W_] ([^a-zA-Z0-9] ASCII only version).
The [^[:alnum:]['-] matches any 1 symbol other than alphanumeric (letter or digit), [, ', or -. That means the R question you refer to does not provide a correct answer.
You can use the following solution:
import re
p = re.compile(r"(\b[-']\b)|[\W_]")
test_str = "No -  d'Ante compactified on a calabi-yau threefold @ ,."
result = p.sub(lambda m: (m.group(1) if m.group(1) else " "), test_str)
print(result)
The (\b[-']\b)|[\W_] regex matches and captures intraword - and ' and we restore them in the re.sub by checking if the capture group matched and re-inserting it with m.group(1), and the rest (all non-word characters and underscores) are just replaced with a space. 
If you want to remove sequences of non-word characters with one space, use
p = re.compile(r"(\b[-']\b)|[\W_]+") 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With