My goal is to emulate the Java slug creation function (makeSlug below) in Python. However, the combination of Java's Normalizer together with the regex pattern is giving me a headache.
My solution using the unidecode module in Python works for most cases fine but not always as highlighted below (e.g. the German letter ß is causing problems).
Java Code
import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;
public class Example {
private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");
public static String makeSlug(String input)
{
String normalized = Normalizer.normalize(input, Form.NFD);
String noNonlatinNormalized = NONLATIN.matcher(normalized).replaceAll("");
return noNonlatinNormalized;
}
public static void main(String[] args)
{
String testString = "thiß-täst";
String slug = makeSlug(testString);
String noNormalize = NONLATIN.matcher(testString).replaceAll("");
System.out.println(String.format("Start string \t'%s'", testString));
System.out.println(String.format("Slug creation \t'%s'", slug));
System.out.println(String.format("Without normalize \t'%s'", noNormalize));
}
}
Java Output
# Start string 'thiß-täst'
# Slug creation 'thi-tast'
# Without normalize 'thi-tst'
Python Code
import regex
import unidecode
NONLATIN = regex.compile("[^[:ascii:]-]") # works better (i.e. closer to Java) than [^\w-]
def make_slug(string: str) -> str:
unidecoded = unidecode.unidecode(string)
no_nonlatin_unidecoded = NONLATIN.sub("", unidecoded)
return no_nonlatin_unidecoded
if __name__ == "__main__":
test_string = "thiß-täst"
slug = make_slug(test_string)
no_unidecode = NONLATIN.sub("", test_string)
print("Start string \t'%s'" % test_string)
print("Slug creation \t'%s'" % slug)
print("Without unidecode \t'%s'" % no_unidecode)
Python Output
# Start string 'thiß-täst' # Same start string
# Slug creation 'thiss-tast' # PROBLEM -> unidecode turns "ß" to "ss"
# Without unidecode 'thi-tst' # Regex Java-to-Python translation is OK
What is more, the behaviour of Java's Normalizer.normalize is peculiar:
Normalizer.normalize("thiß-täst", Form.NFD) returns "thiß-täst"NONLATIN.matcher(normalized).replaceAll("") returns "thi-tast" (as returned by makeSlug)NONLATIN.matcher("thiß-täst").replaceAll("") returns thi-tst (as shown in the Java Output)This shows that Normalizer.normalize clearly has an impact even though it seems as if it left the string untouched.
On the other hand, Python's unidecode.unidecode turns "thiß-täst" to thiss-tast. Turning ä to a is not problematic since the same is done in Java eventually. However, turning ß to ss causes problems.
PS I would prefer to avoid quick fixes of the form string.replace("ß", "") - my goal is adhere to Java as much as possible.
The module unicodedata might be of interest here:
import regex
import unicodedata
NONLATIN = regex.compile("[^[:ascii:]-]")
def make_slug(string: str) -> str:
normalized = unicodedata.normalize("NFD", string)
slug = NONLATIN.sub("", normalized)
return slug
if __name__ == "__main__":
test_string = "thiß-täst"
slug = make_slug(test_string)
print("Start string \t'%s'" % test_string, "Slug creation \t'%s'" % slug, sep="\n")
# Start string 'thiß-täst'
# Slug creation 'thi-tast
I think it is fair to assume that Python's unicodedata.normalize("NFD", string) is fairly analogous to Java's Normalizer.normalize(string, Form.NFD) (or at least comes closer than unidecode.unidecode(string)).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With