Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Slug Creation to Python

Tags:

java

python

Problem

My goal is to emulate the Java slug creation function (makeSlug below) in Python. However, the combination of Java's Normalizer together with the regex pattern is giving me a headache.

My solution using the unidecode module in Python works for most cases fine but not always as highlighted below (e.g. the German letter ß is causing problems).

Code

Java Code

import java.text.Normalizer;
import java.text.Normalizer.Form;
import java.util.regex.Pattern;


public class Example {
  private static final Pattern NONLATIN = Pattern.compile("[^\\w-]");

  public static String makeSlug(String input)
  {
      String normalized = Normalizer.normalize(input, Form.NFD);
      String noNonlatinNormalized = NONLATIN.matcher(normalized).replaceAll("");
      return noNonlatinNormalized;
  }

  public static void main(String[] args)
  {
    String testString = "thiß-täst";   
    String slug = makeSlug(testString);
    String noNormalize = NONLATIN.matcher(testString).replaceAll("");
    System.out.println(String.format("Start string \t'%s'", testString));
    System.out.println(String.format("Slug creation \t'%s'", slug));
    System.out.println(String.format("Without normalize \t'%s'", noNormalize));
  }
}

Java Output

# Start string  'thiß-täst'
# Slug creation     'thi-tast'
# Without normalize     'thi-tst'

Python Code

import regex
import unidecode


NONLATIN = regex.compile("[^[:ascii:]-]")    # works better (i.e. closer to Java) than [^\w-]

def make_slug(string: str) -> str:
    unidecoded = unidecode.unidecode(string)
    no_nonlatin_unidecoded = NONLATIN.sub("", unidecoded)
    return no_nonlatin_unidecoded


if __name__ == "__main__":
    test_string = "thiß-täst"
    slug = make_slug(test_string)
    no_unidecode = NONLATIN.sub("", test_string)
    print("Start string \t'%s'" % test_string)
    print("Slug creation \t'%s'" % slug)
    print("Without unidecode \t'%s'" % no_unidecode)

Python Output

# Start string    'thiß-täst'        # Same start string
# Slug creation   'thiss-tast'       # PROBLEM -> unidecode turns "ß" to "ss"
# Without unidecode       'thi-tst'  # Regex Java-to-Python translation is OK

Notes

What is more, the behaviour of Java's Normalizer.normalize is peculiar:

  • One can check that Normalizer.normalize("thiß-täst", Form.NFD) returns "thiß-täst"
  • NONLATIN.matcher(normalized).replaceAll("") returns "thi-tast" (as returned by makeSlug)
  • NONLATIN.matcher("thiß-täst").replaceAll("") returns thi-tst (as shown in the Java Output)

This shows that Normalizer.normalize clearly has an impact even though it seems as if it left the string untouched.

On the other hand, Python's unidecode.unidecode turns "thiß-täst" to thiss-tast. Turning ä to a is not problematic since the same is done in Java eventually. However, turning ß to ss causes problems.

PS I would prefer to avoid quick fixes of the form string.replace("ß", "") - my goal is adhere to Java as much as possible.

like image 864
user101 Avatar asked May 19 '26 12:05

user101


1 Answers

The module unicodedata might be of interest here:

import regex
import unicodedata


NONLATIN = regex.compile("[^[:ascii:]-]")

def make_slug(string: str) -> str:
    normalized = unicodedata.normalize("NFD", string)
    slug = NONLATIN.sub("", normalized)
    return slug


if __name__ == "__main__":
    test_string = "thiß-täst"
    slug = make_slug(test_string)
    print("Start string \t'%s'" % test_string, "Slug creation \t'%s'" % slug, sep="\n")

# Start string    'thiß-täst'
# Slug creation   'thi-tast

I think it is fair to assume that Python's unicodedata.normalize("NFD", string) is fairly analogous to Java's Normalizer.normalize(string, Form.NFD) (or at least comes closer than unidecode.unidecode(string)).

like image 84
niko Avatar answered May 22 '26 02:05

niko



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!