Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kotlin check for words in string

I have a NSFW class that scans texts like item names and descriptions against a list of known NSFW-words.

That would be the best approach to test a list of strings like

    let nsfw = listof(
    "badword",
    "curseword",
    "ass",
    ... 200+ more
    )

against a string like:

This is the text that contains a badword // returns true

Please note that i need to check for full words. not parts of words.

so the sentence:

The grass is grean // returns false

Because grass is not a bad word.

Ive tried something like this but it doesnt check for full words.

        val result =  nsfw.filter { it in sentence.toLowerCase() }
like image 795
sn0ep Avatar asked Oct 16 '25 04:10

sn0ep


2 Answers

You may build a regex like

\b(?:word1|word2|word3...)\b

See the regex demo. Then, use it with the Regex.containsMatchIn method:

val nsfw = listOf(
    "badword",
    "curseword",
    "ass"
)
val s1 = "This is the text that contains a badword"
val s2 = "The grass is grean"
val rx = Regex("\\b(?:${nsfw.joinToString(separator="|")})\\b")
println(rx.containsMatchIn(s1)) // => true
println(rx.containsMatchIn(s2)) // => false

See this Kotlin demo.

Here, nsfw.joinToString(separator="|") joins the words with a pipe (the alternation operator) and the "\\b(?:${nsfw.joinToString(separator="|")})\\b" creates the correct regex.

If your words may contain special regex metacharacters, like +, ?, (, ), etc., you need to "preprocess" the nsfw values with the Regex.escape method:

val rx = Regex("\\b(?:${nsfw.map{Regex.escape(it)}.joinToString("|")})\\b")
                            ^^^^^^^^^^^^^^^^^^^^^^     

See the Kotlin demo.

AND one more thing: if the keywords may start/end with chars other than letters, digits and underscores, you cannot rely on \b word boundaries. You may

  • Use whitespace boundaries: val rx = Regex("(?<!\\S)(?:${nsfw.map{Regex.escape(it)}.joinToString("|")})(?!\\S)")
  • Use unambiguous word boundaries: val rx = Regex("(?<!\\w)(?:${nsfw.map{Regex.escape(it)}.joinToString("|")})(?!\\w)")
like image 176
Wiktor Stribiżew Avatar answered Oct 17 '25 18:10

Wiktor Stribiżew


You can use split() on the string that you want to check, with space as a delimiter, so you create a list of its words, although this does not always guarantee that all words will be extracted successfully, since there could exist other word separators like dots or commas etc. If that suits you, do this:

val nsfw = listOf(
    "badword",
    "curseword",
    "ass"
)

val str = "This is the text that contains a badword"
val words = str.toLowerCase().split("\\s+".toRegex())
val containsBadWords = words.firstOrNull { it in nsfw } != null
println(containsBadWords)

will print

true

If you want a list of the "bad words":

val badWords = words.filter { it in nsfw }
like image 25
forpas Avatar answered Oct 17 '25 17:10

forpas



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!