I have a NSFW class that scans texts like item names and descriptions against a list of known NSFW-words.
That would be the best approach to test a list of strings like
let nsfw = listof(
"badword",
"curseword",
"ass",
... 200+ more
)
against a string like:
This is the text that contains a badword // returns true
Please note that i need to check for full words. not parts of words.
so the sentence:
The grass is grean // returns false
Because grass is not a bad word.
Ive tried something like this but it doesnt check for full words.
val result = nsfw.filter { it in sentence.toLowerCase() }
You may build a regex like
\b(?:word1|word2|word3...)\b
See the regex demo. Then, use it with the Regex.containsMatchIn
method:
val nsfw = listOf(
"badword",
"curseword",
"ass"
)
val s1 = "This is the text that contains a badword"
val s2 = "The grass is grean"
val rx = Regex("\\b(?:${nsfw.joinToString(separator="|")})\\b")
println(rx.containsMatchIn(s1)) // => true
println(rx.containsMatchIn(s2)) // => false
See this Kotlin demo.
Here, nsfw.joinToString(separator="|")
joins the words with a pipe (the alternation operator) and the "\\b(?:${nsfw.joinToString(separator="|")})\\b"
creates the correct regex.
If your words may contain special regex metacharacters, like +
, ?
, (
, )
, etc., you need to "preprocess" the nsfw
values with the Regex.escape
method:
val rx = Regex("\\b(?:${nsfw.map{Regex.escape(it)}.joinToString("|")})\\b")
^^^^^^^^^^^^^^^^^^^^^^
See the Kotlin demo.
AND one more thing: if the keywords may start/end with chars other than letters, digits and underscores, you cannot rely on \b
word boundaries. You may
val rx = Regex("(?<!\\S)(?:${nsfw.map{Regex.escape(it)}.joinToString("|")})(?!\\S)")
val rx = Regex("(?<!\\w)(?:${nsfw.map{Regex.escape(it)}.joinToString("|")})(?!\\w)")
You can use split()
on the string that you want to check, with space as a delimiter, so you create a list of its words, although this does not always guarantee that all words will be extracted successfully, since there could exist other word separators like dots or commas etc. If that suits you, do this:
val nsfw = listOf(
"badword",
"curseword",
"ass"
)
val str = "This is the text that contains a badword"
val words = str.toLowerCase().split("\\s+".toRegex())
val containsBadWords = words.firstOrNull { it in nsfw } != null
println(containsBadWords)
will print
true
If you want a list of the "bad words":
val badWords = words.filter { it in nsfw }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With