Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare two paragraphs of text?

I need to remove duplicated paragraphs in a text with many paragraphs.

I use functions from the class java.security.MessageDigest to calculate each paragraph's MD5 hash value, and then add these hash value into a Set.

If add()'ed successfully, it means the latest paragraph is a duplicate one.

Is there any risk of this way?

Except String.equals(), is there any other way to do it?

like image 744
mojiayi Avatar asked Mar 13 '13 10:03

mojiayi


1 Answers

Before hashing you could normalize the paragraphs e.g. Removing punctuation, conversion to lower case and removing additional whitespace. After normalizing, paragraphs that only differ there would get the same hash.

like image 100
Matt Avatar answered Sep 24 '22 21:09

Matt



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!