Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."
The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):
int uniqueCharacterCount = string.Distinct().Count();
I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.
Thanks!
Entropy of a language is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in a language. When compressing the text, the letters of the text must be translated into binary digits 0 or 1.
The entropy of letters in the English language is 4.11 bits 12] (which is less than log226 = 4:7 bits).
It is a measure of the information content of the string, and can be interpreted as the number of bits required to encode each character of the string given perfect compression. The entropy is maximal when each character is equally likely.
To compute Entropy the frequency of occurrence of each character must be found out. The probability of occurrence of each character can therefore be found out by dividing each character frequency value by the length of the string message.
The confusion seems to be from the idea that this is used to block posts from being posted - it's not.
It is just one of several algorithms used to find possible low-quality posts, displayed on the low quality posts tab (requires 10k rep) of the moderator tools. Actual humans still need to look at the post.
The idea is to catch posts like ~~~~~~No.~~~~~~ or FUUUUUUUU------, not to catch all low-quality posts.
As for "How does the unique character-count signify entropy?" - it doesn't, really. The most upvoted answers completely miss the point.
See https://codereview.stackexchange.com/questions/868#878 and https://codereview.stackexchange.com/questions/868#926
String 'aaaaaaaaaaaaaaaaaaaaaaaaaaa' has very low entropy, and is rather meaningless.
String 'blah blah blah blah blah blah blah blah' has a bit higher entropy, but is still rather silly and can be a part of an attack.
A post or a comment that has entropy comparable to these strings is probably not appropriate; it can't contain any meaningful message, even a spam link. Such a post can be just filtered out or warrant an additional captcha.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With