Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detecting Retweets using computationally inexpensive Python hashing algorithms

In order to be able to detect RT of a particular tweet, I plan to store hashes of each formatted tweet in the database.

What hashing algorithm should I use. Cryptic is of course not essential. Just a minimal way of storing a data as something which can then be compared if it is the same, in an efficient way.

My first attempt at this was by using md5 hashes. But I figured there can be hashing algorithms that are much more efficient, as security is not required.

like image 745
lprsd Avatar asked May 24 '26 23:05

lprsd


2 Answers

Do you really need to hash at all? Twitter messages are short enough (and disk space cheap enough) that it may be better to just store the whole message, rather than eating up clock cycles to hash it.

like image 67
Chris Upchurch Avatar answered May 27 '26 11:05

Chris Upchurch


I am not familiar with Python (sorry, Ruby guy typing here) however you could try a few things.

Assumptions: You will likely be storing hundreds of thousands of Tweets over time, so comparing one hash against "every record" in the table will be inefficient. Also, RTs are not always carbon copies of the original tweet. After all, the original author's name is usually included and takes up some of the 140 character limit. So perhaps you could use a solution that matches more accurately than a "dumb" hash?

  1. Tagging & Indexing

    Tag and index the component parts of the message in a standard way. This could include treating hashed #...., at-marked @.... and URL strings as "tags". After removing noise words and punctuation, you could also treat the remaining words as tags too.

  2. Fast Searching

    Databases are terrible at finding multiple group membership very quickly (I'll assume your using either Mysql or Postgresql, which are terrible at this). Instead try one of the free text engines like Sphinx Search. They are very very fast at resolving multiple group membership (i.e. checking if keywords are present).

    Using Sphinx or similar, we search on all of the "tags" we extracted. This will probably return a smallish result set of "potential original Tweets". Then compare them one by one using similarity matching algorithm (here is one in Python http://code.google.com/p/pylevenshtein/)

Now let me warmly welcome you to the world of text mining.

Good luck!

like image 22
pchap10k Avatar answered May 27 '26 13:05

pchap10k



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!