Detecting Retweets using computationally inexpensive Python hashing algorithms

Question

In order to be able to detect RT of a particular tweet, I plan to store hashes of each formatted tweet in the database.

What hashing algorithm should I use. Cryptic is of course not essential. Just a minimal way of storing a data as something which can then be compared if it is the same, in an efficient way.

My first attempt at this was by using md5 hashes. But I figured there can be hashing algorithms that are much more efficient, as security is not required.

Chris Upchurch · Accepted Answer

Do you really need to hash at all? Twitter messages are short enough (and disk space cheap enough) that it may be better to just store the whole message, rather than eating up clock cycles to hash it.

pchap10k · Answer

I am not familiar with Python (sorry, Ruby guy typing here) however you could try a few things.

Assumptions: You will likely be storing hundreds of thousands of Tweets over time, so comparing one hash against "every record" in the table will be inefficient. Also, RTs are not always carbon copies of the original tweet. After all, the original author's name is usually included and takes up some of the 140 character limit. So perhaps you could use a solution that matches more accurately than a "dumb" hash?

Tagging & Indexing

Tag and index the component parts of the message in a standard way. This could include treating hashed #...., at-marked @.... and URL strings as "tags". After removing noise words and punctuation, you could also treat the remaining words as tags too.
Fast Searching

Databases are terrible at finding multiple group membership very quickly (I'll assume your using either Mysql or Postgresql, which are terrible at this). Instead try one of the free text engines like Sphinx Search. They are very very fast at resolving multiple group membership (i.e. checking if keywords are present).

Using Sphinx or similar, we search on all of the "tags" we extracted. This will probably return a smallish result set of "potential original Tweets". Then compare them one by one using similarity matching algorithm (here is one in Python http://code.google.com/p/pylevenshtein/)

Now let me warmly welcome you to the world of text mining.

Good luck!

Detecting Retweets using computationally inexpensive Python hashing algorithms

Tags:

python

hash

md5

twitter

lprsd

2 Answers

Chris Upchurch

pchap10k

Recent Activity

Donate For Us

Detecting Retweets using computationally inexpensive Python hashing algorithms

Tags:

python

hash

md5

twitter

lprsd

2 Answers

Chris Upchurch

pchap10k

Related questions

Recent Activity

Donate For Us