Detect Stolen and Duplicate Tweets with Solr

A new feature “duplication detection” is implemented for jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to this tweet and click on the ‘Find Similar’/’Guttenberg’ button below the tweet to investigate existing duplicates. With that feature it is possible for jetwick to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out.

Examples for ‘stolen’ or duplicated tweets:

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

http://twitter.com/#!/Newsteam_Berlin/status/17881387294003200

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier. That is life!

3 thoughts on “Detect Stolen and Duplicate Tweets with Solr

  1. This is a nice tool –
    Only if there was something for blogs as well. Lots of people steal blog posts and pass them off as their own work without attributions to the original source.

    Nice blog BTW🙂

  2. Thanks. doing this for blogs would mean indexing a lot of blogs like every major search engine is doing this … maybe you can ask the guy from duckduckgo.com if he’ll implementing a similar feature🙂

Comments are closed.