Algorithm against Twitter Spam

In jetwick we only want to show relevant tweets for a search. No noise, no spam.

So first problem is solvable when we the user sorts by retweets, filters out by a specific criteria of his choice or when he refines its search: adds more specific terms.

But how can we get rid of spam at twitter? First, what is spam at twitter?

Several years ago Paul Graham gave a nice definition of email spam: ‘unsolicited and automated’. With this definition we can identify 4 situations for twitter spam:

  1. unsolicited tweets which appear in your timeline (e.g. the new ads or even retweets of your followers could be spam too ;-))
  2. unsolicited tweets in your searches not relevant to the search (e.g. spammers simply add hashtags from the trending topics to increase popularity of their tweets)
  3. unsolicited automated tweets which mentions you (and not only you …)
  4. unsolicited direct messages
  5. even fast following and unfollowing can be spam, because most users have enabled a notification
  6. a very cool spamming technic: some spammers add a mini advertisment to a nice statement of you or your product. If you don’t read the tweet carefully or follow the links, this could (mis)lead you to retweet it and make indirect advertisment for them.

With the following algorithm we can try to solve point 2 and 3.

  1. For a new tweet T get the user U
  2. For U get (some) additional tweets and store to a list L
  3. Go through L and compare the content with T.
  4. Use the Jaccard index for comparison (additionally compare URL and title of the linked webpage)
  5. If Jaccard index is too high or URLs are identical then decrease quality. Repeat with 3. if there are still tweets in L otherwise go to 6.
  6. Mark T as spam if quality is under a certain quality limit

I applied this algorithm on the data of jetwick and grabbed the twitter users with a lot of spammy tweets (the number in brackets) for the last week:

careerfan (968) -> bot
teamnapalm (587) -> spam
endy_pink (481) -> spam
manypro (312) -> bot
i_want_napalm (294) -> spam
gutlazaro (216) -> spam (canceled from twitter)
appstoreadam (210) -> bot+spam
livralivro (207) -> bot
sigaajesus (195) -> spammy
thakiddunncase (195) -> ups, no spam
lauberte_ (167) -> spam
2dvdlsnorjeuvou (158) -> spam
josialemrossi (152) -> spam (canceled from twitter)
malucomunic (139) -> spam (Was this account hacked!!?? Because no one of the followers is a spammer!)

The idea is simple, but the results looks promising. There could be a lot of use cases. E.g. twitter clients like hootsuite could add tweet quality to its available filters … the user specific klout score is not useful, because even less popular tweeters can create great tweets 🙂

Let me know what you think!