Twitter API and Me

I have a love hate relationship with Twitter. As a user I see the benefits of Twitter, when looking at it without the spam, duplicates and senseless tweets e.g. through jetwick. But as a developer the Twitter API is very ‘heuristic’ and handwaving in a lot areas and makes it complicated to use. I would have been lost without the nice twitter4j project, so thanks to the author!

Now let me give you some examples of

Strange things of the Twitter API

  • The since id attribute is not supported when paginating in the search API:
    “The since_id parameter will be removed from the next_page element as it is not supported for pagination. If since_id is removed a warning will be added to alert you.”
    So you need to create your own pagination when you do not want to get already visited tweets via search API
  • Search API returns matches in URLs. This is in nearly all cases not useful. Especially for terms like ‘twitter’ or ‘google’ where the search API returns confusing tweets containing URLs search.twitter.com or google.com. But marketing companies need to search URLs and also the tweet button also relies on that ‘feature’, why not disable that and enable ‘link:http://any-link.here’ ? And it would be more useful to match against the title of the website like jetwick it does, but that’s another topic.
  • Search API does NOT return complete results compared to streaming API. I.e. results from streaming API contains all tweets with the specified keywords (without tweets via the URL bug I mentioned in the previous point). But the search API in contrast can leave out ‘spam’ tweets. I’m unsure if those tweets has to be really low quality or whatever. I guess this is more a technically issue with the search API that it leaves out some tweets the streaming has.
  • REST API allows one to get only ~3200 old tweets from one user and 800 tweets from your friends (i.e. your homeline).
  • Huge amount of different API limits:
    • 350 requests per hour and user for the REST API
    • Searches are restricted to IP (unknown number much higher than the 350 requests per hour)
    • Only 2 filter streams are allowed – this is restricted to the IP. And only 200 keywords are possible per stream! But filter streams allow only approx. 50 tweets/s even if only a few keywords are used. (Then those keywords are high frequent)
    • Search API allows searches into history, but how long depends on the frequency of the term. I know this is logically for every real time inverted index of this size, but should be better documented.

Regarding API Terms

Of course Twitter has API terms. This is necessary and nice to prevent the users from spam sites etc.

But there is also a display style guideline, which I had ‘fun’ the last weekend. Where I was asked e.g. to make the hashtag links of jetwick according to the display guideline. This is annoying. Now I need to pop up a dialog instead of directly triggering a search on jetwick – hey, it is a search engine! But twitter has to make money. That is ok. But I would like to have an exception for free or open source projects. No chance 😦 … here is my email conversation regarding the minor API term violation:

Dear XY,
ok, I won't provide an API to others. Thanks for the clarification.

I've got a further question. Are the display guidelines a requirement to
be aligned with the API terms of use and to continue running Jetwick? (I
shutted it down to not being evil)

In the terms I can read as the first principle: "Don't surprise users"
which is very important for me and it would disturb the user experience
if a hashtag click (or a click on '@user') in a tweet would result in a
pop up to twitter search or something and not simply trigger a search on
jetwick.

Please do not understand me wrong, I have already several links back to
twitter: the date links to the tweet on twitter, the retweet and reply
links to twitter and finally the user links back to twitter. Jetwick is
a complete read only service (see my API access), so I would be stupid
if I hadn't links back to twitter, which actually allows my users to
share noisefree information via twitter.

Finally: If the layout guides are a requirement, would you make an
exception for Jetwick regarding the hashtag and @user links within a
tweet? Many companies make exceptions when it comes to open source
projects such as Jetbrains (IDEA), Yourkit (Profiler), Attlassian
(Confluence), ... what about Twitter?

Kind Regards,
Peter.

The answer from twitter is crystal clear that Twitter does not provide API term exceptions to open source projects like other companies does. It also indicates that the API guys have a bit too much to do as the support does not really answer my question and neither understands what github is nor what jetwick means:

Hey Peter,

Thanks for following up. The API Terms of Service, as an overriding
document, do require you to adhere to these display guidelines -- in the
same "Don't Surprise Users" section you referenced. I recommend
adding links of your own, such as "#github on Jetwick" that surface
these results. Again, I'm sorry for the inconvenience this has caused,
and let me know if you have any other questions.

Regards,
XY

A second important thing

you’ll otherwise miss is that you are not allowed to offer an API to other people. Even if your project is open source! Here the email:
“Returning Twitter data, like tweets, through an API of your own is not allowed, neither for commercial services nor independent or open-source services. We are not looking for partners to formally extend new APIs as you request.

Conclusion

So, keep this all in mind when you start to build a system using or even relying on the Twitter API. I hope this post clarifies the mystics of the Twitter API a bit! If you have encountered similar issues: feel free to comment 🙂 !

Twitter Search Tools and more. #Archive #FriendSearch #Trends

There is an overwhelming number of tools for twitter: url shorteners like bit.ly, web clients like hootsuite.com but today I would like to show you twitter tools which are good to search tweets, lets you archive them, display trends and more. I picked tools which could be useful when you are looking only for relevant information without noise. Let me know your favourites to get news out of Twitter!

For the Twitter search tools I always made a quick test to get a feeling how much noise these tools can filter away and only two tools out of a dozen – see below – made it easy to find the following news one day later:

The tools are

  1. Jetwick – Free to use (Caution: I’m the developer)
  2. Research.ly – Freemium (Caution: it required a bit selecting of the appropriated days to got the news)

Twazzup and What The Trend also showed the news I was looking for, but the news weren’t in their displayed tweets – they also display blogs and google news in a separate widget :). To my surprise in SocialMention and Bing Social Search the security news didn’t pop up, but other news important to java developers occur. So, give it a shot. The problem for the other search engines was that they mostly cover the real time tweets only and do not provide a lot of useful filters and some of the search tools simply were not designed for this task.

I’ve splitted the tools in the following subgroups:

  1. Searching & Archiving
  2. Searching
    • The Giants under the Twitter Searches
  3. Archiving
  4. Cool Tools

1 Searching & Archiving

The Archivist

  • Archive a search; Trending URLs; Top Users
  • Alexa rank: 13k
  • Login required to archive, Free to use

Tweet Nest

Jetwick

  • Open Twitter Search without Noise; Sort against retweets; Lets you show only relevant tweets since the last login; Archiving and Searching of any users’ tweets; Search friends only; Filters for duplicate reduction, language or a distinct day
    More Features …
  • Open Source which makes it suitable to do your own research on twitter
  • Alexa rank: 400k
  • Login optional, Free to use

2 Searching

Topsy

IceRocket

  • Searches Blogs, Twitter, MySpace, News, Images, …
  • Alexa rank: 5k
  • Without Login, Free to use

SocialMention

  • Searches Blogs, Twitter, … Every Search shows Top Keywords and Users
  • Alexa rank: 7k
  • Without Login, Free to use

WeFollow

Twellow

  • Twitter directory (‘Twitter yellow pages’)
  • Alex rank: 9k
  • Login optional, Free to use

Trendistic / HashTags

What The Trend

StateOfSearch

PubSub

Twazzup

Research.ly

  • Search ‘historic’ tweets; Search local business; Map the relationships between you and other users
  • Created from the
  • Alexa rank: 88k
  • Login required,Freemium

Twips

TweetScan

Searchtastic

TwitterLocal

  • Search local business
  • Alexa rank: 331k
  • Does not work at the moment?? As of Feb 2011

SnapBird

TwimeMachine

Tweetzi

TweeFind

  • Twitter Search which shows related search
  • Alexa rank: 1600k
  • Login optional, Free to use

Twippr

  • Search within your friends’ tweets
    More Features?
  • Alexa rank: 1800k
  • Login required, Free to use

Sparrw

The Giants under the Twitter Searches

3 Archiving

BackTweets

Twapper Keeper

TweetBackup

Tweetake

BackupMyTweets

4 Cool Tools

FavStar

  • Search Twitter Users; Up vote users (not tweets)
  • Alexa rank: 6k
  • Login optional, Freemium

Tweepi

TweetStats

  • Trends; Stats for your account
  • Alexa rank: 21k
  • Without login, Free to use

Twitaholic ( TwitterCounter )

  • Most popular users and twitter stats
  • Alexa rank: 23k
  • Login optional, Free to use

What the HashTag?

  • User-editable encyclopedia for hashtags found on Twitter – this way you can find the meaning of a hashtag – Similar to find origin of jetwick
  • Alexa rank: 45k
  • Login optional, Free to use

TweetBeep

Twitturly

Mixero

The Cadmus

TweetMeme

Twitter Power Search

  • Multi Widget View; Determining Trends; filter Audio/Video
  • Alexa rank: 1110k
  • Without login, Free to use

Get more Friends on Twitter with Jetwick

Obviously you won’t need a tool to get more friends aka ‘following’ on twitter but you’ll add more friends when you tried our new feature called ‘friend search’. But let me start from the beginning of our recent, major technology shift for jetwick – our open source twitter search.

We have now moved the search server forward to ElasticSearch (from Solr) – more on that in a later post. This move will hopefully solve some data actuality problems but also make more tweets available in jetwick. All features should work as before.

To make it even more pleasant for you my fellow user I additionally implemented the friend search with all the jetwicked features: sort it against retweets, filter against language, filter away spam and duplicates …

Update: you’ll need to install jetwick

Try it on your own

  1. First login to jetwick. You need to ~2 minutes until all your friends will be detected …
  2. Then type your query or leave it empty to get all tweets
  3. Finally select ‘Only Friends’ as the user filter.
  4. Now you are able to search all the tweets of the people you follow.
  5. Be sure you clicked on without duplicates (on the left) etc. as appropriated

Here  the friend search results when querying ‘java’:

And now the normal search where the users rubenlirio and tech_it_jobs are not from my ‘friends’:

You don’t need to stay on-line all the time – jetwick automagically grab the tweets of your friends for you. And if you use the relaxed saved searches, then you’ll also be notified for all changes in you homeline – even after days!

That was the missing puzzle piece for me to be able to stay away from twitter and PC – no need to check every hour for the latest tweets in my homeline or my “twitter – saved searches”.

Jetwick is free so you’ll only need to login and try it! As a side effect of being logged in: your own tweets will be archived and you can search them though the established user search. With that user search you can use twitter as a bookmark application as I’m already doing it … BTW: you’ll notice the orange tweet which is a free ad: just create a tweet containing #jetwick and it’ll be shown on top of matching searches.

Another improvement is (hopefully) the user interface for jetwick, the search form should be more clear:

before it was:

and the blue logo should now look a bit better, before it was:

What do you think?

A world without RSS would not hurt

This post is a quick reply to the post RSS Is Dying, and You Should Be Very Worried.

I really like rss reader. I have a lot of subscriptions to blogs. But I don’t think that RSS (Really Simple Syndication) has a future, simply because there will be at least one service similar to twitter. People simply like it more: “to follow *people*” and not websites!

Twitter is very powerful. You can create your own RSS-like reader from twitter like I did with jetwick. I mean, what do you want with an RSS reader? You want news, right? Personalized news? With RSS you’ll get news from blogs. With twitter you get personalized news from people as I said before or if you are interested in some topics you can get personalized news via search terms too! That’s very powerful: to get news about topics. That’s what you want, I guess!?

Now there are some good statements in the post where I wanted to add my comment now.

  • “IF RSS DIES, WE LOSE THE ABILITY TO READ IN PRIVATE”
    First: why would you use chrome if you want to read in private??? 🙂
    Second: Yes, privacy is a problem with twitter and facebook. But there will be tools like jetwick where you can “silently” follow people if you want. So this is not an issue for me. Of course, you’ll need to host your own version of jetwick to really have no privacy issues 😉
  • “The ability for a website operator to be in control of what he advertise to his users … “
    I don’t understand this. A website operator will always have options to advertise … and with RSS reader it is even more likely that the reader skip advertisments. Or what did you mean?
  • If every website on the web has to have a Facebook account in order to exist in practical terms, the web is dead—competition is dead
    I don’t think that that will be the case. Browsers will add the ability to post to popular services soon. Similar to the RSS icon in the late 20XX 😉
  • “The ability for us to aggregate, mash-up and interpret news without having to go through a closed API that may change on a whim, or disagree with our particular usage”
    A developer should not have to be fluent in Twitter, Facebook and a million different private APIs just to aggregate content from different websites you read”
    Valid arguments. But you are more likely to get the latest news via twitter rather than with your static set of blogs. In the end there will be something like a big realtime rss like API without the problems you describe 😉 But yes, this is a big argument. To be tied to an external service is bad.

It is also very powerful that with twitter-like web services

  1. even people without a blog can share valueable informations (e.g. links)
  2. a lot of people are already there. As a user RSS is very complicated. You have to search for blogs you are interested in. And because of the massive user-base of twitter you don’t need to mashupthings  IMHO.

BTW: why do you use the RSS reader of the browser? I’m using lifera

BTW2: you can follow me at twitter – and see not only what I’m posting here 😉

Detect Stolen and Duplicate Tweets with Solr

A new feature “duplication detection” is implemented for jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to this tweet and click on the ‘Find Similar’/’Guttenberg’ button below the tweet to investigate existing duplicates. With that feature it is possible for jetwick to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out.

Examples for ‘stolen’ or duplicated tweets:

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

http://twitter.com/#!/Newsteam_Berlin/status/17881387294003200

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier. That is life!

Hootsuite gets a Challenger

We are pleased to announce that starting from today you can save searches.

Go to jetwick.com. Login. Do a search. Save it with the rss icon:

BTW: we’ve redefined RSS to “relaxed saved searches” 😉

Hmmh, ok. Nothing new, you might think. Twitter allows you to do the same: saving searches. And e.g. Hootsuite is really good to store several searches too. I must admit that jetwick isn’t a big challenger, because it isn’t as proven and well tested, doesn’t have a solid user base etc

But: neither twitter nor hootsuite allows you to sort the tweets by number of retweets or language, right?

With jetwick you now can save your favourite searches and come back after days. Read only tweets that are important.

One example

if I would have to follow the twitter search for ‘java’ I would get hundreds of tweets per hours. So, logically I cannot read them all and to be honest: I never read them. It gets even worse when I do not visit twitter for days (yeah thats really possible). But I also do not want to miss the most important tweets. And thats where jetwick comes into the game which allows you to do exactly that: define a search important to you, sort or filter against retweets or language, and then stay informed after days. The saved search will display in the gray brackets the number of new tweets, calculated from the last search. It will update this count in a regular interval if you left the browser window open.

Jetwick even allows you to search an account and filter this for a keyword. This is especially usefull for an account with a lots of tweets like heiseonline:

And whats best about this all: jetwick is free software. Get the Java sources or join the team to make jetwick even more exciting and useful!

Jetwick Twitter Search is now free software! Wicket and Solr pearls for Developers.

Today we released Jetwick under the Apache 2 license.

Why we made Jetwick free software

I would like to start with an image I made some years ago for TimeFinder:

This is one reason and we are very interested in your contributions as patches or bug reports.

But there are some more interesting opportunities when releasing jetwick as open source:

  • Open architecture: several jetwick-hosters could provide parts of the twitter index and maybe at some day we can have a way to freely explore all tweets at twitter in the jetwicked way
  • Personalized jetwick: every user has different interests. If you only feed tweets from searches of terms that your are interested in and also only your timeline, then you will be able to search this personalized twitter, sort against retweets, see personal URL-trends, etc.
    This way you’ll be informed faster, wider and more personalized than with an ordinary rss feed or the ordinary twitter timeline. Without reading a lot of unrelated content. If jetwick would stay closed then this task would be too resource intensive or even impossible to convince every user.

In our further development we will concentrate on the second point, because then jetwick will have at least one user

Explore Jetwick

Why you should install Jetwick and try it out?

First you can look at the features and see if something interesting – for you as a user – is shown.

For developers there could be the following things (and more) worth to be investigated:

Jetwick can be used as a simple showcase how to use wicket and solr:

  1. show the use of facets and facet queries
  2. show date facets as explained in here
  3. make autocompletion working
  4. instant results when you select a suggestion from autocompletion as explained in this video

If you are programming some twitter stuff you should keep an eye on the following points:

  • spam detection as explained in this post
  • how you can use oAuthentication with twitter4j and wicket
  • transform tweets with clickable users, links and hashtags
  • translate tweets with google translate

If you are new to wicket

  • Jetwick is configured that if you run ‘mvn jetty:run’ you can update html and code and you can hit refresh on the browser 1-2 seconds later to see the updated results. For css it will be updated immediately
  • query wikipedia and show the results in a lazyload panel

Some solr gems:

  • simple near realtime set up with solr. And that although we make heavy usage of facets where a lot autowarming is required.
  • if you re-enable the user search you can use twitters person suggestions on your own data. I’m relative sure that twitter uses the ‘more like this’ feature of lucene that jetwick had implemented with solr.

fluid database and the PermGenSpace checker

  • fluid update of your database from hibernate changes (via liquibase). Hint: at the moment we only use the db for the ytags
  • a simple helper method ‘getPermGenSpace’ to check if reload is possible

The following interesting issues are still open:

  • using a storage for tweets (mysql or redis or …). This will increase the index time dramatically, because we had to switch to a pure solr solution (we had problems with h2 and hibernate for over 4mio tweets)
  • a mature queue like 0mq and protobuf or something else
  • a ‘real’ realtime solution for solr, if we use solr from trunk

Search any Twitter Account

There are a lot of services offering the same, even offering archiving but they often need registration.

Now with jetwick it is easy to search any account you like . E.g. try my account. To do the same for your account go to Jetwick, click ‘login’, allow jetwick access to your account (you can revoke it at any time and we won’t misuse it or even post tweets etc) and then click “grab tweets”.Then you will see something like

After this procedure you can search the whole history of the tweets easily.

Why do you want to grab tweets from other users? With that you can easily see about which topic a user tweets, on the right side. Again see “Words related to your query” of my account:

PS: Jetwick is now free software … you can host your own and play around!

Algorithm against Twitter Spam

In jetwick we only want to show relevant tweets for a search. No noise, no spam.

So first problem is solvable when we the user sorts by retweets, filters out by a specific criteria of his choice or when he refines its search: adds more specific terms.

But how can we get rid of spam at twitter? First, what is spam at twitter?

Several years ago Paul Graham gave a nice definition of email spam: ‘unsolicited and automated’. With this definition we can identify 4 situations for twitter spam:

  1. unsolicited tweets which appear in your timeline (e.g. the new ads or even retweets of your followers could be spam too ;-))
  2. unsolicited tweets in your searches not relevant to the search (e.g. spammers simply add hashtags from the trending topics to increase popularity of their tweets)
  3. unsolicited automated tweets which mentions you (and not only you …)
  4. unsolicited direct messages
  5. even fast following and unfollowing can be spam, because most users have enabled a notification
  6. a very cool spamming technic: some spammers add a mini advertisment to a nice statement of you or your product. If you don’t read the tweet carefully or follow the links, this could (mis)lead you to retweet it and make indirect advertisment for them.

With the following algorithm we can try to solve point 2 and 3.

  1. For a new tweet T get the user U
  2. For U get (some) additional tweets and store to a list L
  3. Go through L and compare the content with T.
  4. Use the Jaccard index for comparison (additionally compare URL and title of the linked webpage)
  5. If Jaccard index is too high or URLs are identical then decrease quality. Repeat with 3. if there are still tweets in L otherwise go to 6.
  6. Mark T as spam if quality is under a certain quality limit

I applied this algorithm on the data of jetwick and grabbed the twitter users with a lot of spammy tweets (the number in brackets) for the last week:

careerfan (968) -> bot
teamnapalm (587) -> spam
endy_pink (481) -> spam
manypro (312) -> bot
i_want_napalm (294) -> spam
gutlazaro (216) -> spam (canceled from twitter)
appstoreadam (210) -> bot+spam
livralivro (207) -> bot
sigaajesus (195) -> spammy
thakiddunncase (195) -> ups, no spam
lauberte_ (167) -> spam
2dvdlsnorjeuvou (158) -> spam
josialemrossi (152) -> spam (canceled from twitter)
malucomunic (139) -> spam (Was this account hacked!!?? Because no one of the followers is a spammer!)

The idea is simple, but the results looks promising. There could be a lot of use cases. E.g. twitter clients like hootsuite could add tweet quality to its available filters … the user specific klout score is not useful, because even less popular tweeters can create great tweets 🙂

Let me know what you think!

Fun and some important Dev-Tweets of the last week, 11th October

Let us start with the fun tweets. Ok, this week a lot Java bashing tweets, but I like them!

  • maven 3 is out. It now lets you download the internet even faster than before.

  • The world needs to stop hyping “html5” as though it’s markup alone that builds rich web apps. It makes JavaScript angry.

  • “JavaScript is the only language that people feel they dont need to learn before they start using it.” – Crockford

  • Little known fact: JavaScript also has an isNaaN() function for when you aren’t sure if you’re working with Indian food

  • I have seen an app with SQL code in the *views*, looked like a java coder was given a php book and told to make a rails app.

  • Matz on #ruby speed: Build your website in Ruby until you have more traffic than Twitter, then use your riches to hire Java programmers.

  • OH: “Java is just a DSL for turning XML into core dumps.”

  • judging Clojure/Lisp by its parens is like judging Java by its classpath



And last but not least some intersting infos:



Of course this list isn’t complete! So, watch out for more fun and infos at twitter and contact me or comment if you want to add it here or for the next week.