Introducing Jetslide News Reader

Posted on 12 July, 2011 by karussell

Update: Jetsli.de is no longer online. Checkout the projects snacktory and jetwick which were used in jetslide.

We are proud to announce the release of our Jetslide News Reader today! We know that there are a lot services aggregating articles from your twitter timeline such as the really nice tweetedtimes.com or paper.li. But as a hacker you’ll need a more powerful tool. You’ll need Jetslide. Read on to see why Jetslide is different and read this feature overview. By the way: yesterday we open sourced the content extractor called snacktory.

Jetslide is different …

… because it divides your ‘newspaper’ into easily navigatable topics and Jetslide prints articles from your timeline first! So you are following topics and not (only) people. See the first article which was referenced by a twitter friend and others, but it also prints articles from public. See the second article, where the highest share count (187) comes from digg. Click to view the reality of today or browse older content with the links under the articles:

Jetslide is smart …

… enough to skip duplicate articles and enhance your topics with related material. The relavance of every article is determined by an advanced algorithm (number of shares, quality, tweed, your browser language …) with the help of my database ElasticSearch – more on this in a later blog post.

And you can use a lot of geeky search queries to get what you want.

Jetslides are social

As pointed out under ‘Jetslide is different’ you’ll see articles posted in your twitter timeline first. But there is another features which make Jetslide more ‘social’. First, you get suggestions of users if they have the same or similar interests stored in their Jetslide. And second, Jetslide enables you to see others’ personal jetslide when adding e.g. the parameter owner=timetabling to the url.

Jetslides means RSS 3.0

You can even use the boring RSS feed:

http://jetsli.de/rss/owner/timetabling/time/today

But this is less powerful. The recommended way to ‘consume’ your topics is via RSS 3.0 😉

Log in to Jetslide and select “Read Mode:Auto”. Then every time you hit the ‘next’ arrow (or CTRL+right) the current viewed articles will be marked as read and only newer articles will pop up the next time you slide through. This way you can slide through your topics and come back everytime you want: after 2 hours or after 2 days (at the moment up to 7 days). In Auto-Read-Mode you’ll always see only what you have missed and what is relevant!

This is the most important point why we do not call Jetslide a search engine but a news service.

Jetslides are easily shareable

… because a Jetslide is just an URL – viewable on desktops, smartphones and even WAP browsers (left):

ElasticSearch vs. Solr #lucene

Posted on 12 May, 2011 by karussell

GraphHopper – A Java routing engine

karussell ads

I prepared a small presentation of ‘Why one should use ElasticSearch over Solr’ **

There is also a German article available in the iX magazine which introduces you to ElasticSearch and takes several aspects to compare Apache Solr and ElasticSearch.

This slide is based on my personal opinion and experience with my twitter search jetwick and my news reader jetslide. It should not be used to show that Solr or ElasticSearch is ‘bad’.

Twitter API and Me

Posted on 7 March, 2011 by karussell

I have a love hate relationship with Twitter. As a user I see the benefits of Twitter, when looking at it without the spam, duplicates and senseless tweets e.g. through jetwick. But as a developer the Twitter API is very ‘heuristic’ and handwaving in a lot areas and makes it complicated to use. I would have been lost without the nice twitter4j project, so thanks to the author!

Now let me give you some examples of

Strange things of the Twitter API

“The user ids in the Search API are different from those in the REST API”

The since id attribute is not supported when paginating in the search API:
“The since_id parameter will be removed from the next_page element as it is not supported for pagination. If since_id is removed a warning will be added to alert you.”
So you need to create your own pagination when you do not want to get already visited tweets via search API
Search API returns matches in URLs. This is in nearly all cases not useful. Especially for terms like ‘twitter’ or ‘google’ where the search API returns confusing tweets containing URLs search.twitter.com or google.com. But marketing companies need to search URLs and also the tweet button also relies on that ‘feature’, why not disable that and enable ‘link:http://any-link.here’ ? And it would be more useful to match against the title of the website like jetwick it does, but that’s another topic.
Search API does NOT return complete results compared to streaming API. I.e. results from streaming API contains all tweets with the specified keywords (without tweets via the URL bug I mentioned in the previous point). But the search API in contrast can leave out ‘spam’ tweets. I’m unsure if those tweets has to be really low quality or whatever. I guess this is more a technically issue with the search API that it leaves out some tweets the streaming has.
REST API allows one to get only ~3200 old tweets from one user and 800 tweets from your friends (i.e. your homeline).
Huge amount of different API limits:
- 350 requests per hour and user for the REST API
- Searches are restricted to IP (unknown number much higher than the 350 requests per hour)
- Only 2 filter streams are allowed – this is restricted to the IP. And only 200 keywords are possible per stream! But filter streams allow only approx. 50 tweets/s even if only a few keywords are used. (Then those keywords are high frequent)
- Search API allows searches into history, but how long depends on the frequency of the term. I know this is logically for every real time inverted index of this size, but should be better documented.

Regarding API Terms

Of course Twitter has API terms. This is necessary and nice to prevent the users from spam sites etc.

But there is also a display style guideline, which I had ‘fun’ the last weekend. Where I was asked e.g. to make the hashtag links of jetwick according to the display guideline. This is annoying. Now I need to pop up a dialog instead of directly triggering a search on jetwick – hey, it is a search engine! But twitter has to make money. That is ok. But I would like to have an exception for free or open source projects. No chance 😦 … here is my email conversation regarding the minor API term violation:

Dear XY,
ok, I won't provide an API to others. Thanks for the clarification.

I've got a further question. Are the display guidelines a requirement to
be aligned with the API terms of use and to continue running Jetwick? (I
shutted it down to not being evil)

In the terms I can read as the first principle: "Don't surprise users"
which is very important for me and it would disturb the user experience
if a hashtag click (or a click on '@user') in a tweet would result in a
pop up to twitter search or something and not simply trigger a search on
jetwick.

Please do not understand me wrong, I have already several links back to
twitter: the date links to the tweet on twitter, the retweet and reply
links to twitter and finally the user links back to twitter. Jetwick is
a complete read only service (see my API access), so I would be stupid
if I hadn't links back to twitter, which actually allows my users to
share noisefree information via twitter.

Finally: If the layout guides are a requirement, would you make an
exception for Jetwick regarding the hashtag and @user links within a
tweet? Many companies make exceptions when it comes to open source
projects such as Jetbrains (IDEA), Yourkit (Profiler), Attlassian
(Confluence), ... what about Twitter?

Kind Regards,
Peter.

The answer from twitter is crystal clear that Twitter does not provide API term exceptions to open source projects like other companies does. It also indicates that the API guys have a bit too much to do as the support does not really answer my question and neither understands what github is nor what jetwick means:

Hey Peter,

Thanks for following up. The API Terms of Service, as an overriding
document, do require you to adhere to these display guidelines -- in the
same "Don't Surprise Users" section you referenced. I recommend
adding links of your own, such as "#github on Jetwick" that surface
these results. Again, I'm sorry for the inconvenience this has caused,
and let me know if you have any other questions.

Regards,
XY

A second important thing

you’ll otherwise miss is that you are not allowed to offer an API to other people. Even if your project is open source! Here the email:
“Returning Twitter data, like tweets, through an API of your own is not allowed, neither for commercial services nor independent or open-source services. We are not looking for partners to formally extend new APIs as you request.”

Conclusion

So, keep this all in mind when you start to build a system using or even relying on the Twitter API. I hope this post clarifies the mystics of the Twitter API a bit! If you have encountered similar issues: feel free to comment 🙂 !

Da Guttenberg Button!

Posted on 25 February, 2011 by karussell

Da Jetwick ja eine Weile aus technischen Gründen ohne den Duplikaterkennungs-Button auskommen musste, packte ich gestern die Gelegenheit beim Schopfe und erweckte den Button im neuem Lichte. Der Button ist besonders bei eigenen Tweets oder bei Tweet mit vielen Retweets interessant:

Das erschütternde Ergebnis: nicht alle Twitteraner haben Reuters zitiert!

Why Jetwick moved from Solr to ElasticSearch

Posted on 7 February, 2011 by karussell

I like both technologies Solr and ElasticSearch and a lot work is going into both. So, let me explain why I choose to migrate from Solr to ElasticSearch (ES).

What is elastic?

ES lets you add and remove nodes [Video] and the requests will be handled from the correct node. Nodes will even do ‘zero config’ discovery.
To scale if the load increases you can use replicas. ElasticSearch will automatically play the loadbalancer and choose the appropriated node.
ES lets you scale if data amount increase, because then you can easily use sharding: it’s just a number in ES (either via API or via configuration).

With that features ES is well prepared for the century of the cloud [Blog]!

What’s the difference to Solr?

Solr wasn’t designed from the ground up with the ‘cloud’ in mind, but of course you can do sharding, use replication and use multiple cores with Solr. It’s just a bit more complicated.

When using Solr Cloud and ZooKeeper this gets better. You’ll also need to invest some time to make Solr near real time to be comparable with ES. This all seemed to be a bit too tricky to me (in Dec 2010) and I don’t have any time for administration work in my free time e.g. to set up backups/replicas, add shards/indices, …

Other Options?

What are my other options? There is Xapian, Sphinx etc. But only the following two projects fullfilled my requirements:

Using Solandra or
Moving from Solr to ElasticSearch

I wanted a lucene based solution and a solution where it works out of the box to shard and create indices. I simply wanted more data from Twitter available in Jetwick.

The first option is very nice, no changes to your code are required – only a minor change in your solrconfig.xml and you will get a distributed and real time Solr! So, I tried Solandra and after a lot support from Jake (Thanks!) I got it running with Jetwick! But at the end I still had performance issues with my indexing strategy, so I tried – in parallel – the second step.

What are the advantages of ElasticSearch?

To be honest Jetwick doesn’t really need to be elastic – I’m only using the sharding feature at the moment as I don’t own capacity on a cloud. BUT ElasticSearch is also elastic in a different area: ES lets you manage indices very very easy! A clever thing in ES is that you don’t define the document structure in an index like you do in Solr – no, you define types and then create documents of a specific type in a specific index. And documents in ES don’t need to be flat – they can be nested as they are pure JSON.

That and the ‘elasticity’ could make ES suitable as a hip NoSql storage 😉

Another advantage over Solr is the near real time behaviour, which you’ll get at no costs when switching to ES.

The Move!

Moving to ElasticSearch with Jetwick wasn’t that easy as I hoped. Although I’m sure one can make a normal migration in one day with my experience now ;). It took a lot of time to understand the new technology and more importantly to migrate my UI code where I made too much use to construct a SolrQuery object. At the end I created a custom Solr2ElasticHelper utility to avoid this clumsy work at the beginning. ~~And at some day I will fully migrate even this code.~~ This is now migrated to my own query object which makes it easy for me to add and remove filters etc.

When moving to ElasticSearch be sure that it supports all feature Solr has. Although Shay works really hard to integrate new features into ES he cannot do all the work alone! E.g. I had to integrate Solrs’ WordDelimiterFilter, but this wasn’t that difficult – just copy & paste; plus some configuration.

ES uses netty under the hood – no other webserver is necessary. Just start the node either via API or in directly via bin/elasticsearch and then query the node via curl or the browser. For example you can use the nice ElasticSearch Head project:

or ElasticSearch-JS which are equivalents to the Solr admin page. To add a node simply start another ES instance and they will automagically discover each other. You can also use curl on the command line to query and feed the index as documented in the REST API documentation.

No technology is perfect so keep in mind the following disadvantages which will disappear over time in my opinion:

Solr has more Analyzers, Filters, etc., but it is relative easy to use them in ES as well.
Solr has a larger community, a larger user base and more companies offering professional support
Solr has better documentation and more books. Regarding the docs of ES: they are moving now to the github wiki and the docs will now improve IMO.
Solr has more tooling e.g. solrmonitor, LucidGaze and newrelic, but you still have yourkit and jvisualvm 😉

But keep in mind also the following unmentioned notes:

Shay fixes bugs very quickly!
ElasticSearch has a more recent Lucene version and releases more frequently
It is very easy to contribute via github (just a pull request away ;))

To get introduced into ElasticSearch you can read this article.

Get more Friends on Twitter with Jetwick

Posted on 4 February, 2011 by karussell

Obviously you won’t need a tool to get more friends aka ‘following’ on twitter but you’ll add more friends when you tried our new feature called ‘friend search’. But let me start from the beginning of our recent, major technology shift for jetwick – our open source twitter search.

We have now moved the search server forward to ElasticSearch (from Solr) – more on that in a later post. This move will hopefully solve some data actuality problems but also make more tweets available in jetwick. All features should work as before.

To make it even more pleasant for you my fellow user I additionally implemented the friend search with all the jetwicked features: sort it against retweets, filter against language, filter away spam and duplicates …

Update: you’ll need to install jetwick

Try it on your own

First login to jetwick. You need to ~2 minutes until all your friends will be detected …
Then type your query or leave it empty to get all tweets
Finally select ‘Only Friends’ as the user filter.
Now you are able to search all the tweets of the people you follow.
Be sure you clicked on without duplicates (on the left) etc. as appropriated

Here the friend search results when querying ‘java’:

And now the normal search where the users rubenlirio and tech_it_jobs are not from my ‘friends’:

You don’t need to stay on-line all the time – jetwick automagically grab the tweets of your friends for you. And if you use the relaxed saved searches, then you’ll also be notified for all changes in you homeline – even after days!

That was the missing puzzle piece for me to be able to stay away from twitter and PC – no need to check every hour for the latest tweets in my homeline or my “twitter – saved searches”.

Jetwick is free so you’ll only need to login and try it! As a side effect of being logged in: your own tweets will be archived and you can search them though the established user search. With that user search you can use twitter as a bookmark application as I’m already doing it … BTW: you’ll notice the orange tweet which is a free ad: just create a tweet containing #jetwick and it’ll be shown on top of matching searches.

Another improvement is (hopefully) the user interface for jetwick, the search form should be more clear:

before it was:

and the blue logo should now look a bit better, before it was:

What do you think?

A world without RSS would not hurt

Posted on 2 January, 2011 by karussell

This post is a quick reply to the post RSS Is Dying, and You Should Be Very Worried.

I really like rss reader. I have a lot of subscriptions to blogs. But I don’t think that RSS (Really Simple Syndication) has a future, simply because there will be at least one service similar to twitter. People simply like it more: “to follow *people*” and not websites!

Twitter is very powerful. You can create your own RSS-like reader from twitter like I did with jetwick. I mean, what do you want with an RSS reader? You want news, right? Personalized news? With RSS you’ll get news from blogs. With twitter you get personalized news from people as I said before or if you are interested in some topics you can get personalized news via search terms too! That’s very powerful: to get news about topics. That’s what you want, I guess!?

Now there are some good statements in the post where I wanted to add my comment now.

“IF RSS DIES, WE LOSE THE ABILITY TO READ IN PRIVATE”
First: why would you use chrome if you want to read in private??? 🙂
Second: Yes, privacy is a problem with twitter and facebook. But there will be tools like jetwick where you can “silently” follow people if you want. So this is not an issue for me. Of course, you’ll need to host your own version of jetwick to really have no privacy issues 😉
“The ability for a website operator to be in control of what he advertise to his users … “
I don’t understand this. A website operator will always have options to advertise … and with RSS reader it is even more likely that the reader skip advertisments. Or what did you mean?
“If every website on the web has to have a Facebook account in order to exist in practical terms, the web is dead—competition is dead“
I don’t think that that will be the case. Browsers will add the ability to post to popular services soon. Similar to the RSS icon in the late 20XX 😉
“The ability for us to aggregate, mash-up and interpret news without having to go through a closed API that may change on a whim, or disagree with our particular usage”
“A developer should not have to be fluent in Twitter, Facebook and a million different private APIs just to aggregate content from different websites you read”
Valid arguments. But you are more likely to get the latest news via twitter rather than with your static set of blogs. In the end there will be something like a big realtime rss like API without the problems you describe 😉 But yes, this is a big argument. To be tied to an external service is bad.

It is also very powerful that with twitter-like web services

even people without a blog can share valueable informations (e.g. links)
a lot of people are already there. As a user RSS is very complicated. You have to search for blogs you are interested in. And because of the massive user-base of twitter you don’t need to mashupthings IMHO.

BTW: why do you use the RSS reader of the browser? I’m using lifera …

BTW2: you can follow me at twitter – and see not only what I’m posting here 😉

Detect Stolen and Duplicate Tweets with Solr

Posted on 23 December, 2010 by karussell

A new feature “duplication detection” is implemented for jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to this tweet and click on the ‘Find Similar’/’Guttenberg’ button below the tweet to investigate existing duplicates. With that feature it is possible for jetwick to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out.

Examples for ‘stolen’ or duplicated tweets:

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/eeKJWw

— Janell albert (@janellalbert74) December 23, 2010

World Cup hero Donovan files for divorce from actress wife: World Cup hero Landon Donovan has filed for divorce … http://bit.ly/gGfT8R

— Ervin Myers (@jaemas) December 23, 2010

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

— Klaus Redegeld (@muc4u) December 23, 2010

http://twitter.com/#!/Newsteam_Berlin/status/17881387294003200

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

Lufthansa sagt "Fahrt Bahn". Bahn sagt "Fahrt Auto". ADAC sagt "Fahrt morgen". Morgen sagen alle:"Wären Sie mal gestern gefahren"

— schlenzalot (@schlenzalot) December 23, 2010

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

Dear kids, there is no Santa. Those presents are from your parents. Love, Wikileaks

— Suliman Azzouni (@Suliz) December 22, 2010

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier. That is life!

Hootsuite gets a Challenger

Posted on 9 December, 2010 by karussell

We are pleased to announce that starting from today you can save searches.

Go to jetwick.com. Login. Do a search. Save it with the rss icon:

BTW: we’ve redefined RSS to “relaxed saved searches” 😉

Hmmh, ok. Nothing new, you might think. Twitter allows you to do the same: saving searches. And e.g. Hootsuite is really good to store several searches too. I must admit that jetwick isn’t a big challenger, because it isn’t as proven and well tested, doesn’t have a solid user base etc

But: neither twitter nor hootsuite allows you to sort the tweets by number of retweets or language, right?

With jetwick you now can save your favourite searches and come back after days. Read only tweets that are important.

One example

if I would have to follow the twitter search for ‘java’ I would get hundreds of tweets per hours. So, logically I cannot read them all and to be honest: I never read them. It gets even worse when I do not visit twitter for days (yeah thats really possible). But I also do not want to miss the most important tweets. And thats where jetwick comes into the game which allows you to do exactly that: define a search important to you, sort or filter against retweets or language, and then stay informed after days. The saved search will display in the gray brackets the number of new tweets, calculated from the last search. It will update this count in a regular interval if you left the browser window open.

Jetwick even allows you to search an account and filter this for a keyword. This is especially usefull for an account with a lots of tweets like heiseonline:

And whats best about this all: jetwick is free software. Get the Java sources or join the team to make jetwick even more exciting and useful!

Jetwick Twitter Search is now free software! Wicket and Solr pearls for Developers.

Posted on 22 November, 2010 by karussell

Today we released Jetwick under the Apache 2 license.

Why we made Jetwick free software

I would like to start with an image I made some years ago for TimeFinder:

This is one reason and we are very interested in your contributions as patches or bug reports.

But there are some more interesting opportunities when releasing jetwick as open source:

Open architecture: several jetwick-hosters could provide parts of the twitter index and maybe at some day we can have a way to freely explore all tweets at twitter in the jetwicked way
Personalized jetwick: every user has different interests. If you only feed tweets from searches of terms that your are interested in and also only your timeline, then you will be able to search this personalized twitter, sort against retweets, see personal URL-trends, etc.
This way you’ll be informed faster, wider and more personalized than with an ordinary rss feed or the ordinary twitter timeline. Without reading a lot of unrelated content. If jetwick would stay closed then this task would be too resource intensive or even impossible to convince every user.

In our further development we will concentrate on the second point, because then jetwick will have at least one user

Explore Jetwick

Why you should install Jetwick and try it out?

First you can look at the features and see if something interesting – for you as a user – is shown.

For developers there could be the following things (and more) worth to be investigated:

Jetwick can be used as a simple showcase how to use wicket and solr:

show the use of facets and facet queries
show date facets as explained in here
make autocompletion working
instant results when you select a suggestion from autocompletion as explained in this video

If you are programming some twitter stuff you should keep an eye on the following points:

spam detection as explained in this post
how you can use oAuthentication with twitter4j and wicket
transform tweets with clickable users, links and hashtags
translate tweets with google translate

If you are new to wicket

Jetwick is configured that if you run ‘mvn jetty:run’ you can update html and code and you can hit refresh on the browser 1-2 seconds later to see the updated results. For css it will be updated immediately
query wikipedia and show the results in a lazyload panel

Some solr gems:

simple near realtime set up with solr. And that although we make heavy usage of facets where a lot autowarming is required.
if you re-enable the user search you can use twitters person suggestions on your own data. I’m relative sure that twitter uses the ‘more like this’ feature of lucene that jetwick had implemented with solr.

fluid database and the PermGenSpace checker

fluid update of your database from hibernate changes (via liquibase). Hint: at the moment we only use the db for the ytags
a simple helper method ‘getPermGenSpace’ to check if reload is possible

The following interesting issues are still open:

using a storage for tweets (mysql or redis or …). This will increase the index time dramatically, because we had to switch to a pure solr solution (we had problems with h2 and hibernate for over 4mio tweets)
a mature queue like 0mq and protobuf or something else
a ‘real’ realtime solution for solr, if we use solr from trunk

Karussell

Thoughts about Java and more

Category Archives: Jetwick