Get Started with ElasticSearch and Wicket

GraphHopper – A Java routing engine

karussell ads

This article will show you the most basic steps required to make ElasticSearch working for the simplest scenario with the help of the Java API – it shows installing, indexing and querying.

1. Installation

Either get the sources from github and compile it or grab the zip file of the latest release and start a node in foreground via:

bin/elasticsearch -f

To make things easy for you I have prepared a small example with sources derived from jetwick where you can start ElasticSearch directly from your IDE – e.g. just click ‘open projects’ in NetBeans run then start from the ElasticNode class. The example should show you how to do indexing via bulk API, querying, faceting, filtering, sorting and probably some more:

To get started on your own see the sources of the example where I’m actually using ElasticSearch or take a look at the shortest ES example (with Java API) in the last section of this post.

Info: If you want that ES starts automatically when your debian starts then read this documentation.

2. Indexing and Querying

First of all you should define all fields of your document which shouldn’t get the default analyzer (e.g. strings gets analyzed, etc) and specify that in the tweet.json under the folder es/config/mappings/_default

For example in the elasticsearch example the userName shouldn’t be analyzed:

{ "tweet" : {
   "properties" : {
     "userName": { "type" : "string", "index" : "not_analyzed" }
}}}

Then start the node:

import static org.elasticsearch.node.NodeBuilder.*;
...
Builder settings = ImmutableSettings.settingsBuilder();
// here you can set the node and index settings via API
settings.build();
NodeBuilder nBuilder = nodeBuilder().settings(settings);
if (testing)
 nBuilder.local(true);

// start it!
node = nBuilder.build().start();

You can get the client directly from the node:

Client client = node.client();

or if you need the client in another JVM you can use the TransportClient:

Settings s = ImmutableSettings.settingsBuilder().put("cluster.name", cluster).build();
TransportClient tmp = new TransportClient(s);
tmp.addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9200));
client = tmp;

Now create your index:

try {
  client.admin().indices().create(new CreateIndexRequest(indexName)).actionGet();
} catch(Exception ex) {
   logger.warn("already exists", ex);
}

When indexing your documents you’ll need to know where to store (indexName) and what to store (indexType and id):

IndexRequestBuilder irb = client.prepareIndex(getIndexName(), getIndexType(), id).
setSource(b);
irb.execute().actionGet();

where the source b is the jsonBuilder created from your domain object:

import static org.elasticsearch.common.xcontent.XContentFactory.*;
...
XContentBuilder b = jsonBuilder().startObject();
b.field("tweetText", u.getText());
b.field("fromUserId", u.getFromUserId());
if (u.getCreatedAt() != null) // the 'if' is not neccessary in >= 0.15
  b.field("createdAt", u.getCreatedAt());
b.field("userName", u.getUserName());
b.endObject();

To get a document via its id you do:

GetResponse rsp = client.prepareGet(getIndexName(), getIndexType(), "" + id).
execute().actionGet();
MyTweet tweet = readDoc(rsp.getSource(), rsp.getId());

Getting multiple documents at once is currently not supported via ‘prepareGet’, but you can create a terms query with the indirect field ‘_id’ to achieve this bulk-retrieving. When updating a lots of documents there is already a bulk API.

In test cases after indexing you’ll have to make sure that the documents are actually ‘commited’ before searching (don’t do this in production):

RefreshResponse rsp = client.admin().indices().refresh(new RefreshRequest(indices)).actionGet();

To write tests which uses ES you can take a look into the source code how I’m doing this (starting ES on beforeClass etc).

Now let use search:

SearchRequestBuilder builder = client.prepareSearch(getIndexName());
XContentQueryBuilder qb = QueryBuilders.queryString(queryString).defaultOperator(Operator.AND).
   field("tweetText").field("userName", 0).
   allowLeadingWildcard(false).useDisMax(true);
builder.addSort("createdAt", SortOrder.DESC);
builder.setFrom(page * hitsPerPage).setSize(hitsPerPage);
builder.setQuery(qb);

SearchResponse rsp = builder.execute().actionGet();
SearchHit[] docs = rsp.getHits().getHits();
for (SearchHit sd : docs) {
  //to get explanation you'll need to enable this when querying:
  //System.out.println(sd.getExplanation().toString());

  // if we use in mapping: "_source" : {"enabled" : false}
  // we need to include all necessary fields in query and then to use doc.getFields()
  // instead of doc.getSource()
  MyTweet tw = readDoc(sd.getSource(), sd.getId());
  tweets.add(tw);
}

The helper method readDoc is simple:

public MyTweet readDoc(Map source, String idAsStr) {
  String name = (String) source.get("userName");
  long id = -1;
  try {
     id = Long.parseLong(idAsStr);
  } catch (Exception ex) {
     logger.error("Couldn't parse id:" + idAsStr);
  }

  MyTweet tweet = new MyTweet(id, name);
  tweet.setText((String) source.get("tweetText"));
  tweet.setCreatedAt(Helper.toDateNoNPE((String) source.get("createdAt")));
  tweet.setFromUserId((Integer) source.get("fromUserId"));
  return tweet;
}

When you want that the facets will be return in parallel to the search results you’ll have to ‘enable’ it when querying:

facetName = "userName";
facetField = "userName";
builder.addFacet(FacetBuilders.termsFacet(facetName)
   .field(facetField));

Then you can retrieve all term facet via:

SearchResponse rsp = ...
if (rsp != null) {
 Facets facets = rsp.facets();
 if (facets != null)
   for (Facet facet : facets.facets()) {
     if (facet instanceof TermsFacet) {
         TermsFacet ff = (TermsFacet) facet;
         // => ff.getEntries() => count per unique value
...

This is done in the FacetPanel.

I hope you now have a basic understanding of ElasticSearch. Please let me know if you found a bug in the example or if something is not clearly explained!

In my (too?) small Solr vs. ElasticSearch comparison I listed also some useful tools for ES. Also have a look at this!

3. Some hints

  • Use ‘none’ gateway for tests. Gateway is used for long term persistence.
  • The Java API is not well documented at the moment, but now there are several Java API usages in Jetwick code
  • Use scripting for boosting, use JavaScript as language – most performant as of Dec 2010!
  • Restart the node to try a new scripting language
  • Use snowball stemmer in 0.15 use language:English (otherwise ClassNotFoundException)
  • See how your terms get analyzed:
    http://localhost:9200/twindexreal/_analyze?analyzer=index_analyzer “this is a #java test => #java + test”
  • Or include the analyzer as a plugin: put the jar under lib/ E.g. see the icu plugin. Be sure you are using the right guice annotation
  • You set port 9200 (-9300) for http communication and 9300 (-9400) for transport client.
  • if you have problems with ports: make sure at least a simple put + get is working via curl
  • Scaling-ElasticSearch
    This solution is my preferred solution for handling long term persistency of of a cluster since it means
    that node storage is completely temporal. This in turn means that you can store the index in memory for example,
    get the performance benefits that comes with it, without scarifying long term persistency.
  • Too many open files: edit /etc/security/limits.conf
    user soft nofile 15000
    user hard nofile 15000
    ! then login + logout !

4. Simplest Java Example

import static org.elasticsearch.node.NodeBuilder.*;
import static org.elasticsearch.common.xcontent.XContentFactory.*;
...
Node node = nodeBuilder().local(true).
settings(ImmutableSettings.settingsBuilder().
put("index.number_of_shards", 4).
put("index.number_of_replicas", 1).
build()).build().start();

String indexName = "tweetindex";
String indexType = "tweet";
String fileAsString = "{"
+ "\"tweet\" : {"
+ "    \"properties\" : {"
+ "         \"longval\" : { \"type\" : \"long\", \"null_value\" : -1}"
+ "}}}";

Client client = node.client();
// create index
client.admin().indices().
create(new CreateIndexRequest(indexName).mapping(indexType, fileAsString)).
actionGet();

client.admin().cluster().health(new ClusterHealthRequest(indexName).waitForYellowStatus()).actionGet();

XContentBuilder docBuilder = XContentFactory.jsonBuilder().startObject();
docBuilder.field("longval", 124L);
docBuilder.endObject();

// feed previously created doc
IndexRequestBuilder irb = client.prepareIndex(indexName, indexType, "1").
setConsistencyLevel(WriteConsistencyLevel.DEFAULT).
setSource(docBuilder);
irb.execute().actionGet();

// there is also a bulk API if you have many documents
// make doc available for sure – you shouldn't need this in production, because
// the documents gets available automatically in (near) real time
client.admin().indices().refresh(new RefreshRequest(indexName)).actionGet();

// create a query to get this document
XContentQueryBuilder qb = QueryBuilders.matchAllQuery();
TermFilterBuilder fb = FilterBuilders.termFilter("longval", 124L);
SearchRequestBuilder srb = client.prepareSearch(indexName).
setQuery(QueryBuilders.filteredQuery(qb, fb));

SearchResponse response = srb.execute().actionGet();

System.out.println("failed shards:" + response.getFailedShards());
Object num = response.getHits().hits()[0].getSource().get("longval");
System.out.println("longval:" + num);

Get more Friends on Twitter with Jetwick

Obviously you won’t need a tool to get more friends aka ‘following’ on twitter but you’ll add more friends when you tried our new feature called ‘friend search’. But let me start from the beginning of our recent, major technology shift for jetwick – our open source twitter search.

We have now moved the search server forward to ElasticSearch (from Solr) – more on that in a later post. This move will hopefully solve some data actuality problems but also make more tweets available in jetwick. All features should work as before.

To make it even more pleasant for you my fellow user I additionally implemented the friend search with all the jetwicked features: sort it against retweets, filter against language, filter away spam and duplicates …

Update: you’ll need to install jetwick

Try it on your own

  1. First login to jetwick. You need to ~2 minutes until all your friends will be detected …
  2. Then type your query or leave it empty to get all tweets
  3. Finally select ‘Only Friends’ as the user filter.
  4. Now you are able to search all the tweets of the people you follow.
  5. Be sure you clicked on without duplicates (on the left) etc. as appropriated

Here  the friend search results when querying ‘java’:

And now the normal search where the users rubenlirio and tech_it_jobs are not from my ‘friends’:

You don’t need to stay on-line all the time – jetwick automagically grab the tweets of your friends for you. And if you use the relaxed saved searches, then you’ll also be notified for all changes in you homeline – even after days!

That was the missing puzzle piece for me to be able to stay away from twitter and PC – no need to check every hour for the latest tweets in my homeline or my “twitter – saved searches”.

Jetwick is free so you’ll only need to login and try it! As a side effect of being logged in: your own tweets will be archived and you can search them though the established user search. With that user search you can use twitter as a bookmark application as I’m already doing it … BTW: you’ll notice the orange tweet which is a free ad: just create a tweet containing #jetwick and it’ll be shown on top of matching searches.

Another improvement is (hopefully) the user interface for jetwick, the search form should be more clear:

before it was:

and the blue logo should now look a bit better, before it was:

What do you think?

Jetwick Twitter Search is now free software! Wicket and Solr pearls for Developers.

Today we released Jetwick under the Apache 2 license.

Why we made Jetwick free software

I would like to start with an image I made some years ago for TimeFinder:

This is one reason and we are very interested in your contributions as patches or bug reports.

But there are some more interesting opportunities when releasing jetwick as open source:

  • Open architecture: several jetwick-hosters could provide parts of the twitter index and maybe at some day we can have a way to freely explore all tweets at twitter in the jetwicked way
  • Personalized jetwick: every user has different interests. If you only feed tweets from searches of terms that your are interested in and also only your timeline, then you will be able to search this personalized twitter, sort against retweets, see personal URL-trends, etc.
    This way you’ll be informed faster, wider and more personalized than with an ordinary rss feed or the ordinary twitter timeline. Without reading a lot of unrelated content. If jetwick would stay closed then this task would be too resource intensive or even impossible to convince every user.

In our further development we will concentrate on the second point, because then jetwick will have at least one user

Explore Jetwick

Why you should install Jetwick and try it out?

First you can look at the features and see if something interesting – for you as a user – is shown.

For developers there could be the following things (and more) worth to be investigated:

Jetwick can be used as a simple showcase how to use wicket and solr:

  1. show the use of facets and facet queries
  2. show date facets as explained in here
  3. make autocompletion working
  4. instant results when you select a suggestion from autocompletion as explained in this video

If you are programming some twitter stuff you should keep an eye on the following points:

  • spam detection as explained in this post
  • how you can use oAuthentication with twitter4j and wicket
  • transform tweets with clickable users, links and hashtags
  • translate tweets with google translate

If you are new to wicket

  • Jetwick is configured that if you run ‘mvn jetty:run’ you can update html and code and you can hit refresh on the browser 1-2 seconds later to see the updated results. For css it will be updated immediately
  • query wikipedia and show the results in a lazyload panel

Some solr gems:

  • simple near realtime set up with solr. And that although we make heavy usage of facets where a lot autowarming is required.
  • if you re-enable the user search you can use twitters person suggestions on your own data. I’m relative sure that twitter uses the ‘more like this’ feature of lucene that jetwick had implemented with solr.

fluid database and the PermGenSpace checker

  • fluid update of your database from hibernate changes (via liquibase). Hint: at the moment we only use the db for the ytags
  • a simple helper method ‘getPermGenSpace’ to check if reload is possible

The following interesting issues are still open:

  • using a storage for tweets (mysql or redis or …). This will increase the index time dramatically, because we had to switch to a pure solr solution (we had problems with h2 and hibernate for over 4mio tweets)
  • a mature queue like 0mq and protobuf or something else
  • a ‘real’ realtime solution for solr, if we use solr from trunk

Jetwick Layout Update

In the next days I will release a minor changed UI of jetwick.com. It is nice that for this only css+html changes were necessary (I am using wicket btw). A nice consequence of this is that a lot “float: left;” and “width: xy;” are now needless.

This will be three the column style:

Before it was a mix with a lot white areas:

Barchart with Wicket and pure HTML

I needed to display the tweets per day for my date filter @ jetwick.com

I tried the jfreechart approach but I didn’t like to have a generated image with an imagemap although it worked and looks nicely.

So here you have the html, css and java snippet necessary to do the same in pure html. Please comment if something is wrong (I had to edit the working code to remove the unnecessary solrJ stuff that I had within that component).

Html

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"
      xmlns:wicket="http://wicket.apache.org/dtds.data/wicket-xhtml1.4-strict.dtd">
    <head>
        <title>[Panel Test]</title>
    </head>
    <body>
        <wicket:panel>
               <div class="main-bar-chart">
               <div class="bar-chart">
                    <div wicket:id="items">
                        <a wicket:id="itemLink">
                            <span wicket:id="itemLabel">[Text]</span>
                            <div wicket:id="itemSpan"/>
                        </a>
                    </div>
              </div>
              </div>
        </wicket:panel>
    </body>
</html>

Css

.date-filter .main-bar-chart {
    background: #f2f2f2 url('../img/bottom-line.png') bottom left repeat-x;
    padding: 10px;
    width: 610px;
    height: 100px;
}
.date-filter-label {
    padding-bottom: 10px;
}
.date-filter .bar-chart, .main-bar-chart .gray  { color: gray; }

.date-filter .bar-chart .item {
    padding-left: 10px;
    float: left;
}

.date-filter .bar-chart .item span  {
    font-size: 12px;
}

.date-filter .bar-chart .item .item-span {
    background-repeat: repeat-y;
background-image: url('../img/bar-min.png');
}

Java

     private List<Object[]> entryList = new ArrayList<Object[]>();
    private long max = 1;

    public JSDateFilter(String id) {
        super(id);

        ListView items = new ListView("items", entryList) {

            @Override
            public void populateItem(final ListItem item) {
                float zoomer = MAX_HEIGHT_IN_PX / max;
                final Object[] entry = (Object[]) item.getModelObject();
                String strValue = (String) entry[0];
                Integer count = (Integer) entry[1];
                Label bar = new Label("itemSpan");

                AttributeAppender app = new AttributeAppender("title", new Model(count + " entries"), " ");
                bar.add(app).add(new AttributeAppender("style", new Model("height:" + (int) (zoomer * count) + "px"), " "));
                AjaxFallbackLink link = new AjaxFallbackLink("itemLink") {

                    @Override
                    public void onClick(AjaxRequestTarget target) {
                        //TODO
                    }
                };
                link.add(app);
                Label label = new Label("itemLabel", strValue);
                link.add(bar).add(label);
                if (count == 0) {
                    link.setEnabled(false);
                    link.add(new AttributeAppender("class", new Model("gray"), " "));
                }

//                if (selected)
//                    link.add(new AttributeAppender("class", new Model("filter-rm"), " "));
//                else
//                    link.add(new AttributeAppender("class", new Model("filter-add"), " "));

                item.add(link);
            }
        };

        add(items);
    }

    public void update(Map<String, Integer> map) {
        entryList.clear();
        max = 1;
        for (Entry<String, Integer> e : map.entrySet()) {
            entryList.add(new Object[]{e.getKey(), e.getValue()});
            if (e.getValue() > max)
                max = e.getValue();
        }
    }

You can use this code in your wicket page via the following snippet in the html:

<div wicket:id="dateFilter">[dateFilter]</div>

and add(new DateFilter(“dateFilter”)) in the Java part. The bar image is available here.

Twitter Search Jetwick – powered by Wicket and Solr

How different is a quickstart project from production?

Today we released jetwick. With jetwick I wanted to realize a service to find similar users at twitter based on their tweeted content. Not based on the following-list like it is possible on other platforms:

Not only the find similar feature is nice, also the topics (on the right side of the user name; gray) give a good impression about which topic a user tweets about. The first usable prototype was ready within one week! I used lucene, vaadin and db4o. But I needed facets so I switched from lucene to solr.  The tranformation took only ~2 hours. Really! Test based programming rocks 😉 !

Now users told me that jetwick is slow on ‘old’ machines. It took me some time to understand that vaadin uses javascript a lot and inappropriate usage of layout could affect performance negativly in some browsers. So i had the choice to stay with vaadin and improve the performance (with different layouts) or switch to another web UI. I switched to wicket (twitter noise). It is amazingly fast. This transformation took some more time: 2 days. After this I was convinced with the performance of the UI. The programming model is quite similar (‘swing like’) although vaadin is easier and so, faster to implement. While working on this I could improve the tweet collector which searches twitter for information and stores the results in jetwick.

After this something went wrong with the db. It was very slow for >1 mio users. I tweaked to improve the performance of db4o at least one week (file >1GB). It improves, but it wouldn’t be sufficient for production. Then I switched to hibernate (yesql!). This switch took me again two weeks and several frustrating nights. Db4o is so great! Ok, now that I know hibernate better I can say: hibernate is great too and I think the most important feature (== disadvantage!) of hibernate is that you can tweak it nearly everwhere: e.g. you can say that you only want to count the results, that you want to fetch some relationship eager and some lazy and so on. Db4o wasn’t that flexible. But hibernate has another draw back: you will need to upgrade the db schema for yourself or you do it like me: use liquibase, which works perfectly in my case after some tweeking!

Now that we had the search, it turned out that this user-search was quite useful for me, as I wanted to have some users that I can follow. But alpha tester didn’t get the point of it. And then, the shock at the end of July: twitter released a find-similar feature for users! Damn! Why couldn’t they wait two months? It is so important to have a motivation … 😦 And some users seems to really like those user suggestions. ok, some users feel disgustedly when they recognized this new feature. But I like it!

BTW: I’m relative sure that the user-suggestions are based on the same ‘more like this’ feature (from Lucene) that I was using, because for my account I got nearly the same users suggested and somewhere in a comment I read that twitter uses solr for the user search. Others seems to get a shock too 😉

Then after the first shock I decided to switch again: from user-search to a regular tweet search where you can get more information out of those tweets. You can see with one look about which topics a user tweets or search for your original url. Jetwick tries to store expanded URLs where possible. It is also possible to apply topic, date and language filters. One nice consequence of a tweet-based index is, that it is possible to search through all my tweets for something I forgot:

Or you could look about all those funny google* accounts.

So, finally. What have I learned?

From a quick-start project to production many if not all things can change: Tools, layout and even the main features … and we’ll see what comes next.

Not A Java Web Frameworks Survey: Just use Wicket!

‘Java Web Frameworks Survey’ was my first blog posted which was reposted at dzone. Sadly there never was a follow up of it. Although I planned one with:

jZeno, SpringMVC, Seam, Vaadin (at that time: IT-Mill Toolkit), MyFaces, Stripes, Struts, ItsNat, IWebMvc

Now, today just a short, subjective mini-follow-up, maybe someone is interested after all those months … over the months I have additionally investigated JSF, Rails, Vaadin and one more:

  • No comments to JSF :-/
  • Rails is great! Especially the db migrations and other goodies. Partials are a crap: I prefer component based UI frameworks. If you don’t like ruby take a look at grails with autobase.
  • Additionally I highly recommend everyone to take a look at vaadin (‘server-side GWT’) if you need a stateful webapplication. Loading time was a problem for me. Other client-side performance problems can be solved if you use CssLayout, I think.

But for jetwick.com I chose wicket! There were/are 10 reasons:

The most important thing is: if you use ‘mvn jetty:run’ and NetBeans in combination then the development cycle feels like Rails: modify html, css or even Java code. Save and hit F5 in the browser. Nothing more.

The only problem is the database migration (wicket solves only the UI problems). For that I would use liquibase. Or simply run db4o, a nosql solution ‘or’ solr.