Jetslide uses ElasticSearch as Database

GraphHopper – A Java routing engine

karussell ads

This post explains how one could use the search server ElasticSearch as a database. I’m using ElasticSearch as my only data storage system, because for Jetslide I want to avoid maintenance and development time overhead, which would be required when using a separate system. Be it NoSQL, object or pure SQL DBs.

ElasticSearch is a really powerfull search server based on Apache Lucene. So why can you use ElasticSearch as a single point of truth (SPOT)? Let us begin and go through all – or at least my – requirements of a data storage system! Did I forget something? Add a comment 🙂 !

CRUD & Search

You can create, read (see also realtime get), update and delete documents of different types. And of course you can perform full text search!

Multi tenancy

Multiple indices are very easy to create and to delete. This can be used to support several clients or simply to put different types into different indices like one would do when creating multiple tables for every type/class.

Sharding and Replication

Sharding and replication is just a matter of numbers when creating the index:

curl -XPUT 'http://localhost:9200/twitter/' -d '
index :
    number_of_shards : 3
    number_of_replicas : 2'

You can even update the number of replicas afterwards ‘on the fly’. To update the number of shards of one index you have to reindex (see the reindexing section below).

Distributed & Cloud

ElasticSearch can be distributed over a lot of machines. You can dynamically add and remove nodes (video). Additionally read this blog post for information about using ElasticSearch in ‘the cloud’.

Fault tolerant & Reliability

ElasticSearch will recover from the last snapshot of its gateway if something ‘bad’ happens like an index corruption or even a total cluster fallout – think time machine for search. Watch this video from Berlin Buzz Words (minute 26) to understand how the ‘reliable and asyncronous nature’ are combined in ElasticSearch.

Nevertheless I still recommend to do a backup from time to time to a different system (or at least different hard disc), e.g. in case you hit ElasticSearch or Lucene bugs or at least to make it really secure 🙂

Realtime Get

When using Lucene you have a real time latency. Which basically means that if you store a document into the index you’ll have to wait a bit until it appears when you search afterwards. Altought this latency is quite small: only a few milliseconds it is there and gets bigger if the index gets bigger. But ElasticSearch implements a realtime get feature in its latest version, which makes it now possible to retrieve the object even if it is not searchable by its id!

Refresh, Commit and Versioning

As I said you have a realtime latency when creating or updating (aka indexing) a document. To update a document you can use the realtime get, merge it and put it back in the index. Another approach which avoids further hits on ElasticSearch, would be to call refresh (or commit in Solr) of the index. But this is very problematic (e.g. slow) when the index is not tiny.

The good news is that you can again solve this problem with a feature from ElasticSearch – it is called versioning. This an identical to the ‘application site’ optimistical locking in the database world. Put the document in the index and if it fails e.g. merge the old state with the new and try again. To be honest this requires a bit more thinking using a failure-queue or similar, but now I have a really good working system secured with unit tests.

If you think about it, this is a really huge benefit over e.g. Solr. Even if Solrs’ raw indexing is faster (no one really did a good job in comparing indexing performance of Solr vs. ES) it requires a call of commit to make the documents searchable and slows down the whole indexing process a lot when comparing to ElasticSearch where you never really need to call the expensive refresh.

Reindexing

This is not necessary for a normal database. But it is crucial for a search server, e.g. to change an analyzer or the number of shards for an index. Reindexing sounds hard but can be easily implemented even without a separate data storage in ElasticSearch. For Jetslide I’m storing not single fields I’m storing the entire document as JSON in the _source. This is necessary to fetch the documents from the old index and put them into the newly created (with different settings).

But wait. How can I fetch all documents from the old index? Wouldn’t this be bad in terms of performance or memory for big indices? No, you can use the scan search type, which avoids e.g. scoring.

Ok, but how can I replace my old index with the new one? Can this be done ‘on the fly’? Yes, you can simply switch the alias of the index:

curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
   { "remove" : { "index" : "userindex6", "alias" : "userindex" } },
   { "add" : { "index" : "userindex7", "alias" : "uindex" } }]
}'

Performance

Well, ElasticSearch is fast. But you’ll have to determine for youself if it is fast enough for your use case and compare it to your existing data storage system.

Feature Rich

ElasticSearch has a lot of features, which you do not find in a normal database. E.g. faceting or the powerful percolator to name only a few.

Conclusion

In this post I explained if and how ElasticSearch can be used as a database replacement. ElasticSearch is very powerfuly but e.g. the versioning feature requires a bit handwork. So working with ElasticSearch is comparable more to the JDBC or SQL world not to the ORM one. But I’m sure there will pop up some ORM tools for ElasticSearch, although I prefer to avoid system complexity and will always use the ‘raw’ ElasticSearch I guess.

14 thoughts on “Jetslide uses ElasticSearch as Database

    • Hi,

      Can you update the elasticsearch index if we update the values in the mysql database after the initial creation of the index for a table?
      If so, can you please let me know the procedure?

      Thanks!

  1. Great post. I have been using CouchDB as a storage layer behind ElasticSearch because, coming from Solr, I’ve always thought of Lucene indexes as rather ephemeral. Really neat to realize that I can just store these JSON docs in ES.

  2. Would love to hear an update of your experiences on ES as the primary store. Any bumps or horrors? It’s awfully appealing in principle, I just created an app with ES as the only store, but I’m not sure how much to trust it.

  3. What do you mean with ‘ala jpa’?

    store/update via bulk or update api

    retrieve via get (realtime) or search

    delete via delete query

    • I was thinking that ES could be listed in the persistenve.xml file with a jdbc dsn, usermame, and password all underneath tje jpa/hinernate persistence provider.

      Then add annotations to.tje.class files of emtities, qand.fo forward.

      Or sometjkng.like/close to that.

  4. ElasticSearch has no transaction support nor will it have them in the future (see some discussion in the user group). A common workaround is the real time get combined with the version attribute.

Comments are closed.