Rage Against the Android #eclipse

Developing Android applications on Linux with Eclipse sometimes can get really ugly. Sadly neither NetBeans which has a really nice Android plugin, but cannot execute a single test nor IDEA can rescue me or make me switching :( but probably they wouldn’t rescue me due to problems of Android development kit itself – I’m not sure.

So, I’ve collected some of the most common problems I encountered while developing an Android app and how to ‘solve’ them:

  • Problem: Eclipse says ‘Your project contains error(s), please fix it before running it.’ and you cannot find a problem.
    Solution:
    1. Open the problem tab. fix the described errors.
    2. Make sure that you included all necessary jars in your build path
    3. Sometimes even this can help: rm ~/.android/debug.keystore
  • Problem: you cannot debug your application
    Solution:
    1. check if debuggable = true in application tag of manifest xml
    2. if that does not help or if you are getting “Can’t bind to local 8601 for debugger” in the Console tab then read this and make sure you use only the line
    127.0.0.1       localhost
    in your hosts file. if not change the file and restart adb (see below)
  • Problem: Error “AdbCommandRejectedException: device not found”
    Solution: restart adb (see below)
  • Problem: you cannot select one test case to execute
    Solution: run the whole (android) juni test or a package and then select via right click to debug one single test

If nothing seems to help then try one or all of the following steps:
1. restart device
2. restart eclipse
3. restart adb: sudo adb kill-server; sudo adb start-server

Please add your problems and solutions in the comments ;)

Assignment Algorithms to improve Perfomance of Automated Timetabling

Last two days I’ve read the old Java code of a board game. Although the game still compiles and works (it even works on a Zaurus device) the code itself is horrible: no unit tests, thread issues, a lot of static usages and mixed responsibilities. But then I ‘rediscovered’ my old TimeFinder project, which is a lot better – at least it has several unit tests! Then I saw that I even wanted to publish a paper regarding a nice finding I made 3 years ago. But I never published it as the results were too disappointing to me at that time.

Nevertheless I think the idea is still worth to be spread around (you are free to avoid usage ;) ). BTW: now you know why my blog starts with “Find Time …” and why my twitter nick name is called timetabling – its all about history.

My Draft of a paper

So here is my “paper” ‘Optimizing Educational Schedules Using Hungarian Algorithm and Iterated Local Search‘ with the following main result:

The algorithm is bad when it comes to softconstraint optimizations (it wasn’t designed for this though) but it really shines to reduce hard constraints of timetabling problems. E.g. it was an order of magnitude faster than the 5th place (Mueller with UniTime) of the International Timetabling Competition 2007. Of course, I didn’t compare the latest version of UniTtime with TimeFinder yet.

Let me know what you think!

Now I would like to talk a bit about how one could in general use assigment algorithms in timetabling.

Using Assignment Algorithms in Timetabling

In this paper a technic is shown how assignment algorithms could be applied in timetabling. Only a few papers are known to us which uses assignment algorithms based on graph theory [Kin05, Bah04, MD77, TM02].
An assignment algorithm can be used as a subprocedure of a heuristic as it is done in the open source timetabling application called TimeFinder [Kar08]. TimeFinder uses an Hungarian algorithm only to assign a location to an event, where the solution itself is improved with a heuristic based on iterated local search [SLM03], we called this heuristic NoCollisionPrinciple.
Another possibility is to use an assignment algorithms as the ’backbone’ of a heuristic as it is done in [Bah04], where the Hungarian Algorithm is modified to act as a timetabling optimization algorithm.A third approach is to use an assignment algorithm as a pre-solver of another algorithm. This is planed for the TimeFinder project: before using the constraint based algorithm of UniTime an assignment algorithm will be used to give a good and fast starting point where all hard constraints are valid.

In all cases the good performance and optimal results of a mathmatically founded technic are the reason they are choosen and promise a good performing algorithm.

The heuristic tries to add and remove events to specific timeslots very often. So, the check if there is a room available for event E1 in a timeslot has to be very fast. But if there are already a lot of rooms and events in this timeslot it is difficult to test with a simple method if a valid room for E1 exists. Here the assignment algorithms came to my rescue. Other algorithms like the auction algorithm could solve the assignment problem, too, but are not discussed here.

An example for one cost matrix, where we want to calculate the minimal weighted matching (“best assignment”) later, could be the following:

Rooms \ Events | E1    E2
------------------------
R1             | 12    10
R2             | 16    3

The entries of the matrix are the difference of the seats to the visitors. E.g. (E2,R2)=3 means there will be 3 free seats which is better than (E2,R1)=10 free seats.

To optimize room-to-event-assignments it is necessary that the assigned room has nearly the same or slightly more seats as event E1 has visitors. The difference shouldn’t be too large, so that no seats will be ‘wasted’ (and an event with more visitors could be assigned later)

Of course, it could be that an entry is not valid (e.g. too many visitors) then the value in the matrix is +infinity.

The best solution (minimal sum) here would be E1,R1 and E2, R2 with a total sum of 15.

Some informations about the words in graph theory: this cost matrix is equivalent to a weighted, undirected, bi-partitioned, complete graph G. Weighted means with numbers instead of booleans; undirected means the same value for (Ex,Ry) and (Ry,Ex); bi-partitioned means the graph can be divided in two sets A and B where “A and B = empty” and “A or B = G”; complete means the cost matrix has the same number of rows and columns, you can always force this with +infinity values.

Assignment Algorithms

The easiest way to solve an assignment problem correctly is a brute force algorithm, which is very slow for big n: O(n!).
(Okay, it is even slow for small n=10: 10!=3628800)
Where n is in my case the number of rooms available in one timeslot. But this algorithm is always correct, because it tries all combinations of assignments and picks the the one with the minimal total sum. I used this algorithm to check the correctness of other assignment algorithms I implemented later.

Now it is great to know that there is another very fast and correct algorithm which is called Kuhn-Munkres [1] or Hungarian algorithm. The running time is O(n³) (10³=1000) and the ideas used in this algorithm are based on graph theory. Another optimization which one could use only in special cases is the incremental assignment algorithm with O(n²) (E.g. for “add one assignment” or “remove one assignment”)

And there are even faster algorithms such as the algorithm from Edmons and Carp [2]. For bi-partite graphs it runs in O(m n log n).

An interesting approach are the approximation assignment algorithms which can be a lot faster: O(n²) with a small constant. But the resulting assignments are not the best in every case.
A well known and easy to implemented algorithm the path growing algorithm from Drake and Hougardy [3] which works as follows:

currentMatching = matching0 = empty
matching1 = empty
while edges are not empty
   choose a node x from all nodes which has at least one neighbor
   while x has a neighbor
       find the heaviest edge e={x,y} incident to x
       add it to the currentMatching
       if(currentMatching == matching1) currentMatching = matching0
       else currentMatching = matching1
       remove x from the graph
       x:=y
   end
end

The good thing of this algorithm is that you can say sth. about the resulting assignment: the total sum is maximal twice times higher then the sum gained with an optimal algorithm. There are even better guarantees e.g. the one of Pettie and Sanders (~ maximal 3/2 times higher):

matching is empty
repeat k times
  choose a node x from all nodes at random
  matching := matching (+) aug(x)
end
return matching

Resources

You can grab the source of TimeFinder (Apache2 licensed) via svn:

svn checkout https://timefinder.svn.sourceforge.net/svnroot/timefinder/trunk timefinder

Then look in the package

timefinder-core/src/main/java/de/timefinder/core/algo/

To build it you have to use maven.

And finally here is the paper where the idea of a maximum network flow was already presented:
John van den Broek, Cor A. J. Hurkens, and Gerhard J. Woeginger. Timetabling problems at the TU Eindhoven. Electronic Notes in Discrete Mathematics, 25:27–28, 2006.

References

[1] Kuhn Munkres, “Algorithms for the Assignment and Transportation Problems”
[2] Edmonds and Karp, Theoretical improvements in algorithmic efficiency for network flow problems“.
[3] Drake and Hougardy, A Simple Approximation Algorithm for the Weighted Matching Problem, 2002
[4] Pettie and Sanders, A Simpler Linear Time 2/3 – epsilon Approximation for Maximum Weight Matching, Inf. Process. Lett. Vol 91, No. 6, 2004

Bird’s Eye View on ElasticSearch its Query DSL

I’ve copied the whole post into a gist so that you can simply clone, copy and paste the important stuff and even could contribute easily.

Several times per month there pop up questions regarding the query structure on the ElasticSearch user group.

Although there are good docs explaining this in depth probably the bird view of the Query DSL is necessary to understand what is written there. There is even already some good external documentation available. And there were attempts to define a schema but nevertheless I’ll add my 2 cents here. I assume you set up your ElasticSearch instance correctly and on the local machine filled with exactly those 3 articles.

Now we can query ElasticSearch as it is done there. Keep in mind to use the keyword analyzer for tags!

curl -X POST “http://localhost:9200/articles/_search?pretty=true” -d ‘
{“query” : { “query_string” : {“query” : “T*”} },
“facets” : {
“tags” : { “terms” : {“field” : “tags”} }
}}’

But when you now look into the query DSL docs you’ll only find the query part

{“query_string” : {
“default_field” : “content”,
“query” : “this AND that OR thus”
}}

And this query part can be replaced by your favourite query. Be it a filtered, term, a boolean or whatever query.

So what is the main structure of a query? Roughly it is:

curl -X POST “http://localhost:9200/articles/_search?pretty=true” -d ‘
{“from”: 0,
“size”: 10,
“query” : QUERY_JSON,
FILTER_JSON,
FACET_JSON,
SORT_JSON
}’

Keep in mind that the FILTER_JSON only applies to the query not to the facets. Read on how to do this. And now a short example how this nicely maps to the Java API:

SearchRequestBuilder srb = client.prepareSearch(“your_index”);
srb.setQuery(QueryBuilders.queryString(“title:test”));
srb.addSort(“tags”, SortOrder.ASC);
srb.addFacet(FacetBuilders.termsFacet(“tags”));
// etc -> use your IDE autocompletion function ;)

If you install my hack for ElasticSearch Head you can formulate the above query separation directly in your browser/in javascript. E.g.:

q ={ match_all:{} };
req = { query:q }

A more detailed query structure is as follows – you could easily obtain it via Java API, from the navigational elements from the official docs or directly from the source:

curl -X POST “http://localhost:9200/articles/_search?pretty=true” -d ‘
{“query” : QUERY_JSON,
“filter” : FILTER_JSON,
“from”: 0,
“size”: 10,
“sort” : SORT_ARRAY,
“highlight” : HIGHLIGHT_JSON,
“fields” : ["tags", "title"],
“script_fields”: SCRIPT_FIELDS_JSON,
“preference”: “_local”,
“facets” : FACET_JSON,
“search_type”: “query_then_fetch”,
“timeout”: -1,
“version”: true,
“explain”: true,
“min_score”: 0.5,
“partial_fields”: PARTIAL_FIELDS_JSON,
“stats” : ["group1", "group2"]
}’

Let us dig into a simple query with some filters and facets:
curl -XGET ‘http://localhost:9200/articles/_search?pretty=true’ -d ‘
{“query”: {
“filtered” : {
“query” : { “match_all” : {} },
“filter” : {“term” : { “tags” : “bar” }}
}},
“facets” : {
“tags” : { “terms” : {“field” : “tags”} }
}}’

You should get 2 out of the 3 articles and the filter directly applies on the facets as well. If you don’t want that then put the filter part under the query:

curl -XGET ‘http://localhost:9200/articles/_search?pretty=true’ -d ‘
{“query” : { “match_all” : {} },
“filter” : {“term” : { “tags” : “bar” }},
“facets” : {
“tags” : { “terms” : {“field” : “tags”} }
}}’

And how can I only filter on the facets? You’ll need facet_filter:

curl -XGET ‘http://localhost:9200/articles/_search?pretty=true’ -d ‘
{“query” : { “match_all” : {} },
“facets” : {
“mytags” : {
“terms” : {“field” : “tags”},
“facet_filter” : {“term” : { “tags” : “bar”}}
}
}}’

You’ll get 3 documents with filtered facets.

Hope this posts clarifies a bit and reduces your trouble. I’ll update the post according to your comments/suggestions. Let me know if you want something explained which is Query-DSL specific for all the other questions there is the user group.

Java guy reboots with C++ – Trying to understand Memory Mangement

Some random hints if you get started (in my case restarted) with C++ – let me know if I wrote something in wrong in my personal memos:

  1. Use RAII: Resource Acquisition Is Initialization
  2. Use scoped variables which gets destroyed automatically after leaving the block:
    MyClass myVar("Hello Memory");
  3. To use call by reference use the method declaration. But it might be that you are optimizing it when not necessary. Read about Want Speed? Pass by Value.
    void myMethod(MyClass& something)
  4. Understand The Rule of three: If you need to explicitly declare either the destructor, copy constructor or copy assignment operator yourself, you probably need to explicitly declare all three of them.
  5. Understand and avoid traps, understand shallow references.
  6. Understandheap vs. stack allocation. E.g. also that heap allocation times are much slower than allocations off the stack.
  7. Use heap allocation when you dynamically want to change memory usage of an object
  8. … or prefer stl (e.g. vector)
  9. … or when an object should be used outside of a method scope:All dynamically allocated memory must be released before the pointer (except smart pointers) pointing to it goes out of scope. So, if the memory is dynamically allocated for a variable within a function, the memory should be released within the function unless a pointer to it is returned or stored by that function.
  10. The = operator in auto_ptr works in a different to normal way!
  11. Read the FAQ or this light-FAQ
  12. oh my: auto_ptr, shared_ptr, smart_ptr, …!!?
  13. Here is a nice compilation of common possible pitfalls.

… but I’m still fighting a bit to understand the memory management problematic:

1. Are there some rules of thumb?

E.g.

  1. use RAI
  2. if you cannot apply rule 1. use shared_ptr
  3. if you cannot apply rule 1. use new + delete?

2. And how can I solve the problem in C++ when I return a local constructed object (via new) in Java?

E.g. the factory pattern there can look like

public static MyClass createObj() {
  return new MyClass()
}

3. And how would you e.g. put a vector with a lot of data into a different variable?

Can I rely on the ‘Pass by Value‘ thing which boost performance? Should I use tmpVectorObj.swap(vectorObj2) ?

4. How would you fill a vector within a method?

This is forbidden I think:

// declare vector<Node> vectorObj in the class
void addSomething(string tmp) {
  Node n(tmp);
  vectorObj.add(n);
}

5. What are the disadvantages of boosts smart pointers?

Ugly but most efficient setSize for ArrayList


private static final Field sizeField;

static {
    try {
        sizeField = ArrayList.class.getDeclaredField("size");
        sizeField.setAccessible(true);
    } catch (Exception ex) {
        throw new RuntimeException(ex);
    }
}

// 'setSize'
public static <T> void growSize(final ArrayList<T> list, final int maxSize) {
    if (maxSize <= list.size())
        return;

    list.ensureCapacity(maxSize);
    try {
        sizeField.setInt(list, maxSize);
    } catch (Exception ex) {
        throw new RuntimeException("Problem while setting private size field of ArrayList", ex);
    }
}

Here is the reported issue and here a discussion. When you need decreasing the size too then have a look into Vector.setSize

A less hacky but also less efficient version would be:

public static <T> void growSizeSlower(final ArrayList<T> list, final int maxSize) {
    if (maxSize <= list.size())
        return;

    list.addAll(new AbstractList<T>() {

        @Override public Object[] toArray() {
            return new Object[maxSize - list.size()];
        }

        @Override public T get(int index) {
            throw new UnsupportedOperationException("Not supported yet.");
        }

        @Override public int size() {
            throw new UnsupportedOperationException("Not supported yet.");
        }
    });
}

EC2 = Easy Cloud 2.0? Getting started with the Amazon Cloud

If you are command line centric guy like me and you are on Ubuntu this post is for you.Getting starting with Amazon was a pain for me although once you understand the basics it is relativly easy. BTW: there are of course also other cloud systems like Rackspace or Azure.

If you want the official Ubuntu LTS Server (currently 10.04) running in the Amazon Cloud you can do:

ec2-run-instances ami-c00e3cb4 --region eu-west-1 --instance-type m1.small --key amazon-key

or go to this page and pick a different AMI. Hmmh, you are already sick of all the wording like AMI, EC2 and instances? Ok,

lets digg into the amazon world.

Let me know if I have something missing or incorrect:

  • AMI: Amazon Machine Image. This is a highly tuned linux distribution in our case and we can choose from a lot of different types – e.g. on this page.
  • EC2: Elastic Compute Cloud – which is a highly scalable hosting solution where you have root access to the server. You can choose the power and RAM of that instance (‘a server’) and start and stop instances as you like. In Germany Amazon is relative expensive compared to existing hosting solutions (not that case in the US). And since those services can also easy scale there is nearly no advantage of using Amazon or Rackspace.
  • EBS: Elastic block storage – This is where we store our data. An EBS can be attached to any instance but in my case I don’t need a separate volume I just can use the default EBS mounted at /mnt with ~150 GB or even the system partition / with ~8 GB. From wikipedia:
    EBS volumes provide persistent storage independent of the lifetime of the EC2 instance, and act much like hard drives on a real server.
    Also if you choose storage of type ‘ebs’ your instance can be stopped. If it is of type instance-store you could only clone the AMI and terminate. If you try to stop it you’ll get “The instance does not have an ‘ebs’ root device type and cannot be stopped.”
  • A running instance is always attached to one key (a named public key). Once started you cannot change it.
  • S3: Simple Storage Service. Can be used for e.g. backup purposes, has an own API (REST or SOAP). Not covered in this mini post.
  • Availability zone: The datacenter location e.g. eu-west-1 is Ireland or us-west-2 is Oregon. The advantage of having different regions/zones is that if one datacenter crashes you have a fall back in a different. But the big disadvantage of different zones is that e.g. transfering your customized AMIs to a different region is a bit complex and you’ll need to import your keys again etc.

But even now after ‘understanding’ of the wording it is not that easy to get started and e.g. the above command will not work out of the box.

To make the above command working you’ll need:

  1. An Amazon Account and a lot of money ;) or use the micro instance which is free for one year and for a fresh account IMO
  2. The ec2 tools installed locally: sudo apt-get install ec2-api-tools
  3. The amazon credentials stored and added to your ssh-agent:
    export EC2_PRIVATE_KEY=/home/user/.ssh/certificate-privatekey.pem
    export EC2_CERT=/home/user/.ssh/certificate.pem
  4. Test the functionality via
    ec2-describe-instances –region eu-west-1
  5. Now you need to create a key pair and import the public one into your account (choose the right availability zone!)
    Aws Console -> Ec2 -> Network & Security -> Key Pairs -> Import Key Pair and choose amazon-key as name
  6. Then feed your local ssh-agent with the private key:
    ssh-add /home/user/.ssh/amazon-key
  7. Now you should be able to run the above command. To view the instance from the web UI you’ll have to refresh the site.
  8. Open port 22 for the default security group:
    Aws Console -> Ec2 -> Network & Security -> Security Groups -> Click on the default one and then on the ‘inbound’ Tab -> type ’22′ in port range -> Add Rule -> delete the other configurations -> Apply Rule Changes
  9. Now try to login
    ssh ubuntu@ec2-your-machine.amazonaws.com
    For the official amazon AMIs you’ll have to use ec2-user as login

That was easy :) No?

Ok, now you’ll have to configure and install software as you like e.g.
sudo apt-get update && sudo apt-get upgrade -y

To proceed further you could

  • Attach a static IP to the instance so that external applications do not need to be changed after you moved the instance – or use that IP for your load balancer – or use the Amazon load balancer etc:
    Aws Console -> Ec2 -> Network & Security -> Elastic IPs -> Allocate New Address
  • Open some more ports like port 80
  • Or you could create an AMI of your already configured system. You can even publish this custom AMI.
  • Run ElasticSearch as search server in the cloud e.g. even via a debian package which makes it very easy.

Now if you have several instance and you want to

update software on all machines.

How would you do that? Here is one possibility

ips=`ec2-describe-instances --region eu-west-1 | grep running | cut -f17 | tr '\n' ' '`

for IP in $ips
do
 echo UPDATING $IP;
 ssh -A ubuntu@$IP "cd /somewhere; bash ./scripts/update.sh";
done

Shortest Code for a Simple Calculator on Android

String RESULT;
String input = "(1+3)/4 * 2 - 7";
...
webSettings.setJavaScriptEnabled(true);
...
webView.addJavascriptInterface(new JavaScriptInterface() {
   public void returnResult(String o) {
       RESULT = o;
   }}, "JavaCallback"));
webView.loadUrl("javascript:window.JavaCallback"
   + ".returnResult("+input+")");
// now RESULT is -5

Is there a shorter one? BTW: this is only a sketch not sure if I miss a bracket somewhere …

Logitech USB Headset volume is low [Ubuntu] – Fixing this with a script

There is a nasty bug in Ubuntu – even in 10.* and 11.*! But there is a simple fix via alsamixer. The problem is that the volume is always near 0 when when plugging in the device again. So this mini post is how to find out the command to increase the volume which you can execute e.g. on startup. First, you need to find out which card the device has in my case its number 1, then you need to list the controls:

$ amixer -c 1 scontrols
Simple mixer control ‘Speaker’,0
Simple mixer control ‘Mic’,0

As last step increase the volume of one or more control:
$ amixer -c 1 sset ‘Speaker’,0 90% 90%
Simple mixer control ‘Speaker’,0
Capabilities: pvolume pswitch pswitch-joined penum
Playback channels: Front Left – Front Right
Limits: Playback 0 – 44
Mono:
Front Left: Playback 40 [91%] [2.72dB] [on]
Front Right: Playback 40 [91%] [2.72dB] [on]

Now there is only one problem: how to automatically switch from 0 to my USB device 1? Here is the solution

pacmd list-sinks | grep index
pacmd set-default-sink

or the full script:
amixer -c 1 sset ‘Speaker’,0 70% 70%
amixer -c 1 sset ‘Mic’,0 70% 70%

# switch mic
sources=($(pacmd list-sources | grep index | awk ‘{ if ($1 == “*”) print “1″,$3; else print “0″,$2 }’))
[[ ${sources[0]} = 0 ]] && swap=${sources[1]} || swap=${sources[5]}
echo $swap
pacmd set-default-source $swap &> /dev/null

# switch audio
sinks=($(pacmd list-sinks | grep index | awk ‘{ if ($1 == “*”) print “1″,$3; else print “0″,$2 }’))
[[ ${sinks[0]} = 0 ]] && swap=${sinks[1]} || swap=${sinks[3]}
pacmd set-default-sink $swap &> /dev/null

Now have a look here where it is described how to call this script when the device is plugged in.

Jetslide uses ElasticSearch as Database

This post explains how one could use the search server ElasticSearch as a database. I’m using ElasticSearch as my only data storage system, because for Jetslide I want to avoid maintenance and development time overhead, which would be required when using a separate system. Be it NoSQL, object or pure SQL DBs.

ElasticSearch is a really powerfull search server based on Apache Lucene. So why can you use ElasticSearch as a single point of truth (SPOT)? Let us begin and go through all – or at least my – requirements of a data storage system! Did I forget something? Add a comment :) !

CRUD & Search

You can cread, read (see also realtime get), update and delete documents of different types. And of course you can perform full text search!

Multi tenancy

Multiple indices are very easy to create and to delete. This can be used to support several clients or simply to put different types into different indices like one would do when creating multiple tables for every type/class.

Sharding and Replication

Sharding and replication is just a matter of numbers when creating the index:

curl -XPUT 'http://localhost:9200/twitter/' -d '
index :
    number_of_shards : 3
    number_of_replicas : 2'

You can even update the number of replicas afterwards ‘on the fly’. To update the number of shards of one index you have to reindex (see the reindexing section below).

Distributed & Cloud

ElasticSearch can be distributed over a lot of machines. You can dynamically add and remove nodes (video). Additionally read this blog post for information about using ElasticSearch in ‘the cloud’.

Fault tolerant & Reliability

ElasticSearch will recover from the last snapshot of its gateway if something ‘bad’ happens like an index corruption or even a total cluster fallout – think time machine for search. Watch this video from Berlin Buzz Words (minute 26) to understand how the ‘reliable and asyncronous nature’ are combined in ElasticSearch.

Nevertheless I still recommend to do a backup from time to time to a different system (or at least different hard disc), e.g. in case you hit ElasticSearch or Lucene bugs or at least to make it really secure :)

Realtime Get

When using Lucene you have a real time latency. Which basically means that if you store a document into the index you’ll have to wait a bit until it appears when you search afterwards. Altought this latency is quite small: only a few milliseconds it is there and gets bigger if the index gets bigger. But ElasticSearch implements a realtime get feature in its latest version, which makes it now possible to retrieve the object even if it is not searchable by its id!

Refresh, Commit and Versioning

As I said you have a realtime latency when creating or updating (aka indexing) a document. To update a document you can use the realtime get, merge it and put it back in the index. Another approach which avoids further hits on ElasticSearch, would be to call refresh (or commit in Solr) of the index. But this is very problematic (e.g. slow) when the index is not tiny.

The good news is that you can again solve this problem with a feature from ElasticSearch – it is called versioning. This an identical to the ‘application site’ optimistical locking in the database world. Put the document in the index and if it fails e.g. merge the old state with the new and try again. To be honest this requires a bit more thinking using a failure-queue or similar, but now I have a really good working system secured with unit tests.

If you think about it, this is a really huge benefit over e.g. Solr. Even if Solrs’ raw indexing is faster (no one really did a good job in comparing indexing performance of Solr vs. ES) it requires a call of commit to make the documents searchable and slows down the whole indexing process a lot when comparing to ElasticSearch where you never really need to call the expensive refresh.

Reindexing

This is not necessary for a normal database. But it is crucial for a search server, e.g. to change an analyzer or the number of shards for an index. Reindexing sounds hard but can be easily implemented even without a separate data storage in ElasticSearch. For Jetslide I’m storing not single fields I’m storing the entire document as JSON in the _source. This is necessary to fetch the documents from the old index and put them into the newly created (with different settings).

But wait. How can I fetch all documents from the old index? Wouldn’t this be bad in terms of performance or memory for big indices? No, you can use the scan search type, which avoids e.g. scoring.

Ok, but how can I replace my old index with the new one? Can this be done ‘on the fly’? Yes, you can simply switch the alias of the index:

curl -XPOST 'http://localhost:9200/_aliases' -d '{
"actions" : [
   { "remove" : { "index" : "userindex6", "alias" : "userindex" } },
   { "add" : { "index" : "userindex7", "alias" : "uindex" } }]
}'

Performance

Well, ElasticSearch is fast. But you’ll have to determine for youself if it is fast enough for your use case and compare it to your existing data storage system.

Feature Rich

ElasticSearch has a lot of features, which you do not find in a normal database. E.g. faceting or the powerful percolator to name only a few.

Conclusion

In this post I explained if and how ElasticSearch can be used as a database replacement. ElasticSearch is very powerfuly but e.g. the versioning feature requires a bit handwork. So working with ElasticSearch is comparable more to the JDBC or SQL world not to the ORM one. But I’m sure there will pop up some ORM tools for ElasticSearch, although I prefer to avoid system complexity and will always use the ‘raw’ ElasticSearch I guess.

Introducing Jetslide News Reader

Flattr this

We are proud to announce the release of our Jetslide News Reader today! We know that there are a lot services aggregating articles from your twitter timeline such as the really nice tweetedtimes.com or paper.li. But as a hacker you’ll need a more powerful tool. You’ll need Jetslide. Read on to see why Jetslide is different and read this feature overview. By the way: yesterday we open sourced the content extractor called snacktory.

Jetslide is different …

… because it divides your ‘newspaper’ into easily navigatable topics and Jetslide prints articles from your timeline first! So you are following topics and not (only) people. See the first article which was referenced by a twitter friend and others, but it also prints articles from public. See the second article, where the highest share count (187) comes from digg. Click to view the reality of today or browse older content with the links under the articles:

Jetslide is smart …

… enough to skip duplicate articles and enhance your topics with related material. The relavance of every article is determined by an advanced algorithm (number of shares, quality, tweed, your browser language …) with the help of my database ElasticSearch – more on this in a later blog post.

And you can use a lot of geeky search queries to get what you want. Read about it here which lists some more details about this.

Jetslides are social

As pointed out under ‘Jetslide is different’ you’ll see articles posted in your twitter timeline first. But there is another features which make Jetslide more ‘social’. First, you get suggestions of users if they have the same or similar interests stored in their Jetslide. And second, Jetslide enables you to see others’ personal jetslide when adding e.g. the  parameter owner=timetabling to the url:

http://jetsli.de/?owner=timetabling

Jetslides means RSS 3.0

You can even use the boring RSS feed:

http://jetsli.de/rss/owner/timetabling/time/today

But this is less powerful. The recommended way to ‘consume’ your topics is via RSS 3.0 ;)

Log in to Jetslide and select “Read Mode:Auto”. Then every time you hit the ‘next’ arrow (or CTRL+right) the current viewed articles will be marked as read and only newer articles will pop up the next time you slide through. This way you can slide through your topics and come back everytime you want: after 2 hours or after 2 days (at the moment up to 7 days). In Auto-Read-Mode you’ll always see only what you have missed and what is relevant!

This is the most important point why we do not call Jetslide a search engine but a news service.

Jetslides are easily shareable

… because a Jetslide is just an URL – viewable on desktops,  smartphones and even WAP browsers (left):

All suggestions are very welcome! Or chat with us!

Next Page »



Follow

Get every new post delivered to your Inbox.

Join 162 other followers