Digitalizing GPX Points or How to Track Vehicles With GraphHopper

Recently I was asked again if I know a method, or even better a tool, which makes GPX points more align to the existing streets in OpenStreetMap. E.g. currently in Mapillary all recorded streets are not really smooth and digitalized if you look at them in detail. GraphHopper cannot do this out of the box but provides a toolchain which will make this digitalization easier.

Update: this is now open source.

Map Matching

The necessary process before an import would be Map Matching. Quoting (my own words ;)) from Wikipedia:

Map matching is the process of associating a sorted list of user or vehicle positions to the road network on a digital map. The main purposes are to track vehicles, analyze traffic flow and finding the start point of the driving directions.

Or as a picture:


I’m explaining in this blog post the simple steps which can taken to implement something like this with a very simple method. This is not an invention by me, other papers suggesting a similar technique. Also we should use some open testing data additionally to the normal unit tests. Some wordings are necessary before we go deeper:

  • A GPS point is a tuple or object which contains latitude, longitude (, elevation) at which position it was recorded.
  • A GPX point is a tuple or object which contains latitude, longitude (, elevation) and the timestamp at which it was recorded. And a GPX list ist a list of GPX points.
  • In the digital world of maps we need a data structure to handle the road network. This data structure is a graph where the junctions from real world are nodes and the streets between them are the edges.

The question is also why would we need a routing engine at all?


  • firstly you don’t need to write the import process of OpenStreetMap and the closest edge lookup and probably some other tools necessary for map matching and
  • secondly the routing process will fill gaps between larger distances of two GPX points, and will be able to guess those gaps depending on the vehicle and
  • thirdly using a routing engine will make the outcome more realistic, e.g. avoiding going left and then returning the identical street to go further in the straight direction.
  • Lastly with GraphHopper you’ll be easily able to provide such a procedure even for the world wide case as it stores everything very compact and can be used in-memory or on-disc.

We start with an example: example1a

All maps are from Lyrk and data is from OpenStreetMap contributors

In the left map you see the street parts highlighted where only the edges closest to the GPX points are selected. Whereas in the right map the most probable way was taken via a routing engine like GraphHopper.

The simple algorithm

The results for the map on the left side are easy to get, see the gist for example 1. For the map on the right side can try the following procedure explained in pseudocode here or see example 2 as gist for a more complete one:

  1. Find 3 closest edges per GPX point
  2. Associate every edge with a weighting, which is the distance from the edge to the point. The larger the distance the smaller is the probability that the best path goes through it.
  3. Find the best path from the first GPX point to the last, but take into account the weighting and therefor the distance to the GPX points
  4. As a post-processing step you need to assign the GPX points to the resulting path, you can do so via the edge ids and to find the coordinates you can either use path.calcPoints or the lower level API ala edgeState.fetchWayGeometry.

This simple algorithm should work in the most cases and is very fast

The enhancements

There are two limitations of this simple algorithm:

  • If there are loops in the GPX list, then the Dijkstra from step 3 will throw data away
  • and it will happen that another route than the route marked with GPX points is taken. E.g. in the case if the GPX signal is not that frequent and the edge of one point is rather short, which means it will have a small influence on the overall weighting of the taken route.

The following improvements are possible

  • You could do a simple workaround and cut the incoming GPX list when you detect a loop and feed those GPX-list-parts to the algorithm.
  • Or find the best route not from start to end but ‘step by step’. For example from pointX to pointX+4, then pointX+4 to pointX+8 and so on, where you need to merge the results. To avoid suboptimal paths at the intersection (here pointX+4) you’ll need to ‘overlap’ the routes. So you will need to calculate the routes for pointX to pointX+4, then pointX+3 to pointX+7, …. instead. Merging will be more complex probably.
  • A more complex solution could be to maintain a list of e.g. 3 alternative end-results. As you already get 3 edge candidates for every GPX point you can then try to find all routes between them. You could think that this will slow down the search a lot but with GraphHopper you can tweak e.g. the Dijkstra or DijkstraOneToMany algorithm to do this in an efficient manner. E.g. if you want to get the best result but you want to search from multiple destinations you can do: DijkstraBidirectionRef dijkstra = new DijkstraBidirectionRef(hopper.getGraph(), flagEncoder, weighting); dijkstra.initFrom(nodeFrom, distanceToPossibility1); dijkstra.initFrom(node, distanceToPossibility2); … dijkstra.initTo(nodeTo, distanceToEndPossibility1); dijkstra.initTo(nodeTo, distanceToEndPossibility2); …

There are lots of other problems which you’ll encounter, that is the ugly real life with GPS signal errors, tunnels and so on. One example problem is if you are tracking e.g. a truck waiting in front of traffic lights in a one way street. You’ll see several nearly identical GPX points but as they have an error associated they won’t be identical and it could happen that they go ‘backwards’ in the one-way which will result in an extra loop if use the ‘step-by-step’ variant of the shortest path calculation:


Languator: Language Detection for Local Queries

When you have an address lookup service like photon you need language detection for the short local queries like “Berlin Alexanderplatz” to e.g. apply a specific analysis chain. Of course German language is rather simple, but detecting if it is French or Italian etc is much more complex.

I tried the language-detection from google code. And although it has high precision for normal text, it is relative slow (creating the profile and querying) and not specific to local queries although short generic queries should work (e.g. Twitter data). Another problem is the complex and inflexible Java API as it uses lots of statics etc. E.g. the size of the n-grams is not configurable.

So I hacked together a new language detection called languator in Java specific for short local queries and fed it with OpenStreetMap data. E.g. I used the germany.pbf to create the German language profile and france.pbf for the French profile and so on. If you keep in mind that you have lots of foreign names for POI and street in every country the following error values are okayish and better than what I found from other tools:

English: 22.9%, German: 7%, French: 20%, Italian: 18.8%

Another good thing is that you can feed languator with these 4 countries in 20 minutes and create millions of queries from those countries which takes ~10 minutes, which is at least 2 times faster than the google code tool.

I invite you to beat my numbers: fork and have fun!

GraphHopper Maps 0.1 – High Performance and Customizable Routing in Java

Today we’re proud to announce the first stable release of GraphHopper! After over a year of busy development we finally reached version 0.1!

GraphHopper is a fast and Open Source road routing engine written in Java based on OpenStreetMap data. It handles the full planet on a 15GB server but is also scales down and can be embedded into your application! This means you’re able to run Germany-wide queries on Android with only 32MB in a few seconds. You can download the Android offline routing demo or have a look at our web instance which has world wide coverage for car, bike and pedestrian:

GraphHopper Java Routing

The trip to the current state of GraphHopper was rather stony as we had to start from scratch as there is currently no fast Java-based routing engine. What we’ve built is quite interesting as it shows that a Java application can be as fast as Bing or Google Maps (in 2011) and beats YOURS, MapQuest and Cloudmade according to the results outlined in a Blog post from Pascal and with tests against GraphHopper – although OSRM is still ahead. But how can a Java application be so fast? One important side is the used algorithm: Contraction Hierarchies – a ‘simple’ shortcutting technique to speed up especially lengthy queries. But even without this algorithm GraphHopper is fast which is a result of weeks of tuning for less memory consumption (yes, memory has something to do with speed), profiling and tweaking. But not only the routing is fast and memory efficient also the import process. And it should be easy to get started and modify GraphHopper to your needs.

Why would you use GraphHopper?

GraphHopper could be especially useful for more complicated or custom shortest/best path projects. E.g. if you need

  • to embed GraphHopper or only parts of it directly within your Java application, which is easily possible due to the Apache 2 license.
  • offline queries for your Android application
  • highly customized routing (like horse routing – see below) where Google/Bing API calls aren’t sufficient or even possible
  • many to many queries
  • the shortest path tree(s) directly

… you should tell us on the mailing list what you need!


GraphHopper is a young project but it makes great strides and it is already used in GPSies and in more places (cannot disclose all yet).

Last but not least I would like to thank NopMap for his work regarding OSM import and tuning, elevation data and much more! You can try out his horse routing prototype based on GraphHopper at the German (“trail riding map”)!

See the description on how you can contribute.

Have fun!

Make Your Dijkstra Faster

Today I did a bit of research for GraphHopper and I stumbled over yet another minor trick which could speed up the execution of the Dijkstra algorithm. Let me shortly introduce this shortest path algorithm:


If you need the path (and not only the shortest path tree) you will give the method an additional toNode parameter and compare this to distEntry.node to break the loop. When it was found you need to recursivly extract the path from the last distEntry.parent reference.

So, what should we improve?

Regarding performance I’ve already included a Map to directly get the DistanceEntry from a node, otherwise you would need to search it in the PriorityQueue which is too slow. Also Wikipedias says that we could use a Fibonacci heap which are optimal to decrease the key (aka weight) but those are very complicated to implement and memory intensive.

It turned out that you can entirely avoid the ‘decrease key’ operation if you do a visited.contains check after polling from the queue. This makes your heap bigger but you can avoid the costly update operation and use simpler data structures. Read the full paper “Priority Queues and Dijkstra’s Algorithm”.

What else can we improve?

Now we can tune some data structures:

  1. Make sure that you a traversing your graph with full speed. E.g. using just the graph in-memory without any persistence storage dependency could massivly improve performance. Also if you use node indices (pointing to an array) instead of node objects you can reduce memory consumption and e.g. use a BitSet instead of a set for the visited collection.
  2. In case your heap is relative big (>1000 entries) like for multi-dimensional graphs and even for plane graphs then a cached version with 2 or more stages could give you a 30% boost. When you like more complicated and efficient solutions you could implement the probably faster sequence heap and others.
  3. If you have a limited range of weights/keys you can try a TreeMap<Key, Set<Node>> which could speed up your code by roughly 10% if you heavily use the decreaseKey method.

For road networks and others you can apply A* which reduces the amount of visited nodes via guessing where the goal is – still the path is optimal IF the real path is longer than to what you guessed (e.g. use direct linear distance in road networks which is always smaller to the real distance):


Additionally if you accept some less optimal solutions you can apply heuristics like “don’t explore that much more nodes if you’r close the destination”.

If you don’t want less optimal paths and still want it faster you could

Running Shortest-Path Algorithms on the German Road Network within a 1.5GB JVM

Update: With changes introduced in January 2013 you only need 1GB – live demo!

In one of my last blog posts I wrote about memory efficient ways of coding Java. My conclusion was not a bright one for Java: “This time the simplicity of C++ easily beats Java, because in Java you need to operate on bits and bytes”. But I am still convinced that in nearly every other area Java is a good choice. Just some guys need to implement the dirty parts of memory efficient data structures and provide a nice API for the rest of the world. That’s what I had in mind with


GraphHopper does not attack memory efficient data structures like Trove4j etc. Instead it’ll focus on spatial indices, routing algorithms and other “geo-graph” experiments. A road network can already be stored and you can execute Dijkstra, bidirectional Dijkstra, A* etc on it.

Months ago I took the opportunity and tried to import the full road network of Germany via OSM. It failed. I couldn’t make it working in the first days due to massive RAM usage of HashMaps for around 100 mio data points (only 33 mio are associated to roads though). Even using only trove4j brought no success. When using Neo4J the import worked, but it was very slow and when executing algorithms the memory consumption was too high when too many nodes where requested (long paths).

Then after days I created a memory mapped graph implementation in GraphHopper and it worked too. But the implementation is a bit tricky to understand, not thread safe (even not for two reading threads yet), slower compared to a pure in-memory solution. But even more important: the speed was not very predictable and very ugly to debug if off-heap memory got rare.

I’ve now created a ‘safe’ in-memory graph, which saves the data after import and reads once before it starts. At the moment this is read-thread-safe only, as full thread safety would be too slow and is not necessary (yet).

Now performance wise on this big network, well … I won’t talk about the speed of a normal Dijkstra, give me some more time to improve the speed up technics. For a smaller network you can see below that even for this simplistic approach (no edge-contraction or edge-reduction at all) the query time is under 150ms and will be under 100ms for bidirectional A* (w/o approximation!), I guess.

In order to perform realistic route queries on a road network we would like to satisfy two use cases:

  1. Searching addresses (cities, streets etc)
  2. Clicking directly on the map to create a route query

The first one is simple to solve and it is very unlikely to avoid tons of additional RAM. But we can solve it very easy with ElasticSearch or Lucene: just associate the cities, streets etc to the node ids of the graph.

The second use case requires more thinking because we want it memory efficient. A normal quad-tree is not a good choice as it requires too many references. Even for a few million data points it requires several dozens of MB in addition to the graph! E.g. 80MB for only 4 mio nodes.

The solution is to use a raster over the area – which can be a simple array addressed by spatial keys. And per quadrant (aka tile) we store one array index of the graph as entry point. (In fact this is a quad-tree of depth one!) Then when a click on the map happened, we can calculate the spatial key from this point (A), then get the entry point from the array and traverse the graph to get the point in the graph which is the closest one to point A. Here is an implementation, where only one problem remains (which is solved in the new index).

Unfair Comparison

In the last days, just for the sake of fun, I took Neo4J and ran my bidirectional Dijkstra with a small data set – unterfranken (1 mio nodes). GraphHopper is around 8 times faster and uses 5 times less RAM:

The lower the better – it is the mean time in seconds per run on this road network where two of the algorithms (BiDijkstraRef, BiDijkstra) are used. The number in brackets is the actually used memory in GB for the JVM. The lowest possible memory for GraphHopper was around 160MB, but only for the more memory friendly version (BiDikstra).

For all Neo4J-Bashers: this is not a fair comparison as GraphHopper is highly specialized and will only be usable for 2D networks (roads, public transport) and it is also does not have transaction support etc. as pointed out by Michael:

But we can learn that sometimes it is really worth the effort to create a specialized solution. The right tools for the right job.


Although it is not easy to create memory efficient solutions in Java, with GraphHopper it is possible to import (2.5GB) and use (1.5GB) a road network of the size of Germany on a normal sized machine. This makes it possible to process even large road networks on one machine and e.g. lets you run algorithms on even a single, small Amazon instance. If you reduce memory usage of your (routing) application you are also very likely to avoid garbage collection tuning.

There is still a lot room to optimize memory usage and especially speed, because there is a lot of research about road networks! You’re invited to fork & contribute!

Failed Experiment: Memory Efficient Spatial Hashtable

The Background of my Idea

The idea is to use a hash table for points (aka HashMap in Java) and try to implement neighbor searches. First of all you’ll need to understand what a spatial key is. Here you can read the details, but in short it is a binary Geohash where you avoid the memory inefficient base 32 representation.

Now that we have the spatial key, you can think about an array which is used to map from indices (like spatial keys) to values. This is the simplest representation of a spatial index as we don’t need to store the keys at all. But it is only memory efficient iff we would have no empty entries, which is very unlikely for clustered, real-world GIS-data.

If we would solve this with a normal hash table we encounter two problems:

  • It is very unlikely that points in the same area come into the same hash bucket – making neighborhood searches slow i.e. O(n)
  • It would be necessary to store the entire point – not only the associated value. Otherwise it would be impossible in case of a hash-collision to detect which point belongs to which value.

My idea is to use parts of the spatial key for the hashcode and avoid storing the entire key. It is implemented in Java (open source and available at GitHub).

As you can see we’re still using an array of buckets and a “somehow” converted spatial key to get

The Bucket Index

Let me explain the necessary bucket index in more detail (see picture above on the right).

We skip the beginning bits of every spatial key as it is  identical for an area a lot smaller than the world boundaries like Germany.

If we use the first part of the spatial key – in the picture identified as x, then the array is small enough. But with some real world data (also available as osm) this is not sufficient. Too many overflows would happen, some buckets would have several thousands entries! If we move the used part a bit to the right this gets a lot better e.g. for 4 entries per bucket we have a RMS error of about 2.

We now have a form of

A Hashtable & Quadtree Mixture

We can tune if our data structure behaves like a quad tree or a hash table. When moving the bits taken from spatial key to the left we get an quad tree-like characteristics. Taking the bits more from the right we get hash table-like characteristics.

This would be fine if we have massive data. But we need to make this approach practical also e.g. for only 2 mio data points. Because the part of the spatial key is only 19 bits long: if we assume 4 entries per bucket we come to approx. 2 mio (4 * 2^19 = 4 * 524 288). So the bucket index alone is too short. The solution to this problem is to do a bit-operation of the left and the right part of the spatial key:

bucketIndex = x ^ y

Further reduce memory consumption

Besides the fact that we now have some kind of a pointer-less or linear quad tree we can further reduce the memory footprint. We store only the required part (e.g. all bits except y) and not the full spatial key. For this it was necessary that our bit operation (or more generic “hashing scheme”) is reversible. Ie.: we can regain the full spatial key from only the bucket index and the stored part of the key. And in our case the x XOR y it reversible. In fact this memory reduction can be applied to any hashing procedure which fulfills this ‘reversible’ requirement.

Speed of Neighbor Queries is Bad

Neighborhood searches are very slow, slower than I expected. The naiv approach resulted in 60 seconds for a 10km search – 30 times slower as it would take to process all 2 mio entries. When tuning the overflow schema we are now a bit under 2 seconds. Still 10 times slower than a normal quad tree and as slow as processing all entries. The reason for why this storage is only good for get and put operations is that the same bucket needs to be parsed several times: as the same bucket index needs to be the home for several different locations – yeah, exactly as intended.

The good news are:

1. When I was moving the bucket-index-window a lot to the left it gets faster and faster, but took dozens of seconds to create the storage due to the heavy overflow number even for my small data set (2 mio). It could be improved a bit when applying different overflow strategies e.g. not using a linear overflow but skipping every two buckets or others.

2. Even in this state this idea can be used as a memory efficient spatial key-value storage without the neighbor search. E.g. you already have a graph of roads but you need an entrance like a HashMap<Point,NodeId> for it, then our data structure is an efficient hash table. Also doing a simple rectangular neighbor search should be fast: requesting only the 8 surrounding bounding boxes. Then no tree traversal is necessary and every box can be done with just a loop through the bucket array.

3. Another possibility is to use a small quadtree as an entry (mapping spatial keys to ids) for a 2D-graph. Then traversing this graph to find neighbors. This way I’ve finally chosen as I already needed a road-network for Dijkstra. So I only need additional 10MB for the small quadtree inex, see a possible next blog entry.

4. I’m not alone – you can take my idea and try implementing a more efficient neighbor searches yourself 🙂 !


In this post I’ve explained how to create a spatial hash table which is optimized in memory usage. This is achieved combining two ideas: using a hash table-alike data structure still ‘somehow suited’ for neighborhood searches and reducing the amount of memory while storing only parts of the hashkey. This second idea could be applied on every kind of hash table but only if the hashkey creation is reversible.

The ideas are implemented in Java for the GraphHopper project –  see the geohash package. Sadly the perfomance for neighbor searches is really bad. Which created a different solution in my mind (see point 3 of good news).


In the literature similar data structures are called linear or pointer-less quad trees. After this experiment I come to the conclusion that the best way to implement a memory efficient spatial storage which is also able to perform fast neighbor queries could be a prefix quad tree. Still using pointers but storing two bits in very branch node and avoid those bits in the leaf nodes. Ongoing work for this is done currently in Spatial4J & Lucene 4.0 – actually without the use of spatial keys.