The Background of my Idea
The idea is to use a hash table for points (aka HashMap in Java) and try to implement neighbor searches. First of all you’ll need to understand what a spatial key is. Here you can read the details, but in short it is a binary Geohash where you avoid the memory inefficient base 32 representation.
Now that we have the spatial key, you can think about an array which is used to map from indices (like spatial keys) to values. This is the simplest representation of a spatial index as we don’t need to store the keys at all. But it is only memory efficient iff we would have no empty entries, which is very unlikely for clustered, real-world GIS-data.
If we would solve this with a normal hash table we encounter two problems:
- It is very unlikely that points in the same area come into the same hash bucket – making neighborhood searches slow i.e. O(n)
- It would be necessary to store the entire point – not only the associated value. Otherwise it would be impossible in case of a hash-collision to detect which point belongs to which value.
My idea is to use parts of the spatial key for the hashcode and avoid storing the entire key. It is implemented in Java (open source and available at GitHub).
As you can see we’re still using an array of buckets and a “somehow” converted spatial key to get
The Bucket Index
Let me explain the necessary bucket index in more detail (see picture above on the right).
We skip the beginning bits of every spatial key as it is identical for an area a lot smaller than the world boundaries like Germany.
If we use the first part of the spatial key – in the picture identified as x, then the array is small enough. But with some real world data (also available as osm) this is not sufficient. Too many overflows would happen, some buckets would have several thousands entries! If we move the used part a bit to the right this gets a lot better e.g. for 4 entries per bucket we have a RMS error of about 2.
We now have a form of
A Hashtable & Quadtree Mixture
We can tune if our data structure behaves like a quad tree or a hash table. When moving the bits taken from spatial key to the left we get an quad tree-like characteristics. Taking the bits more from the right we get hash table-like characteristics.
This would be fine if we have massive data. But we need to make this approach practical also e.g. for only 2 mio data points. Because the part of the spatial key is only 19 bits long: if we assume 4 entries per bucket we come to approx. 2 mio (4 * 2^19 = 4 * 524 288). So the bucket index alone is too short. The solution to this problem is to do a bit-operation of the left and the right part of the spatial key:
bucketIndex = x ^ y
Further reduce memory consumption
Besides the fact that we now have some kind of a pointer-less or linear quad tree we can further reduce the memory footprint. We store only the required part (e.g. all bits except y) and not the full spatial key. For this it was necessary that our bit operation (or more generic “hashing scheme”) is reversible. Ie.: we can regain the full spatial key from only the bucket index and the stored part of the key. And in our case the x XOR y it reversible. In fact this memory reduction can be applied to any hashing procedure which fulfills this ‘reversible’ requirement.
Speed of Neighbor Queries is Bad
Neighborhood searches are very slow, slower than I expected. The naiv approach resulted in 60 seconds for a 10km search – 30 times slower as it would take to process all 2 mio entries. When tuning the overflow schema we are now a bit under 2 seconds. Still 10 times slower than a normal quad tree and as slow as processing all entries. The reason for why this storage is only good for get and put operations is that the same bucket needs to be parsed several times: as the same bucket index needs to be the home for several different locations – yeah, exactly as intended.
The good news are:
1. When I was moving the bucket-index-window a lot to the left it gets faster and faster, but took dozens of seconds to create the storage due to the heavy overflow number even for my small data set (2 mio). It could be improved a bit when applying different overflow strategies e.g. not using a linear overflow but skipping every two buckets or others.
2. Even in this state this idea can be used as a memory efficient spatial key-value storage without the neighbor search. E.g. you already have a graph of roads but you need an entrance like a HashMap<Point,NodeId> for it, then our data structure is an efficient hash table. Also doing a simple rectangular neighbor search should be fast: requesting only the 8 surrounding bounding boxes. Then no tree traversal is necessary and every box can be done with just a loop through the bucket array.
3. Another possibility is to use a small quadtree as an entry (mapping spatial keys to ids) for a 2D-graph. Then traversing this graph to find neighbors. This way I’ve finally chosen as I already needed a road-network for Dijkstra. So I only need additional 10MB for the small quadtree inex, see a possible next blog entry.
4. I’m not alone – you can take my idea and try implementing a more efficient neighbor searches yourself 🙂 !
In this post I’ve explained how to create a spatial hash table which is optimized in memory usage. This is achieved combining two ideas: using a hash table-alike data structure still ‘somehow suited’ for neighborhood searches and reducing the amount of memory while storing only parts of the hashkey. This second idea could be applied on every kind of hash table but only if the hashkey creation is reversible.
The ideas are implemented in Java for the GraphHopper project – see the geohash package. Sadly the perfomance for neighbor searches is really bad. Which created a different solution in my mind (see point 3 of good news).
In the literature similar data structures are called linear or pointer-less quad trees. After this experiment I come to the conclusion that the best way to implement a memory efficient spatial storage which is also able to perform fast neighbor queries could be a prefix quad tree. Still using pointers but storing two bits in very branch node and avoid those bits in the leaf nodes. Ongoing work for this is done currently in Spatial4J & Lucene 4.0 – actually without the use of spatial keys.