jsii – full text search in 1K LOC of JavaScript!

In the previous blog post I tried to introduce node.js and its nice features. Today I will introduce my little search engine prototype called jsii (javascript inverted index).

jsii provides an in-memory inverted index within approx. 1000 lines of JavaScript. Some more lines are necessary to set up a server via node.js, so that the index is queryable via http and returns Solr compatible json or xml. The sources are available @github:

git clone git@github.com:karussell/jsii.git

Try it out here: http://pannous.info:8124/select?q=google e.g. filter queries works like id:xy or queries with sorting works like &sort=id asc. The paramters start and rows can be used for paging. For those who come too late e.g. my server crashed or sth. ;-), here is an image of the xml response:

The solr compatible xml response format makes it possible to use jsii from applications that are using SolrJ. For example I tried it for Jetwick and the basic search worked – just specify the xml reponse parser:

solrj.setParser(new XMLResponseParser());

His-story

The first thing I needed was a BitSet analogon in JavaScript to perform term look-ups fast and combine them via AND bit-operation. Therefor I took the classes and tests from a GWT patch and made them working for my jasmine specs.

While trying to understand the basics of a scoring function I stumbled over the lucene docs and this thread which mentions ‘Section 6 of a book‘ for a good reference on that subject.

My understanding of the basics is now the following:

  • The term frequency (tf) is to weight documents differently. E.g. document1 contains ‘java’ 10 times but doc2 has it 20 times. So doc2 is more important for a query ‘java’. If you index tweets you should do tf = min(tf, 3). Otherwise you will often get tweets ala ‘java java java java java java…’ instead of important once. So for tweets a higher entropy is also relevant
  • The inverted document frequency (idf) gives certain terms a higher (or lower) weight. So, if a term occurs in all documents the term frequency should be low to make that term of a query not so important compared to other terms where less documents were found

With jsii you can grab docs from a solr index or feed it via the javascript api. jsii is very basic and simple, but it seems to work reasonable fast. I get fair response times of under 50ms with ~20k tweets although I didn’t invest time to improve performance. There are some bugs and yes, jsii is a memory hog, but besides this it is amazing what can be done with a ‘script’ language. BTW: at the moment jsii is a 100% real time search engine because it does not support transactions or warming up ;-)

Hints

About these ads