In the previous blog post I tried to introduce node.js and its nice features. Today I will introduce my little search engine prototype called jsii (javascript inverted index).
jsii provides an in-memory inverted index within approx. 1000 lines of JavaScript. Some more lines are necessary to set up a server via node.js, so that the index is queryable via http and returns Solr compatible json or xml. The sources are available @github:
git clone git@github.com:karussell/jsii.git
Try it out here: http://pannous.info:8124/select?q=google e.g. filter queries works like id:xy or queries with sorting works like &sort=id asc. The paramters start and rows can be used for paging. For those who come too late e.g. my server crashed or sth. ;-), here is an image of the xml response:

Solr XML Response
The solr compatible xml response format makes it possible to use jsii from applications that are using SolrJ. For example I tried it for Jetwick and the basic search worked – just specify the xml reponse parser:
solrj.setParser(new XMLResponseParser());
His-story
The first thing I needed was a BitSet analogon in JavaScript to perform term look-ups fast and combine them via AND bit-operation. Therefor I took the classes and tests from a GWT patch and made them working for my jasmine specs.
While trying to understand the basics of a scoring function I stumbled over the lucene docs and this thread which mentions ‘Section 6 of a book‘ for a good reference on that subject.
My understanding of the basics is now the following:
- The term frequency (tf) is to weight documents differently. E.g. document1 contains ‘java’ 10 times but doc2 has it 20 times. So doc2 is more important for a query ‘java’. If you index tweets you should do tf = min(tf, 3). Otherwise you will often get tweets ala ‘java java java java java java…’ instead of important once. So for tweets a higher entropy is also relevant
- The inverted document frequency (idf) gives certain terms a higher (or lower) weight. So, if a term occurs in all documents the term frequency should be low to make that term of a query not so important compared to other terms where less documents were found
With jsii you can grab docs from a solr index or feed it via the javascript api. jsii is very basic and simple, but it seems to work reasonable fast. I get fair response times of under 50ms with ~20k tweets although I didn’t invest time to improve performance. There are some bugs and yes, jsii is a memory hog, but besides this it is amazing what can be done with a ‘script’ language. BTW: at the moment jsii is a 100% real time search engine because it does not support transactions or warming up 😉
Hints
- look into the TODO file before posting an issue
- jsii feeding is NOT thread safe
- I readed this object oriented JS with node and got some suggestions from node.js users
- As IDE I’m using NetBeans. I reported an issue to create a ‘pure javascript’ project in NetBeans.
- git cheat sheet
- There is older, similar project called jssindex
Pingback: Full text search in 100% JavaScript – The future of JavaScript is bright. « Find Time for the Karussell
Pingback: Node Roundup | tips & tricks
And there are some other projects:
https://github.com/talltyler/node-search
http://search.cpan.org/~dpavlin/jsFind-0.07_01/jsFind.pm