Use cases of faceted search for Apache Solr

GraphHopper – A Java routing engine

karussell ads

In this post I write about some use cases of facets for Apache Solr. Please submit your own ideas in the comments.
This post is splitted into the following parts:

  • What are facets?
  • How do you enable and use simple facets?
  • What are other use cases?
    1. Category navigation
    2. Autocompletion
    3. Trending keywords or links
    4. Rss feeds
  • Conclusion

What are facets?

In Apache Solr elements for navigational purposes are named facets. Keep in mind that Solr provides filter queries (specified via http parameter fq) which filters out documents from the search result. In contrast facet queries only provide information (count of documents) and do not change the result documents. I.e. they provide ‘filter queries for future queries’. So define a facet query and see how much documents I can expect if I would apply the related filter query.

But a picuture – from this great facet-introduction – is worth a thousand words:

What do you see?

  • You see different facets like Manufacturer, Resolution, …
  • Every facet has some constraints, where the user can filter its search results easily
  • The breadcrumb shows all selected contraints and allows removing them

All these values can be extracted from Solrs’ search results and can be defined at query time, which looks surprising if you come from FAST ESP. Nevertheless the fields on which you do faceting needs to be indexed and untokenized. E.g. string or integer. But the type of fields where you want to do faceting mustn’t be the default ‘text’ type, which is tokenized.

In Solr you have

The normal facets can be useful if your documents have a manufacturer string field e.g. a document can be within the ‘Sony’ or ‘Nikon’ bucket. In contrast you will need facet queries for integers like pricing. For example if you specify a facet query from 0 to 10 EUR Solr will calculate on the fly all documents which fall into that bucket. But the facet queries becomes relative unhandy if you have several identical ranges like 0-10, 10-20, 20-30, … EUR. Then you can use range queries.

Date facets are special range queries. As an example look into this screenshot from jetwick:

where here the interval (which is called gap) for every bucket is one day.

For a nice introduction into facets have a look into this publication or use the solr wiki here.

How do you enable and use simple facets?

As stated before they can be enabled at query time. For the http API you add “&facet=true&facet.field=manu” to your normal query “http://localhost:8983/solr/select?q=*:*”. For SolrJ you do:

new SolrQuery("*:*").setFacet(true).addFacetField("manu");

In the Xml returned from the Solr server you will get something like this – again from this post:

<lst name="facet_fields">
            <lst name="manu">
               <int name="Canon USA">17</int>
               <int name="Olympus">12</int>
               <int name="Sony">12</int>
               <int name="Panasonic">9</int>
               <int name="Nikon">4</int>
            </lst>
</lst>

To retrieve this with SolrJ you don’t need to touch any Xml, of course. Just get the facet objects:

List<FacetField> facetFields = queryResponse.getFacetFields();

To append facet queries specify them with addFacetQuery:

solrQuery.addFacetQuery("quality:[* TO 10]").addFacetQuery("quality:[11 TO 100]");

And how you would query for documents which does not have a value for that field? This is easy: q=-field_name:[* TO *]

Now I’ll show you like I implemented date facets in jetwick:

q.setFacet(true).set(“facet.date”, “{!ex=dt}dt”).
set(“facet.date.start”, “NOW/DAY-6DAYS”).
set(“facet.date.end”, “NOW/DAY+1DAY”).
set(“facet.date.gap”, “+1DAY”);

With that query you get 7 day buckets which is visualized via:

It is important to note that you will have to use local parameters like {!ex=dt} to make sure that if a user applies a facet (uses the facet query as filter query) then the other facet queries won’t get a count of 0. In the picture the filter query was fq={!tag=dt}dt:[2010-12-04T00:00:00.000Z+TO+2010-12-05T00:00:00.000Z]. Again: filter query needs to start with {!tag=dt} to make that working. Take a look into the DateFilter source code or this for more information.

Be aware that you will have to tune the filterCache in order to keep performance green. It is also important to use warming queries to avoid time outs and pre-fill caches with old ‘heavy’ used data.

What are other use cases?

1. Category navigation

The problem: you have a tree of categories and your products are categorized in multiple of those categories.

There are two relative similar solutions for this problem. I will describe one of them:

  • Create a multivalued string field called ‘category’. Use the category id (or name if you want to avoid DB queries).
  • You have a category tree. Make sure a document gets not only the leaf category, but all categories until the root node.
  • Now facet over the category field with ‘-1′ as limit
  • But what if you want to display only the categories of one level? E.g. if you don’t want other level at a time or if they are too much.
    Then index the category field ala <level>_category. For that you will need the complete category tree in RAM while indexing. Then use facet.prefix=<level>_ to filter the category list for the level
  • Clicking on a category entry should result in a filter query ala fq=category:”<levle>_categoryId”
  • The little tricky part is now that your UI or middle tier has to parse the level e.g. 2 and the append 2+1=3 to the query: facet.prefix=3_
  • If you filter the level then one question remains:
    Q: how can you display the path from the selected category until the root category?
    A: Either get the category parents via DB, which is easy if you store the category ids in Solr – not the category names.
    Or get the parents from the parameter list which is a bit more complicated but doable. In this case you’ll need to store the category names in Solr.

Please let me know if this explanation makes sense to you or if you want to see that in action – I don’t want to make advertisments for our customers here :-)

BTW: The second approach I have in mind is: instead of using facet.prefix you can use dynamic fields ala category_<level>_s

Special Hint: If it are too many facets you can even page through them!

2. Autocompletion

The problem: you want to show suggestions as the user types.

You’ll need a multivalued ‘tag’ field. For jetwick I’m using a heavy noise word filter to get only terms ‘with information’ into the tag field, from the very noisy tweet text. If you are using a shingle filter you can even create phrase suggestions. But I will describe the “one more word” suggestion here, which will only suggest the next word (not a complete different phrase).

To do this create a the following query when the user types in some characters (see getQueryChoices method of SolrTweetSearch):

  • Use the old query with all filter queries etc to provide a context dependent autocomplete (ie. only give suggestions which will lead to results)
  • split the query into “completed” terms and one “to do” term. E.g. if you enter “michael jack”
    Then michael is complete (ends with space) and jack should be completed
  • set the query term of the old query to michael and add the facet.prefix=jack
  • set facet limit to 10
  • read the 10 suggestions from facet field but exclude already completed terms.

The implementation for jetwick which uses Apache Wicket is available in the SearchBox source file which uses MyAutoCompleteTextField and the getQueryChoices method of SolrTweetSearch. But before you implement autocomplete with facets take a look into this documentation. And if you don’t want to use wicket then there is a jquery autocomplete library especially for solr – no UI layer required.

3. Trending keywords or links

Similar to autocomplete you will need a tag or link field in your index. Then use the facet counts as an indicator how important a term is. If you now do a query e.g. solr you will get the trending keywords and links depending on the filters. E.g. you can select different days to see the changes:

The keyword panel is implemented in the TagCloudPanel and the link list is available as UrlTrendPanel.

Of course it would be nice if we would get the accumulated score of every link instead of a simple ‘count’ to prevent spammers from reaching this list. For that, look into this JIRA issue and into the StatsComponent. Like I explained in the JIRA issue this nice feature could be simulated by the results grouping feature.

4. Rss feeds

If you log into at jetwick.com you’ll see this idea implemented. Every user can have different saved searches. For example I have one search for ‘apache solr’ and one for ‘wikileaks’. Every search could contain additional filters like only German language or sort against retweets. Now the task is to transform that query into a facet query:

  • insert AND’s between the query and all the filter query
  • remove all date filters
  • add one date filter with the date of the last processed search (‘last date’)

Then you will see how many new tweets are available for every saved searches:

Update: no need to click refresh to see the counts. The count-update is done in background via JavaScript.

Conclusion

There are a lot of applications for faceted search. It is very convinient to use them. Okay, the ‘local parameter hack’ is a bit daunting, but hey: it works :-)

It is nice that I can specify different facets for every query in Solr, with that feature you can generate personalized facets like it was explained under “rss feeds”.

One improvement for the facets implemented in Solr could be a feature which does not calculate the count. Instead it sums up a fieldA for documents with the same value in fieldB or even returns the score for a facet or a facet query. To improve the use case “Trending keywords or links”.

10 thoughts on “Use cases of faceted search for Apache Solr

  1. Hi. ! Great article thanks. But i’ve got a question. Why you have mouved jetwick from solr to elasticsearch ? And did you meet some problems ? I would like use elasticsearch with riak and i’m not very comfortable with all these projects. You expérience Will be very usefull.

  2. I haven’t moved it yet. Still teating with more data.

    Main Reason to move: less administrating for multiple indices (when sharding, replicating etc)

    minor thing: ES uses a newer lucene which is near real time.

    I’ll write a blog post about this in the near future.

  3. Nice article!

    Is there a way to combine autocompletion with categories?
    For example if you type “canon” in a search box and you’ll get following result:

    1) canon eos 1000D (article)
    2) canon digital cameras (category)
    3) …

    Thanks in advance.

    • I’m starting a new project so I am open for any method.
      Is it maybe better to use a separate index for categories?

  4. Hi Tom,

    let me think about it a bit. I think, the n-gram technic would be more appropriated in your case:

    http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

    Then you are gettings documents (instead of only counts via facets) and
    you can get directly the category out of each document.
    This would avoid a second query. I don’t think that another index is required for categories.

    Do you have a category tree or only a flat category list?

    Grüße,
    Peter.

    BTW: for a new project also take a look into ElasticSearch :) !

  5. Hi Peter,

    thanks for your help! I will take a look at ElasticSearch.
    Getting documents and not only facets is also a good idea (I have a category tree).

    Grüße
    Tom

    PS: I’ve only just noticed your German too ;)

  6. > PS: I’ve only just noticed your German too ;)

    exactly ;)

    Lass mich wissen wenn du noch Fragen hast (auch gerne via mail).
    Bei solr ist die Mailingliste, bei ElasticSearch sind google groups oder freenode sehr hilfreich!

  7. Pingback: Search Query Suggestions using ElasticSearch via Shingle Filter and Facets « Code // Comment

  8. Hi,
    When I do facet=true&facet.field=manu. It gives me list but each int name is just one word. Like in your example i get Canon seperate and USA seperate. How can i get both of them together in one int name?

Comments are closed.