Detect Stolen and Duplicate Tweets with Solr

A new feature “duplication detection” is implemented for jetwick and seems to work pretty good thanks to the great performance of Solr.

To try it, go to this tweet and click on the ‘Find Similar’/’Guttenberg’ button below the tweet to investigate existing duplicates. With that feature it is possible for jetwick to skip spam, identify different accounts of the same user, skip tweets with wrong retweet or attribution.

but also to see stolen tweets i.e. when users tweeting without attribution or not knowing the original tweet. Or if all tweeters had a common different source, e.g. news paper. Thanks to pannous for pointing this out.

Examples for ‘stolen’ or duplicated tweets:

So this is an example for a user using two twitter accounts, because the tweet has the same twitter client and they were posted on identical times.

The following German example looks more like ‘stolen’ tweets:

http://twitter.com/#!/Newsteam_Berlin/status/17881387294003200

and a lot more: ste_pos, Kleines79, …

the oldest tweet and therefor the original is:

As you can see it is not necessary for the successful detection that the tweets have exactly the same string.

Detecting duplicated tweets could be interesting for all people wanting to give the ‘correct’ guy its attribution. Because it is often the case that not the original tweet but the ‘stolen’ tweet is more popular (has more retweets). Especially for heavy follower accounts.

But it is also useful for “tweet readers” like jetwick to avoid twitter noise and reading the same content twice.

Update: This seems to be the first tweet about santa and wikileaks:

suliz has only 67 followers .. now take that tweet from ihackinjosh with over 6000 followers. This tweet has over 600 retweets, although suliz has tweeted nearly one day earlier. That is life!

Use cases of faceted search for Apache Solr

GraphHopper – A Java routing engine

karussell ads

In this post I write about some use cases of facets for Apache Solr. Please submit your own ideas in the comments.
This post is splitted into the following parts:

  • What are facets?
  • How do you enable and use simple facets?
  • What are other use cases?
    1. Category navigation
    2. Autocompletion
    3. Trending keywords or links
    4. Rss feeds
  • Conclusion

What are facets?

In Apache Solr elements for navigational purposes are named facets. Keep in mind that Solr provides filter queries (specified via http parameter fq) which filters out documents from the search result. In contrast facet queries only provide information (count of documents) and do not change the result documents. I.e. they provide ‘filter queries for future queries’. So define a facet query and see how much documents I can expect if I would apply the related filter query.

But a picuture – from this great facet-introduction – is worth a thousand words:

What do you see?

  • You see different facets like Manufacturer, Resolution, …
  • Every facet has some constraints, where the user can filter its search results easily
  • The breadcrumb shows all selected contraints and allows removing them

All these values can be extracted from Solrs’ search results and can be defined at query time, which looks surprising if you come from FAST ESP. Nevertheless the fields on which you do faceting needs to be indexed and untokenized. E.g. string or integer. But the type of fields where you want to do faceting mustn’t be the default ‘text’ type, which is tokenized.

In Solr you have

The normal facets can be useful if your documents have a manufacturer string field e.g. a document can be within the ‘Sony’ or ‘Nikon’ bucket. In contrast you will need facet queries for integers like pricing. For example if you specify a facet query from 0 to 10 EUR Solr will calculate on the fly all documents which fall into that bucket. But the facet queries becomes relative unhandy if you have several identical ranges like 0-10, 10-20, 20-30, … EUR. Then you can use range queries.

Date facets are special range queries. As an example look into this screenshot from jetwick:

where here the interval (which is called gap) for every bucket is one day.

For a nice introduction into facets have a look into this publication or use the solr wiki here.

How do you enable and use simple facets?

As stated before they can be enabled at query time. For the http API you add “&facet=true&facet.field=manu” to your normal query “http://localhost:8983/solr/select?q=*:*”. For SolrJ you do:

new SolrQuery("*:*").setFacet(true).addFacetField("manu");

In the Xml returned from the Solr server you will get something like this – again from this post:

<lst name="facet_fields">
            <lst name="manu">
               <int name="Canon USA">17</int>
               <int name="Olympus">12</int>
               <int name="Sony">12</int>
               <int name="Panasonic">9</int>
               <int name="Nikon">4</int>
            </lst>
</lst>

To retrieve this with SolrJ you don’t need to touch any Xml, of course. Just get the facet objects:

List<FacetField> facetFields = queryResponse.getFacetFields();

To append facet queries specify them with addFacetQuery:

solrQuery.addFacetQuery("quality:[* TO 10]").addFacetQuery("quality:[11 TO 100]");

And how you would query for documents which does not have a value for that field? This is easy: q=-field_name:[* TO *]

Now I’ll show you like I implemented date facets in jetwick:

q.setFacet(true).set(“facet.date”, “{!ex=dt}dt”).
set(“facet.date.start”, “NOW/DAY-6DAYS”).
set(“facet.date.end”, “NOW/DAY+1DAY”).
set(“facet.date.gap”, “+1DAY”);

With that query you get 7 day buckets which is visualized via:

It is important to note that you will have to use local parameters like {!ex=dt} to make sure that if a user applies a facet (uses the facet query as filter query) then the other facet queries won’t get a count of 0. In the picture the filter query was fq={!tag=dt}dt:[2010-12-04T00:00:00.000Z+TO+2010-12-05T00:00:00.000Z]. Again: filter query needs to start with {!tag=dt} to make that working. Take a look into the DateFilter source code or this for more information.

Be aware that you will have to tune the filterCache in order to keep performance green. It is also important to use warming queries to avoid time outs and pre-fill caches with old ‘heavy’ used data.

What are other use cases?

1. Category navigation

The problem: you have a tree of categories and your products are categorized in multiple of those categories.

There are two relative similar solutions for this problem. I will describe one of them:

  • Create a multivalued string field called ‘category’. Use the category id (or name if you want to avoid DB queries).
  • You have a category tree. Make sure a document gets not only the leaf category, but all categories until the root node.
  • Now facet over the category field with ‘-1’ as limit
  • But what if you want to display only the categories of one level? E.g. if you don’t want other level at a time or if they are too much.
    Then index the category field ala <level>_category. For that you will need the complete category tree in RAM while indexing. Then use facet.prefix=<level>_ to filter the category list for the level
  • Clicking on a category entry should result in a filter query ala fq=category:”<levle>_categoryId”
  • The little tricky part is now that your UI or middle tier has to parse the level e.g. 2 and the append 2+1=3 to the query: facet.prefix=3_
  • If you filter the level then one question remains:
    Q: how can you display the path from the selected category until the root category?
    A: Either get the category parents via DB, which is easy if you store the category ids in Solr – not the category names.
    Or get the parents from the parameter list which is a bit more complicated but doable. In this case you’ll need to store the category names in Solr.

Please let me know if this explanation makes sense to you or if you want to see that in action – I don’t want to make advertisments for our customers here 🙂

BTW: The second approach I have in mind is: instead of using facet.prefix you can use dynamic fields ala category_<level>_s

Special Hint: If it are too many facets you can even page through them!

2. Autocompletion

The problem: you want to show suggestions as the user types.

You’ll need a multivalued ‘tag’ field. For jetwick I’m using a heavy noise word filter to get only terms ‘with information’ into the tag field, from the very noisy tweet text. If you are using a shingle filter you can even create phrase suggestions. But I will describe the “one more word” suggestion here, which will only suggest the next word (not a complete different phrase).

To do this create a the following query when the user types in some characters (see getQueryChoices method of SolrTweetSearch):

  • Use the old query with all filter queries etc to provide a context dependent autocomplete (ie. only give suggestions which will lead to results)
  • split the query into “completed” terms and one “to do” term. E.g. if you enter “michael jack”
    Then michael is complete (ends with space) and jack should be completed
  • set the query term of the old query to michael and add the facet.prefix=jack
  • set facet limit to 10
  • read the 10 suggestions from facet field but exclude already completed terms.

The implementation for jetwick which uses Apache Wicket is available in the SearchBox source file which uses MyAutoCompleteTextField and the getQueryChoices method of SolrTweetSearch. But before you implement autocomplete with facets take a look into this documentation. And if you don’t want to use wicket then there is a jquery autocomplete library especially for solr – no UI layer required.

3. Trending keywords or links

Similar to autocomplete you will need a tag or link field in your index. Then use the facet counts as an indicator how important a term is. If you now do a query e.g. solr you will get the trending keywords and links depending on the filters. E.g. you can select different days to see the changes:

The keyword panel is implemented in the TagCloudPanel and the link list is available as UrlTrendPanel.

Of course it would be nice if we would get the accumulated score of every link instead of a simple ‘count’ to prevent spammers from reaching this list. For that, look into this JIRA issue and into the StatsComponent. Like I explained in the JIRA issue this nice feature could be simulated by the results grouping feature.

4. Rss feeds

If you log into at jetwick.com you’ll see this idea implemented. Every user can have different saved searches. For example I have one search for ‘apache solr’ and one for ‘wikileaks’. Every search could contain additional filters like only German language or sort against retweets. Now the task is to transform that query into a facet query:

  • insert AND’s between the query and all the filter query
  • remove all date filters
  • add one date filter with the date of the last processed search (‘last date’)

Then you will see how many new tweets are available for every saved searches:

Update: no need to click refresh to see the counts. The count-update is done in background via JavaScript.

Conclusion

There are a lot of applications for faceted search. It is very convinient to use them. Okay, the ‘local parameter hack’ is a bit daunting, but hey: it works 🙂

It is nice that I can specify different facets for every query in Solr, with that feature you can generate personalized facets like it was explained under “rss feeds”.

One improvement for the facets implemented in Solr could be a feature which does not calculate the count. Instead it sums up a fieldA for documents with the same value in fieldB or even returns the score for a facet or a facet query. To improve the use case “Trending keywords or links”.

Poor Man’s Monitoring for Solr

For jetwick I’m the developer, PR agent and sadly also the admin ;-). All in one, at once. Here is a minor snippet to get an alert email if your solr index is either not available or countains too few entries. And get a resolved mail if all is fine again.


cd /path/
FILE=bla.log
EMAILS="your@email.here"
SUBJECT="OK: jetwick"
STATUS=OK

CNT=`wget --http-user=user --http-password=password  -T 10 -q "http://your-host.com/solr/select?q=&rows=1&wt=json" -O - | tr ',' '\n' |grep numFound|tr ':' ' '|awk '{print $3}'`
if [ "x$CNT" == x ] || [ "$CNT" -lt 500000 ]; then
  SUBJECT="CRITICAL: check http://your-host.com/solr"
  STATUS=CRITICAL
fi

PREV_STAT=`cat .status`

if [ "$STATUS" == "CRITICAL" ]; then
if [ "$PREV_STAT" == "OK" ]; then
cat $FILE | mail $EMAILS -a $FILE -s "$SUBJECT. doc count was only $CNT"
fi
echo CRITICAL > .status
else
if [ "$PREV_STAT" == "CRITICAL" ]; then
cat $FILE | mail $EMAILS -a $FILE -s "SOLVED: http://your-host.com/solr"
fi
echo OK > .status
fi

echo $STATUS > .status

add via this via crontab -e

*/2 * * * * /path/check-health.sh

If you look at the code there is one mini hack which is necessary if the solr index is down and the CNT is empty:

"x$CNT" == x

Jetwick Twitter Search is now free software! Wicket and Solr pearls for Developers.

Today we released Jetwick under the Apache 2 license.

Why we made Jetwick free software

I would like to start with an image I made some years ago for TimeFinder:

This is one reason and we are very interested in your contributions as patches or bug reports.

But there are some more interesting opportunities when releasing jetwick as open source:

  • Open architecture: several jetwick-hosters could provide parts of the twitter index and maybe at some day we can have a way to freely explore all tweets at twitter in the jetwicked way
  • Personalized jetwick: every user has different interests. If you only feed tweets from searches of terms that your are interested in and also only your timeline, then you will be able to search this personalized twitter, sort against retweets, see personal URL-trends, etc.
    This way you’ll be informed faster, wider and more personalized than with an ordinary rss feed or the ordinary twitter timeline. Without reading a lot of unrelated content. If jetwick would stay closed then this task would be too resource intensive or even impossible to convince every user.

In our further development we will concentrate on the second point, because then jetwick will have at least one user

Explore Jetwick

Why you should install Jetwick and try it out?

First you can look at the features and see if something interesting – for you as a user – is shown.

For developers there could be the following things (and more) worth to be investigated:

Jetwick can be used as a simple showcase how to use wicket and solr:

  1. show the use of facets and facet queries
  2. show date facets as explained in here
  3. make autocompletion working
  4. instant results when you select a suggestion from autocompletion as explained in this video

If you are programming some twitter stuff you should keep an eye on the following points:

  • spam detection as explained in this post
  • how you can use oAuthentication with twitter4j and wicket
  • transform tweets with clickable users, links and hashtags
  • translate tweets with google translate

If you are new to wicket

  • Jetwick is configured that if you run ‘mvn jetty:run’ you can update html and code and you can hit refresh on the browser 1-2 seconds later to see the updated results. For css it will be updated immediately
  • query wikipedia and show the results in a lazyload panel

Some solr gems:

  • simple near realtime set up with solr. And that although we make heavy usage of facets where a lot autowarming is required.
  • if you re-enable the user search you can use twitters person suggestions on your own data. I’m relative sure that twitter uses the ‘more like this’ feature of lucene that jetwick had implemented with solr.

fluid database and the PermGenSpace checker

  • fluid update of your database from hibernate changes (via liquibase). Hint: at the moment we only use the db for the ytags
  • a simple helper method ‘getPermGenSpace’ to check if reload is possible

The following interesting issues are still open:

  • using a storage for tweets (mysql or redis or …). This will increase the index time dramatically, because we had to switch to a pure solr solution (we had problems with h2 and hibernate for over 4mio tweets)
  • a mature queue like 0mq and protobuf or something else
  • a ‘real’ realtime solution for solr, if we use solr from trunk

jsii – full text search in 1K LOC of JavaScript!

In the previous blog post I tried to introduce node.js and its nice features. Today I will introduce my little search engine prototype called jsii (javascript inverted index).

jsii provides an in-memory inverted index within approx. 1000 lines of JavaScript. Some more lines are necessary to set up a server via node.js, so that the index is queryable via http and returns Solr compatible json or xml. The sources are available @github:

git clone git@github.com:karussell/jsii.git

Try it out here: http://pannous.info:8124/select?q=google e.g. filter queries works like id:xy or queries with sorting works like &sort=id asc. The paramters start and rows can be used for paging. For those who come too late e.g. my server crashed or sth. ;-), here is an image of the xml response:

jsii-response

Solr XML Response

The solr compatible xml response format makes it possible to use jsii from applications that are using SolrJ. For example I tried it for Jetwick and the basic search worked – just specify the xml reponse parser:

solrj.setParser(new XMLResponseParser());

His-story

The first thing I needed was a BitSet analogon in JavaScript to perform term look-ups fast and combine them via AND bit-operation. Therefor I took the classes and tests from a GWT patch and made them working for my jasmine specs.

While trying to understand the basics of a scoring function I stumbled over the lucene docs and this thread which mentions ‘Section 6 of a book‘ for a good reference on that subject.

My understanding of the basics is now the following:

  • The term frequency (tf) is to weight documents differently. E.g. document1 contains ‘java’ 10 times but doc2 has it 20 times. So doc2 is more important for a query ‘java’. If you index tweets you should do tf = min(tf, 3). Otherwise you will often get tweets ala ‘java java java java java java…’ instead of important once. So for tweets a higher entropy is also relevant
  • The inverted document frequency (idf) gives certain terms a higher (or lower) weight. So, if a term occurs in all documents the term frequency should be low to make that term of a query not so important compared to other terms where less documents were found

With jsii you can grab docs from a solr index or feed it via the javascript api. jsii is very basic and simple, but it seems to work reasonable fast. I get fair response times of under 50ms with ~20k tweets although I didn’t invest time to improve performance. There are some bugs and yes, jsii is a memory hog, but besides this it is amazing what can be done with a ‘script’ language. BTW: at the moment jsii is a 100% real time search engine because it does not support transactions or warming up 😉

Hints

Feeding Solr with its own Logs

I always looked for a simple way to visualize our log data e.g. from solr. At that time I had in mind a combination of gnuplot and some shellscripts but this session from the lucene revolution changed my idea. (Look here for all videos from lucene revolution.)

I thought: “hey thats it! Just put the logs into solr!” So I coded something which simply reads the log files and named it Sogger. Without sharding, without message queues, … but it should work on real systems without any changes to your system (but probably to sogger).

I hope Sogger doesn’t suck, but it does not come with any warranty, so use it with care! And: It is only a proof of concept – nothing comparable to the guys from loggly.com

To get your logs sogged:

  • Download the ‘Sogger’ code via:
    hg clone http://timefinder.hg.sourceforge.net/hgroot/timefinder/sogger sogger-code
    
  • Download the Solr from trunk.
    svn co -r  1023329 https://svn.apache.org/repos/asf/lucene/dev/trunk solr-code
    

    Sogger doesn’t necessarily need the trunk version but I didn’t tested it for others yet

  • compile solr and Sogger with ant
  • cd solr-code/solr/example/
  • copy solrconfig.xml, schema.xml from Sogger into solr/conf
  • copy the *.vm files from Sogger into the files at solr/conf/velocity/
  • start solr
    java -jar start.jar
  • start feeding your logs
    cd sogger-code/
    java -jar dist/Sogger.jar url=http://localhost:8983/solr logFile=data/solr.2010-10-25.log.gz
    
  • to search your logs do:
    http://localhost:8983/solr/browse?q=twitter

Now you should see something like this

Sogger has several advantages over simple “grep-ing” or scripting with your solr logs:

  • full text search. near real time: ~1min 😉
  • performance. I hope commiting every minute does not make solr a lot slower
  • filtering by log level: Quickly find warnings and exceptions
  • filtering by webapp: If you have multiple apps or solr cores which are logging into the same file filtering is really easy with solr (with grep too, but you’ll have to re-grep the whole log …)
  • open source: you can change the feeding method I used and take care of your special needs. Tell me if you need assistance!
  • new log lines will be detected and commited ala tail -f
  • besides text files sogger accepts and detects compressed (zip, gzip/gz) files ala zgrep. So you don’t need to change your log handlers or preprocess the files.

to do’s:

  • make the log format customizable within a property file:
    line1=regular expression pattern1
    line2=regular expression pattern2
  • read and monitor multiple log files
  • make it a solr plugin via special UpdateHandler?
  • a xy plot (or barchart) in velocity for some facets or facet queries would be nice. Something like I had done before with wicket.
  • I don’t like velocity … althought it is sufficient for this … but should we use wicket!?

Twitter Search Jetwick – powered by Wicket and Solr

How different is a quickstart project from production?

Today we released jetwick. With jetwick I wanted to realize a service to find similar users at twitter based on their tweeted content. Not based on the following-list like it is possible on other platforms:

Not only the find similar feature is nice, also the topics (on the right side of the user name; gray) give a good impression about which topic a user tweets about. The first usable prototype was ready within one week! I used lucene, vaadin and db4o. But I needed facets so I switched from lucene to solr.  The tranformation took only ~2 hours. Really! Test based programming rocks 😉 !

Now users told me that jetwick is slow on ‘old’ machines. It took me some time to understand that vaadin uses javascript a lot and inappropriate usage of layout could affect performance negativly in some browsers. So i had the choice to stay with vaadin and improve the performance (with different layouts) or switch to another web UI. I switched to wicket (twitter noise). It is amazingly fast. This transformation took some more time: 2 days. After this I was convinced with the performance of the UI. The programming model is quite similar (‘swing like’) although vaadin is easier and so, faster to implement. While working on this I could improve the tweet collector which searches twitter for information and stores the results in jetwick.

After this something went wrong with the db. It was very slow for >1 mio users. I tweaked to improve the performance of db4o at least one week (file >1GB). It improves, but it wouldn’t be sufficient for production. Then I switched to hibernate (yesql!). This switch took me again two weeks and several frustrating nights. Db4o is so great! Ok, now that I know hibernate better I can say: hibernate is great too and I think the most important feature (== disadvantage!) of hibernate is that you can tweak it nearly everwhere: e.g. you can say that you only want to count the results, that you want to fetch some relationship eager and some lazy and so on. Db4o wasn’t that flexible. But hibernate has another draw back: you will need to upgrade the db schema for yourself or you do it like me: use liquibase, which works perfectly in my case after some tweeking!

Now that we had the search, it turned out that this user-search was quite useful for me, as I wanted to have some users that I can follow. But alpha tester didn’t get the point of it. And then, the shock at the end of July: twitter released a find-similar feature for users! Damn! Why couldn’t they wait two months? It is so important to have a motivation … 😦 And some users seems to really like those user suggestions. ok, some users feel disgustedly when they recognized this new feature. But I like it!

BTW: I’m relative sure that the user-suggestions are based on the same ‘more like this’ feature (from Lucene) that I was using, because for my account I got nearly the same users suggested and somewhere in a comment I read that twitter uses solr for the user search. Others seems to get a shock too 😉

Then after the first shock I decided to switch again: from user-search to a regular tweet search where you can get more information out of those tweets. You can see with one look about which topics a user tweets or search for your original url. Jetwick tries to store expanded URLs where possible. It is also possible to apply topic, date and language filters. One nice consequence of a tweet-based index is, that it is possible to search through all my tweets for something I forgot:

Or you could look about all those funny google* accounts.

So, finally. What have I learned?

From a quick-start project to production many if not all things can change: Tools, layout and even the main features … and we’ll see what comes next.

How to Test Apache Solr(J)?


public class SolrSearchTest extends AbstractSolrTestCase {

 private SolrServer server;

 @Override
 public String getSchemaFile() {
    return "solr/conf/schema.xml";
 }

 @Override
 public String getSolrConfigFile() {
    return "solr/conf/solrconfig.xml";
 }

 @Before
 @Override
 public void setUp() throws Exception {
    super.setUp();
    server = new EmbeddedSolrServer(h.getCoreContainer(), h.getCore().getName());
 }

 public testFirstTry() {
    // e.g. add some docs via solrJ
    server.add(createDoc(entity1));
    server.add(createDoc(entity2));
    server.add(createDoc(entity3));
    server.add(createDoc(entity4));
    server.add(createDoc(entity5));

    // now query
    ArrayList myEntities = new ArrayList();
    SolrQuery query = new SolrQuery("text:peter").setQueryType("standard");
    QueryResponse rsp = server.query(query);
    SolrDocumentList docs = rsp.getResults();
    for (SolrDocument sd : docs) {
       myEntities.add(readDoc(sd));
    }

    assertEquals("peter", myEntities.get(0).getText());
    assertEquals(5, rsp.getResults().getNumFound());
 }
}

Another approach is documented here.

My Links for Apache Solr 1.4

Here is my Solr/Lucene Link list. Last update: Oct’ 2010

Solr

Feature and Get Started Overview

Query

Multiple Cores

Facetting/Navigators

Grouping/Field Collapsing

Result Highlighting

Config Xml

  • Caching -> performance boost: set HashDocSet to 0.005 of all documents!

Statistics with the StatsComponent

Updating/Indexing

Replication for Solr >1.4

  • See SOLR-561 for more information.
  • Scaling article
  • Dashboard via solr/admin/replication/index.jsp
  • index version via solr/replication?command=details (if we would use ?indexversion this would always return 0?)
  • linux script to monitor health of replication
  • bugs: SOLR-1781 (and SOLR-978)

Scaling Solr

SolrJ

Get source via:

Tips and Tricks

  • If you have heavy commits (‘realtime upates’) don’t miss to read this thread about ‘Tuning Solr caches with high commit rates (NRT)’ from Peter Sturge

Lucene

Lucene FAQ

Did you mean

Highlighting

When to prefer Lucene over Solr? Or should I use Hibernate Search?