Snacktory – Yet another Readability clone. This time in Java.

For Jetslide I needed a readability Java clone. There are already some tools, but I wanted some more and other features so I adapted the existing goose and jreadability and added some stuff. Check out the detection quality at Jetslide and fork it to improve it – since today snacktory is free software 🙂 !

Copied from the README:

Snacktory
This is a small helper utility for pepole don’t want to write yet another java clone of Readability. In most cases, this is applied to articles, although it should work for any website to find its major area and extract its text and its important picture. Have a look into Jetslide where Snacktory is used. Jetslide is a new way to consume news, it does not only display the Websites’ title but it displays a small preview of the site (‘a snack’) and the important image if available.
License
The software stands under Apache 2 License and comes with NO WARRANTY
Features
Snacktory borrows some ideas from jReadability and goose (ideas + a lot test cases)
The advantages over jReadability are
  • better article text detection than jReadability
  • only Java deps
  • more tests
The advantages over Goose are
  • similar article text detection although better detection for none-english sites (German, Japanese, …)
  • snacktory does not depend on the word count in its text detection to support CJK languages
  • no external Services required to run the core tests => faster tests
  • better charset detection
  • with caching support
  • skipping some known filetypes
The disadvantages to Goose are
  • only the detection of the top image and the top text is supported at the moment
  • some tests which passed do not pass. But added a bunch of other useful sites (stackoverflow, facebook, other languages …)
Usage
HtmlFetcher fetcher = new HtmlFetcher();
// set cache. e.g. take the map implementation from google collections:
// fetcher.setCache(new MapMaker().concurrencyLevel(20).
 //               maximumSize(count).expireAfterWrite(minutes, TimeUnit.MINUTES).makeMap();
JResult res = fetcher.fetchAndExtract(url, resolveTimeout, true);
res.getText(); res.getTitle(); res.getImageUrl();
Advertisement

Longest Common Substring Algorithm in Java

For jetwick I needed yet another string algorithm and stumbled over this cool and common problem: trying to find the longest substring of two strings. Be sure that you understand the difference to the LC sequence problem.

For example if we have two strings:

Please, peter go swimming!

and

I’m peter goliswi

The algorithm should print out ‘ peter go’. The longest common substring algorithm can be implemented in an efficient manner with the help of suffix trees.

But in this post I’ll try to explain the bit less efficient ‘dynamic programming‘ version of the algorithm. Dynamic programming means that you can reuse already calculated information in a later step or you break the algorithm into parts to reuse information. To understand the algorithm you just need to fill the entries of an integer-array with the lengths of the identical substrings. Assume we use i for the horizontal string (please …) and j for the vertical string. Then the algorithm hits at some time i=19 and j=0 for one identical character ‘i’. Then the line

num[i][j] = 1;

is executed and saves the lengths of the 1 length identical substring.

  please, peter go swimming
i 0000000000000000000100100
' 0000000000000000000000000
m 0000000000000000000011000
  0000000100000100100000000
p 1000000020000000000000000
e 0010010003000000000000000
t 0000000000400000000000000
e 0010010001050000000000000
r 0000000000006000000000000
  0000000100000700100000000
g 0000000000000080000000000
o 0000000000000009000000000
l 0100000000000000000000000
i 0000000000000000000100100
s 0001000000000000010000000
w 0000000000000000002000000
i 0000000000000000000300100

Later on it hits the m characters and saves 1 two times to the array but then at i=7 and j=3 it starts our substring and saves 1 for the space character. Then some loops later it reaches i=8 and j=4  Now it reuses the already calculated “identical-length” of 1. It will do:

num[8][4] = 1 + num[7][3];

and we get 2. So, we now know we have a substring with two 2 characters. And with

if (num[i][j] > maxlen)

we make sure that we overwrite the existing longest substring (stored in the StringBuilder) ONLY IF there is a longer substring found and either append the character (if it is the current substring in progress):

sb.append(str1.charAt(i));

or we can start a longer substring. See the java code (mainly from wikipedia) for yourself:

public static String longestSubstring(String str1, String str2) {

StringBuilder sb = new StringBuilder();
if (str1 == null || str1.isEmpty() || str2 == null || str2.isEmpty())
  return "";

// ignore case
str1 = str1.toLowerCase();
str2 = str2.toLowerCase();

// java initializes them already with 0
int[][] num = new int[str1.length()][str2.length()];
int maxlen = 0;
int lastSubsBegin = 0;

for (int i = 0; i < str1.length(); i++) {
for (int j = 0; j < str2.length(); j++) {
  if (str1.charAt(i) == str2.charAt(j)) {
    if ((i == 0) || (j == 0))
       num[i][j] = 1;
    else
       num[i][j] = 1 + num[i - 1][j - 1];

    if (num[i][j] > maxlen) {
      maxlen = num[i][j];
      // generate substring from str1 => i
      int thisSubsBegin = i - num[i][j] + 1;
      if (lastSubsBegin == thisSubsBegin) {
         //if the current LCS is the same as the last time this block ran
         sb.append(str1.charAt(i));
      } else {
         //this block resets the string builder if a different LCS is found
         lastSubsBegin = thisSubsBegin;
         sb = new StringBuilder();
         sb.append(str1.substring(lastSubsBegin, i + 1));
      }
   }
}
}}

return sb.toString();
}

3D Rotation in Gimp

  1. Erstelle eine zusätzliche transparente Ebene
  2. Selektiere nun die Ebene die 3D rotiert werden soll
  3. Gehe zu Filter->Abbilden->Auf Objekt Abbilden
  4. Wähle auf Quader abbilden
  5. Klicke ‘transparenter Hintergrund’
  6. Gehe zu Tab ‘Quader’. Eine Seite bekommt Ebene aus dem 1. Schritt. Alle anderen bekommen die Ebene aus dem 2. Schritt.
  7. Gehe zu Ausrichtung->Rotation und verändere wie gewünscht

 

Re: Firefox add-ons you should consider

This is a short blog post triggered by this post from Jonathan Giles. It is more a list of plugins I consider than what others should consider. But hopefully at least one plugin of interests is in the list. Feel free to comment, add your own favplug.

  • With All-in-One Sidebar you will have fast access to plugins and browser history etc … to get opera feeling on firefox
  • Add to search bar makes it easy to add any search (e.g. wolfram, topsy.com, …) with on click to your quick search box
  • ScrapBook can take lots of notes and bookmarks (again like opera)
  • Firebug is the well known army knife for web developers
  • Mouse Gestures (again like opera)
  • Live HTTP headers interesting for developers
  • FoxyProxy could be interesting for your home office work or sth else

Shame on Sourceforge?

Sourceforge restricts access from some more ‘evil’ countries. Read more here. This violates Open Source Initiative:

“5. No Discrimination Against Persons or Groups
The license must not discriminate against any person or group of persons.”

And if the license must not restrict this, the distributor shouldn’t as well! Its free software and not US software! 😦

So, if they argue with US laws I have the feelings that they shouldn’t or cannot provide free software hosting any longer. Maybe they should create a server outside the US?

Or should users from the affected countries use tor or should all developers migrate to alternative free software hosting projects?

But it is nice that they allow the comments to the link above, that’s good:

“sarcastic-man on January 25th, 2010
[…] Maybe one day Americans will wake up and realize that the world is a big place. […]

afsharm on January 25th, 2010
I am an Iranian (an innocent one) and I am not responsible for what ever my government is doing. As nawwark mentioned I’ve sometimes have contributions in SF.NET projects, so why you are denying me from my own works?
It’s against freedom and against FOSS.

bones_0 on January 26th, 2010
Just for the book: I want everybody to access my projects. Beeing a Swiss product (I am Swiss citizend and resident) it’s kind of crazy they are falling under US law now… […]”
But also:
“meonkeys on January 25th, 2010
[…] maybe this will raise awareness of these laws and encourage people to get them changed
lukecrouch on January 26th, 2010
Disclaimer: SourceForge employee.
I speak for all of us here when I say we feel your pain. “Rub us the wrong way” is the nicest possible term for it – very diplomatic on Lee’s part. I host projects here on SourceForge too – ajaxmytop, peardbdeploy, and tangoiconsprite. So blocking these countries not only goes against the free flow of information, but it cuts down my potential audience and collaborators; not to mention that it removes hundreds of thousands of ad impressions from the business!
But at the end of the day, I’m a 20-something web developer with a wife and kid to feed and I have the amazing chance to do that AND try to give my own small contributions to open-source software.  […]”