For jetwick I needed yet another string algorithm and stumbled over this cool and common problem: trying to find the longest substring of two strings. Be sure that you understand the difference to the LC **sequence** problem.

For example if we have two strings:

*Please, peter go swimming!*

and

*I’m peter goliswi*

The algorithm should print out *‘ peter go’*. The longest common substring algorithm can be implemented in an efficient manner with the help of suffix trees.

But in this post I’ll try to explain the bit less efficient ‘dynamic programming‘ version of the algorithm. Dynamic programming means that you can reuse already calculated information in a later step or you break the algorithm into parts to reuse information. To understand the algorithm you just need to fill the entries of an integer-array with the lengths of the identical substrings. Assume we use i for the horizontal string (please …) and j for the vertical string. Then the algorithm hits at some time i=19 and j=0 for one identical character ‘i’. Then the line

num[i][j] = 1;

is executed and saves the lengths of the 1 length identical substring.

please, peter go swimming i 0000000000000000000100100 ' 0000000000000000000000000 m 0000000000000000000011000 0000000100000100100000000 p 1000000020000000000000000 e 0010010003000000000000000 t 0000000000400000000000000 e 0010010001050000000000000 r 0000000000006000000000000 0000000100000700100000000 g 0000000000000080000000000 o 0000000000000009000000000 l 0100000000000000000000000 i 0000000000000000000100100 s 0001000000000000010000000 w 0000000000000000002000000 i 0000000000000000000300100

Later on it hits the m characters and saves 1 two times to the array but then at i=7 and j=3 it starts our substring and saves 1 for the space character. Then some loops later it reaches i=8 and j=4 Now it reuses the already calculated “identical-length” of 1. It will do:

num[8][4] = 1 + num[7][3];

and we get 2. So, we now know we have a substring with two 2 characters. And with

if (num[i][j] > maxlen)

we make sure that we overwrite the existing longest substring (stored in the StringBuilder) ONLY IF there is a longer substring found and either append the character (if it is the current substring in progress):

sb.append(str1.charAt(i));

or we can start a longer substring. See the java code (mainly from wikipedia) for yourself:

public static String longestSubstring(String str1, String str2) { StringBuilder sb = new StringBuilder(); if (str1 == null || str1.isEmpty() || str2 == null || str2.isEmpty()) return ""; // ignore case str1 = str1.toLowerCase(); str2 = str2.toLowerCase(); // java initializes them already with 0 int[][] num = new int[str1.length()][str2.length()]; int maxlen = 0; int lastSubsBegin = 0; for (int i = 0; i < str1.length(); i++) { for (int j = 0; j < str2.length(); j++) { if (str1.charAt(i) == str2.charAt(j)) { if ((i == 0) || (j == 0)) num[i][j] = 1; else num[i][j] = 1 + num[i - 1][j - 1]; if (num[i][j] > maxlen) { maxlen = num[i][j]; // generate substring from str1 => i int thisSubsBegin = i - num[i][j] + 1; if (lastSubsBegin == thisSubsBegin) { //if the current LCS is the same as the last time this block ran sb.append(str1.charAt(i)); } else { //this block resets the string builder if a different LCS is found lastSubsBegin = thisSubsBegin; sb = new StringBuilder(); sb.append(str1.substring(lastSubsBegin, i + 1)); } } } }} return sb.toString(); }

What luck! I was just looking for an algorithm to do just what you describe. Thank you so much.

Cool 🙂

Pingback: Blog bookmarks 04/16/2011 « My Diigo bookmarks

Quadratic-time Algorithm for the String Constrained LCS Problem http://arxiv.org/abs/1106.6342

Hi Karussell! Do note that you don’t need to use m*n space. It is enough to use a 2-dimensional array, as you’re only using the values from the previous row (j – 1).

what does i < str1.length(); mean in the first for loop?

is it the same as i<str1.length();?

sorry i meant i < str1.length();

yes I’ll try to update

what do you mean?

what is the time and space complexity of this algorithm?

m*n for both cases

This doesn’t work for really long strings. I am trying to compare strings of size in the million characters and the line “int[][] num = new int[str1.length()][str2.length()];” tries to make an array so huge that it runs out of memory. Saying that each string is 1m characters, the array size would be 1,000,000,000,000…

This works for long strings as well but you’ll just need a different int storage … of course, arrays are limited to 2gb lengths … BTW: to find common substrings for large text could be better done with lucene