Monday, June 29, 2009

Suffix Trees: Longest Common Substring and Diff Implementation

The Longest Common Substring (LCS) is the longest string that is a substring of two or more strings.
LCS could be found with using of Generalized Suffix Tree (GST).

Steps to find LCS:
  1. build generalized suffix tree:
    1. concatenate original strings;
    2. build suffix tree.
  2. calculate special status for each node of the generalized suffix tree:
    • original strings labels;
    • tree height or suffix length.
  3. find longest common substring - traverse the tree to find a node with maximal tree height and labeled with all original strings.

Download source code of Longest Common Substring and Diff Implementation.

Class diagram:

Helper Inner classes:
  • LCSNodeStatus - Suffix Tree Node Status for Longest Common Substring (LCS)
  • CommonSubstr - Result of searching Longest Common Substring. Also represents partial diff result.
Nodes statuses:

Node statuses could be calculated:
Map<Node, LCSNodeStatus> statuses = new HashMap<Node, LCSNodeStatus>();
getLCSNodeStatus(suffixTree.getRootNode(), 0, statuses);

where
private LCSNodeStatus getLCSNodeStatus(Node node, int height, Map<Node, LCSNodeStatus> statuses) {
LCSNodeStatus nodeStatus = new LCSNodeStatus(node, height);
if (node.getEdges().size() == 0) {
return nodeStatus;
}

for (Edge edge : node.getEdges()) {
LCSNodeStatus status = getLCSNodeStatus(edge.getEndNode(),
height + getEdgeHeightForNodeStatus(edge), statuses);

status.addString(getEdgeStringNumber(edge));
nodeStatus.mergeStatus(status);
}
statuses.put(node, nodeStatus);
return nodeStatus;
}

Find LCS:
private CommonSubstr getLcs(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
int max = 0;
int foundBeginIndexes[] = null;

for (LCSNodeStatus status : statuses.values()) {
if ((status.getHeight() > 0) && (status.isAllStrings()) && (max <= status.getHeight())) {
Node node = status.getNode();
int workingBeginIndexes[] = initFoundBeginIndexes();

updateFoundBeginIndexes(beginIndexes, endIndexes, node, status.getHeight(),
statuses, workingBeginIndexes);

if (verifyFoundBeginIndexes(workingBeginIndexes)) {
foundBeginIndexes = workingBeginIndexes;
max = status.getHeight();
}
}
}

if (foundBeginIndexes == null)
return null;

return new CommonSubstr(foundBeginIndexes, max);
}

private int[] initFoundBeginIndexes() {
int beginIndexes[] = new int[texts.length];
for(int i=0; i<texts.length; i++)
beginIndexes[i] = Integer.MAX_VALUE;
return beginIndexes;
}

private void updateFoundBeginIndexes(int beginIndexes[], int endIndexes[], Node node, int height,
Map<Node, LCSNodeStatus> statuses, int[] foundBeginIndexes) {
for (Edge edge : node.getEdges()) {
LCSNodeStatus nodeStatus = statuses.get(edge.getEndNode());
if ((nodeStatus != null) && nodeStatus.isAllStrings()) {
updateFoundBeginIndexes(beginIndexes, endIndexes, edge.getEndNode(),
height + getEdgeHeight(edge), statuses, foundBeginIndexes);
} else {
int stringNumber = getEdgeStringNumber(edge);
int beginIndex = edge.getBeginIndex() - height;

if ((beginIndex < endIndexes[stringNumber]) &&
(beginIndex >= beginIndexes[stringNumber]) &&
(foundBeginIndexes[stringNumber] > beginIndex)) {

foundBeginIndexes[stringNumber] = beginIndex;
}
}
}
}

private boolean verifyFoundBeginIndexes(int[] beginIndexes) {
for(int i=0; i<texts.length; i++)
if (beginIndexes[i] == Integer.MAX_VALUE)
return false;
return true;
}

Diff Implementation:
  • find LCS;
  • call diff for left part of strings (before LCS);
  • call diff for right part of string (after LCS);
  • join results.

To define strings search parts 2 arrays are used: beginIndexes[], endIndexes[].
private List<CommonSubstr> diff(int beginIndexes[], int endIndexes[], Map<Node, LCSNodeStatus> statuses) {
CommonSubstr commonSubstr = getLcs(beginIndexes, endIndexes, statuses);
if (commonSubstr == null)
return new LinkedList<CommonSubstr>();

List<CommonSubstr> prev = diff(beginIndexes, commonSubstr.beginIndexes, statuses);
List<CommonSubstr> next = diff(incIndexes(commonSubstr.endIndexes), endIndexes, statuses);

prev.add(commonSubstr);
if (next != null)
prev.addAll(next);
return prev;
}

See also: my other Suffix Trees posts

Monday, June 8, 2009

Generalized Suffix Trees: Java Applet

A Generalized Suffix Tree is a Suffix Tree for a set of strings.
Usage of Generalized Suffix Trees Java Applet is similar to Suffix Trees Java Applet:
  • enter 2 strings;
  • # and $ are used as strings terminators;
  • press "Build GST" button;
  • it will build suffix tree and create visual presentation in some auto calculated layout;
  • drag and drop nodes to find best layout;
  • you need Java installed and enabled in your browser;
  • applet uses Java implementation of Ukkonen's Algorithm to build Suffix Tree.




Here are most used examples of Generalized Suffix Trees:



Source code of GeneralizedSuffixTree class that is based on SuffixTree class.
public class GeneralizedSuffixTree {
// SOH - START OF HEADING
public static final char TERMINATOR1 = '\u0001';
// STX - START OF TEXT
public static final char TERMINATOR2 = '\u0002';

private String texts[];
private char terminators[];

private String generalized;
private SuffixTree suffixTree;

public GeneralizedSuffixTree(String text1, String text2) {
this(text1, text2, TERMINATOR1, TERMINATOR2);
}

public GeneralizedSuffixTree(String text1, String text2, char terminator1, char terminator2) {
this(new String[]{text1, text2}, new char[]{terminator1, terminator2});
}

public GeneralizedSuffixTree(String texts[], char terminators[]) {
this.texts = texts;
this.terminators = terminators;

StringBuilder sb = new StringBuilder();
for (int i = 0; i < texts.length; i++) {
sb.append(texts[i]);
sb.append(terminators[i]);
}
generalized = sb.toString();
suffixTree = new SuffixTree(generalized);
fixSpanSuffixes(suffixTree.getRootNode());
}

private void fixSpanSuffixes(Node node) {
for (Edge edge : node.getEdges()) {
for (int i = 0; i < texts.length; i++) {
if ((edge.getBeginIndex() <= getGeneralizedSubstringLength(i)) &&
(edge.getEndIndex() > getGeneralizedSubstringLength(i))) {

edge.setEndIndex(getGeneralizedSubstringLength(i));
continue;
}
}
fixSpanSuffixes(edge.getEndNode());
}
}

private int getGeneralizedSubstringLength(int n) {
int length = 0;
for (int i = 0; i <= n; i++) {
length += texts[i].length() + 1;
}
return length - 1;
}
}

See also: my other Suffix Trees posts

Suffix Trees: Java Applet Sources

Suffix Trees Java Applet is based on JFC/Swing and Applets.

Applet main classes:
  • STApplet - main applet class;
  • STControlPanel - control panel ;
  • STDrawPanel - draw panel.
Applet main classes diagram:

GUI classes:
  • GUIObject - base class;
  • GUIObserver - observer interface;
  • GUINode - suffix tree node;
  • GUIEdge - suffix tree edge;
  • GUILabel - suffix label, linked to edge;
  • GUILink - suffix link;
  • ArrowHead - helper arrow shape.
GUI classes diagram:


Download Suffix Trees Java Applet Sources

See also: my other Suffix Trees posts

Suffix Trees: sample trees

Here are most used samples of suffix trees.
They are built online with Suffix Trees Java Applet









See also: my other Suffix Trees posts

Suffix Trees: Java Applet

How to use Suffix Trees Java Applet:
  • enter string;
  • press "Build Suffix Tree" button;
  • it will build suffix tree and create visual presentation in some auto calculated layout;
  • drag and drop nodes to find best layout;
  • you need Java installed and enabled in your browser;
  • applet uses Java implementation of Ukkonen's Algorithm.
Some samples are available here.

Applet sources are discussed and could be download here.




Please refer previous posts about Suffix Trees: