Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest Common Substring without Splitting words

Tags:

java

string

I am trying to implement modified version of Longest Common Substring where i do not want to split the words.

eg:

String s1 = "hi there how are you";
String s2 = "hi there how ar";

So the output should be: o/p: hi there how.

I am first calculating the Longest common Substring and then filtering any words that are split and below is the code for the same:

private static void longestCommonSubstring(String S1, String S2) {
        int Start = 0;
        int Max = 0;
        for (int i = 0; i < S1.length(); i++) {
            for (int j = 0; j < S2.length(); j++) {
                int x = 0;
                while (S1.charAt(i + x) == S2.charAt(j + x)) {
                    x++;
                    if (((i + x) >= S1.length()) || ((j + x) >= S2.length()))
                        break;
                }
                if (x > Max) {
                    Max = x;
                    Start = i;
                }
            }
        }
        System.out.println("S1 " + S1 + " S2 " + S2 + " " + Start + " " + Max);
        System.out.println("ans " + S1.substring(Start, (Start+Max)));


        if(Start != 0){
            if((S1.charAt(Start-1) == ' ')){
                if(Start+Max != S1.length()){
                    if((S1.charAt(Start+Max) == ' ')){
                        System.out.println(S1.substring(Start, (Start + Max)));
                    }
                }
            }
            else if((S1.charAt(Start+Max) == ' ')){
                int index = S1.indexOf(' ', Start);
                System.out.println(S1.substring(index+1, (Start + Max)));
            }
            else{
                int index = S1.indexOf(' ', Start);
                int lastIndex = S1.lastIndexOf(' ', (Start+Max));
                if(index+1 != lastIndex+1){
                    System.out.println(S1.substring(index+1, lastIndex));
                }
                else{
                    System.out.println(" ");
                }
            }
        }
        else if(Start == 0 && Start+Max != S1.length() && (S1.charAt(Start+Max) == ' ')){
            System.out.println(S1.substring(Start, (Start + Max)));
        }
        else if(Start+Max != S1.length()){
            int index = S1.lastIndexOf(' ', (Start+Max));
            System.out.println(S1.substring(Start, index));
        }
        else{
            System.out.println(S1.substring(Start, (Start + Max)));
        }


    }

But it is failing for the cases like:

String s1 = "hi there how are you";
String s2 = "i ere ho ar you";

where the output should have been "you" but is blank, because the longest common Substring is "ere ho".

Thanks for the help.

like image 210
bna Avatar asked Dec 13 '25 18:12

bna


1 Answers

Based on karthik, and after bna's comment that any character-based answer was flawed:

public static ArrayList<String> stringToTokens(String s) {
    ArrayList<String> al = new ArrayList<String>();
    StringBuilder word = new StringBuilder();
    boolean inWord = !s.isEmpty() && Character.isLetter(s.charAt(0));
    for (int i=0; i<s.length(); i++) {
        char c = s.charAt(i);
        if (Character.isLetter(c)) {
            word.append(c);
        } else if (inWord) {
            if (word.length() > 0) {
                al.add(word.toString());
                word.setLength(0);
            }
            al.add(""+c);
        } else {
            al.add(""+c);
        }
    }
    if (word.length() > 0) {
        al.add(word.toString());
    }
    return al;
}

public static String longestCommonWithWords(String sa, String sb) {
    ArrayList<String> a = stringToTokens(sa);
    ArrayList<String> b = stringToTokens(sb);
    int m[][] = new int[a.size()][b.size()];
    int bestLength = 0;
    HashSet<Integer> bestSet = new HashSet<Integer>();
    for (int i=0; i<a.size(); i++) {
        for (int j=0; j<b.size(); j++) {
            if (a.get(i).equals(b.get(j))) {
                m[i][j] = (i==0 || j==0) ? 1 : m[i-1][j-1]+1;
                if (m[i][j] > bestLength) {
                    bestLength = m[i][j];
                    bestSet.clear();
                    bestSet.add(i);
                } else if (m[i][j] == bestLength) {
                    bestSet.add(i);
                }
            } else {
                m[i][j] = 0;
            }
        }
    }
    // return first result (or empty string if none)
    StringBuilder result = new StringBuilder();
    if (bestLength > 0) {
        int end = bestSet.iterator().next();
        int start = end - bestLength;
        for (int i=start; i<end; i++) {
            result.append(a.get(i+1));
        }
    }
    return result.toString();
}

Tokenization is straightforward (a StringTokenizer would have worked too, but this version handles strange characters better). The LCS algorithm comes straight from the pseudocode in https://en.wikipedia.org/wiki/Longest_common_substring_problem#Pseudocode

like image 129
tucuxi Avatar answered Dec 16 '25 07:12

tucuxi