Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic programming quiz or comparing two blocks of text

Task definition:

I try to write my own diff util. I want to implement inline-search.

Means I have two paragraphs of text. I have to constrant strings from first paragraph (p1) to strings in second paragraph (p2) in such way that sum of common words in constranted strings will be maximal.

And one important point, you can't replace the strings: I mean if you constrant p1[i] to p2[j], than you can't constrant p1[k] to p2[v] if k < i and v < j.

Little example:

Input:

You have two paragraphs:

"Very very very very"         "Very very very"
"bla bla bla"                 "Very very very very very"
"looks like a very dump text" "One more sentence"
"simple text"                 "looks like a peace of ..."
                              "quite simple"
                              "bla bla bla bla"

...and matrix where matrix[i][j] = number of common words in string p1[i] and p2[j]

3 4 0 0 0 0
0 0 0 0 0 3
0 0 0 3 0 0
0 0 0 0 1 0

Output:

You need to constrant them in next way:

----------------               "Very very very"
"Very very very very"          "Very very very very very"
"bla bla bla"                  ----------------
----------------               "One more sentence"
"looks like a very dump text"  "looks like a peace of ..."
"simple text"                  "quite simple"
----------------               "bla bla bla bla"

Or you can just form the next matrix:

(indexes of strings which have the constrants)

p1Indexes: [0, 2, 3] p2Indexes: [1, 3 ,4]

Question:

what is an efficient algorith for this task?

[Not obligatory to read] Faced difficulties:

  1. How to pass index-collection to next iteration: I mean you need to copy all indexes on each iteration
  2. If you want to use dynamic programing, you need to store not only a common wors number but also a two indexes collections for each possible iteration.

Solution:

public void genConditionLCS() {
    int i = -1;
    int j = -1;
    while (true) {
        int[] indexes = nextIndexes(i+1, j+1);
        i = indexes[0];
        j = indexes[1];
        if (i == -1 || j == -1) break;
        firstParagraphIndexes.add(i); 
        secondParagraphIndexes.add(j);
    }
}
private int[] nextIndexes(int i, int j) {
    if ((i > (lcs.length-1)) || (j > (lcs[0].length-1)))
        return new int[] {-1, -1};
    int a = maxBenefit(i + 1, j);
    int b = maxBenefit(i, j + 1);
    int c = maxBenefit(i + 1, j + 1) + lcs[i][j];
    if ((a == 0) && (b == 0) && (c == 0))
        return new int[]{-1, -1};
    else if (a >= b && a >= c)
        return nextIndexes(i+1, j);
    else if (b >= a && b >= c)
        return nextIndexes(i, j+1);
    else //if (c >= a && c >= b)
        return new int[]{i, j};
}

private int maxBenefit(int i, int j) {
    if ((i > lcs.length - 1) || (j > lcs[0].length - 1)) return 0;
    int res = maxBenefit[i][j];
    if (res == -1) {
        int a = maxBenefit(i + 1, j);
        int b = maxBenefit(i, j + 1);
        int c = maxBenefit(i + 1, j + 1) + lcs[i][j];
        res = max(a, b, c);
        maxBenefit[i][j] = res;
    }
    return res;
}
like image 720
VB_ Avatar asked Dec 02 '25 11:12

VB_


1 Answers

Given arrays a[m] and b[n] and given a cost function: benefit(i, j) which calculates the number of common words between elements i and j, your problem can be stated as max_benefit(i, j) which means that i and j are aligned/matched and you need to find out the max benefit and alignment of the remaining part, which is: max(benefit(i + 1, j + 1) + max_benefit(i + 2, j + 2), benefit(i + 2, j + 1) + max_benefit(i + 3, j + 1), benefit(i + 3, j + 1) + max_benefit(i + 4, j + 1), ..., benefit(i + 1, j + 2) + max_benefit(i + 2, j + 3), benefit(i + 1, j + 3) + max_benefit(i + 1, j + 4), ...)

Now, when you first compute max_benefit for any pair of indexes, store the result so that you do not need to reompute it. I. e. check if you have a stored value before computing it; if not, compute it and store the value.

Re faced difficulties:

  1. You can have the array references available as globals / class members, or you can pass the array references as 2 extra arguments: e. g. max_benefit(i, j, a, b) and benefit(i, j, a, b). The arrays won't be copied in most languages.
  2. see the main part of this answer, you just recursively compute and store values so that you do not recompute.
like image 61
necromancer Avatar answered Dec 05 '25 09:12

necromancer



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!