Task definition:
I try to write my own diff util. I want to implement inline-search.
Means I have two paragraphs of text. I have to constrant strings from first paragraph (p1) to strings in second paragraph (p2) in such way that sum of common words in constranted strings will be maximal.
And one important point, you can't replace the strings: I mean if you constrant p1[i] to p2[j], than you can't constrant p1[k] to p2[v] if k < i and v < j.
Little example:
Input:
You have two paragraphs:
"Very very very very" "Very very very"
"bla bla bla" "Very very very very very"
"looks like a very dump text" "One more sentence"
"simple text" "looks like a peace of ..."
"quite simple"
"bla bla bla bla"
...and matrix where matrix[i][j] = number of common words in string p1[i] and p2[j]
3 4 0 0 0 0
0 0 0 0 0 3
0 0 0 3 0 0
0 0 0 0 1 0
Output:
You need to constrant them in next way:
---------------- "Very very very"
"Very very very very" "Very very very very very"
"bla bla bla" ----------------
---------------- "One more sentence"
"looks like a very dump text" "looks like a peace of ..."
"simple text" "quite simple"
---------------- "bla bla bla bla"
Or you can just form the next matrix:
(indexes of strings which have the constrants)
p1Indexes: [0, 2, 3]
p2Indexes: [1, 3 ,4]
Question:
what is an efficient algorith for this task?
[Not obligatory to read] Faced difficulties:
Solution:
public void genConditionLCS() {
int i = -1;
int j = -1;
while (true) {
int[] indexes = nextIndexes(i+1, j+1);
i = indexes[0];
j = indexes[1];
if (i == -1 || j == -1) break;
firstParagraphIndexes.add(i);
secondParagraphIndexes.add(j);
}
}
private int[] nextIndexes(int i, int j) {
if ((i > (lcs.length-1)) || (j > (lcs[0].length-1)))
return new int[] {-1, -1};
int a = maxBenefit(i + 1, j);
int b = maxBenefit(i, j + 1);
int c = maxBenefit(i + 1, j + 1) + lcs[i][j];
if ((a == 0) && (b == 0) && (c == 0))
return new int[]{-1, -1};
else if (a >= b && a >= c)
return nextIndexes(i+1, j);
else if (b >= a && b >= c)
return nextIndexes(i, j+1);
else //if (c >= a && c >= b)
return new int[]{i, j};
}
private int maxBenefit(int i, int j) {
if ((i > lcs.length - 1) || (j > lcs[0].length - 1)) return 0;
int res = maxBenefit[i][j];
if (res == -1) {
int a = maxBenefit(i + 1, j);
int b = maxBenefit(i, j + 1);
int c = maxBenefit(i + 1, j + 1) + lcs[i][j];
res = max(a, b, c);
maxBenefit[i][j] = res;
}
return res;
}
Given arrays a[m] and b[n] and given a cost function: benefit(i, j) which calculates the number of common words between elements i and j, your problem can be stated as max_benefit(i, j) which means that i and j are aligned/matched and you need to find out the max benefit and alignment of the remaining part, which is: max(benefit(i + 1, j + 1) + max_benefit(i + 2, j + 2), benefit(i + 2, j + 1) + max_benefit(i + 3, j + 1), benefit(i + 3, j + 1) + max_benefit(i + 4, j + 1), ..., benefit(i + 1, j + 2) + max_benefit(i + 2, j + 3), benefit(i + 1, j + 3) + max_benefit(i + 1, j + 4), ...)
Now, when you first compute max_benefit for any pair of indexes, store the result so that you do not need to reompute it. I. e. check if you have a stored value before computing it; if not, compute it and store the value.
Re faced difficulties:
max_benefit(i, j, a, b) and benefit(i, j, a, b). The arrays won't be copied in most languages.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With