Write a program to remove fragment that occur in "all" strings,where a fragment is 3 or more consecutive word.
Example:
Input::
s1 = "It is raining and I want to drive home.";
s2 = "It is raining and I want to go skiing.";
s3 = "It is hot and I want to go swimming.";
Output::
s1 = "It is raining drive home.";
s2 = "It is raining go skiing.";
s3 = "It is hot go swimming.";
Removed fragment = "and i want to"
The program will be tested again large files. Efficiency will be taken into consideration.
Assumptions: Ignore capitalization ,punctuation. but preserve in output.
Note: Take care of cases like
a a a a a b c b c b c b c where removing would create more fragments.
Hash three word phrases into an int and store them in an array, for all strings. reduces to array of numbers like
1 2 3 4 5 
3 5 7 9 8
9 3 1 7 9
Problem reduces to intersection of arrays.
sort the arrays. (k * nlogn)
keep k pointers. if all equal match found. else increment the pointer pointing to least value. To solve for the Note above. I was thinking of doing a lazy delete, i.e mark phrases for deletion and delete at the end.
Are there cases where my solution might not work? Can we optimize my solution/ find the best solution ?
First observation: replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
Now you have the problem reduced to remove the longest letter sequence that appears in all words of a given list. So you have to compute the longest common substring for a set of "words". You find it using a generalized suffix tree as this is the most efficient algorithm. This should do the trick and I believe has the best possible complexity.
The first step is as already suggested by izomorphius:
Replace each word with a single "letter" in a big alphabet(i.e. hash the worlds in some way), remove whitespaces and punctuation.
For the second you don't need to know the longest common substring - you just want to erase it from all the strings. Note that this is equivalent to erasing all common substrings of length exactly 3, because if you have a longer commmon substring, then its substrings with length 3 are also common. To do that you can use a hash table (storing key value pairs).
Just iterate over the first string and put all it's 3-substrings into the hash table as keys with values equal to 1. Then iterate over the second string and for each 3-substring x if x is in the hash table and its value is 1, then set the value to 2. Then iterate over the third string and for each 3-substring x, if x is in the hash table and its value is 2, then set the value to 3. ...and so on. At the end the keys that have the value of k are the common 3-substrings.
Now just iterate once more over all the strings and remove those 3-substrings that are common.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With