Suppose we have two strings "abcdefgh" and "abudesh". I want for solution to be a list ["ab", "de", "h"]. So, I want a list of maximally connected substrings that are the same for both strings. Is this has a name and what would be a good approach in solving it?
Edit: I need to say that the order is not important in the way that if we have, for example, two strings "abcdefg" and "defkabc", the result is ["abc", "def"].
Using:
from Bio import pairwise2
from itertools import groupby
def maxConnectedSubstrings(strA, strB):
alignment = pairwise2.align.globalxx(strA, strB)[0]
grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
return [''.join(ca for ca,cb in g) for k,g in grouped if k]
print( maxConnectedSubstrings('abcdefgh', 'abudesh') )
# ['ab', 'de', 'h']
First, we align the sequences. The result of alignment = pairwise2.align.globalxx(strA, strB)[0] is:
alignment.seqA = 'abcdefgh'
alignment.seqB = 'abude-sh'
The alignment algorithm found the best way to add '-' in the sequences to align them.
Then, we use groupby on zip(alignment.seqA, alignment.seqB). The zip(...) is a sequence of pairs (character from seqA, character from seqB). We group these pairs with the key lambda p: p[0] == p[1], which gives the following result:
grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
grouped = [
(True, [('a', 'a'),
('b', 'b')]),
(False, [('c', 'u')]),
(True, [('d', 'd'),
('e', 'e')]),
(False, [('f', '-'),
('g', 's')]),
(True, [('h', 'h')])
]
Finally, we discard the False groups, and we join the letters of every True group.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With