I have a file which contains several lines as follows:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>
I want to extract all lines containing both and including the header lines.
I have tried using grep, but it only extracts the sequence lines but not the header lines.
grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta
How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.
Expected output:
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
man grep:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
grep -B 1 pattern_[12]could work also, but you have several pattern_1s in the sample data so... not this time.
You can easily do that with awk like this:
awk '/^>/{h=$0;next}
/<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta
And here is a sed solution which yields the desired output as well:
sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta
If it is likely that multiline records exist, you can use this:
awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p
function chk( tmp){
tmp=gensub(/\n/,"","g",r)
return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With