Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract lines containing two patterns

I have a file which contains several lines as follows:

>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
>header3
<pattern_1>ATGGCCACCAACAACCAGAGCTCCC
>header4
GACCGGCACGTACAACCTCCAGGAAATCGTGCCCGGCAGCGTGTGGATGGAGAGGGACGTG
>header5
TGCCCCCACGACCGGCACGTACAAC<pattern_2>

I want to extract all lines containing both and including the header lines.

I have tried using grep, but it only extracts the sequence lines but not the header lines.

grep <pattern_1> | grep <pattern_2> input.fasta > output.fasta

How to extract lines containing both the patterns and the headers in Linux? The patterns can be present anywhere in the lines. Not limited to start or end of the lines.

Expected output:

>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>
like image 658
Callie Avatar asked Nov 20 '25 10:11

Callie


2 Answers

$ grep -A 1 header[12] file
>header1
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCGGGCCTCTTTTCCTGACGGCCGCCCCCACTGCCCCCACGACCGGCCCGTACAAC<pattern_2>
>header2
<pattern_1>CGGCGGGCAGATGGCCACCAACAACCAGAGCTCCCTGGCCTGCAATCACTACTCGTGTTTTGCCACCACTGCCCCCACGACCGGCACGTACAAC<pattern_2>

man grep:

   -A NUM, --after-context=NUM
          Print  NUM  lines  of  trailing  context  after  matching lines.
          Places  a  line  containing  a  group  separator  (--)   between
          contiguous  groups  of  matches.  With the -o or --only-matching
          option, this has no effect and a warning is given.

   -B NUM, --before-context=NUM
          Print NUM  lines  of  leading  context  before  matching  lines.
          Places   a  line  containing  a  group  separator  (--)  between
          contiguous groups of matches.  With the  -o  or  --only-matching
          option, this has no effect and a warning is given.

grep -B 1 pattern_[12]could work also, but you have several pattern_1s in the sample data so... not this time.

like image 141
James Brown Avatar answered Nov 21 '25 23:11

James Brown


You can easily do that with awk like this:

awk '/^>/{h=$0;next}
     /<pattern_1>/&&/<pattern_2>/{print h;print}' input.fasta > output.fasta

And here is a sed solution which yields the desired output as well:

sed -n '/^>/{N;/<pattern_1>/{/<pattern_2>/p}}' input.fasta > output.fasta

If it is likely that multiline records exist, you can use this:

awk -v pat1='<pattern_1>' -v pat2='<pattern_2>' '
/^>/ {r=$0;p=0;next}
!p {r=r ORS $0;if(chk()){print r;p=1};next}
p

function chk(   tmp){
    tmp=gensub(/\n/,"","g",r)
    return (tmp~pat1&&tmp~pat2)
}' input.fasta > output.fasta
like image 35
oguz ismail Avatar answered Nov 21 '25 23:11

oguz ismail



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!