Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract multiple independent regex matches per line

Tags:

regex

bash

sed

awk

For the file below, I want to extract the two strings following "XC:Z:" and "XM:Z:". For example:

  • 1st line output should be this: "TGGTCGGCGCGT, GAGTCCGT"
  • 2nd line output should be this: "GAAGCCGCTTCC, ACCGACGG"

The original version of the file has a few more columns and millions of rows than the following example, but it should give you the idea:

    MOUSE_10        XC:Z:TGGTCGGCGCGT       RG:Z:A  XM:Z:GAGTCCGT   ZP:i:33
    MOUSE_10        XC:Z:GAAGCCGCTTCC       NM:i:0  XM:Z:ACCGACGG   AS:i:16
    MOUSE_10        ZP:i:36 XC:Z:TCCCCGGGTACA       NM:i:0  XM:Z:GGGACGGG   ZP:i:28
    MOUSE_10        XC:Z:CAAATTTGGAAA       RG:Z:A  NM:i:1  XM:Z:GCAGATAG

In addition, each of following criteria would be a bonus but is not mandatory if you can get it to work:

  • use standard bash tools: awk, sed, grep, etc. (no GAWK, csvtools,...)
  • assume we don't know the order in which XC and XM appear (although I'm fairly certain XC is almost first, but I am unsure how to check). In the output, however, the XC-string should always be before the XM-string, if at all possible.

The answers from here awk extract multiple groups from each line come awfully close to it, but whenever I try using match(...) I get a "syntax error near unexpected token" message.

Looking forward to your solutions!

Thanks,

Felix

like image 545
Felix Avatar asked Dec 13 '25 23:12

Felix


1 Answers

With sed you can capture non-space characters after XC:Z: and XM:Z:

sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/p;' file

You can add a second s command for reversed values:

sed -n 's/.*XC:Z:\([^[:blank:]]*\).*XM:Z:\([^[:blank:]]*\).*/\1, \2/;s/.*XM:Z:\([^[:blank:]]*\).*XC:Z:\([^[:blank:]]*\).*/\1, \2/;p;' file
like image 162
SLePort Avatar answered Dec 16 '25 16:12

SLePort



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!