I have a big tab file like this:
rs980 INPP5B missense I 749 NP_001284363.1
rs980 INPP5B reference L 749 NP_001284363.1
rs980 INPP5B missense F 749 NP_001284363.1
rs980 INPP5B missense I 913 NP_005531.2
rs980 INPP5B reference L 913 NP_005531.2
rs980 INPP5B missense P 913 NP_005531.2
rs1921 ISG15 reference S 83 NP_005092.1
rs1921 ISG15 missense N 83 NP_005092.1
rs1921 ISG15 missense T 83 NP_005092.1
rs2607 HEATR1 missense K 2133 NP_060542.4
rs2607 HEATR1 missense Q 2133 NP_060542.4
rs2607 HEATR1 reference E 2133 NP_060542.4
and I would like take all rows with 'reference' pattern and move one column of them right after the rows without reference pattern and moreover that have the commonly strings in first, fifth and sixth columns. So the output should be:
rs980 INPP5B missense I 749 NP_001284363.1 L
rs980 INPP5B missense F 749 NP_001284363.1 L
rs980 INPP5B missense I 913 NP_005531.2 L
rs980 INPP5B missense P 913 NP_005531.2 L
rs1921 ISG15 missense N 83 NP_005092.1 S
rs1921 ISG15 missense T 83 NP_005092.1 S
rs2607 HEATR1 missense K 2133 NP_060542.4 E
rs2607 HEATR1 missense Q 2133 NP_060542.4 E
And so on... I got it by an array with awk through two files in whose I split the references from the others but is necessary do it in the same file in bash. Are there any possibility to do this in this more simple way? Thanks in advance.
Yes @Inian. I perform two files, separating references and others. In this way I do by awk the following command something like
awk 'BEGIN {FS=OFS="\t"} NR==FNR {h[$2] = $1; next} {print $1,$2,$3,$4,$5,$6,h[$4] file_1 file_2 > output_file
In which join together the columns than must be commonly in both files and then I create an array by awk to relate these columns with the other column in references that I want to print in the output file. In summary one file is like:
INPP5B missense I rs980;749;NP_001284363.1
INPP5B missense F rs980;913NP_005531.2
and the references:
INPP5B reference L rs980;749;NP_001284363.1
and so on and I apply
awk 'BEGIN {FS=OFS="\t"} NR==FNR {h[$4] = $3; next} {print $1,$2,$3,$4,h[$4]}' file_others file_reference > output_file
This is a brief summary of how I get it, but I need to do in the same original file @Inian. Thanks in advance
Assuming "file" contains all the lines, including the references (your "big tab file")
awk -F '\t' -v OFS='\t' '
{key = $1 FS $2 FS $5 FS $6}
NR == FNR {
if ($3 == "reference") val[key] = $4
next
}
$3 != "reference" {print $0, val[key]}
' file file
Here, we process the file twice, once to collect the references, and then to print the information.
Output:
rs980 INPP5B missense I 749 NP_001284363.1 L
rs980 INPP5B missense F 749 NP_001284363.1 L
rs980 INPP5B missense I 913 NP_005531.2 L
rs980 INPP5B missense P 913 NP_005531.2 L
rs1921 ISG15 missense N 83 NP_005092.1 S
rs1921 ISG15 missense T 83 NP_005092.1 S
rs2607 HEATR1 missense K 2133 NP_060542.4 E
rs2607 HEATR1 missense Q 2133 NP_060542.4 E
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With