Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering rows between two files based on correlation in awk

Tags:

text

bash

awk

After reading several topics such as this, this and this, I am still confused about how to tackle the following issue. I have a bunch of files I would like to filter based on the correlation (Pearson) of their rows (eg. using a threshold greater than +0.9). For instance, two files:

file1.txt -> 15 columns and 4 rows

file2.txt -> 16 columns and 10 rows

So, the first row in file1.txt is correlated with all rows in file2.txt, and then it continues sequentially with the next rows in file1.txt. One caveat here is that only values between column 1 and column 14 are considered for correlation (ie. column 15 in file1.txt and columns 15 and 16 in file2.txt are not taken into account). If correlated rows give a correlation greater than +0.9, then it prints (or saves in a new file) the value from column 15 in file1.txt and values from columns 15 and 16 in file2.txt, so:

output file -> 3 columns and number of rows depend on correlation values > +0.9

I was playing with this code to understand how to do the correlation but no luck so far:

 awk '{
  a = 0; for (i = 1; i <= NF; ++i) a += $i; a /= NF-1
  b = 0; for (i = 1; i <= NF; ++i) b += ($i - a) ^ 2; b = sqrt(b)
  if (b <= 0) next
  for (i = 1; i <= NF; ++i) x[NR, i] = ($i - a) / b
  n[NR] = $1
  for (i = 1; i <= NR; ++i) {
    if (!(i in n)) continue
    a = 0
    for (k = 1; k <= NF; ++k)
      a += x[NR, k] * x[i, k]
    print n[NR], n[i], a
  }}' << eof
r1 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r2 1 2 3 4 5 6 7 8 9 10 11 12 13 14
r3 1 2 3 4 5 6 7 8 9 10 11 12 13 14
eof

it obviously gives 1 indicating the correlated rows:

r1 r1 1
r2 r1 1
r2 r2 1
r3 r1 1
r3 r2 1
r3 r3 1

Not sure how to adapt/modify/improve this piece of code to use two separated files as well as achieving what I would like to obtain as result, any help is much appreciated.

Some real values of the files I have are posted below:

real_file1.txt (15 columns and 4 rows):

17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  110.0MN
16684.4  38376.4  24.5148  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  66.21622MN
17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  60.0MN
17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  20.0MN

real_file2.txt (16 columns and 10 rows):

20985.4  38466.8  57.9789  27394.6  941.422  10109.7  11.326   6077.4   3729.51  87.5948  25.9344  4477.65  23.4117  0.00158122  -84.4166666667  -33.33333333333
21020    38438.3  60.0672  27570.8  1268.44  9737.09  10.7524  7888.51  3609.84  87.7387  25.9116  5785.01  23.3214  0.00191128  -84.3333333333  -33.33333333333
21138.4  38396.4  62.0438  27570.8  1268.44  9737.09  10.7524  7888.51  3609.84  87.7387  25.9116  5785.01  23.3214  0.00191128  -84.25          -33.33333333333
21250.5  38333.5  61.3351  27570.8  1268.44  9737.09  10.7524  7888.51  3609.84  87.7387  25.9116  5785.01  23.3214  0.00191128  -84.1666666667  -33.33333333333
21294.3  38261    59.1823  27703.4  1589.07  9294.17  10.2936  9471.09  3462.93  87.8499  25.8908  7098.47  23.2781  0.00219644  -84.0833333333  -33.33333333333
21263.2  38208.1  60.0892  27703.4  1589.07  9294.17  10.2936  9471.09  3462.93  87.8499  25.8908  7098.47  23.2781  0.00219644  -84.0           -33.33333333333
21208.9  38186.3  63.612   27703.4  1589.07  9294.17  10.2936  9471.09  3462.93  87.8499  25.8908  7098.47  23.2781  0.00219644  -83.9166666667  -33.33333333333
21165    38178.8  67.6346  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.8333333333  -33.33333333333
21131.9  38183.1  72.5871  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.75          -33.33333333333
21096.6  38187.6  72.6827  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.6666666667  -33.33333333333

for instance, expected result based on both files (not tested with correlation, just file structure as example):

110.0MN -83.6666666667  -3.33333333333
20.0MN  -84.25          -3.33333333333

Here a MWE composed of rows from the examples previously mentioned, where:

  • the 1st row of file1.txt is correlated with the 2 rows of file2.txt and the result.txt is created (ie. correlation > 0.9 for this case, because it is the same row for testing purposes, so XOR is 1).
  • the 2nd row of file1.txt is correlated with the 2 rows of file2.txt and no output is generated (ie. I am assuming here that correlation is < 0.9 for all correlated rows).

file1.txt:

17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  110.0MN
16684.4  38376.4  24.5148  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  66.21622MN

file2.txt

21096.6  38187.6  72.6827  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.6666666667  -33.33333333333
17336.2  38352.3  23.3161  16838.3  4127.13  16173    20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639   0.0138022  -80.7777777777  -30.22222222222

result.txt

110.0MN -80.7777777777  -30.22222222222
like image 668
Gery Avatar asked Oct 15 '25 03:10

Gery


2 Answers

With any POSIX awk:

$ cat foo.awk
function stat(    sum, sum2, i) {
  sum = sum2 = 0
  for(i = 1; i <= max; i++) {
    sum += $i
    sum2 += $i * $i
  }
  avg = sum / max
  std = sqrt(sum2 / max - avg * avg)
}

function pcc(i,    sum_ab, j) {
  sum_ab = 0;
  for(j = 1; j <= max; j++) sum_ab += a[i,j] * $j
  return (sum_ab / max - avg_a[i] * avg) / (std_a[i] * std)
}

BEGIN {
  max = 14
}

NR == FNR {
  cnt += 1
  stat()
  for(i = 1; i <= NF; i++) a[cnt,i] = $i
  avg_a[cnt] = avg
  std_a[cnt] = std
  next
}

{
  stat()
  if(std == 0) next
  for(i = 1; i <= cnt; i++) {
    if(std_a[i] == 0) continue
    if(pcc(i) > 0.9) print a[i,max+1], $(max+1), $(max+2);
  }
}

$ cat file1.txt
17336.2  38352.3  23.3161  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  110.0TN
16684.4  38376.4  24.5148  16838.3  4127.13  16173  20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639  0.0138022  66.21622TN

$ cat file2.txt
21096.6  38187.6  72.6827  27192.5  1842.54  9394.1   10.2802  10428.4  3771.67  87.8443  25.8791  8149.04  23.2957  0.00247546  -83.6666666667  -33.33333333333
17336.2  38352.3  23.3161  16838.3  4127.13  16173    20.6492  17695.3  16673.1  85.8239  26.0635  17247.9  24.639   0.0138022  -80.7777777777  -30.22222222222

$ awk -f foo.awk file1.txt file2.txt 
110.0TN -80.7777777777 -30.22222222222
66.21622TN -80.7777777777 -30.22222222222

max is the size of your samples. It is set to 14 in the BEGIN block.

Pseudo 2D matrix a stores the values of the first file: a[i,j] is column j of row i. It is populated by the second block (NR == FNR) while parsing the first file. The second block also uses function stat to compute the average (avg) and the standard deviation (std) of the sample (the max first fields of each row). They are stored in arrays avg_a and std_a, respectively.

The third block parses the rows of the second file. For each row, function stat computes the average (avg) and the standard deviation (std) of the sample, and function pcc is called to compute the Pearson Correlation Coefficient (PCC) of the row with each row of matrix a. If it is strictly larger than 0.9 the max+1-th column of the correlated row of a, and columns max+1 and max+2 of current row are printed.

like image 150
Renaud Pacalet Avatar answered Oct 18 '25 06:10

Renaud Pacalet


Pearson, I updated the post, thanks for it. About the tool, I use bash scripting so linux tools such as awk are easy to include, so code using those great tools are the way to go, but I think I can also include python solutions.

I would then suggest to use existing implementation for computing Pearson correlation.

If you are okay with installing GNU datamash then you might make use of ppearson following way, let file.tsv content be

1   1   -1  1
1.5 1.5 -1.5    2.25
2   2   -2  4
2.5 2.5 -2.5    6.25
3   3   -3  9
3.5 3.5 -3.5    12.25
4   4   -4  16
4.5 4.5 -4.5    20.25
5   5   -5  25

then

datamash 'ppearson 1:2 ppearson 1:3 ppearson 1:4' < file.tsv

gives output

1   -1  0.98263874354359

which is correlation between columns 1 and 2, between columns 1 and 3, between columns 1 and 4. You might either craft string describing what to compute or call datamash for each correlation you wish to know. For example to find every against every column (Cartesian product) then you might do

for i in $(seq 4)
do
  for j in $(seq 4)
  do
    corr=$(datamash "ppearson $i:$j" < file.tsv)
    echo "$i vs $j correlation is $corr"
  done
done

which will result in

1 vs 1 correlation is 1
1 vs 2 correlation is 1
1 vs 3 correlation is -1
1 vs 4 correlation is 0.98263874354359
2 vs 1 correlation is 1
2 vs 2 correlation is 1
2 vs 3 correlation is -1
2 vs 4 correlation is 0.98263874354359
3 vs 1 correlation is -1
3 vs 2 correlation is -1
3 vs 3 correlation is 1
3 vs 4 correlation is -0.98263874354359
4 vs 1 correlation is 0.98263874354359
4 vs 2 correlation is 0.98263874354359
4 vs 3 correlation is -0.98263874354359
4 vs 4 correlation is 1

You would need to first process your data into GNU datamash-compliant format then get values and act depending on their values.

(tested in GNU datamash 1.9)

python has statistics.correlation as part of standard library, but be warned it was added in 3.10, so if your python --version is lower than that you will not have access to it.

like image 26
Daweo Avatar answered Oct 18 '25 07:10

Daweo