Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bash code not counting identical lines properly

Tags:

bash

I need to get a count of how many unique blast hits I've obtained for my list of OTUs/ASVs but for some reason, in what appears to be random cases lines that are identical are not adding up properly. Other times however, the code works fine and is summing up identical lines no problem. I have already made sure to run dos2unix on my files, I'm running my command on Linux and I don't know what else to check for.

Here is a reproducible example:

Input file called check_text.csv with lines below:

ASV Genus_Species   Taxid
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

My code I run:

dos2unix check_text.csv
dos2unix: converting file check_text.csv to Unix format...

cat check_text.csv | uniq -c > hitcounts.csv

But here is the output I keep getting in my hitcounts.csv file:

  1 ASV Genus_Species   Taxid
  5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
  1 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
  1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
  1 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

The first line is fine, those are just headers that is not a problem. But I should have seen the following:

  1 ASV Genus_Species   Taxid
  5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
  2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
  1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
  1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
  1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

The line with Moscharia solbrigii is coming up twice as if these lines are not identical but they are. What could be the problem? I manually check some of these for extra white spaces but there aren't any.

like image 346
Katherine Chau Avatar asked Dec 11 '25 23:12

Katherine Chau


2 Answers

Note the uniq man page:

'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'

You can use Ruby with CSV parser and generator:

ruby -r csv -e '
opts={:headers=>true, :col_sep=>"\t"}
puts CSV.generate(**opts){ |csv|  
        tbl=CSV.parse($<.read, **opts)
        tally=tbl.map{|row| row.to_a.map(&:last)}.tally
        csv<<["1"]+tbl.to_a[0]
        tally.each{|x,cnt| csv<<[cnt]+x }
        } 
' file.tsv

You can also use two awks and a sort:

awk  'BEGIN{FS=OFS="\t"}
FNR==1{print 1,$0; exit}' file.tsv

awk  'BEGIN{FS=OFS="\t"}
FNR==1{next}
{cnt[$0]++}
END{for (e in cnt) print cnt[e],e}
' file.tsv | sort -k1,1rn  

Either prints:

# These are ordered by header then decreasing frequency
1   ASV Genus_Species   Taxid   
5   cb692dc35226a27b1dc4709405077606    Gymnanthemum    amygdalinum 82755
2   46c5e68b0c1230a4305382e0ddf34c5d    Moscharia   solbrigii   460333
1   46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia    foliosa 1910960
1   46c5e68b0c1230a4305382e0ddf34c5d    Leucheria   candidissima    1911693
1   46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum  ulicinum    540062   

Or, if you want to print them in the file order encountered rather than by value:

ruby -lne 'BEGIN{cnt=Hash.new(0)}
if $.==1 then 
    puts "cnt\t#{$_}"
else
    cnt[$_]+=1
end
END{ cnt.each{|k,v| puts "#{v}\t#{k}" } }
' file.tsv

Or awk:

awk 'BEGIN{FS=OFS="\t"}
FNR==1{print "cnt", $0; next}
!seen[$0]++ {order[++count] = $0}
END { 
    for (i=1; i<=count; i++) 
        print seen[order[i]], order[i] 
}' file.tsv

Or a two pass awk (for massive files with lots of duplicates):

awk 'BEGIN{FS=OFS="\t"}
FNR==1{if (NR==1) print "cnt",$0; next}
FNR==NR{cnt[$0]++; next}
!seen[$0]++{print cnt[$0], $0}
' file.tsv file.tsv

Those three all prints:

# ordered by when line first encountered - not by count.
cnt ASV Genus_Species   Taxid
5   cb692dc35226a27b1dc4709405077606    Gymnanthemum    amygdalinum 82755
2   46c5e68b0c1230a4305382e0ddf34c5d    Moscharia   solbrigii   460333
1   46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia    foliosa 1910960
1   46c5e68b0c1230a4305382e0ddf34c5d    Leucheria   candidissima    1911693
1   46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum  ulicinum    540062 
like image 107
dawg Avatar answered Dec 13 '25 16:12

dawg


As others have already mentioned, uniq works on groups of contiguous duplicates, not duplicates across the whole of the input.

Using any awk with a 2-pass approach:

$ awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    $0 in cnt {
        print cnt[$0], $0
        delete cnt[$0]
    }
' file file
1 ASV Genus_Species   Taxid
5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

or 1-pass but storing the input in memory:

$ awk '
    {
        cnt[$0]++
        ordr[NR] = $0
    }
    END {
        for ( i=1; i<=NR; i++ ) {
            rec = ordr[i]
            if ( rec in cnt ) {
                print cnt[rec], rec
                delete cnt[rec]
            }
        }
    }
' file
1 ASV Genus_Species   Taxid
5 cb692dc35226a27b1dc4709405077606    Gymnanthemum amygdalinum    82755
2 46c5e68b0c1230a4305382e0ddf34c5d    Moscharia solbrigii 460333
1 46c5e68b0c1230a4305382e0ddf34c5d    Marticorenia foliosa    1910960
1 46c5e68b0c1230a4305382e0ddf34c5d    Leucheria candidissima  1911693
1 46c5e68b0c1230a4305382e0ddf34c5d    Oxyphyllum ulicinum 540062

I see that @dawg has similar scripts in their answer and I think the differences are interesting and subtle enough to be discussed. Here are @dawgs scripts written using the same style and variable names as I'm using above where applicable, to make the similarities and differences most obvious, and with a tweak where I think it's useful (IMO we should not use an array named seen[] to store and use/print counts, just to test for it's index having previously been seen or not):

Ed 1-pass:

awk '
    {
        cnt[$0]++
        ordr[NR] = $0
    }
    END {
        for ( i=1; i<=NR; i++ ) {
            rec = ordr[i]
            if ( rec in cnt ) {
                print cnt[rec], rec
                delete cnt[rec]
            }
        }
    }
' file

dawg 1-pass:

awk '
    !cnt[$0]++ {
        ordr[++numRecs] = $0
    }
    END { 
        for ( i=1; i<=numRecs; i++) {
            rec = ordr[i]
            print cnt[rec], rec
        }
    }
' file

Comparison: dawg is only storing the order for each unique $0 and only looping through the number of unique records whereas I'm storing the $0 for every line and looping through every line then deleting non-unique records. They are doing 1 extra increment and comparison than I am while loading the arrays but that's probably negligible overhead. Given that, dawg's script will use less memory and probably be more efficient than my script and so is the better approach.

Ed 2-pass:

awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    $0 in cnt {
        print cnt[$0], $0
        delete cnt[$0]
    }
' file file

dawg 2-pass:

awk '
    NR==FNR {
        cnt[$0]++
        next
    }
    !seen[$0]++ {
        print cnt[$0], $0
    }
' file file

Comparison: we both store the record counts the same way on the first pass, but on the second pass dawg is creating an array seen[] of all of the unique records from the input so they will essentially have 2 copies of the input stored in memory (which means dawgs 2-pass script isn't actually using any less memory than their 1-pass script) where my approach only has 1 copy of the input stored in memory. dawg is doing a hash lookup plus arithmetic while I'm doing a hash lookup plus a deletion - in terms of execution speed it's probably not different enough either way to matter. Especially since we'd adopt this 2-pass approach if the input could be huge to avoid memory issues, otherwise we'd use the 1-pass approach, my approach is better as it's using half the memory of both 1-pass scripts and dawgs 2-pass script.

like image 44
Ed Morton Avatar answered Dec 13 '25 16:12

Ed Morton



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!