Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to do frequency counting in Julia

I have a binary file and I am doing frequency counting in julia.

using PyPlot
import StatsBase
const stb=StatsBase

function getall(fname)
    b=Mmap.mmap(fname,Vector{Int32})
    #a=open(fname)
    #b=reinterpret(Int32,readbytes(a))
    d=stb.countmap(b)
    x=collect(keys(d)) & 0x7ffffff
    y=collect(values(d))
    #plot(x,y,"r^")
    #xlim(0,3000)
    #ylim(0,3e5)
    #grid("on")
    return x,y
end

In python, I use numpy.unique, numpy.memmap and get the similar performance ( 550ms). Could Julia code be faster? Is there any other way to count instead of using StatBases.

like image 481
Aung Avatar asked Oct 16 '25 23:10

Aung


1 Answers

The countmap operation is a standard operation in any programming language. Additionally, it is also "raw", like sorting, which means it has to do a basic popular operation over the input data. Operations of this kind are hard to optimize, as they are done similarly in most languages - and if they are not fast enough in the source language, a specialized routine is called (read C/Cpp written).

Julia is no exception. Some "raw" linear algebra is outsourced to heavily optimized libraries.

To put a productive (and Julia positive) spin on this answer, there are algorithmic methods to treat special cases of input which would yield a speedup over the general algorithm (i.e. using a hash-based counter Dict). The ability to code these special cases within Julia represents its speed and the attempt to solve the so-called 2-language problem.

Concretely, the following tries to optimize for files with an uneven distribution of 32-bit words (like text files for example), by by-passing a general hash-based Dict and using a faster simple hash and a 16-bit lookup table.

On my test file, it achieved a 10% speedup over the countmap implementation in the OP. A modest improvement :).

using DataStructures
function getall4(fname)
    b=Mmap.mmap(fname,Vector{UInt32})
    c = zeros(Int,2^16)
    v = Array(UInt16,2^16)
    l = length(b)
    for i=1:l
        d1 = b[i]&0xFFFF
        d2 = d1 $ (b[i]>>16)
        if d1==v[d2+1]
            c[d2+1] += 1
        else
            c[d2+1] -= 1
        end
        if (c[d2+1]<=0)
            c[d2+1] = 1
            v[d2+1] = d1
        end
    end
    cc = DataStructures.counter(UInt32)
    fill!(c,0)
    for i=1:l
        d1 = b[i]&0xFFFF
        d2 = d1 $ (b[i]>>16)
        if v[d2+1]==d1
            c[d2+1] += 1
        end
    end
    for i=1:l
        d1 = b[i]&0xFFFF
        d2 = d1 $ (b[i]>>16)
        if !(v[d2+1]==d1)
            push!(cc,b[(i+1)>>1])
        end
    end
    x = UInt32[]
    y = Int[]
    for i=1:(1<<16)
        if c[i]>0
            push!(x,(UInt32(i)<<16)+v[i])
            push!(y,c[i])
        end
    end
    append!(x,collect(keys(cc.map)))
    append!(y,collect(values(cc.map)))
    x,y
end
like image 117
Dan Getz Avatar answered Oct 19 '25 12:10

Dan Getz



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!