I have a binary file and I am doing frequency counting in julia.
using PyPlot
import StatsBase
const stb=StatsBase
function getall(fname)
b=Mmap.mmap(fname,Vector{Int32})
#a=open(fname)
#b=reinterpret(Int32,readbytes(a))
d=stb.countmap(b)
x=collect(keys(d)) & 0x7ffffff
y=collect(values(d))
#plot(x,y,"r^")
#xlim(0,3000)
#ylim(0,3e5)
#grid("on")
return x,y
end
In python, I use numpy.unique
, numpy.memmap
and get the similar performance ( 550ms). Could Julia code be faster? Is there any other way to count instead of using StatBases.
The countmap
operation is a standard operation in any programming language. Additionally, it is also "raw", like sorting, which means it has to do a basic popular operation over the input data. Operations of this kind are hard to optimize, as they are done similarly in most languages - and if they are not fast enough in the source language, a specialized routine is called (read C/Cpp written).
Julia is no exception. Some "raw" linear algebra is outsourced to heavily optimized libraries.
To put a productive (and Julia positive) spin on this answer, there are algorithmic methods to treat special cases of input which would yield a speedup over the general algorithm (i.e. using a hash-based counter Dict). The ability to code these special cases within Julia represents its speed and the attempt to solve the so-called 2-language problem.
Concretely, the following tries to optimize for files with an uneven distribution of 32-bit words (like text files for example), by by-passing a general hash-based Dict and using a faster simple hash and a 16-bit lookup table.
On my test file, it achieved a 10% speedup over the countmap
implementation in the OP. A modest improvement :).
using DataStructures
function getall4(fname)
b=Mmap.mmap(fname,Vector{UInt32})
c = zeros(Int,2^16)
v = Array(UInt16,2^16)
l = length(b)
for i=1:l
d1 = b[i]&0xFFFF
d2 = d1 $ (b[i]>>16)
if d1==v[d2+1]
c[d2+1] += 1
else
c[d2+1] -= 1
end
if (c[d2+1]<=0)
c[d2+1] = 1
v[d2+1] = d1
end
end
cc = DataStructures.counter(UInt32)
fill!(c,0)
for i=1:l
d1 = b[i]&0xFFFF
d2 = d1 $ (b[i]>>16)
if v[d2+1]==d1
c[d2+1] += 1
end
end
for i=1:l
d1 = b[i]&0xFFFF
d2 = d1 $ (b[i]>>16)
if !(v[d2+1]==d1)
push!(cc,b[(i+1)>>1])
end
end
x = UInt32[]
y = Int[]
for i=1:(1<<16)
if c[i]>0
push!(x,(UInt32(i)<<16)+v[i])
push!(y,c[i])
end
end
append!(x,collect(keys(cc.map)))
append!(y,collect(values(cc.map)))
x,y
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With