I’m writing Julia code whose inputs are json files, that performs analysis in (the field of mathematical finance) and writes results as json. The code is a port from R in the hope of performance improvement.
I parse the input files using JSON.parsefile. This returns a Dict in which I observe that all vectors are of type Array{Any,1}. As it happens, I know that the input file will never contain vectors of mixed type, such as some Strings and some Numbers.
So I wrote the following code, which seems to work well and is “safe” in the sense that if the calls to convert fail then a vector continues to have type Array{Any,1}.
function typenarrow!(d::Dict)
for k in keys(d)
if d[k] isa Array{Any,1}
d[k] = typenarrow(d[k])
elseif d[k] isa Dict
typenarrow!(d[k])
end
end
end
function typenarrow(v::Array{Any,1})
for T in [String,Int64,Float64,Bool,Vector{Float64}]
try
return(convert(Vector{T},v))
catch; end
end
return(v)
end
My question is: Is this worth doing? Can I expect code that processes the contents of the Dict to execute faster if I do this type narrowing? I think the answer is yes in that the Julia performance tips recommend to “Annotate values taken from untyped locations” and this approach ensures there are no “untyped locations”.
There are two levels of the answer to this question:
Level 1
Yes, it will help the performance of the code. See for instance the following benchmark:
julia> using BenchmarkTools
julia> x = Any[1 for i in 1:10^6];
julia> y = [1 for i in 1:10^6];
julia> @btime sum($x)
26.507 ms (477759 allocations: 7.29 MiB)
1000000
julia> @btime sum($y)
226.184 μs (0 allocations: 0 bytes)
1000000
You can write your typenarrow function using a bit simpler approach like this:
typenarrow(x) = [v for v in x]
as using the comprehension will produce a vector of concrete type (assuming your source vector is homogeneous)
Level 2
This is not fully optimal. The problem that is still left is that you have a Dict that is a container with abstract type parameter (see https://docs.julialang.org/en/latest/manual/performance-tips/#Avoid-containers-with-abstract-type-parameters-1). Therefore in order for the computations to be fast you have to use a barrier function (see https://docs.julialang.org/en/latest/manual/performance-tips/#kernel-functions-1) or use type annotation for variables you introduce (see https://docs.julialang.org/en/v1/manual/types/index.html#Type-Declarations-1).
In the ideal world your Dict would have keys and values of homogeneous types and all would be maximally fast then, but if I understand your code correctly values in your case are not homogeneous.
EDIT
In order to solve the Level 2 isuue you can convert Dict into NamedTuple like this (this is a minimal example assuming that Dicts only nest in Dicts directly, but it should be easy enough to extend if you want more flexibility).
First, the function performing the conversion looks like:
function typenarrow!(d::Dict)
for k in keys(d)
if d[k] isa Array{Any,1}
d[k] = [v for v in d[k]]
elseif d[k] isa Dict
d[k] = typenarrow!(d[k])
end
end
NamedTuple{Tuple(Symbol.(keys(d)))}(values(d))
end
Now a MWE of its use:
julia> using JSON
julia> x = """
{
"name": "John",
"age": 27,
"values": {
"v1": [1,2,3],
"v2": [1.5,2.5,3.5]
},
"v3": [1,2,3]
}
""";
julia> j1 = JSON.parse(x)
Dict{String,Any} with 4 entries:
"name" => "John"
"values" => Dict{String,Any}("v2"=>Any[1.5, 2.5, 3.5],"v1"=>Any[1, 2, 3])
"age" => 27
"v3" => Any[1, 2, 3]
julia> j2 = typenarrow!(j1)
(name = "John", values = (v2 = [1.5, 2.5, 3.5], v1 = [1, 2, 3]), age = 27, v3 = [1, 2, 3])
julia> dump(j2)
NamedTuple{(:name, :values, :age, :v3),Tuple{String,NamedTuple{(:v2, :v1),Tuple{Array{Float64,1},Array{Int64,1}}},Int64,Array{Int64,1}}}
name: String "John"
values: NamedTuple{(:v2, :v1),Tuple{Array{Float64,1},Array{Int64,1}}}
v2: Array{Float64}((3,)) [1.5, 2.5, 3.5]
v1: Array{Int64}((3,)) [1, 2, 3]
age: Int64 27
v3: Array{Int64}((3,)) [1, 2, 3]
The beauty of this approach is that Julia will know all types in j2, so if you pass j2 to any function as a parameter all calculations inside this function will be fast.
The downside of this approach is that a function taking j2 has to be pre-compiled, which might be problematic if j2 structure is huge (as then the structure of resulting NamedTuple is complex) and the amount of work your function does is relatively small. But for small JSON-s (small in the sense of structure, as vectors held in them can be large - their size does not add to the complexity) this approach has proven to be efficient in several applications I have developed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With