I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.
This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?
This is some reproducible code:
function chop(s; nc=80)
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
s = "A"^500000
chop(s)
It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]
Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.
I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl
Can someone help me understand why this code above is so slow nonetheless?
Weirdly, the below is much faster, just by adding s = collect(s):
function chop(s; nc=80)
s = collect(s) #this line is new
nr = ceil(Int64, length(s)/nc)
l(i) = 1+(nc*(i-1))
r(i) = min(nc*i, length(s))
rows = [String(s[l(i):r(i)]) for i in 1:nr]
return join(rows,'\n')
end
My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):
chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')
The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.
How does the given solution work:
findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.\n as separator.First attempt:
chop is not recommended to be used as it overshadows the function from Base Julia with the same name;length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)Second attempt:
s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates StringIf you want something faster than regex and correct you can use this code:
function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
@assert nc > 0
isempty(s) && return s
sz = sizeof(s)
cu = codeunits(s)
buf_sz = sz + div(sz, nc)
buf = Vector{UInt8}(undef, buf_sz)
start = 1
buf_loc = 1
while true
stop = min(nextind(s, start, nc), sz + 1)
copyto!(buf, buf_loc, cu, start, stop - start)
buf_loc += stop - start
if stop == sz + 1
resize!(buf, buf_loc - 1)
break
else
start = stop
buf[buf_loc] = UInt8('\n')
buf_loc += 1
end
end
return String(buf)
end
String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With