Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is string creation so slow in Julia?

Tags:

string

julia

I'm maintaining a Julia library that contains a function to insert a new line after every 80 characters in a long string.

This function becomes extremely slow (seconds or more) when the string becomes longer than 1 million characters. Time seems to increase more than linearly, maybe quadratic. I don't understand why. Can someone explain?

This is some reproducible code:

function chop(s; nc=80)
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end

s = "A"^500000

chop(s)

It seems that this row is where most of the time is spent: rows = [String(s[l(i):r(i)]) for i in 1:nr]

Does that mean it takes long to initialize a new String? That wouldn't really explain the super-linear run time.

I know the canonical fast way to build strings is to use IOBuffer or the higher-level StringBuilders package: https://github.com/davidanthoff/StringBuilders.jl

Can someone help me understand why this code above is so slow nonetheless?

Weirdly, the below is much faster, just by adding s = collect(s):

function chop(s; nc=80)
    s = collect(s) #this line is new
    nr   = ceil(Int64, length(s)/nc)
    l(i) = 1+(nc*(i-1)) 
    r(i) = min(nc*i, length(s))
    rows = [String(s[l(i):r(i)]) for i in 1:nr]
    return join(rows,'\n')
end
like image 656
Cornelius Roemer Avatar asked Dec 21 '25 10:12

Cornelius Roemer


2 Answers

My preference would be to use a generic one-liner solution, even if it is a bit slower than what Przemysław proposes (I have optimized it for simplicity not speed):

chop_and_join(s::Union{String,SubString{String}}; nc::Integer=80) =
    join((SubString(s, r) for r in findall(Regex(".{1,$nc}"), s)), '\n')

The benefit is that it correctly handles all Unicode characters and will also work with SubString{String}.

How the solution works

How does the given solution work:

  • findall(Regex(".{1,$nc}") returns a vector of ranges eagerly matching up to nc characters;
  • next I create a SubString(s, r) which avoids allocation, using the returned ranges that are iterated by r.
  • finally all is joined with \n as separator.

What is wrong in the OP solutions

First attempt:

  • the function name you choose chop is not recommended to be used as it overshadows the function from Base Julia with the same name;
  • length(s) is called many times and it is an expensive function; it should be called only once and stored as a variable;
  • in general using length is incorrect as Julia uses byte indexing not character indexing (see here for an explanation)
  • String(s[l(i):r(i)]) is inefficient as it allocates String twice (actually the outer String is not needed)

Second attempt:

  • doing s = collect(s) resolves the issue of calling length many times and incorrect use of byte indexing, but is inefficient as it unnecessarily allocates Vector{Char} and also it makes your code type-unstable (as you assign to variable s value of different type than it originally stored);
  • doing String(s[l(i):r(i)]) first allocates a small Vector{Char} and next allocates String

What would be a fast solution

If you want something faster than regex and correct you can use this code:

function chop4(s::Union{String, SubString{String}}; nc::Integer=80)
    @assert nc > 0
    isempty(s) && return s
    sz = sizeof(s)
    cu = codeunits(s)
    buf_sz = sz + div(sz, nc)
    buf = Vector{UInt8}(undef, buf_sz)
    start = 1
    buf_loc = 1
    while true
        stop = min(nextind(s, start, nc), sz + 1)
        copyto!(buf, buf_loc, cu, start, stop - start)
        buf_loc += stop - start
        if stop == sz + 1
            resize!(buf, buf_loc - 1)
            break
        else
            start = stop
            buf[buf_loc] = UInt8('\n')
            buf_loc += 1
        end
    end
    return String(buf)
end
like image 128
Bogumił Kamiński Avatar answered Dec 23 '25 02:12

Bogumił Kamiński


String is immutable in Julia. If you need to work with a string in this way, it's much better to make a Vector{Char} first, to avoid repeatedly allocating new, big strings.

like image 36
jling Avatar answered Dec 23 '25 01:12

jling