C++ Tokenizer Complexity vs strtok_r

Question

I'm making this question because I moved my tokenizer from strtok_r to an equivalent version in C++. I have to use strtok_r in place of strtok, because I have 2 nested tokenizations to perform most of the time.

The strtok_r algorithm is something like this:

char *end_token, *token, *word ;
// fill 'word'
token = strtok_r (word, " ", &end_token) ;
while (token != NULL) {
  // do something
  token = strtok_r (NULL, " ", &end_token) ;
}

And the C++ version is like this (taken from another post here):

string mystring, token ;
size_t next_token ;
// fill 'mystring'
while (token != mystring) {
    next_token = mystring.find_first_of (" ") ;
    token = mystring.substr (0, next_token) ;
    mystring = mystring.substr (next_token + 1) ;
    // do something
}

Now the question: why the C++ version is so heavy respect to the C version? For long strings I have to wait about 10 seconds with the C++ version, while the C version is instantaneous with the same strings. So, it seems like the C++ version has higher complexity... What do you think about?

Christophe · Accepted Answer

strtok() modifies the string, replacing the token delimiter with a null terminator. If your long string has n tokens, the function just iterates through the string changing n chars to null, which is extremely fast.

In your C++ alternative you are making 2*n string copies, which means potentially 2*n allocation operations, plus the sheer copy of the (very long) remaining string, which is much heavier than the first alternative. The difference is that you're not obliged to change the original string.

You could improve by keeping the string you're iterating trough unchanged and for example use offsets for the search:

string mystring, token ;
size_t cur_token=0, next_token ;
// fill 'mystring'
do {
    next_token = mystring.find_first_of (" ", cur_token) ;
    token = mystring.substr (cur_token, next_token-cur_token);  // next_token-(nex_token==string::npos ? 0:cur_token) would be cleaner
    if (next_token!=string::npos) 
        cur_token = next_token+1; 
    // do something with token;
} while (next_token!=string::npos);

Live demo

C++ Tokenizer Complexity vs strtok_r

Tags:

c++

c

tokenize

strtok

nicolati

1 Answers

Christophe

Recent Activity

Donate For Us

C++ Tokenizer Complexity vs strtok_r

Tags:

c++

c

tokenize

strtok

nicolati

1 Answers

Christophe

Related questions

Recent Activity

Donate For Us