Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ Tokenizer Complexity vs strtok_r

I'm making this question because I moved my tokenizer from strtok_r to an equivalent version in C++. I have to use strtok_r in place of strtok, because I have 2 nested tokenizations to perform most of the time.

The strtok_r algorithm is something like this:

char *end_token, *token, *word ;
// fill 'word'
token = strtok_r (word, " ", &end_token) ;
while (token != NULL) {
  // do something
  token = strtok_r (NULL, " ", &end_token) ;
}

And the C++ version is like this (taken from another post here):

string mystring, token ;
size_t next_token ;
// fill 'mystring'
while (token != mystring) {
    next_token = mystring.find_first_of (" ") ;
    token = mystring.substr (0, next_token) ;
    mystring = mystring.substr (next_token + 1) ;
    // do something
}

Now the question: why the C++ version is so heavy respect to the C version? For long strings I have to wait about 10 seconds with the C++ version, while the C version is instantaneous with the same strings. So, it seems like the C++ version has higher complexity... What do you think about?

like image 288
nicolati Avatar asked Dec 02 '25 16:12

nicolati


1 Answers

strtok() modifies the string, replacing the token delimiter with a null terminator. If your long string has n tokens, the function just iterates through the string changing n chars to null, which is extremely fast.

In your C++ alternative you are making 2*n string copies, which means potentially 2*n allocation operations, plus the sheer copy of the (very long) remaining string, which is much heavier than the first alternative. The difference is that you're not obliged to change the original string.

You could improve by keeping the string you're iterating trough unchanged and for example use offsets for the search:

string mystring, token ;
size_t cur_token=0, next_token ;
// fill 'mystring'
do {
    next_token = mystring.find_first_of (" ", cur_token) ;
    token = mystring.substr (cur_token, next_token-cur_token);  // next_token-(nex_token==string::npos ? 0:cur_token) would be cleaner
    if (next_token!=string::npos) 
        cur_token = next_token+1; 
    // do something with token;
} while (next_token!=string::npos);

Live demo

like image 195
Christophe Avatar answered Dec 04 '25 06:12

Christophe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!