Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to restore string after using strtok()

Tags:

c

sorting

strtok

I have a project in which I need to sort multiple lines of text based on the second, third, etc word in each line, not the first word. For example,

this line is first

but this line is second

finally there is this line

and you choose to sort by the second word, it would turn into

this line is first

finally there is this line

but this line is second

(since line is before there is before this)

I have a pointer to a char array that contains each line. So far what I've done is use strtok() to split each line up to the second word, but that changes the entire string to just that word and stores it in my array. My code for the tokenize bit looks like this:

 for (i = 0; i < numLines; i++) {
   char* token = strtok(labels[i], " ");
   token = strtok(NULL, " ");
   labels[i] = token;
 }

This would give me the second word in each line, since I called strtok twice. Then I sort those words. (line, this, there) However, I need to put the string back together in it's original form. I'm aware that strtok turns the tokens into '\0', but Ive yet to find a way to get the original string back.

I'm sure the answer lies in using pointers, but I'm confused what exactly I need to do next.

I should mention I'm reading in the lines from an input file as shown:

for (i = 0; i < numLines && fgets(buffer, sizeof(buffer), fp) != 0; i++) {
  labels[i] = strdup(buffer);

Edit: my find_offset method

size_t find_offset(const char *s, int n) {
  size_t len;
  while (n > 0) {
     len = strspn(s, " ");
     s += len;
  }

  return len;
} 

Edit 2: The relevant code used to sort

//Getting the line and offset
for (i = 0; i < numLines && fgets(buffer, sizeof(buffer), fp) != 0; i++) {
   labels[i].line = strdup(buffer);
   labels[i].offset = find_offset(labels[i].line, nth);
}


int n = sizeof(labels) / sizeof(labels[0]);
qsort(labels, n, sizeof(*labels), myCompare);
for (i = 0; i < numLines; i++)
  printf("%d: %s", i, labels[i].line); //Print the sorted lines


int myCompare(const void* a, const void* b) { //Compare function
  xline *xlineA = (xline *)a;
  xline *xlineB = (xline *)b;

  return strcmp(xlineA->line + xlineA->offset, xlineB->line + xlineB->offset);
}
like image 624
nhlyoung Avatar asked Sep 02 '25 02:09

nhlyoung


2 Answers

Perhaps rather than mess with strtok(), use strspn(), strcspn() to parse the string for tokens. Then the original string can even be const.

#include <stdio.h>
#include <string.h>

int main(void) {
  const char str[] = "this line is first";
  const char *s = str;
  while (*(s += strspn(s, " ")) != '\0') {
    size_t len = strcspn(s, " ");

    // Instead of printing, use the nth parsed token for key sorting
    printf("<%.*s>\n", (int) len, s);

    s += len;
  }
}

Output

<this>
<line>
<is>
<first>

Or

Do not sort lines.

Sort structures

typedef struct {
  char *line;
  size_t offset;
} xline;

Pseudo code

int fcmp(a, b) {
  return strcmp(a->line + a->offset, b->line + b->offset);
}

size_t find_offset_of_nth_word(const char *s, n) {
  while (n > 0) {
    use strspn(), strcspn() like above
  }
}

main() {
  int nth = ...;
  xline labels[numLines];
  for (i = 0; i < numLines && fgets(buffer, sizeof(buffer), fp) != 0; i++) {
     labels[i].line = strdup(buffer);
     labels[i].offset = find_offset_of_nth_word(nth);
  }

  qsort(labels, i, sizeof *labels, fcmp);

}

Or

After reading each line, find the nth token with strspn(), strcspn() and the reform the line from "aaa bbb ccc ddd \n" to "ccd ddd \naaa bbb ", sort and then later re-order the line.


In all case, do not use strtok() - too much information lost.

like image 108
chux - Reinstate Monica Avatar answered Sep 04 '25 16:09

chux - Reinstate Monica


I need to put the string back together in it's original form. I'm aware that strtok turns the tokens into '\0', but Ive yet to find a way to get the original string back.

Far better would be to avoid damaging the original strings in the first place if you want to keep them, and especially to avoid losing the pointers to them. Provided that it is safe to assume that there are at least three words in each line and that the second is separated from the first and third by exactly one space on each side, you could undo strtok()'s replacement of delimiters with string terminators. However, there is no safe or reliable way to recover the start of the overall string once you lose it.

I suggest creating an auxiliary array in which you record information about the second word of each sentence -- obtained without damaging the original sentences -- and then co-sorting the auxiliary array and sentence array. The information to be recorded in the aux array could be a copy of the second word of the sentence, their offsets and lengths, or something similar.

like image 45
John Bollinger Avatar answered Sep 04 '25 16:09

John Bollinger