Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast string search to find string array element which matches the givven pattern

Tags:

c

fastsearch

I have an array of constant strings which I iterate through to find an index of element which is a string that contains a search pattern. Which search algorithm should I choose to improve the speed of finding this element? I am not limited in time before running the application for preparing the look up tables if any are necessary.

I corrected a question - I am not doing exact string match - I am searching for pattern inside the element, which is in an array

array of strings:
[0] Red fox jumps over the fence
[1] Blue table
[2] Red flowers on the fence

I need to find an element which contains word 'table' - in this case its element 1

I do like 50000 iterations of a set of 30 array which could contain up to 30000 strings of not less than 128 characters. Now I am using good-old strstr brute force which is too slow...

Ok, posting a part of my function, the first strstr - looks up in an uncut array of lines if there are any occurrences, then the brute search follows. I know I can speed this part, but I am not doing optimization on this approach...

// sheets[i].buffer - contains a page of a text which is split into lines
// fullfunccall - is the pattern
// sheets[i].cells[k] - are the pointers to lines in a buffer

for (i = 0; i < sheetcount; i++) {
    if (i != currsheet && sheets[i].name && sheets[i].name[0] != '\0') {
        if (strstr(sheets[i].buffer, fullfunccall)) {
            usedexternally = 1;

            int foundatleastone = 0;
            for (k = 0; k < sheets[i].numcells; k++) {
                strncpy_s(testline, MAX_LINE_SIZE,
                          sheets[i].cells[k]->line,
                          sheets[i].cells[k]->linesize);
                testline[sheets[i].cells[k]->linesize] = '\0';

                if (strstr(testline, fullfunccall)) {
                    dependency_num++;

                    if (dependency_num >= MAX_CELL_DEPENDENCIES-1) {
                        printf("allocation for sheet cell dependencies is insufficient\n");
                        return;
                    }

                    sheets[currsheet].cells[currcellposinsheet]->numdeps = dependency_num + 1;
                    foundatleastone++;
                    sheets[currsheet].cells[currcellposinsheet]->deps[dependency_num] = &sheets[i].cells[k];
                }
            }

            if (foundatleastone == 0) {
                printf("error locating dependency for external func: %s\n", fullfunccall);
                return;
            }
        }
    };
}
like image 301
Ulterior Avatar asked Jan 31 '26 15:01

Ulterior


1 Answers

You wrote that your 'haystack' (the set of strings to search through) is roughly 30000 strings with approx. 200 characters each. You also wrote that the 'needle' (the term to search for) is either a string of 5 or 20 characters.

Based on this, you could precompute a hashtable which maps any 5-character subsequence to the string(s) in the haystack it occurs in. For 30000 strings (200 characters each) there are at most 30000 * (200 - 5) = 5.850.000 different 5-character substrings. If you hash each of it to a 16bit checksum you'd need a minimum 11MB of memory (for the hash keys) plus some pointers pointing to the string(s) in which the substring occurs.

For instance, given a simplfied haystack of

static const char *haystack[] = { "12345", "123456", "23456", "345678" };

you precompute a hash map which maps any possible 5-character string such that

12345 => haystack[0], haystack[1]
23456 => haystack[1], haystack[2]
34567 => haystack[3]
45678 => haystack[4]

With this, you could take the first five characters of your given key (either 5 or 20 characters long), hash it and then do a normal strstr through all the strings to which the key is mapped by the hash.

like image 155
Frerich Raabe Avatar answered Feb 03 '26 04:02

Frerich Raabe