Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C regex performance

I'm writing a code in C to perform some regex in enwik8 and enwik9. I'm also creating the same algorithm in other languages for benchmark purposes. The issue is that I'm doing something wrong with my C code because it takes 40 seconds while python and others take just 10 seconds.

What am I forgetting?

#include <stdio.h>
#include <regex.h>

#define size 1024

int main(int argc, char **argv){
    FILE *fp;
    char line[size];
    regex_t re;
    int x;
    const char *filename = "enwik8";
    const char *strings[] = {"\bhome\b", "\bdear\b", "\bhouse\b", "\bdog\b", "\bcat\b", "\bblue\b", "\bred\b", "\bgreen\b", "\bbox\b", "\bwoman\b", "\bman\b", "\bwomen\b", "\bfull\b", "\bempty\b", "\bleft\b", "\bright\b", "\btop\b", "\bhelp\b", "\bneed\b", "\bwrite\b", "\bread\b", "\btalk\b", "\bgo\b", "\bstay\b", "\bupper\b", "\blower\b", "\bI\b", "\byou\b", "\bhe\b", "\bshe\b", "\bwe\b", "\bthey\b"};   

    for(x = 0; x < 33; x++){
        if(regcomp(&re, strings[x], REG_EXTENDED) != 0){
            printf("Failed to compile regex '%s'\n", strings[x]);

            return -1;
        }

        fp = fopen(filename, "r");

        if(fp == 0){
            printf("Failed to open file %s\n", filename);

            return -1;
        }

        while((fgets(line, size, fp)) != NULL){
            regexec(&re, line, 0, NULL, 0);
        } 
    }

    return 0;
}
like image 700
Frederico Schardong Avatar asked Jan 23 '26 10:01

Frederico Schardong


1 Answers

file access and compiling regexes is probably a culprit.

  • compile your regexs once and store them in an array
  • open the file
  • read a line
  • run each compiled regex over it
  • close the file.
like image 66
Josh Petitt Avatar answered Jan 26 '26 00:01

Josh Petitt



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!