Frequency of all substrings

Question

I want to find frequency of all substrings in a string in C++. Currently I am using the following :

    unordered_map<string,int> mp;
    string s;// the string of which we want all substrings... n is length
    cin>>s;
    string t;
    for(int i=0,i<=n-1;++i) // starting point of a substring
    {
        t="";
        for(int j=i;j<=n-1;++j) // all substrings startings at i
        {
            t+=s[j];
            ++mp[t];
        }
    }

I want to improve its time complexity. Can we do any better? Does there exist a better algorithm? Sorry if this is not on topic here... I will close it if that's the case.

Edit :

Here's what I came up with... Maintain a trie of all suffixes of the string. Then traverse all the substrings starting at i so that searching is O(1).

Each node specifies a substring(prefix of suffix). Now maintain frequency at each node and update it accordingly. Though this method is O(n^2) but the constants are fairly large due to memory allocation and resetting each next pointers of node(26 times) to NULL. Can I optimize it further? Also can there be any faster alternatives to store a trie than a linked list? I was able to squeeze my solution but it was very close to the time limit.

Brent Washburne · Accepted Answer

Here's a version in raw C code. It uses two arrays that are the same length as the input string (s_len) to count the number of matches and the position of duplicates. The advantage is that the string is never duplicated in a map, and it saves the time needed to create map entries (which you found was much slower). Another advantage is that it doesn't require n^2 memory, it prints out the information immediately like a map/reduce function, to be processed in a later step. It uses native C memory functions like calloc(), bzero() and memcmp() for efficient memory allocation, zeroing and comparing.

The algorithm works like this:

For each string of length len, as len goes from 1 to s_len-1:
Clear out the arrays matches and dups;
Walk along the string (using i) from the start (position 0) to the end:
If this position was already counted as a duplicate, skip it;
For each position further down the string (j = i+1 to stop), compare it with the current position;
If this is a match, increment the number of matches at position i and mark position j as a duplicate;
At the end of this walk, print out the number of matches for length len.

Here is the code:

#include <stdio.h>
#include <stdlib.h>    /* for calloc() */
#include <strings.h>   /* for bzero() */

/* Find the number of matching substrings in the string s */
void sub(char *s)
{
    size_t s_len = strlen(s);
    short *matches = (short *) calloc(s_len, sizeof(short));
    short *dups = (short *) calloc(s_len, sizeof(short));
    size_t n = s_len * sizeof(short);    /* used by bzero() */
    size_t len, i, j, stop;

    /* Find all substrings of length 1..s_len */
    for (len=1; len<s_len; ++len)
    {
        bzero((void *) matches, n);    /* zero out the number of matches */
        bzero((void *) dups, n);       /* zero out the duplicates */
        stop = s_len - len + 1;
        for (i=0; i<stop; ++i)
        {   
            if (dups[i])    /* this is a duplicate (was already counted) */
                continue;   
            for (j=i+1; j<stop; ++j)
            {       
                if (memcmp(s+i, s+j, len))    /* substring comparison */
                    continue;    /* not a match? continue */
                matches[i]++;
                dups[j] = 1;
            }       
            if (matches[i])
                printf("%d: %.*s
", matches[i]+1, (int) len, s+i);
        }   
    }
}

int main()
{
    sub("abcabcabcabc");
    return 0;
}

Here is the output:

4: a
4: b
4: c
4: ab
4: bc
3: ca
4: abc
3: bca
3: cab
3: abca
3: bcab
3: cabc
3: abcab
3: bcabc
2: cabca
3: abcabc
2: bcabca
2: cabcab
2: abcabca
2: bcabcab
2: cabcabc
2: abcabcab
2: bcabcabc
2: abcabcabc

Frequency of all substrings

Tags:

c++

string

algorithm

evil999man

1 Answers

Brent Washburne

Recent Activity

Donate For Us

Frequency of all substrings

Tags:

c++

string

algorithm

evil999man

1 Answers

Brent Washburne

Related questions

Recent Activity

Donate For Us