I want to find frequency of all substrings in a string in C++. Currently I am using the following :
unordered_map<string,int> mp;
string s;// the string of which we want all substrings... n is length
cin>>s;
string t;
for(int i=0,i<=n-1;++i) // starting point of a substring
{
t="";
for(int j=i;j<=n-1;++j) // all substrings startings at i
{
t+=s[j];
++mp[t];
}
}
I want to improve its time complexity. Can we do any better? Does there exist a better algorithm? Sorry if this is not on topic here... I will close it if that's the case.
Edit :
Here's what I came up with... Maintain a trie of all suffixes of the string. Then traverse all the substrings starting at i so that searching is O(1).
Each node specifies a substring(prefix of suffix). Now maintain frequency at each node and update it accordingly. Though this method is O(n^2) but the constants are fairly large due to memory allocation and resetting each next pointers of node(26 times) to NULL. Can I optimize it further? Also can there be any faster alternatives to store a trie than a linked list? I was able to squeeze my solution but it was very close to the time limit.
Here's a version in raw C code. It uses two arrays that are the same length as the input string (s_len) to count the number of matches and the position of duplicates. The advantage is that the string is never duplicated in a map, and it saves the time needed to create map entries (which you found was much slower). Another advantage is that it doesn't require n^2 memory, it prints out the information immediately like a map/reduce function, to be processed in a later step. It uses native C memory functions like calloc(), bzero() and memcmp() for efficient memory allocation, zeroing and comparing.
The algorithm works like this:
len, as len goes from 1 to s_len-1:matches and dups;i) from the start (position 0) to the end:j = i+1 to stop), compare it with the current position;i and mark position j as a duplicate;len.Here is the code:
#include <stdio.h>
#include <stdlib.h> /* for calloc() */
#include <strings.h> /* for bzero() */
/* Find the number of matching substrings in the string s */
void sub(char *s)
{
size_t s_len = strlen(s);
short *matches = (short *) calloc(s_len, sizeof(short));
short *dups = (short *) calloc(s_len, sizeof(short));
size_t n = s_len * sizeof(short); /* used by bzero() */
size_t len, i, j, stop;
/* Find all substrings of length 1..s_len */
for (len=1; len<s_len; ++len)
{
bzero((void *) matches, n); /* zero out the number of matches */
bzero((void *) dups, n); /* zero out the duplicates */
stop = s_len - len + 1;
for (i=0; i<stop; ++i)
{
if (dups[i]) /* this is a duplicate (was already counted) */
continue;
for (j=i+1; j<stop; ++j)
{
if (memcmp(s+i, s+j, len)) /* substring comparison */
continue; /* not a match? continue */
matches[i]++;
dups[j] = 1;
}
if (matches[i])
printf("%d: %.*s\n", matches[i]+1, (int) len, s+i);
}
}
}
int main()
{
sub("abcabcabcabc");
return 0;
}
Here is the output:
4: a
4: b
4: c
4: ab
4: bc
3: ca
4: abc
3: bca
3: cab
3: abca
3: bcab
3: cabc
3: abcab
3: bcabc
2: cabca
3: abcabc
2: bcabca
2: cabcab
2: abcabca
2: bcabcab
2: cabcabc
2: abcabcab
2: bcabcabc
2: abcabcabc
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With