BASH GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)
Given a string str, which can store only any lower, UPPER or number values only.
How can I find the first character (with the LOWEST occurrence) in a non-empty string str? The focus of the problem is to print letter 'z' if the script is this (as fast as possible, without getting any errors if the data is in a string or a file): https://repl.it/@asangal/find1stleastoccurrencecharmaintainorderanyleastsize
or Sample example values of str:
str=aa, output should be 'a' (as 'a' is the only one char in string - occurred 2 times)
str=aa1, output should be '1' (as '1' is the first character with lowest occurrence count of 1)
str=aa1c1deef, output should be 'c' (as 'c' came before 'd' and both had 1 as lowest occurrence count of 1)
str=abcdeeddAbac, output should be 'A' (as 'A' is the first character with lower occurrence count of 1)
str=abcdeeddAbacA, output should be 'a' (as 'a' is the first character with lower occurrence count of 2)
str=abcdeeddAbacAabc, output should be 'e' (as 'e' is the first character with lower occurrence count of 2)
Other large size example value can be:
str=axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh, output should be 'g' (as 'g' is the first character with lowest occurrence count of 1)
Constraints / Context:
str string with lowest occurrence count.PS: I know system level commands does call all these things behind the scene, but I'm looking for minimal code if possible at command line i.e. $ prompt.
I tried the following ugly-looking non-one-liner attempt as listed below, here I have the for loop which I want to avoid if possible and the sort command is helping but also making me loose the order and doesn't cover all the conditions.
I don't like my current attempt that I have listed below, but seems like I'm close.
str="axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh";
for char in $(echo $str | sed "s/\(.\)/\1\n/g" | grep .| tr '\012' ' ');
do
echo -n "$char=$(echo ${str} | sed "s/\(.\)/\1\n/g" | grep . | grep -c $char)";echo;
done | sort -u
I believe it's possible to achieve what I'm looking for as a One-liner (i.e. by using bunch of common Linux commands and pipes |) in BASH; just wanted to pick your brain! I know there are better shell experts than me.
Most of the solutions I found online don't maintain the order (which is important to me) and just give the highest/lower occurrence/count of a character.
EDIT2: In case anyone needs to know minimum value first appeared character/integer etc from whole Input_file then try following.
awk '
{
num=split($0,array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
}
END{
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}
' Input_file
EDIT: Since OP is getting an error so in spite of reading from a variable lets read from an Input_file, in case OP reading values from Input_file then try following.
awk '
{
delete tot_ind
delete array
delete count
delete actual
min=""
num=split($0,array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
num=split($0,array,"") ##Splitting current line into arrray with NULL delimiter.
for(i=1;i<=num;i++){ ##Running loop to run till num here.
++count[array[i]] ##Creating count array with index of valueof array and keep incrementing its value with 1.
}
for(j=1;j<=num;j++){ ##Running for loop till num here.
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j] ##Creating tot_ind with index of value of count array, this will have all values of minimum number here.
}
for(i in count){ ##Traversing in array count here.
min=min<=count[i]?(min?min:count[i]):count[i] ##Looking to get minimum value by comparing its value to each element.
}
print "Minimum value found is:" min ##Printing Minimum value here.
split(tot_ind[min],actual," ") ##Splitting tot_ind into actual array to get very first element of minimum value out of all values which have same minimum number.
print "All item(s) with same minimum values are:" actual[1] ##Printing very first minimum number here.
}' Input_file ##Mentioning Input_file name here.
To get very first minimum value which is occurring in Input_file(BTW by this solution all items which have same minimum values could be printed too, with minor change in this code's last print statement). Written and tested in GNU awk.
str="abcdeeddAbacA"
awk -v str="$str" '
BEGIN{
num=split(str,array,"")
for(i=1;i<=num;i++){
++count[array[i]]
}
for(j=1;j<=num;j++){
tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
}
for(i in count){
min=min<=count[i]?(min?min:count[i]):count[i]
}
print "Minimum value found is:" min
split(tot_ind[min],actual," ")
print "All item(s) with same minimum values are:" actual[1]
}'
PROOF of concept: Ran above with OP's examples.
./script.ksh aa
Minimum value found is:2
All item(s) with same minimum values are:a
./script.ksh aa1
Minimum value found is:1
All item(s) with same minimum values are:1
./script.ksh aa1c1deef
Minimum value found is:1
All item(s) with same minimum values are:c
./script.ksh abcdeeddAbac
Minimum value found is:1
All item(s) with same minimum values are:A
./script.ksh abcdeeddAbacA
Minimum value found is:2
All item(s) with same minimum values are:a
./script.ksh abcdeeddAbacAabc
Minimum value found is:2
All item(s) with same minimum values are:e
NOTE: I saved above solution in a script file and passing OP's sample inputs as argument to script, OP could use in whatever way he wants, this was done to show how its working.
ANSWER #1 - string/variable based solution
Assuming the desired string is stored in a variable str, here's one awk solution:
awk -v str="${str}" '
BEGIN { num = split(str,token,"") # split str into an array of single letter/number elements
for ( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
count[token[i]]++
}
min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # keep track of the lowest/minimum count
}
for ( i=1; i<=num; i++ ) { # loop through array of letter/numbers
if ( min == count[token[i]] ) { # for the first letter/number we find where count = min
print token[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}'
Running the above against the different sample strings:
++++++++++++++++ str = aa
a 2
++++++++++++++++ str = aa1
1 1
++++++++++++++++ str = aa1c1deef
c 1
++++++++++++++++ str = abcdeeddAbac
A 1
++++++++++++++++ str = abcdeeddAbacA
a 2
++++++++++++++++ str = abcdeeddAbacAabc
e 2
++++++++++++++++ str = axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
g 1
ANSWER #2 - file/array based solution
Looking at a comment OP made to RavinderSingh13's answer re: a really large string resides in a file, and assuming that file's name is giga.txt ...
We should be able to make some minor mods to the previous awk solution like such:
awk '
BEGIN { RS = "\0" } # address files with no cr/lf
{ num = split($0,token,"") # split line/$0 into an array of single letter/number elements
for( i=1; i<=num; i++ ) { # get a count of occurrences of each letter/number
all[NR i] = token[i] # token array is for current line/$0 while all array is for entire file
count[token[i]]++
}
}
END { min = 10000000
for ( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for ( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # then break out of our loop
}
}
}
' giga.txt
Placing the longer str sample into giga.txt:
$ cat giga.txt
axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
Running the above awk solution against giga.txt gives us:
$ awk '....' giga.txt
g 1
ANSWER #3 - file/substr() based solution
OP provided additional details on how to generate a 'large' file of data:
$ ls lR / > giga.txt # I hit ^C after ~20 secs
$ sed "s/\(.\)/\1\n/g" giga.txt | grep -o [a-zA-Z0-9] | tr -d '\012' > newgiga.txt # remove all but letters and numbers
This gave me a 14 million character file (newgiga.txt).
I ran several timing tests, along with a new awk solution (see below), against the 14 million character file and came up with the following timings:
awk solution (see my previous answer - above)sed/grep/echo/uniq/tr/sort answerawk solution (actually hit ^C after 4 mins)awk solution (see below)NOTE: For all solutions run against my particular newgiga.txt file the final answer was the letter Z (with 365 occurrences).
By replacing the split/array code with a series of substr() calls, and making a small change to how the all array is indexed, I was able to cut ~60% off the run time of the previous file/array based awk solution:
awk '
BEGIN { RS = "\0" }
{ len=length($0)
for( i=1; i<=len; i++ ) { # get a count of occurrences of each letter/number
token=substr($0,i,1)
a++
all[a] = token # token array is for current line/$0 while all array is for entire file
count[token]++
}
}
END { min=10000000
for( i in count ) {
min = count[i]<min?count[i]:min # find the lowest/minimum count
}
for( i in all ) { # loop through array of letter/numbers
if ( min == count[all[i]] ) { # for the first letter/number we find where count = min
print all[i], min # print the letter/number and count and
break # break out of our loop
}
}
}
' newgiga.txt
NOTE: To be honest I didn't expect the substr() calls to be faster than the split/array method but I'm guessing awk has a pretty fast builtin method for running substr() calls.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With