Print / Find the first character (with the LOWEST occurrence) in a non-empty string and order is important [closed]

Question

BASH GNU bash, version 4.2.46(2)-release (x86_64-redhat-linux-gnu)

Given a string str, which can store only any lower, UPPER or number values only.

How can I find the first character (with the LOWEST occurrence) in a non-empty string str? The focus of the problem is to print letter 'z' if the script is this (as fast as possible, without getting any errors if the data is in a string or a file): https://repl.it/@asangal/find1stleastoccurrencecharmaintainorderanyleastsize

or Sample example values of str:

str=aa, output should be 'a' (as 'a' is the only one char in string - occurred 2 times)

str=aa1, output should be '1' (as '1' is the first character with lowest occurrence count of 1)

str=aa1c1deef, output should be 'c' (as 'c' came before 'd' and both had 1 as lowest occurrence count of 1)

str=abcdeeddAbac, output should be 'A' (as 'A' is the first character with lower occurrence count of 1)

str=abcdeeddAbacA, output should be 'a' (as 'a' is the first character with lower occurrence count of 2)

str=abcdeeddAbacAabc, output should be 'e' (as 'e' is the first character with lower occurrence count of 2)

Other large size example value can be:

str=axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh, output should be 'g' (as 'g' is the first character with lowest occurrence count of 1)

Constraints / Context:

Value can be a set of either a lower, UPPER or a Number
String will always be non-empty; we can ignore any space type characters in the value for now.
Find the first letter([a-zA-Z0-9]) present in the str string with lowest occurrence count.
If possible, I don't want to use any Statements (ex: if-then-else), loops (For/While) or user defined functions. Use of commands, library functions (if any available to a user out of the box) is OK.

PS: I know system level commands does call all these things behind the scene, but I'm looking for minimal code if possible at command line i.e. $ prompt.

I tried the following ugly-looking non-one-liner attempt as listed below, here I have the for loop which I want to avoid if possible and the sort command is helping but also making me loose the order and doesn't cover all the conditions.

I don't like my current attempt that I have listed below, but seems like I'm close.

str="axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh";
for char in $(echo $str | sed "s/$.$/\1\n/g" | grep .| tr '\012' ' '); 
do
    echo -n "$char=$(echo ${str} | sed "s/$.$/\1\n/g" | grep . | grep -c $char)";echo;
done | sort -u

I believe it's possible to achieve what I'm looking for as a One-liner (i.e. by using bunch of common Linux commands and pipes |) in BASH; just wanted to pick your brain! I know there are better shell experts than me.

Most of the solutions I found online don't maintain the order (which is important to me) and just give the highest/lower occurrence/count of a character.

RavinderSingh13 · Accepted Answer

EDIT2: In case anyone needs to know minimum value first appeared character/integer etc from whole Input_file then try following.

awk '
{
  num=split($0,array,"")
  for(i=1;i<=num;i++){
    ++count[array[i]]
  }
  for(j=1;j<=num;j++){
    tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
  }
  for(i in count){
    min=min<=count[i]?(min?min:count[i]):count[i]
  }
}    
END{
  print "Minimum value found is:" min
  split(tot_ind[min],actual," ")
  print "All item(s) with same minimum values are:" actual[1]
}
'  Input_file

EDIT: Since OP is getting an error so in spite of reading from a variable lets read from an Input_file, in case OP reading values from Input_file then try following.

awk '
{
  delete tot_ind
  delete array
  delete count
  delete actual
  min=""
  num=split($0,array,"")
  for(i=1;i<=num;i++){
    ++count[array[i]]
  }
  for(j=1;j<=num;j++){
    tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
  }
  for(i in count){
    min=min<=count[i]?(min?min:count[i]):count[i]
  }
  print "Minimum value found is:" min
  split(tot_ind[min],actual," ")
  print "All item(s) with same minimum values are:" actual[1]
}'  Input_file

Explanation: Adding detailed explanation for above.

awk '                                                            ##Starting awk program from here.
{
  num=split($0,array,"")                                         ##Splitting current line into arrray with NULL delimiter.
  for(i=1;i<=num;i++){                                           ##Running loop to run till num here.
    ++count[array[i]]                                            ##Creating count array with index of valueof array and keep incrementing its value with 1.
  }
  for(j=1;j<=num;j++){                                           ##Running for loop till num here.
    tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]   ##Creating tot_ind with index of value of count array, this will have all values of minimum number here.
  }
  for(i in count){                                               ##Traversing in array count here.
    min=min<=count[i]?(min?min:count[i]):count[i]                ##Looking to get minimum value by comparing its value to each element.
  }
  print "Minimum value found is:" min                            ##Printing Minimum value here.
  split(tot_ind[min],actual," ")                                 ##Splitting tot_ind into actual array to get very first element of minimum value out of all values which have same minimum number.
  print "All item(s) with same minimum values are:" actual[1]    ##Printing very first minimum number here.
}' Input_file                                                    ##Mentioning Input_file name here.

To get very first minimum value which is occurring in Input_file(BTW by this solution all items which have same minimum values could be printed too, with minor change in this code's last print statement). Written and tested in GNU awk.

str="abcdeeddAbacA"
awk -v str="$str" '
BEGIN{
  num=split(str,array,"")
  for(i=1;i<=num;i++){
    ++count[array[i]]
  }
  for(j=1;j<=num;j++){
    tot_ind[count[array[j]]]=(tot_ind[count[array[j]]]?tot_ind[count[array[j]]] OFS:"")array[j]
  }
  for(i in count){
    min=min<=count[i]?(min?min:count[i]):count[i]
  }
    print "Minimum value found is:" min
    split(tot_ind[min],actual," ")
    print "All item(s) with same minimum values are:" actual[1]
}'

PROOF of concept: Ran above with OP's examples.

./script.ksh aa
Minimum value found is:2
All item(s) with same minimum values are:a

./script.ksh aa1
Minimum value found is:1
All item(s) with same minimum values are:1

./script.ksh aa1c1deef
Minimum value found is:1
All item(s) with same minimum values are:c

./script.ksh abcdeeddAbac
Minimum value found is:1
All item(s) with same minimum values are:A

./script.ksh abcdeeddAbacA
Minimum value found is:2
All item(s) with same minimum values are:a

./script.ksh abcdeeddAbacAabc
Minimum value found is:2
All item(s) with same minimum values are:e

NOTE: I saved above solution in a script file and passing OP's sample inputs as argument to script, OP could use in whatever way he wants, this was done to show how its working.

markp-fuso · Answer

ANSWER #1 - string/variable based solution

Assuming the desired string is stored in a variable str, here's one awk solution:

awk -v str="${str}" '
BEGIN { num = split(str,token,"")                # split str into an array of single letter/number elements

        for ( i=1; i<=num; i++ ) {               # get a count of occurrences of each letter/number
            count[token[i]]++
        }

        min = 10000000

        for ( i in count ) {
            min = count[i]<min?count[i]:min      # keep track of the lowest/minimum count
        }

        for ( i=1; i<=num; i++ ) {               # loop through array of letter/numbers
            if ( min == count[token[i]] ) {      # for the first letter/number we find where count = min
                print token[i], min              # print the letter/number and count and 
                break                            # then break out of our loop
            }
        }
      }'

Running the above against the different sample strings:

++++++++++++++++ str = aa
a 2
++++++++++++++++ str = aa1
1 1
++++++++++++++++ str = aa1c1deef
c 1
++++++++++++++++ str = abcdeeddAbac
A 1
++++++++++++++++ str = abcdeeddAbacA
a 2
++++++++++++++++ str = abcdeeddAbacAabc
e 2
++++++++++++++++ str = axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh
g 1

ANSWER #2 - file/array based solution

Looking at a comment OP made to RavinderSingh13's answer re: a really large string resides in a file, and assuming that file's name is giga.txt ...

We should be able to make some minor mods to the previous awk solution like such:

awk '
BEGIN { RS = "\0" }                            # address files with no cr/lf
{ num = split($0,token,"")                     # split line/$0 into an array of single letter/number elements

  for( i=1; i<=num; i++ ) {                    # get a count of occurrences of each letter/number
      all[NR i] = token[i]                     # token array is for current line/$0 while all array is for entire file
      count[token[i]]++
  }
}

END { min = 10000000

      for ( i in count ) {
          min = count[i]<min?count[i]:min      # find the lowest/minimum count
      }

      for ( i in all ) {                       # loop through array of letter/numbers
          if ( min == count[all[i]] ) {        # for the first letter/number we find where count = min
              print all[i], min                # print the letter/number and count and 
              break                            # then break out of our loop
          }
      }
    }
' giga.txt

Placing the longer str sample into giga.txt:

$ cat giga.txt
axavzzzfdfdsldfnasdlkfjasdlkfjaslkfjasldkfjaslfjlasjkflasdkjfasdlfjasdljfasdkjfgio23yoryoiasyfoiywoerihlkdfhlaskdnkasdnvxcnvjzxkiivhaslyqwoyroiqwyroqwroqwlkasddlkkhaslkfjasdldkfjalsdkfashoqwiyroiqwyroiqwhrkjhajkdfhaslfkhasldkfh

Running the above awk solution against giga.txt gives us:

$ awk '....' giga.txt
g 1

ANSWER #3 - file/substr() based solution

OP provided additional details on how to generate a 'large' file of data:

$ ls lR / > giga.txt       # I hit ^C after ~20 secs
$ sed  "s/$.$/\1\n/g" giga.txt | grep -o [a-zA-Z0-9] | tr -d '\012' > newgiga.txt       # remove all but letters and numbers

This gave me a 14 million character file (newgiga.txt).

I ran several timing tests, along with a new awk solution (see below), against the 14 million character file and came up with the following timings:

15 secs with file/array based awk solution (see my previous answer - above)
25 secs with OPs sed/grep/echo/uniq/tr/sort answer
4+ mins with RavinderSingh13's awk solution (actually hit ^C after 4 mins)
6 secs with a new file/substr() based awk solution (see below)

NOTE: For all solutions run against my particular newgiga.txt file the final answer was the letter Z (with 365 occurrences).

By replacing the split/array code with a series of substr() calls, and making a small change to how the all array is indexed, I was able to cut ~60% off the run time of the previous file/array based awk solution:

awk '
BEGIN { RS = "\0" }
{ len=length($0)

  for( i=1; i<=len; i++ ) {                    # get a count of occurrences of each letter/number
      token=substr($0,i,1)
      a++
      all[a] = token                           # token array is for current line/$0 while all array is for entire file
      count[token]++
  }
}

END { min=10000000

      for( i in count ) {
          min = count[i]<min?count[i]:min      # find the lowest/minimum count
      }

      for( i in all ) {                        # loop through array of letter/numbers
          if ( min == count[all[i]] ) {        # for the first letter/number we find where count = min
              print all[i], min                # print the letter/number and count and
              break                            # break out of our loop
          }
      }
    }
' newgiga.txt

NOTE: To be honest I didn't expect the substr() calls to be faster than the split/array method but I'm guessing awk has a pretty fast builtin method for running substr() calls.

Print / Find the first character (with the LOWEST occurrence) in a non-empty string and order is important [closed]

Tags:

linux

bash

sed

awk

one-liner

AKS

2 Answers

RavinderSingh13

markp-fuso

Recent Activity

Donate For Us

Print / Find the first character (with the LOWEST occurrence) in a non-empty string and order is important [closed]

Tags:

linux

bash

sed

awk

one-liner

AKS

2 Answers

RavinderSingh13

markp-fuso

Related questions

Recent Activity

Donate For Us