Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shell format and substitution task

Tags:

bash

sed

I want to make the operations on structured text available here using bash script language. However, my knowledge makes the task very challenging.

Input sample:

"4-QUEIJOS": Mucarela Provolone Catupiry Ricota Oregano
"A-MODA": Mucarela Presunto Calabresa Bacon Tomate Milho Oregano
"ALHO-E-OLEO": Mucarela Alho oleo Oregano
"PEITO-DE-PERU-ESPECIAL": Mucarela Peito-de-Peru Catupiry Oregano

Output sample

"4-QUEIJOS": ["mucarela", "provolone", "catupiry", "ricota", "oregano"],
"A-MODA": ["mucarela", "presunto", "calabresa", "bacon", "tomate", "milho", "oregano"],
"ALHO-E-OLEO": ["mucarela", "alho", "oleo", "oregano"],
"PEITO-DE-PERU-ESPECIAL": ["Mucarela", "peito-de-peru", "catupiry", "oregano"]    

As you can see above, we need to:

  1. Put lower case to words after the character ":";
  2. Add commas between these words above;
  3. put them between brackets [...]

The cherry-at-the-top is the commas at the end of each line except the last.

like image 772
Bruno Peixoto Avatar asked Oct 20 '25 04:10

Bruno Peixoto


1 Answers

I just decided I wanted to do a deep-dive on sed and more specifically, to understand @HatLess's sed kungfu in the answer above. I ran the command as posted with --debug and spent a little more time digging into regex and other sed-isms. It's one thing to get an answer that solves the problem with a one liner, it's another thing to grok what the heck just happened and what it is you did to get the answer - so here is my play-by-play of the above answer... because I am not satisfied just memorizing formulas or patterns!

Lifting the hood to see how the sausage is made is the only way this stuff really sinks in, especially with something like sed. It's like learning the fundamentals of music, once you figure out the patterns, composing your own symphony is not a far stretch..

Let's break this sed command/script down and go step by step:

sed --debug -E ':a;s/([^ ]*) ([^ ]*)/\1"\2",/;ta;s/(:)(.*")/\1 [\L\2]/;$s/,$//;s/,/& /g' input_file

Note the single quotes enclose the set of commands sed will be running, and the semicolons separate the sed commands.

Annotate program execution

--debug

Use extended regular expressions

-E

Mark a spot to jump to for iterations

:a is a label that can be returned to using t a

Add quotes and commas to the values

s/([^ ]*) ([^ ]*)/\1"\2",/

Matches "4-QUEIJOS": Mucarela pattern, initially

  • "capture group 1" = "4-QUEIJOS":
  • "capture group 2" = Mucarela
  • removes space between words (for further processing)
  • adds double quotes around group2
  • adds comma after quoted group 2

Removing the spaces seems to be a clever "trick" so the next iteration can set capture group 2 to the next word and so on, until all the values are properly formatted... then the space is added back in later.

  s                           # sed substitute command, (i.e.: s/old/new/)
   /                          # start search pattern from here
    (                         # start a capture group (group 1)
     [^                       # begin negated set (inversely matches set)
                              # space character
        ]                     # end (negated) set (or, set of non-space chars)
         *                    # select as many of these sets as found
          )                   # end capture group (group 1)
                              # space between first and second capture group
            (                 # begin capture group (group 2)
             [^               # begin negated set (inversely matches set)
                              # space character
                ]             # end (negated) set (or, set of non-space chars)
                 *            # select as many of these sets as found
                  )           # end capture group (group 2)
                   /          # replace above found items with items below
                    \1        # represents string in capture group 1
                      "\2",   # surround group 2 w/ quotes and trailing comma
                           /  # end the replacement

Branch to label 'a'...

  t a                         # if above match was successful, jump back to 
                              # position label 'a' (start from the beginning)
                              # replacing the second group pattern with "x",
                              # until there are no more matches.. then go on

Restore spaces between words and make values lowercase s/(:)(.*")/\1 [\L\2]/

  # matches ':"Mucarela","Provolone","Catupiry","Ricota","Oregano"'
  # group 1 = ':'
  # group 2 = '"Mucarela","Provolone","Catupiry","Ricota","Oregano"'

  s                           # sed substitute command, (i.e.: s/old/new/)
   /                          # start search pattern from here
    (                         # start a capture group (group 1)
     :                        # look for the colon char
      )                       # end capture group (group 1)
       (                      # start a capture group (group 2)
        .                     # match any char, including space
         *"                   # match any number of chars up to last quote
           )                  # end capture group (group 2)
            /                 # replace above found groups with items below
             \1               # represents string in capture group 1
                              # output a space after the first item(s)
                [\L\2]        # set group2 lowercase + surround w/ brackets
                      /       # end the replacement

Remove last comma on last line $ s/,$//

  # matches ',' at the end of the last line read from the file and removes it

  $                           # match on the last line in the input file
    s                         # sed substitute command, (i.e.: s/old/new/)
     /                        # start search pattern from here
      ,$                      # match comma at end of line
        //                    # replace with nothing (delete)

Replace all commas with a comma and a space s/,/& /g

  # matches ',' and replaces with `, `

  s                           # sed substitute command, (i.e.: s/old/new/)
   /,/                        # match on a comma
      & /                     # replace comma with itself (comma) and space
         g                    # do this for all commas on the line

And now for the play-by-play (only going to show the first line processed and part of the last line for brevity). For this exercise, I added the original data to a file called "input_file" and ran the sed command on it, just like the above answer provided.

First few lines of the --debug output

This shows the commands as interpreted by sed (described in detail above)

SED PROGRAM:
  :a
  s/([^ ]*) ([^ ]*)/\1"\2",/
  t a
  s/(:)(.*")/\1 [\L\2]/
  $ s/,$//
  s/,/& /g

Read in the first line of data from the input file

INPUT:   'input_file' line 1

The rest of these are somewhat self-explanatory or I added "comments" referenced from the man pages for sed.

# pattern to be operated on from the input file
PATTERN: "4-QUEIJOS": Mucarela Provolone Catupiry Ricota Oregano

# :a is a label for 'b' and 't' commands
#    'b a' means to branch to label 'a', unconditionally.
#       if 'a' is omitted, branch to end of script.
#    't a' means to branch to label 'a', conditioned on
#       a s/// doing a successful substitution since the last
#       input line was read and since the last t or T command,
#       if label 'a' is omitted, branch to end of script.
COMMAND: :a 

# look for this pattern and do the quotes and comma thing...
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
  regex[0] = 0-21 '"4-QUEIJOS": Mucarela'
  regex[1] = 0-12 '"4-QUEIJOS":'
  regex[2] = 13-21 'Mucarela'

Next - do it again

# the above produced this as an output
PATTERN: "4-QUEIJOS":"Mucarela", Provolone Catupiry Ricota Oregano

# because s/// did a successful substitution since the last input line 
# was read and since the last t or T command, branch to label 'a'
COMMAND: t a

# starting back at label 'a' for another iteration
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
  regex[0] = 0-33 '"4-QUEIJOS":"Mucarela", Provolone'
  regex[1] = 0-23 '"4-QUEIJOS":"Mucarela",'
  regex[2] = 24-33 'Provolone'

Another iteration... another substitution

# the above produced this as an output
PATTERN: "4-QUEIJOS":"Mucarela","Provolone", Catupiry Ricota Oregano

# branch to label 'a' for another iteration 
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
  regex[0] = 0-44 '"4-QUEIJOS":"Mucarela","Provolone", Catupiry'
  regex[1] = 0-35 '"4-QUEIJOS":"Mucarela","Provolone",'
  regex[2] = 36-44 'Catupiry'

...

# the above produced this as an output
PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry", Ricota Oregano

# branch to label 'a' for another iteration 
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
  regex[0] = 0-53 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry", Ricota'
  regex[1] = 0-46 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry",'
  regex[2] = 47-53 'Ricota'

Finish up that last word..

#... last word of the line to add quotes and a comma to coming right up
PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota", Oregano
COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/
MATCHED REGEX REGISTERS
  regex[0] = 0-63 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota", Oregano'
  regex[1] = 0-55 '"4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota",'
  regex[2] = 56-63 'Oregano'

Substitution was made.. branch again...

PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota","Oregano",

COMMAND: t a
COMMAND: :a
COMMAND: s/([^ ]*) ([^ ]*)/\1"\2",/

No more substitutions since last branch.. next check will move on

PATTERN: "4-QUEIJOS":"Mucarela","Provolone","Catupiry","Ricota","Oregano",
COMMAND: t a

# did not branch back to a... now let's enclose the values in a list/ brackets
COMMAND: s/(:)(.*")/\1 [\L\2]/
MATCHED REGEX REGISTERS
  regex[0] = 11-64 ':"Mucarela","Provolone","Catupiry","Ricota","Oregano"'
  regex[1] = 11-12 ':'
  regex[2] = 12-64 '"Mucarela","Provolone","Catupiry","Ricota","Oregano"'

Moving on

# last command produced this... good job
PATTERN: "4-QUEIJOS": ["mucarela","provolone","catupiry","ricota","oregano"],

# is the last line in the file, remove last comma
COMMAND: $ s/,$//

# no match for this one, must not be the last line from file.. moving on

# look for commas and add a space after them
COMMAND: s/,/& /g
MATCHED REGEX REGISTERS
  regex[0] = 24-25 ','

# result
PATTERN: "4-QUEIJOS": ["mucarela", "provolone", "catupiry", "ricota", "oregano"],

Would you look at that.. all done on this line!

END-OF-CYCLE:
"4-QUEIJOS": ["mucarela", "provolone", "catupiry", "ricota", "oregano"],

New line read from the file to be iterated on...

INPUT:   'input_file' line 2
PATTERN: "A-MODA": Mucarela Presunto Calabresa Bacon Tomate Milho Oregano

The cycle repeats until we get to the last part of the last line...

# result from previous operation
PATTERN: "PEITO-DE-PERU-ESPECIAL": ["mucarela","peito-de-peru","catupiry","oregano"],

# are we on the last line in the file? yes? k, remove comma at end of line
COMMAND: $ s/,$//
MATCHED REGEX REGISTERS
  regex[0] = 75-76 ','

Nice - last line is missing the end of line comma - just need spaces added

PATTERN: "PEITO-DE-PERU-ESPECIAL": ["mucarela","peito-de-peru","catupiry","oregano"]

# check for commas, and replace ',' with ', '
COMMAND: s/,/& /g
MATCHED REGEX REGISTERS
  regex[0] = 37-38 ','
PATTERN: "PEITO-DE-PERU-ESPECIAL": ["mucarela", "peito-de-peru", "catupiry", "oregano"]

And there it is... last line.

END-OF-CYCLE:
"PEITO-DE-PERU-ESPECIAL": ["mucarela", "peito-de-peru", "catupiry", "oregano"]
like image 73
TheAnalogyGuy Avatar answered Oct 22 '25 21:10

TheAnalogyGuy