Measure field/column width in fixed width output - Finding delimiters? [closed]

Question

In the context of the bash shell and command output:

Is there a process/approach to help determine/measure the width of fields that appear to be fixed width? (apart from the mark one human eyeball and counting on the screen method....)
If the output appears to be fixed width, is it possible/likely that it's actually delimited by some sort of non-printing character(s)?
If so, how would I go about hunting down said character?

I'm mostly after a way to do this in bash shell/script, but I'm not averse to a programming language approach.

Sample Worst Case Data:

Name                   value 1    empty_col    simpleHeader  complex multi-header
foo                    bar                     -someVal1     1someOtherVal       
monty python           circus                  -someVal2     2someOtherVal       
exactly the field_widthNextVal                 -someVal3     3someOtherVal

My current approach: The best I have come up with is redirecting the output to a file, then using a ruler/index type of feature in the editor to manually work out field widths. I'm hoping there is a smarter/faster way...

What I'm thinking:

With Headers:
Perhaps an approach that measures from the first character 'to the next character that is encountered, after having already encountered multiple spaces'?
Without Headers:
Drawing a bit of a blank on this one....?

This strikes me as the kind of problem that was cracked about 40 years ago though, so I'm guessing there are better solutions than mine to this stuff...

Some Helpful Information:

Column Widths

fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')

This is proving to be helpful for determining column widths. I don't fully understand how it works yet to provide a complete answer, but it might be helpful to a future someone else. Source: https://unix.stackexchange.com/questions/465170/parse-output-with-dynamic-col-widths-and-empty-fields

File Examination

Redirect output to a file: command > file.data

Use hexdump or xxd against file.data to look at it's raw information. See links for some basics on those tools:

hexdump output vs xxd output

https://nwsmith.blogspot.com/2012/07/hexdump-and-xxd-output-compared.html?m=1

hexdump

https://man7.org/linux/man-pages/man1/hexdump.1.html

https://linoxide.com/linux-how-to/linux-hexdump-command-examples/

https://www.geeksforgeeks.org/hexdump-command-in-linux-with-examples/

xxd

https://linux.die.net/man/1/xxd

https://www.howtoforge.com/linux-xxd-command/

Chris · Accepted Answer

tl;dr:

# Determine Column Widths
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')

# Iterate
while IFS= read -r line
do
    # You can do put awk command in a separate line if this is clearer to you
    awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
    field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"
    
    # Or do it all in one line if you prefer:
    field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"    

        *** Code Stuff Here ***

done <<< $(appropriate-command)

Some explanation of the above - for newbies (like me)

Okay, so I'm a complete newbie, but this is my answer, based on a grand total of about two days of clawing around in the dark. This answer is relevant to those who are also new and trying to process data in the bash shell and bash scripts.

Unlike the *nix wizards and warlocks that have presented many of the solutions you will find to specific problems (some impressively complex), this is just a simple outline to help people understand what it is that they probably don't know; that they don't know. You will have to go and look this stuff up separately, it's way to big to cover it all here.

EDIT:

I would strongly suggest just buying a book/video/course for shell scripting. You do learn a lot doing it the school of hard knocks way as I have for the last couple of days, but it's proving to be painfully slow. The devil is very much in the details with this stuff. A good structured course probably instils good habits from the get go too, rather than potentially developing your own habits/short hand 'that seems to work' but will likely and unwittingly, bite you later on.

Resources:

Bash references:

https://linux.die.net/man/1/bash

https://tldp.org/LDP/Bash-Beginners-Guide

https://www.gnu.org/software/bash/manual/html_node

Common Bash Mistakes, Traps and Pitfalls:

https://mywiki.wooledge.org/BashPitfalls

http://www.softpanorama.org/Scripting/Shellorama/Bash_debugging/typical_mistakes_in_bash_scripts.shtml

https://wiki.bash-hackers.org/scripting/newbie_traps

My take is that there is no 'one right way that works for everything' to achieve this particular task of processing fixed width command output. Notably, the fixed widths are dynamic and might changed each time the command is run. It can be done somewhat haphazardly using standard bash tools (it depends on the types of values in each field, particularly if they contain whitespace or unusual/control characters). That said, expect any fringe cases to trip up the 'one bash pipeline to parse them all' approach, unless you have really looked at your data and it's quite well sanitised.

My uninformed, basic approach:

Pre-reqs:

To get much out of all this:

Learn the basics of how IFS= read -r line (and it's variants) work, it's one way of processing multiple lines of data, one line at a time. When doing this, you need to be aware of how things are expanded differently by the shell.
Grasp the basics of process substitution and command substitution, understand when data is being manipulated in a sub-shell, otherwise it disappears on you when you think you can recall it later.
It helps to grasp what Regular Expressions (regex) are. Half of the hieroglyphics that you encounter are probably regex in action.
Even further, it helps to understand when/what/why you need to 'escape' certain characters, at certain times, as this is why there is even more \ than you would expect amongst the hieroglyphics.
When doing redirection, be aware of the difference in > (overwrites without prompting) and >> (which appends to any existing data).
Understand differences in comparison operators and conditional tests (such as used with if statements and loop conditions).
if [ cond ] is not necessarily the same as if [[ cond ]]
look into the basics of arrays, and how to load, iterate over and query their elements.
bash -x script.sh is useful for debugging. Targeted debugging of specific lines is done by using set -x lines of code to debug set +x within the script.

As for the fixed width data:

If it's delimited:

Use the delimiter. Most *nix tools use a single white space as a default delimiter, but you can typically also set a specific delimiter (google how to do it for the specific tool).

Optional Step:

If there is no obvious delimiter, you can check to see if there is some secret hidden delimiter to take advantage of. There probably isn't, but you can feel good about yourself for checking. This is done by looking at the hex data in the file. Redirect the output of a command to a file (if you don't have the data in a file already). Do it using command > file.data and then explore file.data using hexdump -Cv file.data (another tool is xxd).

If you're stuck with fixed width:

Basically to do something useful, you need to:

Read line by line (i.e. record by record).
Split the lines into their columns (i.e. field by field, this is the fixed-width aspect)
Check that you are really doing what you think you are doing; particularly if expanding or redirecting data. What you see on shell as command output, might not actually be exactly what you are presenting to your script/pipe (most commonly due to differences in how the shell expands args/variables, and tends to automatically manipulate whitespace without telling you...)
Once you know exactly what your processing pipe/script is seeing, you can then tidy up any unwanted whitespace and so forth.

Starting Guidelines:

Feed the pipe/script an entire line at a time, then chop up fields (unless you really know what you are doing). Doing the field separation inside any loops such as while IFS= read -r line; do stuff; done is less error prone (in terms of the 'what is my pipe actually seeing' problem. When I was doing it outside, it tended to produce more scenarios where the data was being modified without me understanding that it was being altered (let alone why), before it even reached the pipe/script. This obviously meant I got extremely confused as to why a pipe that worked in one setting on the command line, fell over when I 'feed the same data' in a script or by some other method (but the pipe really wasn't actually getting the same data). This comes back to preserving whitespace with fixed-width data, particularly during expansion and redireciton, process substitiution and command substitution. Typically it amounts to liberal use of double quotes when calling a variable, i.e. not $someData but "$someData". Use parenthesis to clear up which var you are talking about, i.e. ${var}bar. Similarly when capturing the entire output of a command.
If there is nothing to leverage as a delimiter, you have some choices. Hack away directly at the fixed width data using tools like:
cut -c n1-n2 this directly cuts things out, starting from character n1 through to n2.
awk '{print $1}' this uses a single space by default to separate fields and print the first field.

Or, you can try to be a bit more scientific and 'measure twic, cut once'.

You can work out the field widths fairly easily if there are headers. This line is particularly helpful (sourced from an answer I link below):

fieldwidths=$(head -n 1 file | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}')
echo $fieldwidths

You can also look at all the data to see what length of data you are seeing in each field, and if you are actually getting the number of fields you expect (Thanks to David C. Rankin for this one!):

awk '{ for (i=1; i<=NF; i++) printf "%d ",length($i) } {print ""}' file.data

With that information, you can then set about chopping fields up with a bit more certainty that you are actually capturing the entire field (and only the entire field). Tool options are many and varied, but I'm finding GNU awk (gawk) and perl's unpack to be the clearest. As part of a pipe/script consider this (sub in your relevant field widths and which ever field you want out in the {print $fieldnumber} obviously):

awk 'BEGIN {FIELDWIDTHS=$10 20 30 10}{print $1}

For command output with dynamic field widths, if you feed it into a while IFS= read -r line; do; done loop, you will need to parse the output using the awk above, as each time the field widths might have changed. Since I originally couldn't get the expansion right, I built the awk command on a separate line and stored it in a variable, which I then called in the pipe. Once you have it figured out though, you can just shove it all back into one line if you want:

# Determine Column Widths:
# Source for this voodoo:
# https://unix.stackexchange.com/a/465178/266125
fieldwidths=$(echo "$(appropriate-command)" | head -n 1 | grep -Po '\S+\s*' | awk '{printf "%d ", length($0)}' | sed 's/^[ ]*//;s/[ ]*$//')

# Iterate
while IFS= read -r line
do
    # Separate the awk command if you want:
    # This uses GNU awk to split the column widths and pipes it to sed to remove leading and trailing spaces.
    awkcmd="BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$1}"
    field1="$(echo "$line" | awk "$awkcmd" | sed 's/^[ ]*//;s/[ ]*$//')"

    # Or do it all in one line, rather than two:
    field2="$(echo "$line" | awk "BEGIN {FIELDWIDTHS=\"$fieldwidths\"}{print \$2}" | sed 's/^[ ]*//;s/[ ]*$//')"    

    if [ "${DELETIONS[0]}" == 'all' ] && [ "${#DELETIONS[@]}" -eq 1 ] && [ "$field1" != 'UUID' ]; then 
        *** Code Stuff ***
    fi
    
    *** More Code Stuff ***

done <<< $(appropriate-command)

Remove excess whitespace using various approaches:

tr -d '[:blank:] and/or tr -d '[:space:](the later eliminates new lines and vertical whitespace, not just horizontal like :blank: does. They both also remove internal whitespace).
sed s/^[ ]*//;s/[ ]*$// this cleans up only leading and trailing whitespace.

Now you should basically have clean, separated fields to work with one at a time, having started from multi-field, multi-line command output.
Once you get what is going on fairly well with the above, you can start to look into other more elegant approaches as presented in these answers:

Finding Dynamic Field Widths:

https://unix.stackexchange.com/a/465178/266125

Using perl's unpack:

https://unix.stackexchange.com/a/465204/266125

Awk and other good answers:

https://unix.stackexchange.com/questions/352185/awk-fixed-width-columns

Some stuff just can't be done in a single pass. Like the perl answer above, it basically breaks the problem down into two parts. The first is turning the fixed width data into delimited data (just chose a delimiter that doesn't occur within any of the values in your fields/records!). Once you have it as delimited data, it makes the processing substantially easier from there on out.

Measure field/column width in fixed width output - Finding delimiters? [closed]

Tags:

bash

shell

parsing

fixed-width

Some Helpful Information:

Chris

1 Answers

EDIT:

Resources:

My uninformed, basic approach:

Chris

Recent Activity

Donate For Us

Measure field/column width in fixed width output - Finding delimiters? [closed]

Tags:

bash

shell

parsing

fixed-width

Some Helpful Information:

Chris

1 Answers

EDIT:

Resources:

My uninformed, basic approach:

Chris

Related questions

Recent Activity

Donate For Us