Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk/printf formating number strange behaviour

Tags:

bash

printf

awk

The following piece of code is supposed to left pad number in the column 1.

input="/home/user_install/folders_LIST_SIZES_MY_FOLDERS.txt"
while IFS= read -r line ; do 
  sudo du  -sm $line | LC_NUMERIC=fr_FR.UTF8 awk '{printf "% 11 '\''d  : %s\n",  $1 , $2}'
done < "$input"

The expected result should be :

    155 283  : /
          0  : /100_samba
        462  : /backup_sys
         62  : /backup_sys_data
          0  : /bdd
          0  : /data
          0  : /data1_pub
          1  : /data3_dwnld_pub
          0  : /data4_mk_dvd_pub
          0  : /data5_my_tmp_pub
      1 211  : /home
          0  : /local
         13  : /root
          0  : /srv
     33 313  : /virtual_0_backup_vdi
     14 689  : /virtual_linux_1
     99 116  : /virtual_win_1
        300  : /win_linux_echange_1

The real result is :

  155 283  : /
          0  : /100_samba
        462  : /backup_sys
         62  : /backup_sys_data
          0  : /bdd
          0  : /data
          0  : /data1_pub
          1  : /data3_dwnld_pub
          0  : /data4_mk_dvd_pub
          0  : /data5_my_tmp_pub
    1 211  : /home
          0  : /local
         13  : /root
          0  : /srv
   33 313  : /virtual_0_backup_vdi
   14 689  : /virtual_linux_1
   99 116  : /virtual_win_1
        300  : /win_linux_echange_1

When a number has more than 3 digits, there is a 1 digit shift to the left.

like image 610
jcdole Avatar asked Oct 23 '25 16:10

jcdole


1 Answers

Using input numbers of 1234567, 123456 and 123 ...

  • if I use LC_NUMERIC=en_US everything lines up correctly with a comma as the 1000s delimiter

    1,234,567 :
      123,456 :
          123 :
    
  • if I use LC_NUMERIC=fr_FR.UTF8 I get the same shifting as seen in OP's output

    1 234 567 :
       123 456 :
            123 :
    

With the 2nd set of data it's as if the space delimiter is followed by a backspace, or perhaps ... represented by a multi-byte character?

Piping the 2nd set of data to od -c I get:

0000000   1 302 240   2   3   4 302 240   5   6   7       :  \n
0000020       1   2   3 302 240   4   5   6       :  \n
0000040                   1   2   3       :  \n
0000052

So, that 1-byte space delimiter is actually implemented as 2 bytes (302 240 - one printable character, one non-printable character).

Since the printf formatting is based on number of bytes and not number of (printable) characters, each non-printable character 'eats up' one output position thus causing the final output to shift (or shrink) by one (visible/printable) position.


One workaround consists of splitting the formatting into 2 separate operations, eg:

printf '1234567\n123456\n123\n' |
LC_NUMERIC=fr_FR.UTF8 awk '{ x = sprintf("%\04711d",$1 )  # format just the number
                             printf "%13s :\n",x          # feed formatted number to basic string format
                           }'

This generates:

    1 234 567 :
      123 456 :
          123 :

NOTE: the 2 bytes are still there, ie, piping this latest set of data to od -c generates

0000000                   1 302 240   2   3   4 302 240   5   6   7
0000020   :  \n                           1   2   3 302 240   4   5   6
0000040       :  \n                                           1   2   3
0000060       :  \n
0000063

Some caveats:

  • if OP expects to deal with 11-digit numbers then the (s)printf formats may need to be modified from 11d/13s

  • as highlighted in the comments section (KamilCuk, Ed Morton, me) the actual bytes, and even number of bytes (2-bytes vs 3-bytes), that make up the 'space delimiter' can vary based on the version of glibc used to build the fr_FR.UTF-8 locale; for a 3-byte character this will cause each delimiter to eat/shift the output by 2 (printing) characters; awk is designed to work on characters and not bytes so any attempt to dynamically determine the number of bytes that make up the 1000s separator would require an additional effort (eg: making system() calls to wc; pre-calculating in bash and then passing as a -v awk_var=byte_count arg to awk; awk -b)

like image 151
markp-fuso Avatar answered Oct 26 '25 07:10

markp-fuso



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!