Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I fix sed commands becoming extremely slow when load is high?

Tags:

linux

bash

sed

I have a bash script that takes a simple properties file and substitutes the values into another file. (Property file is just lines of 'foo=bar' type properties)

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME=`echo $line | cut -f1 -d'='`
   PROP_VALUE=`echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g`
   time INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done <<<$(cat "$PROPERTIES_FILE")
# Do more stuff with INPUT

However, when my machine has high load (upper forties) I get a large time loss on my seds

real  0m0.169s
user  0m0.001s
sys  0m0.006s

Low load:

real  0m0.011s
user  0m0.002s
sys  0m0.004s

Normally losing 0.1 seconds isn't a huge deal but both the properties file and the input files are hundreds/thousands of lines long and those .1 seconds add up to over an hour of wasted time.

What can I do to fix this? Do I just need more CPUs?

Sample properties (lines start with special char to create a way to indicate that something in the input is trying to access a property)

$foo=bar
$hello=world
^hello=goodbye

Sample input

This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Expected result

This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"
like image 277
Bryan Tan Avatar asked Oct 26 '25 18:10

Bryan Tan


2 Answers

One idea/approach using bash and sed , you could try something like:

#!/usr/bin/env bash

while IFS='=' read -r prop_name prop_value; do
  if [[ "$prop_name" == "^"* ]]; then
     prop_name="\\${prop_name}"
  fi
  input_value+=("s/${prop_name}\\b/${prop_value}/g")
done < properties.txt

sed_input="$(IFS=';'; printf '%s' "${input_value[*]}")"

sed "$sed_input" sample_input.txt

One way to check the value of sed_input is

declare -p sed_input

Or

printf '%s\n' "$sed_input"

  • Embedding an external utility from bash within a shell loop like cut and sed should be avoided. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice

  • The sed invocation above run only once even if the file that needs to be edited has 500+ lines.

  • See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?

  • See How can I use array variables in bash?

  • See Parameter Expansion

  • See Howto_Parameter_Expansion

  • See How_do_I_do_string_manipulation_in_bash

like image 137
Jetchisel Avatar answered Oct 28 '25 06:10

Jetchisel


Adding additional lines to OP's input file to demonstrate word boundary matching and a property name occurring more than once in a line:

$ cat input.txt
This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Leave first 2 matches alone: $foobar $hellow ^hello
^hello $foo $hello ^hello $foo $hello

Assumptions:

  • for word boundary matching it is sufficient to verify the character immediately after a matching property name is not an alphabetic character ([a-zA-Z]); otherwise we can expand the next_char testing (see awk code, below)

General idea:

  • read all properties.txt entries into an array (map[name]=value)
  • for each line from input.txt, loop through all names, checking for any word boundary matches to replace

One idea using awk:

$ cat replace.awk

FNR==NR { split($0,arr,"=")                             # 1st file: split on "=" delimiter
          map[arr[1]]=arr[2]                            # build map[name]=value array, eg: map[$foo]=bar
          len[arr[1]]=length(arr[1])                    # save length of "name" so we do not have to repeatedly calculate later
          next
        }

NF      { newline = $0                                  # 2nd file: if we have at least one non white space field then make copy of current input line

          for (name in map) {                           # loop through all "names" to search for 
              line    = newline                         # start over copy of current line
              newline = ""

              while ( pos = index(line,name) ) {        # while we have a match ...

                    # find next_character after "name"; if it is an
                    # alpha/numeric character we do not have a word
                    # boundary otherwise we do have a word boundary
                    # and we need to make the replacement with 
                    # map[name]=value
                    
                    next_char = substr(line,pos+len[name],1)

                    if (next_char ~ /[[:alnum:]]/)
                       newline = newline substr(line,1,pos+len[name]-1)
                    else
                       newline = newline substr(line,1,pos-1) map[name]

                    line = substr(line,pos+len[name])   # strip off rest of line to test for additional matches of "name"
              }
              newline = newline line                    # append remaining contents of line
          }
          $0 = newline                                  # overwrite current input line with "newline"
        }
1                                                       # print current line

NOTES:

  • most awk string matching functions (eg, sub(), gsub(), match()) treat the search pattern as a regex
  • this means those non-alphabetic characters in OP's properties file (eg, $, ^) will need to be escaped before trying to use sub() / gsub() / match()
  • instead of jumping through hoops to escape all special characters I've opted to use ...
  • the index() function treats search patterns as literal text (so no need to escape special characters)

Taking for a test drive:

$ awk -f replace.awk properties.txt input.txt
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world

For timing purposes I created a couple larger files from OP's properties file and my input.txt file (see above):

$ awk 'BEGIN {FS=OFS="="} {map[$1]=$2} END {for (i=1;i<=300;i++) {for (name in map) {nn=name x;print nn,map[name]};x++}}' properties.txt > properties.900.txt

$ for ((i=1;i<=250;i++)); do cat input.txt; done > input.1500.txt

$ wc -l properties.900.txt input.1500.txt
  900 properties.900.txt
 1500 input.1500.txt

Timing for the larger data files:

$ time awk -f replace.awk properties.900.txt input.1500.txt > output

real    0m0.126s
user    0m0.122s
sys     0m0.004s

$ head -12 output
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world

NOTE: timing is from an Ubuntu 22.04 system (metal, vm) running on an Intel i7-1260P

like image 32
markp-fuso Avatar answered Oct 28 '25 06:10

markp-fuso



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!