How do I fix sed commands becoming extremely slow when load is high?

Question

I have a bash script that takes a simple properties file and substitutes the values into another file. (Property file is just lines of 'foo=bar' type properties)

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME=`echo $line | cut -f1 -d'='`
   PROP_VALUE=`echo $line | cut -f2- -d'=' | sed 's/\$/\\$/g`
   time INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done <<<$(cat "$PROPERTIES_FILE")
# Do more stuff with INPUT

However, when my machine has high load (upper forties) I get a large time loss on my seds

real  0m0.169s
user  0m0.001s
sys  0m0.006s

Low load:

real  0m0.011s
user  0m0.002s
sys  0m0.004s

Normally losing 0.1 seconds isn't a huge deal but both the properties file and the input files are hundreds/thousands of lines long and those .1 seconds add up to over an hour of wasted time.

What can I do to fix this? Do I just need more CPUs?

Sample properties (lines start with special char to create a way to indicate that something in the input is trying to access a property)

$foo=bar
$hello=world
^hello=goodbye

Sample input

This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Expected result

This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Jetchisel · Accepted Answer

One idea/approach using bash and sed , you could try something like:

#!/usr/bin/env bash

while IFS='=' read -r prop_name prop_value; do
  if [[ "$prop_name" == "^"* ]]; then
     prop_name="\${prop_name}"
  fi
  input_value+=("s/${prop_name}\b/${prop_value}/g")
done < properties.txt

sed_input="$(IFS=';'; printf '%s' "${input_value[*]}")"

sed "$sed_input" sample_input.txt

One way to check the value of sed_input is

declare -p sed_input

Or

printf '%s
' "$sed_input"

Embedding an external utility from bash within a shell loop like cut and sed should be avoided. See why-is-using-a-shell-loop-to-process-text-considered-bad-practice
The sed invocation above run only once even if the file that needs to be edited has 500+ lines.
See How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
See How can I use array variables in bash?
See Parameter Expansion
See Howto_Parameter_Expansion
See How_do_I_do_string_manipulation_in_bash

markp-fuso · Answer

Adding additional lines to OP's input file to demonstrate word boundary matching and a property name occurring more than once in a line:

$ cat input.txt
This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Leave first 2 matches alone: $foobar $hellow ^hello
^hello $foo $hello ^hello $foo $hello

Assumptions:

for word boundary matching it is sufficient to verify the character immediately after a matching property name is not an alphabetic character ([a-zA-Z]); otherwise we can expand the next_char testing (see awk code, below)

General idea:

read all properties.txt entries into an array (map[name]=value)
for each line from input.txt, loop through all names, checking for any word boundary matches to replace

One idea using awk:

$ cat replace.awk

FNR==NR { split($0,arr,"=")                             # 1st file: split on "=" delimiter
          map[arr[1]]=arr[2]                            # build map[name]=value array, eg: map[$foo]=bar
          len[arr[1]]=length(arr[1])                    # save length of "name" so we do not have to repeatedly calculate later
          next
        }

NF      { newline = $0                                  # 2nd file: if we have at least one non white space field then make copy of current input line

          for (name in map) {                           # loop through all "names" to search for 
              line    = newline                         # start over copy of current line
              newline = ""

              while ( pos = index(line,name) ) {        # while we have a match ...

                    # find next_character after "name"; if it is an
                    # alpha/numeric character we do not have a word
                    # boundary otherwise we do have a word boundary
                    # and we need to make the replacement with 
                    # map[name]=value
                    
                    next_char = substr(line,pos+len[name],1)

                    if (next_char ~ /[[:alnum:]]/)
                       newline = newline substr(line,1,pos+len[name]-1)
                    else
                       newline = newline substr(line,1,pos-1) map[name]

                    line = substr(line,pos+len[name])   # strip off rest of line to test for additional matches of "name"
              }
              newline = newline line                    # append remaining contents of line
          }
          $0 = newline                                  # overwrite current input line with "newline"
        }
1                                                       # print current line

NOTES:

most awk string matching functions (eg, sub(), gsub(), match()) treat the search pattern as a regex
this means those non-alphabetic characters in OP's properties file (eg, $, ^) will need to be escaped before trying to use sub() / gsub() / match()
instead of jumping through hoops to escape all special characters I've opted to use ...
the index() function treats search patterns as literal text (so no need to escape special characters)

Taking for a test drive:

$ awk -f replace.awk properties.txt input.txt
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world

For timing purposes I created a couple larger files from OP's properties file and my input.txt file (see above):

$ awk 'BEGIN {FS=OFS="="} {map[$1]=$2} END {for (i=1;i<=300;i++) {for (name in map) {nn=name x;print nn,map[name]};x++}}' properties.txt > properties.900.txt

$ for ((i=1;i<=250;i++)); do cat input.txt; done > input.1500.txt

$ wc -l properties.900.txt input.1500.txt
  900 properties.900.txt
 1500 input.1500.txt

Timing for the larger data files:

$ time awk -f replace.awk properties.900.txt input.1500.txt > output

real    0m0.126s
user    0m0.122s
sys     0m0.004s

$ head -12 output
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world

NOTE: timing is from an Ubuntu 22.04 system (metal, vm) running on an Intel i7-1260P

How do I fix sed commands becoming extremely slow when load is high?

Tags:

linux

bash

sed

Bryan Tan

2 Answers

Jetchisel

markp-fuso

Recent Activity

Donate For Us

How do I fix sed commands becoming extremely slow when load is high?

Tags:

linux

bash

sed

Bryan Tan

2 Answers

Jetchisel

markp-fuso

Related questions

Recent Activity

Donate For Us