Improve performance of bash script

Question

I'm working on looping over hundreds of thousands of CSV files to generate more files from them. The requirement is to extract previous 1 month, 3 month, month, 1 year & 2 years of data from every file & generate new files from them.

I've written the below script which gets the job done but is super slow. This script will need to be run quite frequently which makes my life cumbersome. Is there a better way to achieve the outcome I'm after or possibly enhance the performance of this script please?

for k in *.csv; do
    sed -n '/'"$(date -d "2 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.2years.csv
    sed -n '/'"$(date -d "1 year ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1year.csv
    sed -n '/'"$(date -d "6 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.6months.csv
    sed -n '/'"$(date -d "3 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.3months.csv
    sed -n '/'"$(date -d "1 month ago" '+%Y-%m')"'/,$p' ${k} > temp_data_store/${k}.1month.csv
done

ceving · Accepted Answer

You read each CSV five times. It would be better to read each CSV only once.

You extract the same data multiple times. All but one parts are subsets of the others.

2 years ago is a subset of 1 year ago, 6 months ago, 3 months ago and 1 month ago.
1 year ago is a subset of 6 months ago, 3 months ago and 1 month ago.
6 months ago is a subset of 3 months ago and 1 month ago.
3 months ago is a subset of 1 month ago.

This means every line in "2years.csv" is also in "1year.csv". So it will be sufficient to extract "2years.csv" from "1year.csv". You can cascade the different searches with tee.

The following assumes, that the contents of your files is ordered chronologically. (I simplified the quoting a bit)

sed -n "/$(date -d '1 month ago' '+%Y-%m')/,\$p" "${k}" |
tee temp_data_store/${k}.1month.csv |
sed -n "/$(date -d '3 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.3months.csv |
sed -n "/$(date -d '6 month ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.6months.csv |
sed -n "/$(date -d '1 year ago' '+%Y-%m')/,\$p" |
tee temp_data_store/${k}.1year.csv |
sed -n "/$(date -d '2 year ago' '+%Y-%m')/,\$p" > temp_data_store/${k}.2years.csv

Improve performance of bash script

Tags:

bash

shell

usert4jju7

1 Answers

ceving

Recent Activity

Donate For Us

Improve performance of bash script

Tags:

bash

shell

usert4jju7

1 Answers

ceving

Related questions

Recent Activity

Donate For Us