Are too many pipes bad for performance

Question

I use that bash command to search files and execute md5dsum on my local system. In my opinion this command has bad performance on large vendor directories. Is there a better style instead of using pipe after pipe with higher performance?

find ./vendor -type f -print0 | sort -z | xargs -0 md5sum | grep -vf /usr/local/bin/vchecker_ignore > MD5sums

Maxim Egorushkin · Accepted Answer

sort introduces blocking here: it has to wait till find completed before outputting its results. find on a large filesystem, especially with hdd or nfs, may take a while.

You may like to sort at the very end to allow md5sum to run in parallel with find, e.g.:

find ./vendor -type f -print0 | xargs -0 md5sum | grep -vf /usr/local/bin/vchecker_ignore | sort -k2 > MD5sums

md5sum may take some time for large files. You may like to run it with GNU parallel instead of xargs if there are many files or files are large.

You may also like to play with line-buffered mode. In this case it needs to use new-line delimiters for filenames (that prohibits new-line symbols in filenames, which would be rather unusual) instead of 0-delimiter for line-buffered mode to work. E.g.:

stdbuf -oL find ./vendor -type f | stdbuf -oL grep -vf /usr/local/bin/vchecker_ignore | xargs -n50 -d'
' md5sum | sort -k2 > MD5sums

The above command is going to filter each file through that grep first and then execute md5sum on batches of 50 files. For small files you may like larger batches (and may be remove both stdbuf -oL completely), for large files - smaller.

Are too many pipes bad for performance

Tags:

bash

shell

unix

1 Answers

Maxim Egorushkin

Recent Activity

Donate For Us

Are too many pipes bad for performance

Tags:

bash

shell

unix

1 Answers

Maxim Egorushkin

Related questions

Recent Activity

Donate For Us