Monday, 1 June 2015

xargs

Unix xargs parallel execution of commands:
--------------------------------------------------------------

Xargs has option that allows you to take advantage of multiple cores in your machine. Its -P option which allows xargs to invoke the specified command multiple times in parallel. From XARGS(1) man page: 
-P max-procs
   Run up to max-procs processes at a time; the default is 1.  If max-procs is 0, xargs will run as many processes as possible at a time.   Use  the  -n  option
   with -P; otherwise chances are that only one exec will be done.

-n max-args
    Use at most max-args arguments per command line.  Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the  -x  
    option is given, in which case xargs will exit.

-i[replace-str]
    This option is a synonym for -Ireplace-str if replace-str is specified, and for -I{} otherwise.  This option is deprecated; use -I instead.
Let me try to give one example where we can make use of this parallel option avaiable on xargs. e.g. I got these 8 log files (each one is of 1.5G size) for which I have to run a script named count_pipeline.sh which does some calculation around the log lines in the log file. 
$ ls -1 *.out
log1.out
log2.out
log3.out
log4.out
log5.out
log6.out
log7.out
log8.out
The script count_pipeline.sh takes nearly 20 seconds for a single log file. e.g. 
$ time ./count_pipeline.sh log1.out

real 0m20.509s
user 0m20.967s
sys 0m0.467s
If we have to run count_pipeline.sh for each of the 8 log files one after the other, total time needed: 
$ time ls *.out | xargs -i ./count_pipeline.sh {}           

real 2m45.862s
user 2m48.152s
sys 0m5.358s
Running with 4 parallel processes at a time (I am having a machine which is having 4 CPU cores): 
$ time ls *.out | xargs -i -P4 ./count_pipeline.sh {} 

real 0m44.764s
user 2m55.020s
sys 0m6.224s
We saved time ! Isn't this useful ? You can also use -n1 option instead of the -i option that I am using above. -n1 passes one arg a time to the run comamnd (instead of the xargs default of passing all args). 
$ time ls *.out | xargs -n1 -P4 ./count_pipeline.sh

real 0m43.229s
user 2m56.718s

sys 0m6.353s

No comments:

Post a Comment