Parallelising Jobs with GNU Parallel

In the cloud, you can easily scale the size of your computing resources to fit your needs, meaning you have the ability to run many commands or scripts in parallel. This is where GNU parallel comes in handy!

Parallelising Jobs with GNU Parallel

One of the biggest advantages of running your analyses in the cloud is that you can easily scale the size of your computing resources to fit your needs. This means that you can run multiple commands or scripts simultaneously, completing many jobs in the same time it takes to run one.

For example, let's say you needed to convert an output file to a different format, but you needed to do this for 30 different files. On a standard laptop or personal computer, you might have to convert each file one at a time, but in the cloud you could instead select a machine with (say) 32 virtual CPUs (vCPUs) and convert your 30 output files at once by running each file conversion on a separate vCPU in parallel. Another common example is when you have to run the same script on 50 different samples — your first thought might be to create a loop to perform this task, but why wait to run each of your samples one after the other when you could just run them all at the same time?

This is where GNU parallel comes in. GNU parallel is a simple shell tool for executing jobs in parallel, where a job is either a single command or a small script that needs to be repeated over several inputs. The input can either be a list of files, a list of sample names, or even a list of commands. GNU parallel will then take each line of input and use it as an argument for a command that you specify, or will execute the line if no command is given. Each line will then be run in parallel across the available vCPUs on the machine (or the number of vCPUs you specify - keep reading for more information).

This blog post will teach you the basics of installing and using GNU parallel.

Installing GNU parallel

Fortunately, installing GNU on ubuntu is as simple as running the following command:

sudo apt install parallel

Otherwise you can follow these steps to install the latest version:

# Download the latest version of GNU parallel
wget http://ftp.jaist.ac.jp/pub/GNU/parallel/parallel-latest.tar.bz2

# Unpack the download - this will create a new folder: parallel-yyyymmdd based on the version
tar -xjf parallel-latest.tar.bz2

# Move into the new folder
cd parallel-yyyymmdd

# Build the software
sudo ./configure && make

# Install GNU parallel
sudo make install

# Test it is working
parallel -h

Note: If you receive errors during the build step you will likely need to install some packages necessary for compiling software first e.g. sudo apt install build-essental for ubuntu or yum groupinstall 'Development Tools' for Red Hat distributions.

Using GNU parallel

Specifying inputs

There are a number of different ways you can specify your inputs to GNU parallel.

One of the most common ways is using ::: to separate your command from your inputs. The inputs can either be a list separated by spaces, or a range using brace expansions, or a list that is generated using wildcards:

# Echo 3 different file names
parallel echo ::: A.txt B.txt C.txt

# Echo numbers 1-20
parallel echo ::: {1..20}

# Echo all .txt files in the current directory
parallel echo ::: *.txt
Note: ::: is the default argument separator but this can be changed using the --arg-file-sep flag.

The above method works well when you want to perform a command across a number of files. But sometimes you might instead have a list of sample names or other variables that you would like to use as input for parallel instead.

To input a file (where each line will be used as input by parallel) use the -a flag:

# Echo each line of the file names.txt
parallel -a names.txt echo

Or, you can read a file in from standard input:

# Print the contents of a file and pipe this to parallel as standard input
cat names.txt | parallel echo

# Input the file using input redirection
parallel echo < names.txt

The input method you use largely depends on what you are trying to run and what makes the most sense to you. Your input can also be a list of commands you wish to run in parallel — in this case ensure parallel is run without any commands so that it will instead execute each line of your file as a command:

# Using pipes
cat mycommands.txt | parallel

# Using input redirection
parallel < mycommands.txt

By default, your inputs will always be appended to the end of the parallel command; however, sometimes you might need to specify your input within the command. This can easily be accomplished by adding {} wherever your input should go. For example:

# Echo sample1.txt through to sample50.txt
parallel echo sample{}.txt ::: {1..50}
Note: The output of parallel commands will be printed as soon as each command completes. This means the output may be in a different order than the inputs. In the example above, sample22.txt may end up printing before sample1.txt. If the order of the output is important, you could just sort it after everything has completed. Alternatively, you can force GNU parallel to print the output in the order of the input values using the --keep-order/-k flag. The commands will be run in parallel, but the output of later jobs will be delayed until the earlier jobs are printed.

Stripping paths and extensions

When you are working with filenames in GNU parallel, sometimes you might want to remove the file extension or the path to the file. You can do this with replacement strings. The default {} will always include the whole input you passed to parallel including any paths and file extensions. Special characters can be used within the {} to edit that input and remove any information you don't want or need:

# Keep the entire input using {}
parallel echo {} ::: mydir/mysubdir/myfile.myext
# Prints: mydir/mysubdir/myfile.myext

# Strip the file extension using {.}
parallel echo {.} ::: mydir/mysubdir/myfile.myext
# Prints: mydir/mysubdir/myfile

# Strip the path to the file {/}
parallel echo {/} ::: mydir/mysubdir/myfile.myext
# Prints: myfile.myext

# Strip both the path and file extension using {/.}
parallel echo {/.} ::: mydir/mysubdir/myfile.myext
# Prints: myfile

# Keep only the path to the file (strip the file and extension) using {//}
parallel echo {//} ::: mydir/mysubdir/myfile.myext
# Prints: mydir/mysubdir

This is really helpful when converting files to other formats. For example, below we strip the .csv extension from the filename and rename it with a .txt extension.

# Convert .txt files to .csv
parallel mv {} {.}.csv ::: *.txt

Keeping track of jobs

There are  two important replacement strings that allow you to keep track of your jobs running in parallel.

The first is the job number {#}. When a job is started it gets a job number. The job numbers start at 1 and increase by 1 for each new job. The number of jobs is based on the number of inputs you give to GNU parallel:

# Echo the job numbers instead of the inputs

parallel echo {#} ::: A.txt B.txt C.txt D.txt E.txt

# Prints:
1
2
3
4
5

# Note the order may be different based on which job runs/completes first

The second is the job slot number {%}. The job slot number is different from the job number above, because job slot numbers are based on the number of jobs GNU parallel can run concurrently and not the number of inputs. For example, if you give GNU Parallel 10 inputs, but you only have 2 vCPUs available, GNU parallel can only run 2 jobs at a time. This means the job slot numbers will be 1-2 and the job numbers will be 1-10. The job slot number is unique between the running jobs, but it is re-used as soon as one of the jobs/inputs in the list has finished.

# Echo the job slot numbers of 5 inputs that are being run across 2 available CPUs

parallel echo {%} ::: A.txt B.txt C.txt D.txt E.txt

# Prints:
1
2
1
2
1

# Note the order may be different based on which job runs/completes first

Limiting the number of jobs

By default, GNU parallel will automatically detect the number of available vCPUs on the machine and will run as many jobs concurrently as there are available vCPUs. For example, if you have 8 vCPUs available, GNU parallel will process 8 inputs in parallel at a time. You can instead limit or specify the number of jobs/vCPUs you want to run in parallel by using the -j flag, for example:

# Echo the job slot numbers of 5 inputs on a machine where 5 vCPUs are available

parallel echo {%} ::: A.txt B.txt C.txt D.txt E.txt

# Prints:
1
2
3
4
5

# Note the order may be different based on which job runs/completes first

# Echo the job slot numbers of 5 inputs on a machine where 5 CPUs are available but you have limited the number of parallel jobs to 2 so that only 2 CPUs are being used

parallel -j 2 echo {%} ::: A.txt B.txt C.txt D.txt E.txt

# Prints:
1
2
1
2
1

# Note the order may be different based on which job runs/completes first

Quoting complex commands

When working with GNU parallel, all of your inputs are automatically quoted, so you only need to quote characters within your command that have special meaning in the shell:

( ) $ ` ' " < > ; | \

For instance, if you want to redirect output to a file within your parallel commands, you will need to quote "" or escape \ the output redirection symbol >:

# Echo numbers 1-20 to their own text files
parallel echo {} ">" {}.txt ::: {1..20}

The following symbols might also need to be quoted depending on the context:

~ & # ! ? space * {

For commands with complex quoting, like awk commands, it is usually easier to save the required syntax to a variable first, and then use this variable within your parallel command, for example:

# Rearrange some columns in a number of tsv files using awk
my_awk='{OFS="\t"; print $3, $2, $1}'
parallel "cat {} | awk '$my_awk' > {.}_rearranged.tsv" ::: *.tsv

You will notice in the parallel command above that the whole command that is being executed is encapsulated in double quotes — this is sometimes easier than having to quote each special character within the command, e.g. the > or the |.

The -q flag can also be used to quote complex commands automatically, particularly those that use regular expressions, though be careful when using this flag as it will escape all special characters, meaning characters like > or | may no longer work as expected.

Quoting can take some time to get used to, so start simple and work your way up. Sometimes testing your parallel command before you run it can be a big help too — we cover this in the section below.

Testing your parallel commands

Sometimes you may be unsure if you have formulated your GNU parallel command successfully, especially when things start to get complicated with flags, replacement strings and quotes. So before you run your parallel jobs, you may want to see what commands will actually be run.

To test whether your command looks right, you can use the --dryrun flag. This flag will print to the screen the exact commands that will be run without actually running them e.g.:

parallel --dryrun echo sample{}.txt ::: {1..3}

# Prints:
echo sample1.txt
echo sample2.txt
echo sample3.txt

Some real world examples

The best way to understand how to use GNU parallel is to see some real-world examples. The previous examples shown in this blog post were simple, to make them easier to read, but don't really showcase just how useful GNU parallel can be.

So here are a few real-world examples that may apply to you, or at least show you what tasks can be parallelised with this helpful tool:

# Convert tab separated files to csv files
parallel "tr '\t' ',' < {} > {.}.csv" ::: *.tsv

# Find files with the extension .bam and gzip those files in parallel
find . -type f -name '*.bam' -print | parallel gzip

# Run some summary stats on bam files in parallel across 8 CPUs with the samtools flagstat command
parallel -j 8 'samtools flagstat {} > {.}.stat' ::: *.bam

# Run myscript.sh with values 1 through to 100 as input
parallel ./myscript.sh ::: {1..100} #script uses command line arguments
# OR
parallel "echo {} | ./myscript.sh" ::: {1..100} #script uses stdin
# Note: for these commands to work you will need to make sure myscript.sh can accept/handle the right type of inputs and also ensure myscript.sh is executable (otherwise use "bash ./myscript.sh")

You can find a lot of other examples in the GNU parallel manual, or here are some other genomic examples, or even a great neuroscience example!  

Rewriting loops

If you have any for-loops or while-loops in your code, chances are that they could run much more efficiently in parallel with GNU parallel. Loops can easily be rewritten into a GNU parallel commands as follows:

# Standard for-loop that does something with each line in a file called 'list'
(for x in `cat list` ; do
  do_something $x
done) | process_output

# Standard while-read-loop that does something with each line in a file called 'list'
cat list | (while read x ; do
  do_something $x
done) | process_output

# Converted into GNU parallel command
cat list | parallel do_something | process_output

You will notice in the above command that the output of GNU parallel commands can be piped directly into other commands that process the output. This is yet another helpful feature of GNU parallel!

So there you have it, all the basics of GNU parallel. Of course there is loads more you can do with this helpful shell tool, so please check out the complete tutorial here and the manual here. Also don't forget to cite GNU parallel if you end up using it in your analysis:

O. Tange (2011): GNU Parallel – The Command-Line Power Tool, The USENIX Magazine, February 2011:42-47.

Now go ahead and work all those CPUs you have lying around!