Parallelising freebayes with snakemake

Published on Nov 1, 2021

3 min read


freebayes is a bayesian haplotype-based variant caller, used widely in genomics. As with many variant callers, it is not readily parallelised, but can be done so by splitting the genome into smaller chunks, calling them separately, and subsequently combining the chunks together.

A wrapper for freebayes, freebayes-parallel, does exactly this, making use of gnu-parallel. However, this approach has a major limitation:

  • When a chunk is completed, that cpu core will not move onto the next region until all cores have completed their respective chunk. This is particularly problematic in regions of variable coverage, and so one can attempt to split the genome into regions of roughly equal coverage. Unfortunately, this still results in many cores being unused for substantial periods of time.

I was implementing a freebayes variant calling step in a snakemake RNA-Sequencing pipeline I was writing (more on this later), and wanted to parallelise freebayes, without the above limitation.

To do so, we can write a snakemake rule (below) which runs an R script to read in the genome index (.fai) file, and output multiple bed files, breaking the genome into chunks of equal size. By using an extra snakemake wildcard, the index of each genome chunk, we can produce, and supply freebayes with different bed files. Finally, after concatenating the vcfs with bcftools concat it is also important to stream the output through vcfuniq, to ensure there are no duplicate calls at the region overlaps.

The benefit of this, is that snakemake will automatically run the next job when each chunk is complete, reducing overall computation time as compared with freebayes-parallel. Importantly, it also allows us to perform joint, multi-sample calling, which is one of the main benefits of using freebayes in the first place.

1# Read in the desired number of genome chunks from the config.yaml, and arange a sequence 1-n.
2chunks = np.arange(1, config['chunks'])
4# Note - in this case the script also produces some other files for Freebayes
5rule GenerateFreebayesParams:
6 input:
7 ref = config['ref']['genome'],
8 index = config['ref']['genome'] + ".fai",
9 bams = expand("resources/alignments/{sample}.bam", sample=samples)
10 output:
11 bamlist = "resources/bam.list",
12 pops = "resources/populations.tsv",
13 regions = expand("resources/regions/genome.{chrom}.region.{i}.bed", chrom=config['chroms'], i = chunks) # bed files
14 log:
15 "logs/GenerateFreebayesParams.log"
16 params:
17 metadata = config['samples'],
18 chroms = config['chroms'],
19 chunks = config['chunks']
20 conda:
21 "../envs/diffsnps.yaml"
22 script:
23 "../scripts/GenerateFreebayesParams.R"

Update - 09/03/2021 - In the beautiful nature of open source software development, I have now written a parallelisation section in the freebayes documentation :)

I'll leave you with a song I've been enjoying recently. Happy variant calling.

More Posts

Browse all posts