---
title: Using diplotype clustering to discover mutations causing insecticide resistance in malaria mosquitoes
shorttitle: Diplotype clustering for genomic surveillance of mosquitoes
slug: diplotype-clustering
date: 08/15/2024
thumbnail: /thumbnails/dendro.png
tag: genetics 
canonicalUrl: https://sanjaycnagi.com/blog/diplotype-clustering/
---

Vectors of disease evolve rapidly in response to the interventions we throw at them. By monitoring genetic changes in these populations over time, we can detect emerging insecticide resistance mechanisms and monitor the spread of known mechanisms, with the aim of informing vector control strategies. In a world of limited active ingredients, this surveillance will be crucial for maintaining the effectiveness of front-line interventions like long-lasting insecticide-treated bed nets (LLINs) and indoor residual spraying (IRS).

To facilitate this, we have been developing systems and tools for genomic surveillance. The MalariaGEN Vector Observatory have developed the Python package [`malariagen_data`](https://malariagen.github.io/malariagen-data-python/latest/), which provides tools for accessing and analysing genomic datasets from major malaria vectors. By developing innovative software, we can help to build capacity in genomic research, allowing more people to perform robust, complex genomic analyses that would otherwise be limited to a select few.

In this blog post, we wanted to share a new function that we have recently added to `malariagen_data`: [plot_diplotype_clustering_advanced()](https://malariagen.github.io/malariagen-data-python/latest/generated/malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced.html#malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced). This method allows us to rapidly zoom in on a genome region of interest and identify selective sweeps, assess their size, detect potential gene flow events between countries or species, and investigate whether sweeps are driven by copy number variants (CNVs), amino acid mutations or both.

But what exactly are diplotypes, and why are they useful? A diplotype, sometimes referred to as a multi-locus genotype, is essentially the combination of two haplotypes from a single mosquito - one from each chromosome - at a particular genomic region. By analysing diplotypes rather than haplotypes, we can better capture the full genetic variation present in an individual, including complex structural variants like CNVs that can be difficult to phase onto haplotypes. Often, CNVs and multiallelic SNPs are ignored when analysing haplotype data. The more mosquitoes we sequence, the worse this problem gets - *An. gambiae s.l* is so genetically diverse, eventually, a significant proportion of all SNPs become multiallelic. 

![diplotype](/blog/diplotype.png)
*Figure 1. Illustration of the relationship between diplotypes and haplotypes*

The new diplotype clustering functionality in `malariagen_data` performs hierarchical clustering on diplotypes from a specified genomic region. It then visualises the results, displaying:  

&nbsp;&nbsp;&nbsp; 1. The clustering dendrogram  
&nbsp;&nbsp;&nbsp; 2. Sample metadata (e.g., species, collection location)  
&nbsp;&nbsp;&nbsp; 3. Heterozygosity of each sample (within this genomic region)  
&nbsp;&nbsp;&nbsp; 4. Copy number at genes of interest  
&nbsp;&nbsp;&nbsp; 5. Amino acid variants in a specified transcript  

#### A case study

To illustrate the power of this approach, let's look at a case study of the *Gste2* gene from some recent whole-genome data of *An. gambiae s.l* from Obuasi, central Ghana (Figure 2). *Anopheles* mosquitoes from this area are highly resistant to multiple classes of insecticides [[1](https://bmcinfectdis.biomedcentral.com/articles/10.1186/s12879-022-07795-4)]. The *Gste2* gene is known to be involved in resistance to DDT (and potentially other insecticides), through either copy number variation [[2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6673711/)], amino acid mutations [[3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3968025/)], or both. *Gste2*-I114T and *Gste2*-L119V are the major amino acid mutations at this locus known to confer resistance.

Here is an example of how to use the [plot_diplotype_clustering_advanced()](https://malariagen.github.io/malariagen-data-python/latest/generated/malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced.html#malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced) function:  


``` python
ag3.plot_diplotype_clustering_advanced(
    region="3R:28,597,000-28,600,000",         # The genomic region for clustering
    cnv_region="3R:28,594,000-28,605,000",     # The genomic region for CNV data
    snp_transcript="AGAP009194-RA",            # The transcript for amino acid variants
    sample_sets="1244-VO-GH-YAWSON-VMF00149",  # The sample set
    sample_query=None,                         # A query to filter samples
    site_mask="gamb_colu",                     # The site mask to use
    linkage_method="complete",                 # The linkage method to use
    color="taxon",                             # The metadata column to determine color
    )
```  

<br></br>
<br></br>
<br></br>
<br></br>

There are many more optional parameters for the user to configure - see the [API docs](https://malariagen.github.io/malariagen-data-python/latest/generated/malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced.html#malariagen_data.ag3.Ag3.plot_diplotype_clustering_advanced) for more information. Here is the figure it produces:


![dipclust](/blog/dipclust-gste2.png)
*Figure 2. Diplotype clustering at Gste2.*

We've annotated this figure with some diplotype clusters which are particularly interesting. For example, cluster B contains a large number of diplotypes which are all genetically identical and have very low heterozygosity - this is what you expect when a selective sweep has occurred, and you find many individuals that are homozygous for the haplotype under selection. All individuals in cluster B also carry the *Gste2*-I114T substitution. If we did not already know something about this mutation, this figure would give us a clue that the mutation is a potential driver of selection and insecticide resistance. In fact, we already know this mutation causes insecticide resistance, so it is no surprise to find it linked to a selective sweep in this dataset.

We can also see another large cluster (cluster C) which does not harbour either I114T or L119V, but instead, the *Gste2*-F120L mutation. This cluster is homozygous for F120L and shows low heterozygosity, again indicative of diplotypes which have two copies of the same swept haplotype. Two things about the *Gste2*-F120L mutation are convincing as a potential driver of resistance. Firstly, it is in very close physical proximity to known resistance mutations in codons 114 and 119. According to Riveron et al., the 120 codon is located at the active site of the enzyme and is therefore likely to interact with the insecticide [[4](https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r27)]. Secondly, there are no CNVs associated with this sweep, and no other amino acid variants except N3K, which is less likely to be causative due to its physical location away from the active site. 

We also observe a small cluster of individuals (cluster A) which harbour a copy number variant (CNV) spanning *Gste2, Gste1, Gste3* and *Gste7*. CNVs could be driving insecticide resistance by increasing the expression of the genes they encompass, allowing the mosquito to detoxify more of the insecticide as a result. This CNV, an amplification, seems to exist at a variable copy number. In total, we can see six or seven distinct CNVs in these samples.

This case study demonstrates how diplotype clustering can provide insights into the mutations causing insecticide resistance; in a single snapshot, we can explore amino acid and CNV data and really understand the nature of selection at a genomic region. 

We hope that others will find it useful for their own research. Please feel free to [get in touch](mailto:sanjay.c.nagi@gmail.com?subject=diplotype-clustering) if you have any questions or feedback :)

[Sanjay C Nagi](https://www.sanjaycnagi.com/) & [Alistair Miles](https://alimanfoo.github.io/)

Diplotype clustering for genomic surveillance of mosquitoes

Using diplotype clustering to discover mutations causing insecticide resistance in malaria mosquitoes

---
title: "Book Club"
shorttitle: "Book Club"
slug: "book-club"
date: "05/14/2024"
thumbnail: '/thumbnails/books.png'
tag: books
canonicalUrl: 'https://sanjaycnagi.com/blog/book-club/'
---

![library](/blog/library.png)
*My kind of library*

I recently got back into reading after a few months hiatus, and realised this blog would be an ideal place to keep track of what books I've been reading, and share any recommendations with the world (should anyone actually be reading this!).  

### Books

Currently reading *Pachinko* by Min Jin Lee

**Atomic habits by James Clear**  
I started reading this after a stag do, just when I was contemplating all the things that I should really be doing in life 🥲. The book argues that real change comes from the compound effect of hundreds of small decisions which over time create a new identity - initially, it might be doing two push-ups a day, or waking up five minutes early. Clear calls these small changes "atomic habits." The book draws from proven ideas in biology, psychology, and neuroscience to create an easy-to-understand guide for making good habits inevitable and bad habits impossible.

**The Lean Startup by Eric Reis**  
Steve Jobs led me to The Lean Startup, a book I'd been meaning to read for many years. The book is based on applying the tenets of lean manufacturing to startups. Lean manufacturing is a process developed in Japan at Toyota, which aims to eliminate waste to increase efficiency. Incidentally, I had previously had some exposure to lean manufacturing when I was interning at Illumina, where I was tasked with building automated software to highlight waste in their lab processes to reduce turnaround times for their sequencing of NHS samples. The Lean Startup aims to shorten product development cycles and rapidly discover if a proposed business model is viable. This is achieved by adopting rapid scientific experimentation, early product releases, and what Eric Reis calls 'validated learning'.

**Steve Jobs by Walter Isaacson**  
A fantastic read. Before this, I knew very little about Steve Jobs. He certainly was an odd fellow. So fascinating to hear about the beginnings of Apple; the excitement that Jobs and Wozniak must have felt in those early days is palpable. As someone who works in Tropical Medicine - a field heavily funded by the [BGMF](https://www.gatesfoundation.org/) - his relationship with Bill Gates is also particularly interesting. 

**The Mountains Sing by Nguyễn Phan Quế Mai**  
A really wonderful book about the multigenerational saga of the Trần family. It depicts the struggles and triumphs of the Vietnamese people as they navigate the challenges of colonialism, communism, and the war with the United States. I realised when reading this, I knew so little about the Vietnam War, and the atrocities committed by the US government. There is so much loss and heartbreak in the story, and yet life and love endure on, in a really beautiful way.  

**The Odyssey by Homer, translation by Robert Fagles**  
After reading a few books about Greek mythology, including Stephen Fry's excellent Mythos and Heroes, I decided to read an actual classic itself, beginning with the Odyssey, one of Homers two epic poems. I thoroughly enjoyed it, and was pleasantly surprised by its readability; I don't read of lot of poetry and the book is written in verse, but for the most part, it reads like prose.  

**The Unfolding of Language by Guy Deutscher**  
What an awesome book - this has been blowing my mind for the last couple of weeks. It's about the evolution of language, the destructive and creative forces which cause it to change, such as economy, expressiveness, and analogy. I'm learning Hindi at the minute, and its actually really helped me to understand why some of the things in English and Hindi are the way they are. 

**Running with the Kenyans by Adharanand Finn**  
In 2023, I really got into running. At the time of writing, I'm also in Kenya, and after a recommendation from a friend, figured this could be a good shout. It was. Although the book purports to be about finding the 'secrets' to the exceptional feats of Kenyan runners, it really is just about the authors journey to Kenya with his family, and the wonderful people he meets there. In reality, there are no 'secrets'. And its a really lovely read. Get me to Iten!!


---

### A few favourites that pre-date the blog
**Nelson Mandela - Long Walk to Freedom**  
Everyone on earth should read this book! It's been some years since I read it, but I always remember it having a profound impact. The world would be a better place if we all had to read it.

**Richard Dawkins - The Selfish Gene**  
I owe a lot to this book. I read it during a formative period, inbetween my Bachelors and joining LSTM to study for a masters. It really opened my eyes to the wonderful world of evolutionary biology, and I've been hooked ever since. It also helped to awaken a thirst for knowledge which has remained with me. 

**The Ramayana - Linda Egenes and Kumuda Reddy**  
This was the first (and only) version of the Ramayana I've read. The Ramayana literally means the Journey of Ram, and tells the story of Rama, the prince of Ayodha, who wages a war against the demon king Ravana to rescue his wife, Sita. And it is this victory of light triumphing over evil for which we celebrate Diwali.  It's a really beautiful book. 


Book Club

---
title: "My favourite polyfluorinated pyrethroid"
shorttitle: "My favourite polyfluorinated pyrethroid"
slug: "transfluthrin-resistance"
date: "09/07/2023"
thumbnail: '/thumbnails/tft.png'
tag: repellent 
canonicalUrl: 'https://sanjaycnagi.com/blog/transfluthrin-resistance/'
---

![mosquito_shield](/blog/mosquito-shield.jpg)        
*Mosquito Shield™* - a novel transfluthrin-based spatial repellent product for the malaria vector control market (source: [SC Johnson](https://www.scjohnson.com/en/a-healthier-world/sc-johnson-combats-malaria))

---

A few words on the greatest polyfluorinated pyrethroid I’ve ever written a thesis about – Transfluthrin. Back in my first foray into vector control, I was working in Mumbai at [Godrej](https://godrej.com), looking at resistance to the main active ingredient in their flagship insect repellent brand, [GoodKnight™](https://www.goodknight.in/), for my MSc dissertation. With randomised-controlled trials (RCTs) of the SC Johnson product *Mosquito Shield™* ongoing in sub-Saharan Africa, now seems an opportune time to discuss some of those findings.

Transfluthrin is a vapour-phase pyrethroid often used in domestic household products, such as sprays and liquid vapourisers. It repels mosquitoes, as well as incapacitating them to prevent host-seeking and blood-feeding. It's been particularly popular in the South Asian market for several years, with South America seemingly playing catch up. And, in recent years, it's also been explored as a novel vector control tool for malaria.

#### Metabolic resistance to transfluthrin?

As well as resulting in its high vapour pressure compared to common pyrethroids, the fluorination of transfluthrin may make it somewhat resistant to metabolic attack from Cytochrome P450s. Typically, pyrethroids are metabolised at the 4' position of the phenoxybenzyl ring. In Transfluthrin, however, the electro-negative fluorines pull electrons away from its benzyl ring, in theory preventing attack by electron-hungry cytochrome P450s. 

---
![tft_metabolism](/blog/tft_metabolism.png)  
A figure from my thesis '*Mechanisms of resistance to transfluthrin in mosquitoes', 2017*, supervised by Dr. David Weetman and Dr. Mark Paine. 

---

Earlier research from Bayer [[1]](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0149738), had shown limited to no synergism with PBO in the FuMoz strain of *An. funestus*, suggesting that P450s could not metabolism transfluthrin. This was followed with a later study [[2]](https://www.sciencedirect.com/science/article/pii/S0048357523000214), showing that CYP6P9a/b could only metabolise transfluthrin very weakly, by targeting the gem-dimethyl group - as predicted in my thesis ;)

We demonstrated that this wasn't the case across species, however, with PBO synergising volatile transfluthrin in an Indian strain of *Culex quinquefasciatus*, and showing that the *An. gambiae* P450 CYP6P3 can metabolise transfluthrin *in vitro*. *In vitro* metabolism was much lower than for Deltamethrin, however, demonstrating transfluthrin's comparative ability to resist metabolic attack from P450s. It is not quite clear the role that other gene families, such as carboxylesterases or chemosensory proteins will play in transfluthrin resistance.

Given that PBO should still synergise transfluthrin in most resistant mosquito strains, the combination of PBO nets and a transfluthrin-based spatial repellent could be a useful combination for vector control. 


#### The effect of *VGSC* knockdown mutations on transfluthrin 

The "resistance-breaking" potential of transfluthrin doesn't end there. Although there is evidence that *Kdr* may reduce the sensitivity of mosquitoes to transfluthrin's repellent effects [[3]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4400042/) [[4]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8266078/), there are suggestions that *Kdr* mutations may not confer resistance to Transfluthrin and other poly-fluorinated pyrethroids, to the same degree as typical pyrethroids. A study showed that *Kdr* mutations in *Aedes aegypti* lead to lower levels of resistance to transfluthrin than with other pyrethroids [[5]](https://www.sciencedirect.com/science/article/pii/S0048357513001478), whilst other research has shown that House-fly *Super-Kdr* does not confer resistance to transfluthrin at all, potentially due to its shorter length [[6]](https://pubmed.ncbi.nlm.nih.gov/26691197/). 

It has even been hypothesised that vapour-phase pyrethroids may bypass cuticular resistance, via direct entry to the nervous system through insect spiracles [[7]](https://link.springer.com/article/10.1007/s13355-016-0443-2). Together, the above factors result in relatively low resistance ratios for transfluthrin when compared with standard pyrethroids [[8]](https://parasitesandvectors.biomedcentral.com/articles/10.1186/s13071-021-04997-8#Sec7), something we have also found with a range of pyrethroid-resistant mosquito strains at the School of Tropical Medicine (unpublished). It is important to note that resistance to transfluthrin is still likely to develop. 

#### Spatial repellent mixtures?

If spatial repellents are shown to be an effective tool for vector control, it will be important to raise discussions on how to maintain and increase the longevity of these products. Whilst writing this, I saw a study which found that transfluthrin does not activate olfactory neurons (like most repellents, such as DEET). Instead, its repellent properties are dependent on sodium channel activation [[9]](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8266078/pdf/pntd.0009546.pdf). Interestingly, they also found that minuscule concentrations of transfluthrin synergise the effects of DEET and several other repellents! [[10]](https://www.sciencedirect.com/science/article/pii/S0048357523000524)

This finding suggests that using transfluthrin in a mixture with a non-pyrethroid vapour-phase repellent could be extremely effective, as well as extending the shelf-life of the products themselves. Another recent study identified repellent compounds with greater activity than the gold-standard DEET, and which have similar vapour pressures to transfluthrin [[11]](https://www.sciencedirect.com/science/article/pii/S0048357518303900) which could be ideal within such a mixture.

---

In 2023, we find ourselves in desperate need of novel vector control tools. Let's pray that spatial repellents can play an important role in reducing the burden of Malaria. 


My favourite polyfluorinated pyrethroid

---
title: "Ultra user-friendly bioinformatics pipelines pt.1"
shorttitle: "Ultra user-friendly bioinformatics pipelines pt.1"
slug: "Ultra-user-friendly-bioinformatics-pipelines-pt1"
date: 06/25/2023
thumbnail: '/thumbnails/jb.png'
tag: snakemake
canonicalUrl: 'https://sanjaycnagi.com/blog/ultra-user-friendly-bioinformatics-pipelines-pt1/'
---

Workflow managers such as [Snakemake](https://snakemake.github.io/) and [Nextflow](https://www.nextflow.io/) are wonderful tools - they allow us to build complex pipelines to reproducibly analyse genomic data with relative ease. These workflows run command line tools or scripts, performing some processing and analysis on input data, and writing outputs, tables and figures to results directories for the user to explore. Interpreting these genomic analyses, however, can be challenging, particularly for those who are less familiar with computational biology. To compound that, bioinformatic pipelines rarely have sufficient documentation, if at all. 

In this post, I wanted to share an interesting approach I've been using to present the results of computational workflows :) 

**Results web-books with Papermill, Notebooks, and Jupyter Book**

The approach involves the combination of a few semi-recent developments - in particular - [Papermill](https://github.com/nteract/papermill) and [Jupyter Book](https://jupyterbook.org/en/stable/intro.html), combined with Jupyter Notebooks.  

<details>
    <summary><em><b>What is a Jupyter Notebook?</b></em></summary>
  
    A Jupyter Notebook is an interactive computing environment that allows you to create and share documents containing live code, visualizations, and explanatory text. For those familiar with R, it is similar to R Markdown. It provides a web-based interface where you can write and execute code, typically Python. Jupyter Notebooks enable data analysis, experimentation, and collaboration in a convenient and flexible manner.
</details>

Papermill is a tool which allows Jupyter Notebooks to be parameterised and run from the command line - when we run the notebook, we can pass through some parameters. Surprisingly, standard Jupyter Notebooks do not support this - they are intended to be run interactively, cell by cell. Papermill means we can use Jupyter Notebooks in workflows directly like python scripts, and store the executed notebooks as outputs.

This is useful for a few reasons... 

Many people develop and debug in a Jupyter Notebook, and so this approach removes the need to convert to and from python scripts, saving valuable developer time. It also means that if you would like to perform a specific part of the analysis, it's easy to pull out a single notebook and apply it to your data. 

But the coolest thing comes when you integrate Jupyter Book. Jupyter Book is an awesome tool which builds html web pages from a collection of Jupyter Notebooks, a table of contents, and a configuration file. It's now widely used for building software documentation, such as in [malariagen_data](https://malariagen.github.io/vector-data/ag3/api.html), or the [Jupyter Book docs](https://jupyterbook.org/en/stable/start/example-book.html) themselves! Importantly, the Jupyter Notebooks can contain executed code with tables and figures, as well as markdown text. This means we can include our results notebooks for each step of the analysis, with clear descriptions on what the analysis does, and how to interpret the results! Interactive plots can be generated using the powerful plotly and bokeh libraries, giving end-users the chance to dive deep into the data, all within the familiar realm of a web page.

I and our collaborators at UVRI, Trevor Mugoya and Edward Lukyamezi, have recently been exploring this idea in [AmpSeeker](https://github.com/sanjaynagi/AmpSeeker), a workflow we are developing for amplicon sequencing data. I must say, it feels really nice to be able to browse all the analyses in one place. An example of the results book is shown below. 

<figure>
    <div align="center">
    <iframe width="560" height="315" src="https://www.youtube.com/embed/mt-AZeYz50k" title="YouTube video player" frameBorder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowFullScreen></iframe>
    </div> 
    <figcaption><em>An example of the (draft) AmpSeeker results book. If you like Hindi music or Anime, I'd recommend checking out the user guide placeholder, or the film that it's from :) </em></figcaption>
</figure>

I hope that others might find this a useful approach to building workflows. Exciting as it is, it now means I have the task of converting [rna-seq-pop](https://github.com/sanjaynagi/rna-seq-pop) to this way of workflow infrastructure! Wish me luck!


Ultra user-friendly bioinformatics pipelines pt.1

---
title: "Parantha reviews"
shorttitle: "Parantha reviews"
slug: "Paranthas-Are-The-Best"
date: "03/30/2023"
thumbnail: '/thumbnails/parantha.png'
tag: food
canonicalUrl: 'https://sanjaycnagi.dev/blog/year-rewind-2020/'
---

![parantha](/blog/parantha_mj.png)
*Parantha heaven.*

Hello, fellow paratha/parantha/parotta enthusiasts! I'm here to share my thoughts 
and reviews on some of the best paranthas I've had the pleasure of 
devouring. As a self-proclaimed parantha addict, I have traveled far 
and wide in search of the perfect parantha. And let me tell you, it's 
been a delicious journey.

<br></br>

**Reviews**

**Anands sweets, LS6 Leeds**  
Anands is my favourite place to visit any time I return to Leeds. It used to be an indian sweet shop, but a few years back they turned it into a cafe, serving a whole range of vegetarian dishes. Its an amazing place which brings me joy just to sit in there, and the channa masala is the best I've ever tasted. The last time I went, the paranthas however, could be better (I'm sorry, Anands!).  
Rating: 6.5/10

**Chaiwala, Bold Street L1 Liverpool**  
These Indian street food places seem to have popped up everywhere. Im not complaining. Chai was delicious, and so was the Samosa pav I had. The parantha could have done with more flavour (I suppose this has to cater to the British palate), but the yoghurt and raita was strangely good. Enjoyable.  
Rating: 6.5/10

**Chai walay, LS6 Leeds**  
I actually found this place whilst trying to visit Anands, on the way back to Liverpool one morning. Anands was closed, so I stopped in my car and desperately googled 'Paranthas Leeds' and this place came up. I was excited. I remember it clearly, a really sunny winters day. And it was good. A massive aloo Parantha, very flaky and very delicious. A bit too oily, though I suspect that may have aided the overall taste. Chai was delicious.  
Rating: 7.5/10

**Zaffran, L8 Liverpool**  
This is one of those ghost restuarants - its just a kitchen featured on JustEat and Deliveroo, you cant sit inside. This poses a problem for me, as I like to eat the Parantha fresh out of the tandoor. The only option - drive right up to the entrance, and sit in my car to eat it. Greasy fingers everywhere, not ideal. The kulchas were nice, but the aloo parantha was fried (not cooked in a tandoor) and I'm fairly sure it came from a frozen packet.   
Rating: 6.1/10

**Paratha Hut, Levenshulme, Manchester**    
I've been meaning to go here for ages. One day I took a cheeky 45 min detour on the way back to leeds. this is a little hut in the corner of a car wash. 
There are many options, and (aloo) paranthas were quite delicious!

Rating 7.8/10

---

the fun bit of this blog post will arrive in October, when I travel to my family home in Punjab, India, with my brother, sister and father. I'm super excited, not only for Paranthas, but because it will be the first time we have all visited together since I was 3 years old (26 years ago!). I'll be sure to update this post with my tales of deliciousness!


#### Hindi 

(*I'm learning Hindi, so I thought I'd try translating this post for my homework*)

Main hindi seekh raha hu, ki main translate . 



Parantha reviews

---
title: "Parallelising freebayes with snakemake"
shorttitle: "Parallelising freebayes with snakemake"
slug: "parallelising-freebayes-with-snakemake"
date: "11/01/2021"
thumbnail: '/thumbnails/snakemake.png'
tag: genetics
canonicalUrl: 'https://sanjaycnagi.com/blog/parallelising-freebayes-with-snakemake/'
---

[`freebayes`](https://github.com/freebayes/freebayes) is a bayesian haplotype-based variant caller, used widely in genomics. As with many variant callers, it is not readily parallelised, but can be done so by splitting the genome into smaller chunks, calling them separately, and subsequently combining the chunks together.

A wrapper for freebayes, [`freebayes-parallel`](https://github.com/freebayes/freebayes/blob/master/scripts/freebayes-parallel), does exactly this, making use of `gnu-parallel`. However, this approach has a major limitation:

* When a chunk is completed, that cpu core will not move onto the next region until all cores have completed their respective chunk. This is particularly problematic in regions of variable coverage, and so one can attempt to split the genome into regions of roughly equal coverage. Unfortunately, this still results in many cores being unused for substantial periods of time.

I was implementing a `freebayes` variant calling step in a snakemake RNA-Sequencing pipeline I was writing (more on this later), and wanted to parallelise freebayes, without the above limitation. 

To do so, we can write a snakemake rule (below) which runs an [R script](https://github.com/sanjaynagi/rna-seq-ir/blob/master/workflow/scripts/GenerateFreebayesParams.R) to read in the genome index (.fai) file, and output multiple bed files, breaking the genome into chunks of equal size. By using an extra snakemake wildcard, the index of each genome chunk, we can produce, and supply freebayes with different bed files. Finally, after concatenating the vcfs with `bcftools concat` it is also important to stream the output through `vcfuniq`, to ensure there are no duplicate calls at the region overlaps. 

The benefit of this, is that snakemake will automatically run the next job when each chunk is complete, reducing overall computation time as compared with `freebayes-parallel`. Importantly, it also allows us to perform joint, multi-sample calling, which is one of the main benefits of using freebayes in the first place. 


```python
# Read in the desired number of genome chunks from the config.yaml, and arange a sequence 1-n. 
chunks = np.arange(1, config['chunks'])

# Note - in this case the script also produces some other files for Freebayes
rule GenerateFreebayesParams:
    input:
        ref = config['ref']['genome'],
        index = config['ref']['genome'] + ".fai",
        bams = expand("resources/alignments/{sample}.bam", sample=samples)
    output:
        bamlist = "resources/bam.list",
        pops = "resources/populations.tsv",
        regions = expand("resources/regions/genome.{chrom}.region.{i}.bed", chrom=config['chroms'], i = chunks) # bed files 
    log:
        "logs/GenerateFreebayesParams.log"
    params:
        metadata = config['samples'],
        chroms = config['chroms'],
        chunks = config['chunks']
    conda:
        "../envs/diffsnps.yaml"
    script:
        "../scripts/GenerateFreebayesParams.R"
```

*Update - 09/03/2021 - In the beautiful nature of open source software development, I have now written a parallelisation section in the [freebayes documentation](https://github.com/freebayes/freebayes) :)*

I'll leave you with a song I've been enjoying recently. Happy variant calling.

<iframe width="560" height="315" src="https://www.youtube.com/embed/1fBEEANitDY" frameBorder="0" allow="autoplay; encrypted-media" allowFullScreen></iframe>



Ultra user-friendly bioinformatics pipelines pt.1

More Posts

⬅️ Previous: My favourite polyfluorinated pyrethroid

Next: Parantha reviews ➡️