Genotype and single nucleotide polymorphism calling (SNP) is a technique to find bases in next-generation sequencing data that differ from a reference genome. This technique is commonly used in (plant) genetic research. However, most algorithms focus on allowing calling in diploid heterozygous organisms (specifically human) only. Within the realm of plant breeding, many species are of polyploid nature (e.g. potato with 4 copies, wheat with 6 copies and strawberry with eight copies). For genotype and SNP calling in these organisms, only a few algorithms exist, such as freebayes (https://github.com/ekg/freebayes). However, with the increasing amount of next generation sequencing data being generated, we are noticing limits to the scalability of this methodology, both in compute time and memory consumption (>100Gb).
We are looking for a student with a background in computer science, who will perform the following tasks:
- Examine the current implementation of the freebayes algorithm
- Identify bottlenecks in memory consumption and compute performance
- Come up with an improved strategy to reduce memory consumption of the freebayes algorithm
- Come up with an improved strategy to execute this algorithm on a cluster with multiple CPU’s or on GPU/s (using the memory of multiple compute nodes)
- Implement an improved version of freebayes, according to the guidelines established above
- Test the improved algorithm on real datasets of potato.
This is a challenging master thesis project on an important food crop (potato) on a problem which is relevant for both science and industry. As part of the thesis, you will be given the opportunity to present your progress/results to relevant industrial partners for the Dutch breeding industry.
Occasional traveling to Wageningen will be required.