[FILLED] Vacancy for a PhD in reliable exascale computing

The Netherlands eScience Center has a vacant PhD position on “reliable exascale computing” The Netherlands eScience Center supports and reinforces multidisciplinary, data and computational intensive research through creative and innovative use of digital technologies in all its manifestations. We stimulate enhanced Science (eScience) by focusing on collaboration with universities, research institutions and industry. We aim … Continue reading “[FILLED] Vacancy for a PhD in reliable exascale computing”

The Netherlands eScience Center has a vacant PhD position on
“reliable exascale computing”

The Netherlands eScience Center supports and reinforces multidisciplinary, data and computational intensive research through creative and innovative use of digital technologies in all its manifestations. We stimulate enhanced Science (eScience) by focusing on collaboration with universities, research institutions and industry. We aim to change scientific practice by making large-scale big data analysis and computation possible across multiple scientific disciplines in order to achieve scientific breakthroughs. In this way, we create an effective bridge between scientific research and state-of-the-art digital technologies.

We are looking for an enthusiastic:

PhD student

The position:

The candidate will work in the European funded project PROCESS. The goal is to narrow the gap between user communities and digital infrastructure. The research is driven by 5 concrete scientific and industrial use-cases ranging from astronomy to medical imaging and Big Data analysis for airlines. The PhD candidate will work on defining and developing reliable and scalable techniques for exascale computing.

The candidate is expected to publish his/her results in peer reviewed conferences and journals, which will form the basis of a PhD thesis to be obtained from the University of Amsterdam (promotor Rob van Nieuwpoort, co-promotor Jason Maassen). The candidate will be based at the Netherlands eScience Center in Amsterdam, but is expected to work at least one day a week at the University of Amsterdam.

We require:

  • Academic level;
  • Finished master degree in computer science or equivalent study;
  • Prior expertise in one or more of the following fields: high-performance computing, parallel and distributed programming, Big Data, scientific data processing;
  • Strong programming skills in C, C++, Java, Python;
  • Fluency in oral and written English is required as well as good presentation skills;
  • Capable of working in an interdisciplinary team of researchers and industrial partners.

Working conditions:

We offer a temporary position at the eScience Center for a fixed period of 4 years within the collective agreement for Dutch Research Institutes. The salary is based on the salary scales in the cao-OI, starting at € 2.246,- in the first year and growing to € 2.879,- in the fourth year of employment, with a 38-hour working week. Holiday pay amounts to 8% of gross salary and we also offer a 13th month of salary as an end-of-year payment.

Information:

The eScience Center offers an interesting and challenging position with (additional) options for personal development. You will work in an international team with an informal but creative and ambitious working environment. The main location is Amsterdam (Science Park). Due to the international nature of this project, travelling (mostly within Europe) is expected. The eScience Center has an active diversity policy and would like to hire persons with a background that is underrepresented at the eScience Center. We therefore encourage women and minorities to apply. For more information about this opportunity you can contact Rob van Nieuwpoort, director of technology of the Netherlands eScience Center, by emailing R.vanNieuwpoort@esciencecenter.nl, or by calling +31(0)20 460 4770. Please send your resume and application letter at the latest on January 5th 2018 to vacancy@esciencecenter.nl. Additional information may also be found at www.esciencecenter.nl.

Master project: Auto-tuning for GPU pipelines and fused kernels

Achieving high performance on many-core accelerators is a complex task, even for experienced programmers. This task is made even more challenging by the fact that, to achieve high performance, code optimization is not enough, and auto-tuning is often necessary. The reason for this is that computational kernels running on many-core accelerators need ad-hoc configurations that … Continue reading “Master project: Auto-tuning for GPU pipelines and fused kernels”

Achieving high performance on many-core accelerators is a complex task, even for experienced programmers. This task is made even more challenging by the fact that, to achieve high performance, code optimization is not enough, and auto-tuning is often necessary. The reason for this is that computational kernels running on many-core accelerators need ad-hoc configurations that are a function of kernel, input, and accelerator characteristics to achieve high performance. However, tuning kernels in isolation may not be the best strategy for all scenarios.

Imagine having a pipeline that is composed by a certain number of computational kernels. You can tune each of these kernels in isolation, and find the optimal configuration for each of them. Then you can use these configurations in the pipeline, and achieve some level of performance. But these kernels may depend on each other, and may also influence each other. What if the choice of a certain memory layout for one kernel causes performance degradation on another kernel?

One of the existing optimization strategies to deal with pipelines is to fuse kernels together, to simplify execution patterns and decrease overhead. In this project we aim to measure the performance of accelerated pipelines in three different tuning scenarios: (1) tuning each component in isolation, (2) tuning the pipeline as a whole, and (3) tuning the fused kernel. Measuring the performance of one or more pipelines in these scenarios we hope to, on one level, being able to determine which is the best strategy for the specific pipelines on different hardware platform, and on another level we hope to better understand which are the characteristics that influence this behavior.

Master project: Speeding up next generation sequencing of potatoes

Genotype and single nucleotide polymorphism calling (SNP) is a technique to find bases in next-generation sequencing data that differ from a reference genome. This technique is commonly used in (plant) genetic research. However, most algorithms focus on allowing calling in diploid heterozygous organisms (specifically human) only. Within the realm of plant breeding, many species are … Continue reading “Master project: Speeding up next generation sequencing of potatoes”

Genotype and single nucleotide polymorphism calling (SNP) is a technique to find bases in next-generation sequencing data that differ from a reference genome. This technique is commonly used in (plant) genetic research. However, most algorithms focus on allowing calling in diploid heterozygous organisms (specifically human) only. Within the realm of plant breeding, many species are of polyploid nature (e.g. potato with 4 copies, wheat with 6 copies and strawberry with eight copies). For genotype and SNP calling in these organisms, only a few algorithms exist, such as freebayes (https://github.com/ekg/freebayes). However, with the increasing amount of next generation sequencing data being generated, we are noticing limits to the scalability of this methodology, both in compute time and memory consumption (>100Gb).

 

We are looking for a student with a background in computer science, who will perform the following tasks:

  • Examine the current implementation of the freebayes algorithm
  • Identify bottlenecks in memory consumption and compute performance
  • Come up with an improved strategy to reduce memory consumption of the freebayes algorithm
  • Come up with an improved strategy to execute this algorithm on a cluster with multiple CPU’s or on GPU/s (using the memory of multiple compute nodes)
  • Implement an improved version of freebayes, according to the guidelines established above
  • Test the improved algorithm on real datasets of potato.

 

This is a challenging master thesis project on an important food crop (potato) on a problem which is relevant for both science and industry. As part of the thesis, you will be given the opportunity to present your progress/results to relevant industrial partners for the Dutch breeding industry.

Occasional traveling to Wageningen will be required.

Master project: Topic Models of large document collections

Humanities researchers and literary scholars commonly struggle to make sense of, and extract information from patterns in large document collections. Computational techniques for finding similarities between documents in such collections continue to be developed and are invaluable in understanding these collections as single entities. Topic modelling is a text processing tool widely used to identify … Continue reading “Master project: Topic Models of large document collections”

Humanities researchers and literary scholars commonly struggle to make sense of, and extract information from patterns in large document collections. Computational techniques for finding similarities between documents in such collections continue to be developed and are invaluable in understanding these collections as single entities.

Topic modelling is a text processing tool widely used to identify structure in document collections. It has been successfully used for a wide range of applications. However, it suffers from several weaknesses which make it difficult to use and may lead to unreliable results: It is dependent on preprocessing for removal of stop-words and other text features: this is a common first step on many NLP applications; results produced are highly dependent on the quality of the selection at this stage:

  • The number of topics typically needs to be determined a priori: ideally the number of topics could be inferred from the data itself.
  • Interpretation of resulting topics is difficult: topics are normally represented by a set of most representative words for the topic, but these words do not relate directly to human experience.
  • Visualization of produced topics is not intuitive: the optimal way of visualizing topics is current area of active research.
  • Incorporating additional dimensions to a model (e.g. evolution of topics over time) is a feature which is not easily incorporated.
  • Topic models scale poorly for large document collections: this is mainly due to the computational complexity of the algorithm.

There have been various efforts to overcome such limitations with varying degrees of success.

However solutions to these issues are not yet common practice and require additional knowledge and effort in order to take full advantage of them.

The aim of this project is to develop topic modelling tools which can help humanities researchers and literary scholars to make sense of large document collections without requiring extensive expertise in tuning model parameters.