[FILLED] Vacancy for a PhD in reliable exascale computing

The Netherlands eScience Center has a vacant PhD position on “reliable exascale computing” The Netherlands eScience Center supports and reinforces multidisciplinary, data and computational intensive research through creative and innovative use of digital technologies in all its manifestations. We stimulate enhanced Science (eScience) by focusing on collaboration with universities, research institutions and industry. We aim … Continue reading “[FILLED] Vacancy for a PhD in reliable exascale computing”

The Netherlands eScience Center has a vacant PhD position on
“reliable exascale computing”

The Netherlands eScience Center supports and reinforces multidisciplinary, data and computational intensive research through creative and innovative use of digital technologies in all its manifestations. We stimulate enhanced Science (eScience) by focusing on collaboration with universities, research institutions and industry. We aim to change scientific practice by making large-scale big data analysis and computation possible across multiple scientific disciplines in order to achieve scientific breakthroughs. In this way, we create an effective bridge between scientific research and state-of-the-art digital technologies.

We are looking for an enthusiastic:

PhD student

The position:

The candidate will work in the European funded project PROCESS. The goal is to narrow the gap between user communities and digital infrastructure. The research is driven by 5 concrete scientific and industrial use-cases ranging from astronomy to medical imaging and Big Data analysis for airlines. The PhD candidate will work on defining and developing reliable and scalable techniques for exascale computing.

The candidate is expected to publish his/her results in peer reviewed conferences and journals, which will form the basis of a PhD thesis to be obtained from the University of Amsterdam (promotor Rob van Nieuwpoort, co-promotor Jason Maassen). The candidate will be based at the Netherlands eScience Center in Amsterdam, but is expected to work at least one day a week at the University of Amsterdam.

We require:

  • Academic level;
  • Finished master degree in computer science or equivalent study;
  • Prior expertise in one or more of the following fields: high-performance computing, parallel and distributed programming, Big Data, scientific data processing;
  • Strong programming skills in C, C++, Java, Python;
  • Fluency in oral and written English is required as well as good presentation skills;
  • Capable of working in an interdisciplinary team of researchers and industrial partners.

Working conditions:

We offer a temporary position at the eScience Center for a fixed period of 4 years within the collective agreement for Dutch Research Institutes. The salary is based on the salary scales in the cao-OI, starting at € 2.246,- in the first year and growing to € 2.879,- in the fourth year of employment, with a 38-hour working week. Holiday pay amounts to 8% of gross salary and we also offer a 13th month of salary as an end-of-year payment.

Information:

The eScience Center offers an interesting and challenging position with (additional) options for personal development. You will work in an international team with an informal but creative and ambitious working environment. The main location is Amsterdam (Science Park). Due to the international nature of this project, travelling (mostly within Europe) is expected. The eScience Center has an active diversity policy and would like to hire persons with a background that is underrepresented at the eScience Center. We therefore encourage women and minorities to apply. For more information about this opportunity you can contact Rob van Nieuwpoort, director of technology of the Netherlands eScience Center, by emailing R.vanNieuwpoort@esciencecenter.nl, or by calling +31(0)20 460 4770. Please send your resume and application letter at the latest on January 5th 2018 to vacancy@esciencecenter.nl. Additional information may also be found at www.esciencecenter.nl.

Master project: Auto-tuning for GPU pipelines and fused kernels

Achieving high performance on many-core accelerators is a complex task, even for experienced programmers. This task is made even more challenging by the fact that, to achieve high performance, code optimization is not enough, and auto-tuning is often necessary. The reason for this is that computational kernels running on many-core accelerators need ad-hoc configurations that … Continue reading “Master project: Auto-tuning for GPU pipelines and fused kernels”

Achieving high performance on many-core accelerators is a complex task, even for experienced programmers. This task is made even more challenging by the fact that, to achieve high performance, code optimization is not enough, and auto-tuning is often necessary. The reason for this is that computational kernels running on many-core accelerators need ad-hoc configurations that are a function of kernel, input, and accelerator characteristics to achieve high performance. However, tuning kernels in isolation may not be the best strategy for all scenarios.

Imagine having a pipeline that is composed by a certain number of computational kernels. You can tune each of these kernels in isolation, and find the optimal configuration for each of them. Then you can use these configurations in the pipeline, and achieve some level of performance. But these kernels may depend on each other, and may also influence each other. What if the choice of a certain memory layout for one kernel causes performance degradation on another kernel?

One of the existing optimization strategies to deal with pipelines is to fuse kernels together, to simplify execution patterns and decrease overhead. In this project we aim to measure the performance of accelerated pipelines in three different tuning scenarios: (1) tuning each component in isolation, (2) tuning the pipeline as a whole, and (3) tuning the fused kernel. Measuring the performance of one or more pipelines in these scenarios we hope to, on one level, being able to determine which is the best strategy for the specific pipelines on different hardware platform, and on another level we hope to better understand which are the characteristics that influence this behavior.

Master project: Auto-Tuning for power efficiency

Auto-tuning is a well-known optimization technique in computer science. It has been used to ease the manual optimization process that is traditionally performed by programmers, and to maximize the performance portability. Auto-tuning works by just executing the code that has to be tuned many times on a small problem set, with different tuning parameters. The … Continue reading “Master project: Auto-Tuning for power efficiency”

Auto-tuning is a well-known optimization technique in computer science. It has been used to ease the manual optimization process that is traditionally performed by programmers, and to maximize the performance portability. Auto-tuning works by just executing the code that has to be tuned many times on a small problem set, with different tuning parameters. The best performing version is than subsequently used for the real problems. Tuning can be done with application-specific parameters (different algorithms, granularity, convergence heuristics, etc) or platform parameters (number of parallel threads used, compiler flags, etc).

For this project, we apply auto-tuning on GPUs. We have several GPU applications where the absolute performance is not the most important bottleneck for the application in the real world. Instead the power dissipation of the total system is critical. This can be due to the enormous scale of the application, or because the application must run in an embedded device. An example of the first is the Square Kilometre Array, a large radio telescope that currently is under construction. With current technology, it will need more power than all of the Netherlands combined. In embedded systems, power usage can be critical as well. For instance, we have GPU codes that make images for radar systems in drones. The weight and power limitations are an important bottleneck (batteries are heavy).

In this project, we use power dissipation as the evaluation function for the auto-tuning system. Earlier work by others investigated this, but only for a single compute-bound application. However, many realistic applications are memory-bound. This is a problem, because loading a value from the L1 cache can already take 7-15x more energy than an instruction that only performs a computation (e.g., multiply).

There also are interesting platform parameters than can be changed in this context. It is possible to change both core and memory clock frequencies, for instance. It will be interesting to if we can at runtime, achieve the optimal balance between these frequencies.

We want to perform auto-tuning on a set of GPU benchmark applications that we developed.

Master project: Fast data serialization and networking for Apache Spark

Apache Spark is a system for large-scale data processing used for Big Data applications business applications, but also in many scientific applications. Spark uses Java (or Scala) object serialization to transfer data over the network. Especially if data fits in memory, the performance of serialization is the most important bottleneck in Spark applications. Spark currently … Continue reading “Master project: Fast data serialization and networking for Apache Spark”

Apache Spark is a system for large-scale data processing used for Big Data applications business applications, but also in many scientific applications. Spark uses Java (or Scala) object serialization to transfer data over the network. Especially if data fits in memory, the performance of serialization is the most important bottleneck in Spark applications. Spark currently offers two mechanisms for serialization: Standard Java object serialization and Kryo serialization.

In the Ibis project (www.cs.vu.nl/ibis), we have developed an alternative serialization mechanism for high-performance computing applications that relies on compile-time code generation and zero-copy networking for increased performance. Performance of JVM serialization can also be compared with benchmarks: https://github.com/eishay/jvm-serializers/wiki. However, we also want to evaluate if we can increase Spark performance at the application level by using out improved object serialization system. In addition, our Ibis implementation can use fast local networks such as Infiniband transparently. We also want to investigate if using specialized networks increases application performance. Therefore, this project involves extending Spark with our serialization and networking methods (based on existing libraries), and on analyzing the performance of several real-world Spark applications.

Master project: Applying and generalizing data locality abstractions for parallel programs

TIDA is a library for high-level programming of parallel applications, focusing on data locality. TIDA has been shown to work well for grid-based operations, like stencils and convolutions. These are in an important building block for many simulations in astrophysics, climate simulations and water management, for instance. The TIDA paper gives more details on the … Continue reading “Master project: Applying and generalizing data locality abstractions for parallel programs”

TIDA is a library for high-level programming of parallel applications, focusing on data locality. TIDA has been shown to work well for grid-based operations, like stencils and convolutions. These are in an important building block for many simulations in astrophysics, climate simulations and water management, for instance. The TIDA paper gives more details on the programming model.

This projects aims to achieve several things and answer several research questions:

  • TIDA currently only works with up to 3D. In many applications we have, higher dimensionalities are needed. Can we generalize the model to N dimensions?
  • The model currently only supports a two-level hierarchy of data locality. However, modern memory systems often have many more levels, both on CPUs and GPUs (e.g., L1, L2 and L3 cache, main memory, memory banks coupled to a different core, etc). Can we generalize the model to support N-level memory hierarchies?
  • The current implementation only works on CPUs, can we generalize to GPUs as well?
  • Given the above generalizations, can we still implement the model efficiently? How should we perform the mapping from the abstract hierarchical model to a real physical memory system?

We want to test the new extended model on a real application. We have examples available in many domains. The student can pick one that is of interest to her/him.