Achieving high performance on many-core accelerators is a complex task, even for experienced programmers. This task is made even more challenging by the fact that, to achieve high performance, code optimization is not enough, and auto-tuning is often necessary. The reason for this is that computational kernels running on many-core accelerators need ad-hoc configurations that are a function of kernel, input, and accelerator characteristics to achieve high performance. However, tuning kernels in isolation may not be the best strategy for all scenarios.
Imagine having a pipeline that is composed by a certain number of computational kernels. You can tune each of these kernels in isolation, and find the optimal configuration for each of them. Then you can use these configurations in the pipeline, and achieve some level of performance. But these kernels may depend on each other, and may also influence each other. What if the choice of a certain memory layout for one kernel causes performance degradation on another kernel?
One of the existing optimization strategies to deal with pipelines is to fuse kernels together, to simplify execution patterns and decrease overhead. In this project we aim to measure the performance of accelerated pipelines in three different tuning scenarios: (1) tuning each component in isolation, (2) tuning the pipeline as a whole, and (3) tuning the fused kernel. Measuring the performance of one or more pipelines in these scenarios we hope to, on one level, being able to determine which is the best strategy for the specific pipelines on different hardware platform, and on another level we hope to better understand which are the characteristics that influence this behavior.