Auto-tuning is a well-known optimization technique in computer science. It has been used to ease the manual optimization process that is traditionally performed by programmers, and to maximize the performance portability. Auto-tuning works by just executing the code that has to be tuned many times on a small problem set, with different tuning parameters. The best performing version is than subsequently used for the real problems. Tuning can be done with application-specific parameters (different algorithms, granularity, convergence heuristics, etc) or platform parameters (number of parallel threads used, compiler flags, etc).
For this project, we apply auto-tuning on GPUs. We have several GPU applications where the absolute performance is not the most important bottleneck for the application in the real world. Instead the power dissipation of the total system is critical. This can be due to the enormous scale of the application, or because the application must run in an embedded device. An example of the first is the Square Kilometre Array, a large radio telescope that currently is under construction. With current technology, it will need more power than all of the Netherlands combined. In embedded systems, power usage can be critical as well. For instance, we have GPU codes that make images for radar systems in drones. The weight and power limitations are an important bottleneck (batteries are heavy).
In this project, we use power dissipation as the evaluation function for the auto-tuning system. Earlier work by others investigated this, but only for a single compute-bound application. However, many realistic applications are memory-bound. This is a problem, because loading a value from the L1 cache can already take 7-15x more energy than an instruction that only performs a computation (e.g., multiply).
There also are interesting platform parameters than can be changed in this context. It is possible to change both core and memory clock frequencies, for instance. It will be interesting to if we can at runtime, achieve the optimal balance between these frequencies.
We want to perform auto-tuning on a set of GPU benchmark applications that we developed.