Specialized Tools for Performance Tuning

Vladimir Getov
University of Westminster, London


The fast Fourier transform (FFT) is the cornerstone of many supercomputer applications and therefore needs careful performance tuning. Most often, however, the real performance of the FFT implementations is far below the acceptable figures. Within the frame of the hierarchical tiling approach, we explore several strategies for performance optimisations of the FFT computation, such as enhancing instruction-level parallelism, loop fusion, and reducing the memory loads and stores by using a special-purpose source code generator. Our approach is based on the principle of complete unrolling which we apply to modify the FT kernel of the NAS Parallel Benchmarks. In experiments on two different IBM SP2 platforms, we show performance improvements between 40% and 53% in comparison with the original code. Further, our 3-D FFT mega-step of the whole benchmark is faster than the corresponding FFT library call from the vendor-optimised PESSL numerical library. Finally, our approach for automatic generation of moderately optimised but specialised codes requires only a modest amount of programming effort.


Further info and related paper(s):