Specialized Tools for Performance Tuning
Vladimir Getov
University of Westminster, London
The fast Fourier transform (FFT) is the cornerstone of many supercomputer
applications and therefore needs careful performance tuning. Most often,
however, the real performance of the FFT implementations is far below the
acceptable figures. Within the frame of the hierarchical tiling approach, we
explore several strategies for performance optimisations of the FFT
computation, such as enhancing instruction-level parallelism, loop fusion, and
reducing the memory loads and stores by using a special-purpose source code
generator. Our approach is based on the principle of complete unrolling which
we apply to modify the FT kernel of the NAS Parallel Benchmarks. In
experiments on two different IBM SP2 platforms, we show performance
improvements between 40% and 53% in comparison with the original
code. Further, our 3-D FFT mega-step of the whole benchmark is faster than the
corresponding FFT library call from the vendor-optimised PESSL numerical
library. Finally, our approach for automatic generation of moderately
optimised but specialised codes requires only a modest amount of programming
effort.
Further info and related paper(s):