High-performance implementation of the level-3 BLAS 论文
2008ACM Transactions on Mathematical Software引用 314
Parallel Computing and Optimization TechniquesInterconnection Networks and SystemsAdvanced Data Storage Technologies
摘要
A simple but highly effective approach for transforming high-performance implementations on cache-based architectures of matrix-matrix multiplication into implementations of other commonly used matrix-matrix computations (the level-3 BLAS) is presented. Exceptional performance is demonstrated on various architectures.