Archives: Articles

An effective implementation of Strassen’s algorithm using AVX intrinsics for a multicore architecture

Nwe Zin Oo and Panyayot Chaikan

DOI: 10.14456/sjst-psu.2020.177

pp. 1368 - 1376

Abstract

This paper proposes an effective implementation of Strassen’s algorithm with AVX intrinsics to augment matrix-matrix
multiplication in a multicore system. AVX-2 and FMA3 intrinsic functions are utilized, along with OpenMP, to implement the
multiplication kernel of Strassen’s algorithm. Loop tiling and unrolling techniques are also utilized to increase the cache
utilization. A systematic method is proposed for determining the best stop condition for the recursion to achieve maximum
performance on specific matrix sizes. In addition, an analysis method makes fine-tuning possible when our algorithm is adapted
to another machine with a different hardware configuration. Performance comparisons between our algorithm and the latest
versions of two well-known open-source libraries have been carried out. Our algorithm is, on average, 1.52 and 1.87 times faster
than the Eigen and the OpenBLAS libraries, respectively, and can be scaled efficiently when the matrix becomes larger.

228 Views
808 Download

PDF

PDF w/Links