Header menu link for other important links
Cache-efficient matrix transposition
Siddhartha Chatterjee,
Published in IEEE, Los Alamitos, CA, United States
Pages: 195 - 205
We investigate the memory system performance of several algorithms for transposing an N × N matrix in-place, where N is large. Specifically, we investigate the relative contributions of the data cache, the translation lookaside buffer, register tiling, and the array layout function to the overall running time of the algorithms. We use various memory models to capture and analyze the effect of various facets of cache memory architecture that guide the choice of a particular algorithm, and attempt to experimentally validate the predictions of the model. Our major conclusions are as follows: limited associativity in the mapping from main memory addresses to cache sets can significantly degrade running time; the limited number of TLB entries can easily lead to thrashing; the fanciest optimal algorithms are not competitive on real machines even at fairly large problem sizes unless cache miss penalties are quite high; low-level performance tuning 'hacks', such as register tiling and array alignment, can significantly distort the effects of improved algorithms; and hierarchical non-linear layouts are inherently superior to the standard canonical layouts (such as row- or column-major) for this problem.
About the journal
Published in IEEE, Los Alamitos, CA, United States
Open Access
Impact factor