MTL 4: Performance on an AMD Opteron 2GHz

The following measurements are performed with the benchmark template library (BTL) from Laurent Plagne.

The first benchmark is the product of a dense matrix with its transposed, i.e. A * trans(A):

This operation is favorable for row-major matrices because they are processed with stride 1. As a consequence the C and C++ implementations perform well for large matrices compared with the Fortran implementation (where both arguments are traversed with long strides). The MTL4 implementation is even less affected by the matrix size thanks to the recursive approach.

The implementation uses tiling on block-level (typically 64 by 64). For the considered processor a tiling of 2 by 4 yields the performance while processors with more available FP registers (e.g. PowerPC) are faster with 4 by 4 tiling. The metaprogramming tuning in MTL4 allows the user to define these parameters in type definitions of functors and the unrolled implementation is generated at compile time.

In this measurement, the benchmark was compiled without -DMTL_HAS_BLAS (/DMTL_HAS_BLAS on MSVC). If we had enabled BLAS in MTL4, the two curves would have been identical.

The second example transposes the first argument in the dense matrix product, i.e. trans(A) * A. This operation is correspondingly more appropriate for column-major matrices so that the Fortran implementation scales better than the C/C++ codes:

As for MTL4, the performance is decreased as well with respect to the first benchmark but the effect is limited due to the recursive implementation.

Multiplying matrices of the same orientation without transposition, i.e. A * A, scales poorly for row-major and column-major if no blocking is used:

As for the previous measurements, the nested blocking of GotoBLAS and the recursive blocking of MTL4 cope with the locality problems of large matrices. In this plot, we also compare with the performance of using recursive matrix formats. The trend is similar to traditional row-major layout but the performance behaves more stably. While row-major matrices with strides that are large powers of two introduce a fair amount of cache conflicts the improved locality of the recursive layout minimizes such conflicts.

The following benchmark considers a different operation, which is x= alpha * y + beta * z with alpha and beta scalars and x, y, z vectors.

Most modern libraries use expression templates for this calculation so that all operations are performed in one single loop.

Finally, we managed outperforming GotoBLAS in one function at least for some sizes:

The dot product in this plot used internally unrolling with block size 8. Please note that this is compiler generated code not unrolled by hand.

Thanks to Laurent Plagne for his support with the BTL and to Chris Cole for running the programs on a cluster node at Indiana University.

**Remarks:**- Performance measurings labeled MTL represent computations with MTL2. MTL2 was tuned for KCC and achieved excellent performance with this compiler (cf. MTL2 performance). With MTL4 we did not rely on compilers for tiling, loop unrolling and similar transformations. There are two reasons for this: one is that compilers have very different behavior in this regard. The other reason is that many transformation rely on mathematical properties as commutativity that are not known for user types and/or user-defined operations so that compiler optimization is limited to build-in types and operations. To cope with this, we implemented accelerating transformation by meta-programming and count on compilers regarding efficient inlining and reference forwarding. Our meta-programming optimizations -- short meta-tuning -- proved high efficiency in multiple measurings (the plots above are only few examples) and were always as fast as code directly written in unrolled/tiled form.

Return to Addicted to peak performance Table of Content Proceed to Finite Elements in Few Lines

*
*