Logo MTL4
Benchmarks with CUDA-MTL4

For measuring the performance, we compared the execution on:

Dot Product

The following plot shows the performance of dot products with float values:

Dot4.png

CRS Matrix Vector Product

CRS matrices have a memory access that is unfortunate for GPU load instructions. Therefore, the performance gain on a GPU is not spectacular. For matrices, where the number of entires is balanced, i.e. all rows have the number of stored entries--including stored zeros--it is easier to arrange a coalesced memory access and the performance could be almost doubled. This is particular important for earlier architectures where the loaded line was 128 byte. On newer architectures, the load line is 32 byte. Thus, the worst case for accessing float values is now to load 7 irrelevant values out of 8 instead of loading 31 unneeded entries out of 32. Thus, the impact of non-coalesced memory access is less severe but still an issue for optimal performance. The next plot shows the performance of a float CRS matrix times vector:

CRS.png

Remark: The problem size is given in millions of entries.

Sparse Matrix Vector Product

Matrices in the Ellpack format are known to be more appropriate for data streaming on GPUs. In contrast to it, CPUs are slightly more efficient with CRS matrices. The following plot compares the performance of different matrix vector products with single precision:

SparseMatrixVectorProduct.png

Iterative Linear Solver

The Conjugate Gradient (CG) method is an intensively investigated implementation in MTL4, i.e. the CPU performance is already well-tuned. Since iterative solver do not work properly with single precision on large systems, we switched to double precision. The previously stated superiority of the Ellpack matrices over CRS is also clearly observable when used within an iterative solver:

cg.png
Remarks:
More benchmarks are planned.

Return to Why Not Using Shallow Copy in Numerical Software                                Table of Content                                Proceed to Performance on an AMD Opteron 2GHz


Benchmarks with CUDA-MTL4 -- CUDA-MTL4 -- Peter Gottschling -- Gen. with rev. 9324 on Sun Jun 16 2013 by doxygen 1.7.6.1 -- © 2013 by SimuNova UG.