Benchmarks with CUDA-MTL4
For measuring the performance, we compared the execution on:
The following plot shows the performance of dot products with float values:
CRS matrices have a memory access that is unfortunate for GPU load instructions. Therefore, the performance gain on a GPU is not spectacular. For matrices, where the number of entires is balanced, i.e. all rows have the number of stored entries--including stored zeros--it is easier to arrange a coalesced memory access and the performance could be almost doubled. This is particular important for earlier architectures where the loaded line was 128 byte. On newer architectures, the load line is 32 byte. Thus, the worst case for accessing float values is now to load 7 irrelevant values out of 8 instead of loading 31 unneeded entries out of 32. Thus, the impact of non-coalesced memory access is less severe but still an issue for optimal performance. The next plot shows the performance of a float CRS matrix times vector:
Remark: The problem size is given in millions of entries.
Matrices in the Ellpack format are known to be more appropriate for data streaming on GPUs. In contrast to it, CPUs are slightly more efficient with CRS matrices. The following plot compares the performance of different matrix vector products with single precision:
The Conjugate Gradient (CG) method is an intensively investigated implementation in MTL4, i.e. the CPU performance is already well-tuned. Since iterative solver do not work properly with single precision on large systems, we switched to double precision. The previously stated superiority of the Ellpack matrices over CRS is also clearly observable when used within an iterative solver: