The development of the CUDA-MTL4 is driven by the target to provide the scientific community the maximal benefit. Therefor we started with operations on vector and sparse matrices that are predominantly used in simulation software and postponed dense matrices for later despite they would allow for demonstrating much higher performance. The sparse operations do not allow approaching peak performance of GPUs. Our yard stick is rather the peak memory bandwidth. Several operations, e.g. dot product, reach 70 percent peak bandwidth which is competitive with Nvidia's sparse implementations.
Another important design criteria is that the CUDA-MTL4 is fully compatible with the open-source version. This means that all applications for the later can be immediately compiled with CUDA-MTL4 (as long as the application does not use C++11 features not supported by the used nvcc). Not all operations in the open-source version have GPU acceleration and some might never have (e.g. preconditioners). CUDA-MTL4 is internally controlled by meta-programming mechanisms that are aware of which types and operations are supported. When no CUDA support is available the operations are performed on the CPU. Applications with partial GPU support might be even slower than pure CPU computations due to frequent data transfer between CPU and GPU memory. We will explain this in more detail on page Dynamic Memory Handling.
As with the previously released MTL4 editions, CMTL4 aims for the balance of productivity and performance. Users of the existing editions will find it easy to switch to CMTL4 and new users will find an easy entry into CUDA programming. To make the development really productive, we offer debugger support that represents our containers in a readable way.