Performance on a HPC Cluster
These are only first results. We will perform more benchmarks in the near future.
Comparison with PETSc
The following plot compares the week scaling of a Conjugate Gradient solver with PMTL4 and PETSc on Juropa, for details see here. The time for one iteration is plotted:
The linear system was generated with the finite element package AMDiS solving a Poisson equation (for the sake of simplicity). In the largest example (i.e. the upper pair of lines), 1,048,576 DOFs per processor were used. In the smaller examples the number of DOFs is halved each time. The plot demonstrates that the performance is similar whereas PMTL4 performed better on the largest example.
In the following plot, we used a similar test as in the largest example above. This time the sub-domains are arranged in a 1D-scheme for simplicity, i.e. each sub-domain has only two neighbor domains and the inner boundary is always 1,024 grid points long. The linear systems are generated directly in PMTL4 without using a FEM program.
In this simple example the usage of the domain decomposition with ParMetis is counter-productive. The optimal boundaries can only be deteriorated (more communication) and the order of the indices becomes more irregular (less performance on the cores).
The plot indicates that the memory bandwidth seems to be the limiting factor. Although the Intel Xeon X5570 has 3 fast memory channels for 4 cores, the performance is higher when only 1 or 2 cores are active.
The compute times in the previous plot correspond to the following numbers of floating point operations:
They were performed on the TU Dresden's old HPC cluster "Deimos" with dual-core AMD Athlons, for details of the cluster see here. This platform will be turned off in the near future (or might be already by the time of reading this).
The solvers were also significantly tuned in the meantime and the benchmarks will be repeated, on a more powerful architecture ASAP.
In this benchmark, we are using three matrices from the University of Florida Sparse Matrix Collection:
- Ga41As41H72 (Gallium Arsenide cluster) typical material in electronic structure (rows=268,096, nnz=18,488,476, more here).
- F1 is a stiffness matrix from an engine crankshaft (rows=343,791, nnz= 26,837,113, more here).
- ldoor is a positive definite test matrix (rows=952,203, nnz= 42,493,817, more here ).
These matrices were used to set up linear systems of according dimension. We solved these linear systems with a Bi-Conjugate Gradient method (BiCG) such that |r| ≤ 10-8:
We measured the wall clock time for the solution. Although, that are among the largest matrices that one can find on the internet, they are still small for scaling above 20 cores.
The following benchmark solved the same linear systems with the method Bi-Conjugate Gradient Stabilized version 2 (BiCGStab(2)):
Likewise, the Conjugate Gradient Squared method (CGS) was applied in the following benchmark:
The ill condition of the ldoor matrix impeded a solution with CGS. Similarly, the other systems could not be solved efficiently with CGS when the absolute time is compared to the benchmarks above.
To better represent the behavior of large-scale simulations in real HPC scenarios, we considered the performance of applications that grow proportionally to the process number (weak scaling). In the end, large parallel computers are used to solve larger problems than those on single-core PCs. In this benchmark, we created a linear system to solve a two-dimensional Laplace equation. The dimension of the linear system was one million times the number of cores (the number of non-zeros in the matrix is about the five-fold). Two iterative solvers were used:
- Conjugate Gradient (CG) and
The figure below plots the accumulated performance in MegaFLOPS:
In this performance measurement, we simulated the drug diffusion into the cochlea (inner ear), see the figure below.
This simulation was performed with our simulation software AMDiS. Here, we measured the solution time (wall clock) of the linear system for different refinement levels of the grid:
- Refinement 22 had 23,130,338 non-zero entries in the matrix;
- Refinement 24 had 65,368,625 non-zeros; and
- Refinement 26 had 152,345,109 non-zeros.
The calculation was performed twice in each refinement level:
- By distributing the matrix block-wise in the order as it was read;
- By migrating the linear system according to the partitioning from ParMetis, indicated by ParM.
The entire migration time:
- Computing the partioning with ParMetis
- Calculating a new block distribution and according migration schemes;
- Migrating the matrix and the rhs vector
was about 5 per cent overall compute time. On the other hand, the migration saved about half the run-time of the solver, as can be observed in the following plot:
The calculation within refinement level 26 did not run on 2 processes (for memory exhaustion). Also, the run-time of this example without migration grows up to 8712s, which we cut off for better readability of the plot.