Our new sparse matrix multiplication algorithm was presented at SC16
SC (formerly Supercomputing) conference is the largest annual event for the practioners of high performance computing (HPC). The conference gathers leading HPC scientists from all over the world along with top HPC vendors, ISVs, as well as their HPC clients.
This year, a paper I co-authored had been accepted to the scientific program of the conference. The acceptance rate was as low as 18% (just 82 papers were accepted out of 446 submissions!).
Our paper is titled "Enhanced MPSM3 for applications to quantum biological simulations" and is co-authored by
- me, Alexander Pozdneev (IBM Science and Technology Center, Moscow, Russia),
- Valéry Weber (IBM Research – Zurich Laboratory),
- Teodoro Laino (IBM Research – Zurich Laboratory),
- Costas Bekas (IBM Research – Zurich Laboratory),
- and Alessandro Curioni (IBM Research – Zurich Laboratory).
This paper is a follow-up to our previous paper "Semiempirical Molecular Dynamics (SEMD) I: Midpoint-Based Parallel Sparse Matrix–Matrix Multiplication Algorithm for Matrices with Decay". This research targeted the development of a novel quantum molecular dynamics code for the application of quantum Hamiltonians to large-scale simulations of biological systems.
The core operation of self-consistency field computations in Born-Oppenheimer molecular dynamics is the sparse matrix-matrix multiplication. Dense matrix-matrix multiplication computation on parallel computers is a well-developed area (for example, dense matrix-matrix multiplication is the computational core of a famous LINPACK benchmark that is used to range supercomputers for the TOP500 list). Sparse matrix computational linear algebra is far less developed, though. There are several general purpose sparse matrix-matrix algorithms (for example, SUMMA), but they suffer of scalability issues when both number of non-zeroes and number of processors increases.
The scalability could be improved if one utilizes a special structure of matrices that arise in specific applications. In the previous paper, we presented an algorithm that exploits the spatial structure of the underlying atomic system in order to lay out over a processors grid the matrix blocks that represent the interaction between atoms.
The idea is illustrated at the following picture:
Figure (a) shows a sample spatial distribution of few atoms, represented as dots and numbered, to a 2 × 2 grid of processes P₁–P₄. The stars represent the midpoints between pairs of atoms connected by a line. Symbols of the same color are owned by the same process (e.g., red entities are owned by process P₁). Figure (b) is the respective matrix view, in which the row and the column correspond to atom indices. Each colored block represents the interaction between a pair of atoms.
Since the process that owns a matrix block is uniquely determined by a midpoint between the corresponding atoms, we use the term midpoint-based parallel distribution of a sparse matrix for this scheme.
Matrix multiplication results in new non-zero blocks (that means we start to evaluate and take into account the interaction between the atoms that we initially considered to be too far from each other). These new blocks are also distributed according to the midpoint rule.
Comparing to the less sophisticated matrix distribution schemes and general purpose matrix multiplication algorithms, our approach helped us to reduce the interprocess communication volume and overcome the curse of scalability.
Since the previous paper, we were able to drastically simplify the logic of the matrix multiplication algorithm, leaving the matrix distribution and the initial communication step the same as in the original paper.
The new distribution of the computations is illustrated on the following pictures. First, processes I and II send their matrices to the neighbors, and some process III (and possibly some other processes) receives matrices from both processes:
Here, squares indicate process boxes; dots represent atoms, and stars between a pair of atoms connected by a line locate the midpoint of those atoms.
Then, process III is chosen to multiply the local matrices from processes I and II:
Finally, the multiplication is followed by a matrix redistribution stage, in which the process redistributes the computed matrix among its neighbors according to the midpoint rule (blue and red arrows):
This new approach (that we presented in the latest paper) allowed us to improve the performance by an order of magnitude over our previous results.
In our papers, we report the scalability for millions of atoms and tens of thousands processes with high value of parallel efficiency.
If you are eager to learn more, I encourage you to take a look at the slides that accompanied the talk and the references therein:
The posting is my own and do not necessarily represent IBM’s positions, strategies or opinions.