Systems Research

Supporting MPI collective communication operations with application bypass

Principal Investigators: D.K. Panda, P. Sadayappan, P. Wyckoff
Duration: 5/6/03 -12/31/2003

Description: For large-scale parallel systems supporting MPI, it is desirable that the MPI implementation ensures `progress' in order to achieve good performance and scalability [1]. Currently, collective communication operations in MPI are implemented by explicit send/recv calls by the processes. However, if a single node gets delayed (say an intermediate node of a broadcast, reduction, or barrier operation), the whole operation gets delayed. This leads to increased execution time for applications and limited scalability. Modern interconnects are supporting new communication mechanisms such as remote memory operations (RDMA Read and RDMA write). Similarly, modern NICs are providing programmable interface and memory to support collective communication operations with minimal interaction from processors [2,3,4]. These advances allow communication operations to be implemented without explicit use of send/recv calls by the processes. This leads to the following two open challenges:

  1. Can we implement MPI collective operations with application bypass by taking advantage of RDMA operations and NIC-level support?
  2. How much performance benefit can be delivered to applications with this bypass property?