Why use RDMA?
In
traditional TCP/IP networking, the typical flow of data is as
follows. First the user applications makes a request to the OS to
send or receive some data via a read() or write() call on a
socket. This causes the OS to jump into kernel mode (context
switch) where the data is copied from the application buffer to a
kernel buffer. The data is then processed (usual TCP/IP
stack). This means two things, the CPU has to copy data between
user and kernel buffers, and the CPU must dedicate cycles to deal with
network processing. Once the TCP/IP stack processing is complete
the CPU has to copy the data from the kernel to the Ethernet NIC.
These copies and the network processing are prohibitively expensive as
network speeds in cease. Today's CPUs are able to (barely) keep up
with Gigabit Ethernet, but choke when it comes to processing higher
speed networks. The fact is that network technology is increasing
faster than CPU technology when it comes to performance gains.
Basically the
problem is two fold, the data copy, and the protocol processing, which
are both done by the CPU. If the CPU is busy moving data and
dealing with network processing, it is unable to do any real
computational work, clearly this means the overall productivity of the
system is severely degraded.
One solution to this problem is what is
known as a TCP Offload Engine, or TOE card. What a TOE card does
is offload the protocol stack processing to the NIC so that the CPU
does not have to deal with it. While this is a big step forward,
it is only half the answer. The data is still being moved by the
CPU through the kernel, with multiple copies along the way.
This is where RDMA comes in. With special RDMA based
interconnects such as Infiniband, Myrinet, RapidArray, and iWARP, the
CPU is not involved in network data transfers. A user application
makes a request to read or write data on a remote host. Note the
semantic with RDMA is reading and writing another hosts memory, rather
than the sending and receiving semantics of TCP/IP. The RDMA
adapter takes the initiative to move data from user's application
buffer to on board memory. All protocol processing is handled
on board the adapter as well. The CPU is not involved at all in
moving data or the network processing that must occur. On the
other side, data is moved directly from the adapter to application
buffer. Again the CPU is not involved. Thus RDMA addresses
both sides of the problem.
Rather than go into detail on how RDMA actually works, the argument for
why to use RDMA is shown in
the graphs below. The data was gathered from running the popular
networking benchmark tool, NetPIPE. The test systems are OSC's P4
Infiniband cluster and the Cray XD1
at OSC Springfield. The Cray XD1 employs an Infiniband like
network known as the RapidArray. For more information on the
performance of the XD1
see ORNL's
XD1 evaluation.
The RDMA lines represent the MPI version of NetPIPE which takes
advantage of the RDMA capabilities of the network. The IPoIB and
IPoRA lines represent running TCP over the high speed network, but when
doing this all data is moved by the CPU and processed like ordinary
TCP. In effect the 10Gigabit Infiniband and RapidArray become
10Gigabit Ethernet NICs.

What we see here is that the red and blue RDMA lines have significantly
higher bandwidth than the purple and green TCP lines.

What we see in this graph is that the TCP lines, again green and purple
have higher latency than the blue and red RDMA lines.

Just like in the previous
graph we see the IP based communication is much more latent. The
difference is that this graph shows the big picture.
Based on this data it
should be clear that RDMA is the right choice.
Questions? Comments?
Email me. dennis AT osc.edu