Software implementation and testing of iWarp protocol
OSC iWarp Team: Dennis Dalessandro, Pete Wyckoff, and Ananth Devulapalli.
The evolving DDP and RDMA protocols, historically known as iWarp, provide for higher performance and lower overhead transfers over existing, widely deployed TCP/IP networks. While hardware implementations of iWarp are starting to emerge, there is a need for software implementations both as a transition mechanism and for testing.
iWarp is likely a made-up term (by one of the co-authors in a bar, if you believe the rumor). It is not to be confused with the late 80s parallel computer project at CMU/CS that Intel later marketed. We use it here by tradition and to encompass a series of internet drafts and other documents. In particular, draft-ietf-rddp-rdmap-03.txt and draft-ietf-rddp-ddp-04.txt are the most relevant. DDP stands for "Direct Data Placement" and describes an upper layer protocol that can move receive data to its final location without intermediate copies. It relies on a lower level reliable point-to-point transport. The specification details how to implement DDP on TCP or SCTP. RDMAP stands for "Remote Direct Memory Access Protocol" that provides services to read or write remotely to another machine. It uses DDP primitives.
We use the term iWarp in this proposal to cover the entire software stack between TCP at the bottom and the user application at the top.
Why implement iWarp in software?
Although a software implementation of iWarp on a client should not be expected to require less processor utilization or to provide better performance, it will enable a server with a hardware iWarp implementation to realize those benefits. iWarp hardware devices work by first establishing a traditional TCP/IP connection, then negotiating to determine if the switch to remote direct memory access is possible, and if so, various parameters such as transfer credits. After successful negotiation, both the server and client can use their iWarp hardware to do most of the processing, including the TCP/IP stack, and place incoming data directly into user space of the communication process. These features are often called OS bypass and zero-copy, respectively.
In the case that iWarp negotiation fails, neither side can take advantage of these features. All packets in the subsequent connection must transfer into the kernel TCP/IP stack for analysis and demultiplexing to the user application.
We have written a pure userspace library implementation (as well as a kernel-resident implementation) of an iWarp stack for deployment on clients as a transition mechanism to enable servers to take advantage of their iWarp hardware. The typical path to adoption of new hardware components such as iWarp-enabled NICs is to buy the expensive early cards for use in key machines such as file servers and web servers, then gradually upgrade less important machines as costs decrease. Eventually even old desktop machines will be replaced with new ones that have the new feature built directly onto the motherboard.
If its clients do not use an iWarp implementation, the server must fall back to the computationally more expensive kernel-based TCP/IP stack and multiple memory copies for communications with that client. It cannot take advantage of its iWarp NIC. With the proposed iWarp userspace library, codes that use it will encounter another layer of software and a few more function calls, but they will allow the server to offload its side of the processing entirely into hardware, providing better throughput and scaling as the number of clients grows.
Few iWarp hardware components are available on the market, thus little work has been done with them in the high performance computing (or other any computing communities). The impacts on applications, higher-layer protocols, system-area networks, wide-area networks and many other aspects have yet to be analyzed. By having a userspace implementation of iWarp, we can deploy the software on arbitrary machines at no cost and study such aspects at large scale. This advance planning is crucial both to influence evolving hardware designs and to plan for future adoption of high-speed protocol-offload NICs in our computing environments.
To take an example from a similar technology that is still rapidly evolving although currently more mature than iWarp, InfiniBand changed and was changed by its user community during its development. The initial IB specification had no provision for shared receive queue, but the latest specification does due to valid concerns voiced by potential users of large IB installations (such as high-performance computing cluster users). No vendor had bothered to design a subnet manager that would work sufficiently quickly to handle thousand-node machines. In the other direction, the rise of IB hardware led to a substantial redesign in the RDMA channel of the CH3 device in the MPICH2 source once limitations on hardware-initiated retry and time penalties of RDMA read vs RDMA write were fully grasped by the developers.
With our open-source implementation of the iWarp specification, testing of all sorts of applications and scenarios can proceed to give developers a head start on moving applications to iWarp devices.
Papers and Presentations
This first paper covers the userspace implementation and gives a good background on design issues and implementation choices, as well as performance numbers to validate the approach.
Design and Implementation of the iWarp Protocol in Software
Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff
Proceedings of PDCS '05, Phoenix, AZ, November 2005
Presentation Materials (PDF)
Abstract: The term iWarp indicates a set of published protocol specifications that provide remote read and write access to user applications, without operating system intervention or intermediate data copies. The iWarp protocol provides for higher bandwidth and lower latency transfers over existing, widely deployed TCP/IP networks. While hardware implementations of iWarp are starting to emerge, there is a need for software implementations to enable offload on servers as a transition mechanism, for protocol testing, and for future protocol research.
The work presented here allows a server with an iWarp network card to utilize it fully by implementing the iWarp protocol in software on the non-accelerated clients. While throughput does not improve, the true benefit of reduced load on the server machine is realized. Experiments show that sender system load is reduced from 35% to 5% and receiver load is reduced from 90% to under 5%. These gains allow a server to scale to handle many more simultaneous client connections.
The second paper covers the issues and performance related to the kernel module.
iWARP Protocol Kernel Space Software Implementation
Dennis Dalessandro, Ananth Devulapalli and Pete Wyckoff
Proceedings of the 20th IEEE International Parallel & Distributed Processing Symposium (IPDPS '06), Communication Architectures for Clusters Workshop, Rhodes Greece
Presentation Materials (PDF)
Abstract: Zero-copy, RDMA, and protocol offload are three very important characteristics of high performance interconnects. Previous networks that made use of these techniques were built upon proprietary, and often expensive, hardware. With the introduction of iWARP, it is now possible to achieve all three over existing low-cost TCP/IP networks.
iWARP is a step in the right direction, but currently requires an expensive RNIC to enable zero-copy, RDMA, and protocol offload. While the hardware is expensive at present, given that iWARP is based on a commodity interconnect, prices will surely fall. In the meantime only the most critical of servers will likely make use of iWARP, but in order to take advantage of the RNIC both sides must be so equipped.
It is for this reason that we have implemented the iWARP protocol in software. This allows a server equipped with an RNIC to exploit its advantages even if the client does not have an RNIC. While throughput and latency do not improve by doing this, the server with the RNIC does experience a dramatic reduction in system load. This means that the server is much more scalable, and can handle many more clients than would otherwise be possible with the usual sockets/TCP/IP protocol stack.
Source Code Distribution
This distribution is code to implement the iWarp communication protocol stack in software. There are three major components, with a subdirectory for each:
User-space implementation of RDMAP, DDP, and MPA layers.
Kernel-space implementation of RDMAP, DDP, and MPA layers.
Application programming interface (API) common for both iwarp and kiwarp.
OpenFabrics support is available for userpsace only
Each of iwarp and kiwarp has a test subdirectory with low-level programs to test various features of the implementations. These tests do not use the verbs interface. Some simple applications that do use the verbs interface are in verbs/Benchmarks.
While this code is reasonably well tested, you may still encounter bugs that may, in the case of kernel-space iWarp, crash your machine. Treat with care.
Feel free to use and modify this code as you see fit, as long as such use follows the GPLv2 license terms. We would appreciate it if you sent fixes, changes, and additions back to us for release in later versions of this code so that others may benefit from your improvements.
About OpenFabrics Support
Versions 1.1 and beyond support the OpenFabrics API, it is not an OpenFabrics stack. We merely support wrapper functions around our existing verbs API that enables code written for OpenFabrics verbs to compile against our library. Please note that we do not include every OpenFabrics function, rather we have implemented the most commonly used functions. We have been able to compile applications ranging from simple ping-pong benchmarks to our RDMA module for the Apache web server.
Download the source
Version 1.1 * Support for OpenFabrics API (3/16/07)
Version 1.2 * Support for real disconnection via RDMA CM funtions. Supports communication with NetEffect and Chelsio RNICs (9/28/07)
*Note: Tested with user space library only.
Patch 1 Patch to fix some memory leaks by Oliver Roussel
Patch 2 MPA markers patch by Oliver Roussel
NOTE* Please send an email to firstname.lastname@example.org or email@example.com just to let us know who's downloading the code.