|
Systems ResearchSoftware implementation and testing of iWarp protocolOSC iWarp Team: Dennis Overview: iWarp is likely a made-up term (by one of the co-authors in a bar, if you believe the rumor). It is not to be confused with the late 80s parallel computer project at CMU/CS that Intel later marketed. We use it here by tradition and to encompass a series of internet drafts and other documents. In particular, draft-ietf-rddp-rdmap-03.txt and draft-ietf-rddp-ddp-04.txt are the most relevant. DDP stands for "Direct Data Placement" and describes an upper layer protocol that can move receive data to its final location without intermediate copies. It relies on a lower level reliable point-to-point transport. The specification details how to implement DDP on TCP or SCTP. RDMAP stands for "Remote Direct Memory Access Protocol" that provides services to read or write remotely to another machine. It uses DDP primitives. We use the term iWarp in this proposal to cover the entire software stack between TCP at the bottom and the user application at the top. Why implement iWarp in software? In the case that iWarp negotiation fails, neither side can take advantage of these features. All packets in the subsequent connection must transfer into the kernel TCP/IP stack for analysis and demultiplexing to the user application. We have written a pure userspace library implementation (as well as a kernel-resident implementation) of an iWarp stack for deployment on clients as a transition mechanism to enable servers to take advantage of their iWarp hardware. The typical path to adoption of new hardware components such as iWarp-enabled NICs is to buy the expensive early cards for use in key machines such as file servers and web servers, then gradually upgrade less important machines as costs decrease. Eventually even old desktop machines will be replaced with new ones that have the new feature built directly onto the motherboard. If its clients do not use an iWarp implementation, the server must fall back to the computationally more expensive kernel-based TCP/IP stack and multiple memory copies for communications with that client. It cannot take advantage of its iWarp NIC. With the proposed iWarp userspace library, codes that use it will encounter another layer of software and a few more function calls, but they will allow the server to offload its side of the processing entirely into hardware, providing better throughput and scaling as the number of clients grows. Research vehicle To take an example from a similar technology that is still rapidly evolving although currently more mature than iWarp, InfiniBand changed and was changed by its user community during its development. The initial IB specification had no provision for shared receive queue, but the latest specification does due to valid concerns voiced by potential users of large IB installations (such as high-performance computing cluster users). No vendor had bothered to design a subnet manager that would work sufficiently quickly to handle thousand-node machines. In the other direction, the rise of IB hardware led to a substantial redesign in the RDMA channel of the CH3 device in the MPICH2 source once limitations on hardware-initiated retry and time penalties of RDMA read vs RDMA write were fully grasped by the developers. With our open-source implementation of the iWarp specification, testing of all sorts of applications and scenarios can proceed to give developers a head start on moving applications to iWarp devices. Papers and Presentations Design and Implementation of the iWarp Protocol in Software Abstract:
The term iWarp indicates a set of published protocol specifications that
provide remote read and write access to user applications, without operating
system intervention or intermediate data copies. The iWarp protocol provides
for higher bandwidth and lower latency transfers over existing, widely deployed
TCP/IP networks. While hardware implementations of iWarp are
starting to emerge, there is a need for software implementations to
enable offload on servers as a transition mechanism, for protocol testing, and
for future protocol research. The second paper covers the issues and performance related to the kernel module. iWARP Protocol Kernel Space Software Implementation Abstract:
Zero-copy, RDMA, and protocol offload are three very important characteristics of high performance interconnects.
Previous networks that made use of these techniques were built upon proprietary, and often expensive, hardware. With
the introduction of iWARP, it is now possible to achieve all three over existing low-cost TCP/IP networks.
Source Code Distribution
Each of iwarp and kiwarp has a test subdirectory with low-level programs to test various features of the implementations. These tests do not use the verbs interface. Some simple applications that do use the verbs interface are in verbs/Benchmarks. While this code is reasonably well tested, you may still encounter bugs that may, in the case of kernel-space iWarp, crash your machine. Treat with care. Feel free to use and modify this code as you see fit, as long as such use follows the GPLv2 license terms. We would appreciate it if you sent fixes, changes, and additions back to us for release in later versions of this code so that others may benefit from your improvements. About OpenFabrics Support Download the source Thanks to Sandia National Labs for supporting the initial
work on the software iWarp project. |
Dalessandro, Pete Wyckoff, and Ananth Devulapalli.