An Infrastructure for Network Development. Proof of Concept: Fast UDP. Edgar A. Leon Ph.D. student in Computer Science University of New Mexico Michal Ostrowski IBM T. J. Watson Research Scientific applications demand a tremendous amount of computational capabilities. In the last few years, computational power provided by clusters of workstations and SMPs has become popular as a cost-effective alternative to supercomputers. Nodes in these systems suffer from a variety of performance and scalability problems that may affect the applications running on them. Two of these problems are: (1) Host network processing scales poorly with respect to other parts of the system, namely, processor, bus and link bandwidths (this effect has become more evident as network speeds continue to increase); (2) Host overhead, due to communication processing, significantly reduces the processor availability to perform application's work. In recent years, network vendors have created network interface controllers (NIC) which can be programmed and provide a significant amount of computational resources. These controllers allow overlap of computation and communication by processing communication tasks on the NIC and computational tasks on the host processor(s). Although in many cases processing on the NIC has been beneficial to application's performance, it is not clear what the capabilities of these smart NICs should be and how they integrate with the rest of the system. The interactions between Applications, the Operating System (OS) and the Network are complex and should be studied further. The way in which these entities interact with each other is a key component to application's scalability and performance. To investigate these interactions as well as to propose and study next generation NICs, we have created an infrastructure in which network interface controllers can be simulated. Simulated NICs are based on a functional model which can run arbitrary functionality and may interact with the host in novel ways such as injecting data directly into a processor's cache. As a proof of concept that this infrastructure meets our objectives, we extended the implementation of a simple unreliable communication protocol, User Datagram Protocol (UDP), by applying three network optimizations. This new implementation of UDP, which we have called Fast UDP, provides significant performance advantages over traditional UDP. Our network infrastructure has been implemented in the "Mambo" architecture simulator. Although this system is not open source, we have created a "shim layer", that allows the creation and dynamic loading of simulated network devices without access to Mambo source. Thus, our infrastructure can be used by any institution with access to Mambo binaries. An important component of the infrastructure is defining the functional abstraction the NIC provides in hardware/firmware. By explicitly defining this API, NIC independent code can be developed and potentially ran in any NIC that implements this functional abstraction. Our simulated NICs can be accessed by the host's OS and user processes through memory mapped registers. To show the flexibility of our network infrastructure, we created Fast UDP, a high-performance implementation of UDP. Fast UDP is divided into code running on the host and code running on the NIC. The code running on the host is the Linux UDP/IP stack unmodified. The code running on the NIC implements three network optimizations: message matching on the NIC; splintering data and control information from network packets; and NIC offloading. In common UDP implementations, when a packet arrives from the network, the NIC copies the message to a kernel buffer and raises an interrupt for the host OS to handle. The OS process the packet through the UDP/IP stack and finally copies the payload to user space where data is expected to arrive. In our approach, the NIC has been instrumented to partially process UDP packets so that the payload is transfered directly from the NIC to user space, while the header (control information) is copied to a kernel buffer. Thus the kernel remains aware of incoming network packets but does not incur in the overhead of processing application's data (including an extra copy to user space). This technique is called "splintering". Message matching semantics of UDP are based on an IP address and a Port. Message matching in Fast UDP is also performed on the NIC. The information about receive UDP buffers is shared by the OS with the NIC when a user posts a UDP receive. When a UDP packet arrives from the network, the NIC matches the packet using its destination port, and if a user has posted a receive for that port, the payload will be delivered to the user buffer. UDP checksum on the packet is performed (offloaded) on the NIC to avoid the transfer of erroneous data to the user. To compare a traditional UDP implementation with Fast UDP, we created a simple UDP application and measured its runtime. The application consists of two phases, namely, the reception of a number of packets; and performing computation on the data. Fast UDP performed 5% better than UDP when the application spends 80% of its time computing. This performance improvement is expected to increase as the application's communication to computation ratio increases. The operating system we have used to run our experiments is K42, a high-performance research OS. In conclusion, we have created an infrastructure to simulate network interfaces which allow us to: (a) Better understand recent and future network architectures to fully take advantage of their capabilities; (b) Make a case for optimizations that improve application performance and scalability and provide arguments for those ones who do not; (c) Better understand the interactions between the Operating System, Applications and smart NICs to avoid bottlenecks in the data path from the network all the way to the application. As a proof of concept, we applied 3 network optimizations techniques to improve application performance: matching on the NIC, NIC offloading and splintering of control and data. We obtained significant performance improvements even for a computation-bound application.