An Infrastructure for Network Development. 
Proof of Concept: Fast UDP. 

Edgar A. Leon
Ph.D. student in Computer Science
University of New Mexico

Michal Ostrowski
IBM T. J. Watson Research

Scientific applications demand a tremendous amount of computational
capabilities. In the last few years, computational power provided by
clusters of workstations and SMPs has become popular as a
cost-effective alternative to supercomputers. Nodes in these systems
suffer from a variety of performance and scalability problems that may
affect the applications running on them. Two of these problems are:
(1) Host network processing scales poorly with respect to other parts
of the system, namely, processor, bus and link bandwidths (this effect
has become more evident as network speeds continue to increase); (2)
Host overhead, due to communication processing, significantly reduces
the processor availability to perform application's work.

In recent years, network vendors have created network interface
controllers (NIC) which can be programmed and provide a significant
amount of computational resources. These controllers allow overlap of
computation and communication by processing communication tasks on the
NIC and computational tasks on the host processor(s).

Although in many cases processing on the NIC has been beneficial to
application's performance, it is not clear what the capabilities of
these smart NICs should be and how they integrate with the rest of the
system. The interactions between Applications, the Operating System
(OS) and the Network are complex and should be studied further. The
way in which these entities interact with each other is a key
component to application's scalability and performance.

To investigate these interactions as well as to propose and study next
generation NICs, we have created an infrastructure in which network
interface controllers can be simulated. Simulated NICs are based on a
functional model which can run arbitrary functionality and may
interact with the host in novel ways such as injecting data directly
into a processor's cache. As a proof of concept that this
infrastructure meets our objectives, we extended the implementation of
a simple unreliable communication protocol, User Datagram Protocol
(UDP), by applying three network optimizations. This new
implementation of UDP, which we have called Fast UDP, provides
significant performance advantages over traditional UDP.

Our network infrastructure has been implemented in the "Mambo"
architecture simulator. Although this system is not open
source, we have created a "shim layer", that allows the creation
and dynamic loading of simulated network devices without access to Mambo
source. Thus, our infrastructure can be used by any institution with
access to Mambo binaries. An important component of the infrastructure
is defining the functional abstraction the NIC provides in
hardware/firmware. By explicitly defining this API, NIC independent
code can be developed and potentially ran in any NIC that implements this
functional abstraction. Our simulated NICs can be accessed by the host's
OS and user processes through memory mapped registers.

To show the flexibility of our network infrastructure, we created Fast
UDP, a high-performance implementation of UDP. Fast UDP is divided
into code running on the host and code running on the NIC. The code
running on the host is the Linux UDP/IP stack unmodified. The code
running on the NIC implements three network optimizations: message
matching on the NIC; splintering data and control information from
network packets; and NIC offloading. 

In common UDP implementations, when a packet arrives from the
network, the NIC copies the message to a kernel buffer and raises an
interrupt for the host OS to handle. The OS process the packet through
the UDP/IP stack and finally copies the payload to user space where
data is expected to arrive. In our approach, the NIC has been
instrumented to partially process UDP packets so that the payload is
transfered directly from the NIC to user space, while the header
(control information) is copied to a kernel buffer. Thus the kernel
remains aware of incoming network packets but does not incur in the
overhead of processing application's data (including an extra copy to
user space). This technique is called "splintering". 

Message matching semantics of UDP are based on an IP address and a 
Port. Message matching in Fast UDP is also performed on the NIC. The
information about receive UDP buffers is shared by the OS with the NIC
when a user posts a UDP receive. When a UDP packet arrives from the
network, the NIC matches the packet using its destination port, and if
a user has posted a receive for that port, the payload will be
delivered to the user buffer. UDP checksum on the packet is performed
(offloaded) on the NIC to avoid the transfer of erroneous data to
the user. 

To compare a traditional UDP implementation with Fast UDP, we
created a simple UDP application and measured its runtime. The
application consists of two phases, namely, the reception of a number
of packets; and performing computation on the data. Fast UDP performed
5% better than UDP when the application spends 80% of its time
computing. This performance improvement is expected to increase as the
application's communication to computation ratio increases. The
operating system we have used to run our experiments is K42, a
high-performance research OS.   

In conclusion, we have created an infrastructure to simulate network
interfaces which allow us to: (a) Better understand recent and future
network architectures to fully take advantage of their capabilities;
(b) Make a case for optimizations that improve application performance
and scalability and provide arguments for those ones who do not; (c)
Better understand the interactions between the Operating System,
Applications and smart NICs to avoid bottlenecks in the data path from
the network all the way to the application. As a proof of concept, we
applied 3 network optimizations techniques to improve application
performance: matching on the NIC, NIC offloading and splintering of
control and data. We obtained significant performance improvements
even for a computation-bound application.