next up previous contents
Next: Key Ideas and Concepts Up: Using Kernel Extensions to Decrease the Previous: Contents

Introduction

 

A statement of the problem and why it should be solved [55].

Explicit message passing and various forms of one-sided communications (puts and gets) are efficient methods to harness the power of a distributed memory machine. Much effort and research is being spent on improving message passing latency and bandwidth for massively parallel systems. While this research is important, it does not address remote execution. Often it is necessary to perform a simple action, such as adding two numbers or incrementing a counter, when a message arrives. Most mechanisms that support remote execution have an overhead of ten to a hundred times the cost to deliver a simple message.

Collective communications, such as broadcast and global sum, need low latencies to send messages between nodes. On each node, however, it is important that the message be processed quickly and then sent on to other nodes. This is difficult for routines that allow user-defined operations. While common operations, such as global sum and global max, are often part of the native message passing system, user specified functions have to be executed in user space and the overhead of a context switch, to run the function, delays the global operation.

Another class of user-level systems that require remote execution of user (or library) specified code, are runtime systems. For example, Cilk's [9] runtime system achieves load balancing through work-stealing. An idle node with no work chooses another node at random and sends it a work-steal request. The ``victim'' node should respond quickly with new work, or a negative acknowledgment. The goals are to minimize the impact on working nodes and to get an idle node working again as quickly as possible.

Another example is Split-C [16], which uses remote put and get operations to transfer data from one node to another. To avoid deadlocks, allow for synchronization, and guarantee atomicity, small handlers that are part of the Split-C runtime system, have to be executed on most message arrivals. Split-C is usually implemented on top of an active message layer. The arrival of an active message triggers the execution of a handler that performs a small amount of work, such as incrementing a counter, and then sends a reply. It is crucial that this handler be invoked as soon as possible after the active message arrives.

In all three examples (collective operations, Cilk, and Split-C), the user application or a third party runtime system specifies a function to be executed when a message arrives. Response time can be reduced if these functions are invoked immediately when a message arrives. Several methods have been devised to address this need.

Active messages [100] transmit the address of a function to be executed on the remote node, along with a small amount of data passed as parameters to the function. Most implementations poll for incoming messages and then jump to the address specified in the message header. Handler invocation can be very fast. Single digit microsecond latencies are reported in the literature. Latencies are much higher if the remote node is busy with a long computation and is not polling at the moment the active message arrives.

Intel's NX message passing system for the Paragon [73, ] provides the functions hrecv() and hsend() to execute user defined handlers on message arrival or completion of a send. The implementation is interrupt driven. After calling hsend(), the application continues to run. When the send completes, that is, the data has been delivered to the remote node, the sending application is interrupted and the handler specified as a parameter to hsend() is run. Similarly, an hrecv() sets up a buffer and matching criteria for an incoming message and specifies a handler. When a message is deposited in the buffer, the receiving application is interrupted and the handler is run. The overhead to context switch to the handler is high compared to the cost of receiving a message (about tex2html_wrap_inline1841s versus tex2html_wrap_inline1843s).

In the Puma operating system [104, , ] a portal event handler can be attached to any Puma portal. The user specified handler is run after the message has been deposited in the portal. As with hsend() and hrecv(), this requires a relatively expensive context switch.

As these examples illustrate, methods that provide the necessary functionality exist. All of them have a performance impact on the receiving node and introduce significant delays in the propagation of messages to other nodes.

Consider the example of a broadcast where a single node sends information to all other nodes in the application. In a distributed memory architecture this can be done using a fanout tree. The originating node sends a message to one of its neighbors. The neighbor then passes the message to one of its other neighbors, while the originating node is copying the message to yet another node. This pattern continues until all nodes have received the message.

Each node has to receive the message and then send it on to the appropriate nodes in the fanout tree. Implementing a broadcast using basic point-to-point message passing operations has several drawbacks. If a node in the middle of the fanout tree has not completed its current task, its participation in the fanout will be delayed and the children of that node will have to wait, even if they are ready to receive the broadcast data. On architectures where the network interface can only be accessed in supervisor mode, the necessary trap into the kernel and back to user level further increases the cost for each message receipt and send.

Using (non-polling) active messages, Intel's hsend() and hrecv(), or Puma portal event handlers, the problem of delaying a broadcast by a busy intermediate node can be avoided. There is a cost associated with this. When a message arrives, a context switch from the currently running application to the the user specified handler and back to the application occurs. These interrupt driven context switches are expensive, especially on modern RISC CPUs which have to save and restore a large amount of internal context. Context switches disrupt cache and TLB contents and impact the currently running application.

The research proposed in this document explores ways to avoid these additional context switches and thereby improve the performance of runtime systems and other communication primitives that require user specified handlers.

More specifically, we will explore methods to execute untrusted user code in supervisor mode, while the kernel is receiving a message. If all desired user specified functions were known a priori, they could be built directly into the operating system kernel. On message arrival the kernel would then simply execute that function and, after that, return to the user level. On systems with a general purpose message processor, such as the Intel Paragon, the coprocessor could execute the handler without interrupting the user code that is running on the main CPU.

Of course, it is impossible to anticipate all possible user handlers. Therefore, an approach to let user applications insert code into the kernel at runtime is needed. Since this code is untrusted and executed in supervisor mode at the time a message arrives, precautions must be taken to ensure the integrity of the kernel and other applications.

In the next chapter we will look in detail at the cost of user level message handlers and the potential savings of running handlers in kernel mode. We will also look at methods to safely execute untrusted code in the kernel of an operating system. In Chapter 3 we discuss work that is related to this proposal. In Chapter 4 we characterize our proposed solution and establish the criteria to measure and compare our solution. Finally, in Chapter 5, we outline a statement of work and a schedule.


next up previous contents
Next: Key Ideas and Concepts Up: Using Kernel Extensions to Decrease the Previous: Contents

Rolf Riesen
Wed Jan 22 22:24:20 MST 1997