next up previous contents
Next: Statement of Work Up: Using Kernel Extensions to Decrease the Previous: Java

Proposed Solution

 

A statement or characterization of what kind of solution is being sought [55].

Among the four possible ways, outlined in Section 2.2, to safely introduce untrusted user level code into the kernel, we believe that a kernel embedded interpreter is the most efficient and best suited approach for event handlers. Therefore, the dissertation will attempt to prove the following thesis:

A kernel embedded interpreter is an effective way to decrease the latency of user-level communication primitives.

It should be possible to write the handlers in C (or any other high-level language). The handler is then compiled into what we call R-codegif. During code insertion, the kernel transforms R-code into direct threaded code. To avoid a parser and assembler in the kernel, we transform R-code into a binary representation before it is inserted into the kernel. The assembly of R-code into its binary representation can be done by the application prior to insertion into the kernel, or off-line when the application is compiled. The former has the advantage that the application can easily make runtime changes, for example to insert constants, such as the logical node number. It has the disadvantage that the R-code assembler has to be resident on the node. The latter method avoids the overhead of last minute assembly, but makes it harder to make modifications at runtime. The types of code and the transformations are shown in Figure 4.1.

  figure546
Figure 4.1: Code Transformations: From C code to threaded code.

R-code has to have several characteristics. Most importantly, it has to be possible to interpret it at speeds that are close to assembly code generated by standard C compilers. R-code and the interpreter for it have to be able to take advantage of certain Puma kernel characteristics, as well as exist within the limitations imposed by the kernel (we will discuss them in the next paragraph). Another objective for the design of R-code is that it should be relatively simple for a C compiler to produce efficient R-code for the handlers we discussed in Chapter 1. We expect R-code to evolve during this research, swinging between easy to compile to and fast to interpret.

The Puma kernel has a few unique characteristics that have to be taken into consideration when designing R-code. There is no demand paging under Puma. Therefore, all memory pages are always present. This gets around the problem of having to bring in a page for a message that is currently being received or sent. It means that the Puma kernel can always stream messages directly into user memory, with no need for buffering and costly memory copies. Puma portals are currently added to the Linux kernel. The assumption that the necessary pages are always in physical memory is no longer true in that environment. Our first implementation requires that memory which is overlayed by a portal, must be wired down. It has to be made unpagable before the Puma portal can be activated.

For the R-code interpreter this implies that it can be very simple and the individual handlers can be executed to completion on each message arrival. No blocking has to occur because of missing virtual memory pages. There will be a time limit for each handler invocation to prevent a handler from consuming all CPU cycles of a node in an infinite loop. While the Puma kernel is executing, all interrupts are disabled. In Puma it is therefore difficult to enforce a running time limit on code that is executing inside the kernel. For interpreted code it is easy however. The interpreter simply counts the number of instructions executed and aborts after a predetermined limit is reached.

The Puma kernel and its data structures are static in size. This means that the room to insert user handlers is of fixed size. The stacks of the virtual machine as well as the data segment of the handler are of fixed size. This should not pose any problems. The handlers and their local data will be small. Large handlers will also run longer and belong into user space.

And last but not least, the Puma kernel is non-blocking. There are no functions in the kernel that have to be suspended because a resource is busy. This means that handlers have to be given a certain (small) amount of time to run and then be aborted if that limit is exceeded. Most handlers will probably never reach that limit. Those that do, might be better implemented as up-calls to a user-level handler. We will determine the cross-over point, where it becomes more efficient to do an up-call rather than using a kernel extension, when we do our experiments.

Handlers should have access to the header of the message that triggered invocation of the handler, as well as the data delivered by the message. Aside from the data stack, there should be a fixed-size data area for each handler that allows storage of data between invocations of the same handler. Global sum operations are an example where this storage becomes important. On the first invocation the handler stores the value. On the second invocation it sums it with the newly arrived value and sends the sum to the parent in the fan-in tree.

Handlers should also have access to the memory owned by the process that installed the handler. This allows handlers to manipulate counters and set flags inside the application. In the C code for the handler these memory references should be marked as external. A good way of linking these references to actual memory locations in the running application has to be found.

Handlers that fault (on division by zero for example) or attempt unsafe operations (illegal instructions or illegal memory accesses) are terminated. A return code associated with the handler should indicate the error. For debugging purposes and performance tuning as much information as possible about the handler run should be made available to the application that inserted the handler. This could be achieved by having two interpreters in the kernel: A high-performance one, and another that logs information such as running time, register usage, external memory accesses, etc.

The interpreter has to be very fast. We currently plan a stack-based virtual machine which introduces the problem of mapping a stack onto a register file. There has been some research in this area and we intend to find a good solution for the R-code interpreter. If necessary, we may consider to change to a register-based virtual machine. At the present time this seems unlikely, though.


next up previous contents
Next: Statement of Work Up: Using Kernel Extensions to Decrease the Previous: Java

Rolf Riesen
Wed Jan 22 22:24:20 MST 1997