Advanced Operating Systems

http://www.cs.unm.edu/~darnold/classes/cs587-f09/
Office Hours: after class

Meta
class notes
reading notes
project
exams [1/3]
question
- difference between grafting and co-location?
- how can secure binding ensure good behavior after binding?
concepts / terms
overview

10	participation
25	homework
25	project
22.5	midterm
22.5	final

class notes

2009-08-25 Tue

same course path as undergraduate OS, just much more detail
all reading in research papers
groups
- all reading
- all work
- one day's lecture
research-lite project
- proposal in ~3rd weak of Sept.
- 1-2 month working on implementation
- will produce a research-paper
undergrad lectures serve as good background for the course

2009-08-27 Thu

email server was down, try again or send email to Dorian

concepts

overview

OS

software providing access to hardware (cpu, memory, disk, IO)

policies: what user can do
mechanisms: how policies enforced

permission levels

often controlled by indicator bit

user
kernel

system call

allows access to kernel functions from user mode

syscall is made
parameters stored in registers
switch to kernel mode
execute routine defined in the kernel

virtual memory

virtual address space which can be mapped to actual memory. this allows the process using the memory to be loaded/unloaded/moved etc…

data/virtual-memory.png

if a page of virtual memory is not in physical memory a page fault occurs and the page is loaded into physical memory

working set / footprint

of a process is the parts of it's address space currently in use. this is the pages of memory that need to be loaded in physical memory to avoid memory.

design goals

efficiency

often a tradeoff between time and space

robustness

fulfills expectation of users

security

hardware interface

expose features/capabilities of hardware

user interface

present features/capabilities to user

portability

target hardware

economics

development cost, user base

scalability

range of supported hardware/user sizes/numbers

extensibility

ability to support new components

papers

keep in mind the context

users back then were developers
cpu used to be bottleneck now it's memory

increasing gap between CPU speed and the ability of memory and bandwidth to keep up.

bandwidth is proving the limit on the amount of memory which can be used efficiently

observations

hints

2009-09-01 Tue

Stephens Richie and Thompson developed unix and TCP/IP

Stephens
- part of the unix team
- wrote the unix bible

2009-09-03 Thu

uni-programming: only one process at a time, typically it would run to completion
batch-programming: still uni-programming but you maintain a queue of processes that are ready to run
multi-programming: allows multiple processes to run "simultaneously" on the machine using preemption, time slicing and by utilizing different hardware components in parallel
time sharing: multi-programming with multiple users creating processes. these days, tend to make the most sense for large batch processes, rather than interactive use

Mechanisms for multi-programming

context switch: switching processes on a CPU
process table: maintained by the OS, this contains an entry including a process control block for each process currently running on the system
process control block: contains the PID, the files the process is using, the program counter, register values, a pointer to the image
image: when you are about to run a process you load that process which creates an image of the process and bring it into memory. image+state=process. program -> image -> process

file systems

file: in general to the OS a file is just an uninterpreted raw ordered set of bytes, some specialized OSs do differentiate between file types for optimization.
directory: list of files, most OSs limit access to these files to system calls

mechanisms

filename relates to an index # which points to the index table which relates the index # to an i-node
when mounting a new disk the first couple of bites on the disk contain the information used by the OS to populate the index table
generally corrupt disks are the results of damage to this meta data section on the front of the disk

links

hard link: the contents of the file is a pointer to the i-node of the file
soft link: the contents of the file is the name of the file to which it is pointing

deleting a file

the actual data isn't "erased", rather the link counter in the i-node is decremented, and if there are no more links, then the blocks on which the file is written are added to the free list
deleting a soft link doesn't change it's target's i-node

2009-09-08 Tue

processes: unit of work user gives to the OS
thread: finer unit of work inside the process

processes

schedulers

batch queue

outside the OS, waiting jobs

short term scheduler

many different scheduling policies

round robin
priority
shortest-first

data/process-context.png

scheduling

scheduling policy

dispatcher

implements the scheduler policy

goals

timing

responsiveness: time to first response
waiting: total time spent waiting
turnaround: time start to finish

resource utilization

users don't really care about this

policies

fifo: first in first out
round robin: move around giving everyone time slices
shortest job first: theoretical, not in real life
priority: not a complete policy, combined with fifo or round robin etc…
multi-level queues: semantics added to queues (i.e. system, interactive, batch, IO-bound, etc…)
multi-level feedback queues: jobs can change priority over time, based on things like increasing the priority of long-waiting jobs to avoid starvation

threads

multiple processes sharing an address space

each thread would have it's own

UID
stack
registers
defining handling signals like C-c, segmentation violations, etc…

everything else is shared (text, static data, dynamic data)

threading schema

many-to-one (user-level threads)

don't really speed up execution, but could help to modularize a program

if one thread blocks for IO, all threads are blocked
thread operations (creation, deletion) are performed in user-space which makes them faster than having to do them in kernel space

one-to-one (kernel threads)

real parallelism
less portable
takes longer to create a thread
real parallel execution

many-to-many (hybrid)

static: static mapping between user-threads and kernel-threads
dynamic: thread mappings can be changed
pool: you can establish a pool of kernel threads, then do all further thread operations in user-space (faster)
limit: you can limit the number of kernel threads to something reasonable (number of processors) reducing overhead on the OS and times jumping the user|kernel boundaries
complicated: this is the most complex of the threading schemas
fixed: most often you have a fixed number of kernel threads

inter process communication

through sharing or message passing. each of these could be implemented in terms of the other

in both cases

synchronous == blocking - asynchronous == non-blocking

message passing

µ-kernels can be though of as using message passing
messages typically have to pass through kernels
can cross machine boundaries

shared memory

monolithic kernels can be thought of as using shared memories
typically faster than message passing
requires shared physical media
when two processes try to access a variable at the same time
1. i read
2. you read
3. i process and write
4. you process and write
5. we've both missed the other's changes
atomic operations, locks, semaphores (binary, counting), etc…

2009-09-10 Thu

talking about µ-kernels.

address space == process

L4 kernel has hierarchical address space

every process inherits address space from a parent, and the initial address space (sigma-0) maps directly to physical memory
which is like monolithic unix where everything descends from the init process.
manages address spaces

paging vs. contiguous allocation

contiguous allocation

has a base and offset registers which are used to map the virtual addresses to the physical disk

paging

pages of memory map randomly (not contiguous) on disk

more complicated translation from virtual to physical addresses
allows you to fill holes one disk (finer granularity because physical memory is eaten in page-size chunks rather than address-space chunks)
allows portions of the address to be loaded individually as opposed to contiguous allocation where the entire address space must be loaded before any execution can take place

1st v.s. 2nd generation µ-kernels

second tailored more to hardware
second built from scratch (rather than back from monolithic kernels)

interrupts (µ-kernel is slower)

main reason µ-kernels are slower is because every interaction is translated through IPC which has to go through the table

monolithic kernel: hardware interrupts in a monolithic kernel are directly looked up in a register table
µ-kernel: one thread waiting for each potential interrupt source

top-half / bottom-half interrupts (in linux)

tradeoff between speed of handling interrupts, and need to do significant amount of processing in many cases

top-half

responds quickly and does what needs to be done immediately

for example it will just record that the interrupt occurred
has high priority and can interrupt other interrupt handlers
setup

bottom-half

does the actual bulk of the work

has lower priority
service

system call mechanisms

.so shared object: can be shared across multiple address spaces
.a static shared code: statically linked (shared) at compile time
trampolines: jumps the code execution to somewhere else, then jumps back

scheduling

in L4Linux the normal linux scheduler is used much like a many-to-one thread scheduler. Maps all of the linux "user-threads" to a single kernel thread.

the L4 kernel is scheduled using hard priority round robin

L4 scheduling priority levels:

top-half interrupt handler
bottom-half interrupt handler
kernel (which is the linux server)
user

translation look-aside buffer (TLB)

Fast associative memory that helps in address translation

It maps a virtual address to a physical address, if you hit in the TLB, then you don't have to look up the page.

tagged TLB: like a TLB plus information as to which process the address belongs to

in a normal TLB you have to flush all entries in the TLB to clear old mappings, in a tagged TLB you don't need to flush the TLB on context switch. this saves time which quickly switching between processes and back.

dual space mistake

tried to facilitate speedy kernel <-> user IPC through shared memory

space costs (doubling memory usage)
synchronization costs (takes time)

co-location

allows multiple processes to all have access to kernel memory, like threads

2009-09-17 Thu

Project Ideas

System monitoring stuffs
- DynInst
- KernInst
- PIN
- PAPI (Performance API)
massively parallel stuffs
- map-reduce
- hadoop, w/PIG
- MRNet large scale group operations
- RPC
- XML-RPC
- task-farming (programming model, issue tasks to the farm and collect results, example SETI-at-home)
threads
- end-to-end threading model
- deadlock prevention
file systems
- encrypted
process hijacking
project HAIL
FUSE (MAC-FUSE)
Amazon Dynamo

exokernels

secure bindings

pain to write to

the spirit of the exokernel is that you would normally use abstractions exported by the library OS rather than always having to use your own

µ-kernel vs. exokernel vs. monolithic-kernel

Monolithic kernel

+------------------------+
|   S1   S2     S3       |
|                        |   Monolithic kernel
|      S4    S5          |
+------------------------+
           ^              
          App

µ-kernel

+------+
|App   |                 +------+
|      |-\               | App  |
+------+  -\             |      |
      +-------------+    +-/----+
      |             |    /-
      |             |  /-
      |  u-kernel   |--
      +---/-----\---+
         /       -\
       /-          --\
      /          +---------+
  +--/-----+     | S2      |
  |S1      |     |         |
  |        |     +---------+
  +--------+

exokernel

+----+ +----+ +----+
|App | |App | |LOs |
+----+ +----+ +----+
+-------------------+
|    exokernel      |
+-------------------+
|    Hardware       |
+-------------------+

downside (cooperation)

when each application has direct access to the hardware it becomes difficult for applications to cooperate (intelligently share resources), which is routinely done in standard kernels.

2009-09-22 Tue

monitors

(see ../cse451/notes/2007-10-17)

software construct which provides for mutual exclusion around a resource. maintains the invariant that when entering a monitor there is no-one else inside of the monitor.

condition variables

allow communication between processes avoiding spinning (while(condition !true);)

signal() alerts other processes
wait() sleep (relinquish cpu) and wait to be signaled

semaphore

semaphores are effectively equivalent to condition variables

sem.p() waits on the semaphore
sem.v() signals those waiting on the semaphore

Synchronization Problems

bounded-buffer
reader-writer-problem
dining-philosophers

Mesa Monitors

Mesa

programs comprised of modules
- clear API boundary between modules
  - public interface
  - private procedures

Mesa Monitors

monitor module
- entry procedures (publicly interface)
- internal procedures (private procedures)
- external procedures (procedures that require no locking)

Issues

when in a monitor (in function foo), and you call a function bar in another module then during the execution of that function you are not in the monitor (bar has no access to the structures of the monitor while as it is another module)
- if you don't release the lock when moving into bar
  - you have the risk that something in bar tries to grab the resource protected by the monitor /deadlock/
  - you have to unwind and open locks if say there is a deep exception
- if you do release the monitor while calling bar you need to
  - ensure that you get the monitor back after executing bar
  - potentially do cleanup before/after executing bar
this is tradeoff between simplicity (per class) and efficiency (per object), best option really depends on the use case. monitor per class is sort of a strawman
it is possible for lower-priority processes to run in front of higher-priority processes.
1. p1 acquires l1
2. p2 preempts p1
3. p3 preempts p2
4. p3 acquires l1 (but can't because p1 has the lock)
the problem is that p2 will run in front of p3, because p1 can't run and release the lock until p2 has run to completion

priority inheritance
associate a priority with a resource (lock) and the priority of that lock is set to the highest priority of those processes waiting for/on the lock. the priority of the process inside the lock is set to the lock's priority.

Difference between Mesa and Hoare

in Hoare you are guaranteed that immediately upon signaling of a condition variable the waiting process will receive control, however in Mesa monitors the signal is more of a hint and you are not guaranteed to receive control when a signal is sent.
- Hoare
```
if(!cond){condition_v.wait()}
```
- Mesa (must re-check after condition becomes true)
```
while(!cond){timed_wait()}
```
Mesa
timeout

abort

broadcast vs. signal

naked notify
allows hardware interrupt to signal a condition variable without first acquiring the monitor lock. this is more efficient than forcing a device driver to wait for a lock to be released before accessing a monitor.
- this could lead to a problem where a device signals that a resource is free, but the notification is missed by a process which is just switching from !cond to wait()
- note that this only allows the hardware interrupt to signal the condition variable, not to actually touch the resource

deadlock

requires

circular wait
mutual exclusion
no preemption
hold-and-wait

2009-09-24 Thu

scheduler activations

wherever their system does well they present the numbers in a table. when their system doesn't fare so well they embed the numbers in the prose.

virtual processors to real processors

how do virtual processors map to real processors

SMMP

shared memory multi-processors, many CPUs which all have access to a single big block of shared memory

normal user-level threads

may not really be that much faster than kernel level threads (at least not to the point that this paper claims)

data/normal-user-thread-setup.png

these user-level threads

in this paper when they say user-level threads they mean the following model

data/sched-act-user-thread-setup.png

kernel scheduling

priority levels and equal access for each priority level

lifo

when there are not enough processors to run all threads, then they follow a lifo policy to take advantage of cache locality (if I was running recently then my cache is still around)

critical section

need to be careful not to preempt a thread in a critical section (or at least let it get back quickly)

data/critical-section-address-space.png

when a guy in a critical section is preempted and other people are waiting for his lock (and they've pushed him down the lifo queue) then you could deadlock (as he can't get up the queue until they finish and they can't finish until he runs)

solution: make a copy of each critical section which ends in a jump to an upcall. when the kernel preempts a process the kernel checks if the code is in a critical section, and if so, it jumps to the copy which is guaranteed to jump to an upcall when the section completes

spin lock

burns CPU, but keeps a process on the ready list, good for short wait, or when you have processor to burn

upcalls

used for the kernel to talk to the user process

preemption
adding procs
blocking
unblocking

downcalls

when the user-space communicates to the kernel

more procs
less procs

2009-09-29 Tue

lottery scheduling

a proportional-share-scheduling system where each entry (waiting consumer/process) gets some number of tickets, and whenever a resource is to be consumed a lottery is held and the winner's ticket is taken and the winner is placed in control of the resource.

specifics

actually help many lotteries at once forming a queue (rather than a lottery every time quanta)
processes can give their tickets to other processes (i.e. client server model, client could give tickets to server)
given to processes that release the CPU before their time quantum has expired
more uniform stat distribution with more samples -> smaller quantum leads to more samples
tickets can be used for any resource

memory management
reverse lottery, when a page needs to be evicted from memory a lottery is held to select page to remove

resource containers

aimed at implementing a web server

relevant metrics for web server

client metrics
- response time
- throughput
server metrics
- number simultaneous clients
- quality of service, might want different levels for different clients

resource containers allow the application to specify resource containers and tell the kernel how to assign resources to the resource container.

mechanism of resource containers

connection comes in and is wrapped in a resource container
thread handling that connection is bound to the resource container
additional resources (i.e. file descriptor) are bound to the resource container

this can be useful for handling malicious requests (i.e. if they're tagged as malicious on the way in they can be given little/no resources)

memory management

handling the speed/capacity tradeoffs of memory maintaining

performance
protection
correctness

                /\
               /  \                |
speed ^       /    \      capacity v
      |      / reg. \
            /--------\
           /          \
          /  cache     \
         /--------------\
        /                \
       /   main memory    \
      /--------------------\
     /                      \
    /     local disk         \
   /--------------------------\
  /                            \
 /  cloud, remote disk, tape    \
----------------------------------

relocation

addressing schemas (w/static relocation)

source code: symbolic representation of memory addresses
compiled code: relative refs (e.g. module x + offset)
loaded code: absolute addresses

so to change where the code is located in memory you will generally need to reload the code. dynamically relocatable code has it's absolute addresses resolved at runtime rather than at load time, so the code can be moved without reloading the code.

allocation

contiguous allocation

simple (base, limit, attr). makes context switches very simple (the kernel only need to change the base and limit registers)

external fragmentation: may not have enough contiguous free space
sharing: can't share w/o sharing entire address space (no portions)
setting attributes: same as above, can't identify parts of the space

segmentation allocations

divide address space into segments of arbitrary size. segment number -> (base, limit, attr)

external fragmentation: because with variable length sizes there could be many free spaces which aren't big enough to be used

paging

(most popular) fixed size segmentation. this ensures that there is no external fragmentation (if there is any space available then it is page sized and can be used). this is still vulnerable to internal fragmentation

page table

data/memory-management.png

2009-10-01 Thu

2009-10-06 Tue

disco (implementation & performance)

OS modifications

drivers for DISCO specific "hardware"
changes to keep OS from trying to access a small chunk of unmapped memory
(small) allows the guest OS to request a 0'd page (so the guest OS doesn't have to re-0 a page)
(disco) interprets the guest OS going into low power mode as the OS yielding the processor

virtual memory

Multics

since segments are organized/structured as files they actually didn't have a file system. referencing a segment through it's symbolic name is like referencing a file
seg.tag | address | opcode | external | addressing-mode

seg. tag
points to the base register of the owning base register

external
whether to use the segment tag (if external) or your own base register
address points to another address. happens when you have multiple levels of paging hierarchy.
- indirect address points to 2 36-bit words, the new segment number and the new word number
reference to external program
- symbolic name -> module name
- symbolic address -> function name or variable name
are added to each process to hold the lookup information for external segments. after an initial reference the number of the link in the linking segment is used for future references.

VAX

VMS addressing

2-bit seg. | 21-bit page number | 9-bit offset

segments

system space, program region, control region

program region: user data for the program
control region: kernel data for the program

TLB

the TLB is split in two (system/process), less has to be flushed on context switch

2009-10-08 Thu

VM pros and cons

pros
- larger address space
- convenience in segmentation and paging
- code portability
cons
- (time overhead) increased effective memory latency
- (space overhead) maintaining mappings, page tables
- increased complexity

2009-10-20 Tue

disks and file systems (see related 481 slides on Dorian's homepage)

disks

disks: stack of platter of concentric circles (or tracks) of sectors along with a movable arm (in all modern systems) and there is one arm/head per platter. each platter (aside from top and bottom) has data on both sides.

data/disk-os-stack.png

file system

semantics on top of disks

abstractions

files
directories

handles

permissions
mapping abstractions to disk
enforcing resource quotas

2009-10-22 Thu

going through the midterm

Grade Distribution


	mean	med	max	max possible
p1	23	25	30	30
p2	11	12	14	15
p3	19	20	25	25
p4	8.7	10	15	15
total	61	61	84	85

Review of problems (in general on the exam less is more)

1. 1. exokernel, library OSs are linked into the application space of the application, so the getpid call is just a function call in the user-space which would not have to cross the user/kernel boundary, so this would be faster than the monolithic kernel
  2. protection, multiplexing, IPC

3) 1)

1. 1. it is much more complicated to move a process than to move a block of data. if you have many readers/writers of a block of data it may make more sense to move the users to the data rather than moving/replicating the data.
  2. in message passing structured IPC using copy-on-write can allow pointers to be passed from process->kernel->process rather than the actual block of data

2009-10-27 Tue

will discuss LFS and RAID on Thursdays

LFS: wanted to improve performance and ended up improving filesystem reliability
RAID: vise-versa

Network Files System (NFS)

remote file access

pros
- larger file servers (capacity)
- sharing
- robustness / redundancy
cons
- speed (latency)
- availability
- consistency
- complexity
NFS specific goals
- 80% speed of local disk
- simple crash recovery
  - can repeat operations until success (idempotent). many operations are not naturally idempotent, for example the read operations read(f, out, nbytes) would normally increment a counter in the file, in nfs this counter must be tracked on the client side and passed as a parameter to the server
- no state on the server
- transparent access
- preserve Unix semantics
Deployment Issues
- sharing the root file system
- scalability/performance sharing heavy use files (e.g. binaries required on startup)
  - made these files local to each individual node
- /tmp files (use the process ID, which wouldn't be unique across different nodes)
- /dev entries of this directory have local semantics which make no sense to access on a remote system
- authentication across machines (need a global system of user IDs, "Yellow Pages")
- concurrency: local locks but no global locks, so two users on different nodes could have their writes to a file interleaved.
- performance: (solution is always caching)
  - calls which occur often, but transfer small bits of data, (e.g. getattr which is called by ls, and pretty much every file access, this was initially 90% of the transactions) – so, they just cached attributes, this cache is invalidated every three seconds for files and thirty seconds for directories
  - used UDP (Unreliable Data Transport), so if a packet in a RPC is lost they'd just redo the RPC
  - really big packets
    - read-ahead to try to get blocks before their needed – this doesn't help for executables with random access patterns

data/nfs-structure.png

VFS (virtual file system) abstraction on top of the specific file system used. allows file systems to be plugged in sort of like device drivers
XDR is used as a canonical data representation ensuring that when the client and server share objects (ints, arrays, etc…) they cache their objects out into bits in the same way (endianness, float representations, etc…)

2009-10-29 Thu

(if we are ever really interested in a paper we could lead that lecture)

disk failures

updates, 3 parts – related to disk failure

(D) data blocks
(F) free blocks
(M) meta-data blocks

disk failure part way through a write could lead to incoherence in the three above. most FS will perform the above in such a way the any inconsistency is a "functional" inconsistency – while space may be wasted everything will still "work".

some crash cases

(D) -> crash: no real problem, just wasted time writing to a block that's still on the free list
(D) -> (F) -> crash: leaked a data block that will not be recovered
(F) -> (M) -> crash: functional problem, file points to whatever was previously on disk (garbage or someone else's old data)

fsck checks that

all blocks not on free list are in use – referenced by an inode
all blocks referenced by an inode are not in the free list

journal/log-structured differences

journal – transactions in progress which can be used to recover from crash/failure
log structured FS – actually uses the log as the only structure on disk

RAID / LFS

writes are buffered in main memory until there is a segments worth of data to write to disk. This allows the entire segment to be written w/o any seeks taking advantage of the disk's full bandwidth.

in RAID there is a slowdown factor of N when writing to N disks.

in LFS the checkpoints become the journal

RAID levels (5 and 1 are the only common levels)

striping across a single disk
straight mirrored disks, faster reads as you can read from both disks and whichever returns first wins (best case seek), for a write you have to wait for the write to complete on both disks (worst case seek)
Hamming code for ECC
Single check disk per group
Independent read writes
No single check disk (large performance increase over RAID level 4)

note know the basic read/write operations for each level and be able to discuss the performance implications

2009-11-03 Tue

LFS and RAID

LFS

main point is the caching setup. user <-> cache <-> disk

RAID

don't need to know the names of the specific levels, but should be able to derive the mechanisms for reading/writing, as well as the implications speed/reliability for these mechanisms. RAID can be implemented in hardware or software. Be able to extend these concepts (e.g. RAID 7 is )

0

block-level striping

1

simple mirrored disks

read: could use either disk (faster), for a multi-block read each disk could serve up different blocks
write: will necessarily use both disks

5

block-level striping and distributed parity – parity is spread across all disks

read: will either only touch the specific disk which the block lives on, or will read all disks (including parity) and will reconstruct the data
write: must touch all disks, writes to the disk on which the data will live and to the parity disk and reads from the other disks to calculate the parity

blocks and sectors

block: software construct, typically will be equal in size to either a single sector or multiple sectors
sector: the actual size sections of the physical disk

CODA

call backs are used in asynchronous operations, they alleviate the need for active probing. allows the server to alert the client when a change occurs – used in CODA for cache coherence

2009-11-05 Thu

general consistency

by and large message passing has beat out shared memory when it comes to distributed computing. MPI is the de-facto distributed memory standard openMP is a new message passing alternative.

typically there is no global clock

strong consistency: (called sequential consistency in Munin paper) any write is immediately visible to subsequent reads
causal ordering: uses communication between processes to determine a global partial ordering
weak consistency: this is not really ever used. makes no guarantees that writes will be visible to future reads
eventual consistency: write will eventually be seen
release consistency: requires data to be visible only at certain synchronization points (i.e. at release or barrier)

Munin

Munin – shared program variables are annotated with their access pattern which is used by the OS

data/dist-memory-arch.png

barrier: designate a point where you will wait at that point until every other thread gets to that point
split-phase barrier: two checkpoints, everyone can pass the first checkpoint arbitrarily, but no-one passes the second checkpoint until everyone has passed the first

Munin Annotations and Protocol Parameters


annotations	I	R	D	FO	M	S	FI	W
read-only	N	Y						N
migratory	Y	N		N	N		N	Y
write-shared	N	Y	Y	N	Y	N	N	Y
producer-consumer	N	Y	Y	N	N	Y	N	Y
reduction	N	Y	N	Y	N		N	Y
result	N	Y	Y	Y	Y		Y	Y
conventional	Y	Y	N	N	N		N	Y

Meanings of Parameters


I	invalidate or update
R	replicas allowed?
D	delay vs. immediate
FO	fixed owner?
M	multiple writers allowed?
S	stable sharing pattern?
FI	flush changes to owner
W	writable?

Non-functional performance enhancing objects

ability to map an object to a lock
ability to explicitly flush changes to an object

Implementation

maintained a hash table mapping object addresses to their attributes
copyset was a list of where (which processors) an object currently exists
delayed update queue (DUQ) to hold updates which will need to be propagated, generally held until barrier and then sent to everyone in it's copyset

question: why only use twins when there are multiple writers?

2009-11-10 Tue

Munin implementation

DHT or Distributed Object Directory
delayed update queue
- page twins: two copies of a page used to find out what the differences are between old/new versions of the page
distributed locks were effectively a queue, person at the front owns the lock and everyone else is further down the line.

page faults used to track updates

write protect pages that process would normally be able to write to
when page faults allow write to go through but make a note and maybe update remote copies of the page

Quicksilver

transaction

collection of operations into a single atomic unit of consistency and recovery. techniques include…

locks
mutexes
semaphores
monitors
h/w instructions
interruptible disabling

commit protocols

some things to be considered as goals

atomicity
recovery semantics
minimize overhead
- blocking/sync
- logging overhead
- communication

two phase commit

coordinator and subordinates

transaction_begin
1
2
3
...
transaction_end

the coordinator
1. initiates the transaction
2. prepare message is sent to all subordinates
3. subordinates act and respond
4. send commit
the subordinate
1. upon receipt of prepare message the subordinates reply with either yes or no
2. no -> veto
3. or go to prepared state and update logs and respond yes

2009-11-12 Thu

Quicksilver

locks used to make a monolithic unit out of a series of operations

short lock

would only be held for a single operation inside of a transaction

long lock

could be held for an entire transaction


	short locks	long locks
read
write

degree 0 consistency

short write lock and no read lock

cascading abort
dirty reads
non-repeatable reads

degree 1 consistency

long write lock and no read lock

dirty reads
non-repeatable reads

degree 2 consistency

long write lock, and short read lock

non-repeatable reads

degree 3 consistency

long write lock, and long read lock

locks in the context of their DFS (Distributed File System)

directories
- locks for renaming, creating, deleting
- write lock for dir.entries
- no read locks
files
- short read locks and long write locks

highlights (distinguishing features)

distributed OS using transactions for data consistency

wrapped applications in trivial transactions, so bad quit would remove all previous changes

in order to share a transaction with another process you would need to fork that process

Cluster Based Scalable Network Services

advantages

small unit of fault -> robust
scalable
cost effective

BASE

Basically Available
Soft state
Eventual consistency

Condore is another system that finds idle machines and sends them work when work accrues

implementation

components of the system

front end
- http server
- thread pool
workers
- to provide services
- to hold the results of computation
- report failed services to the manager
manager
- calculates load and sends requests to the front-end
- receives failure reports from workers

failure peers vs. failure pairs

failure peers: manager watches front-end and restarts if it crashes and vice versa
failure pairs: more generally called hot backups where each component has a backup which can take over if one fails

2009-12-01 Tue

cover CFS and do Map-Reduce on Thursday, presentations starting next week

final

lets try to do a final-review outside of class
final will sprinkle questions over the first half, but will focus on the second half

project

paper is due at the end of next week 2009-12-11 Fri 22:49
10-12 minutes per group – 8-10 slides

CFS

cfs-reading

lookup: (finger table and successor list) the successor list was slow because on average you would have to touch half the servers in the system, so the finger table was added to store IDs of far away people for quick jumps to distant portions of the circle.
caching & timeout

2009-12-03 Thu

map reduce

stream programming collection of filters which the data passes through

       +---+
       | F |
       +---+
      /-   -\
     /       -\
   /-          -\
+-+              +---+
|F|              | F |
+-+              +---+
   -\         /--
     -\     /-
       +---+
       | F |
      /+---+\
   /--       ---\
+-+             +---+
|F|             | F |
+-+\           /+---+
    \         /
     \       /
      \     /
       +---+
       | F |
       +---+

data/google-map-reduce.png

consistency: can handle failures in workers (just aborts if master happens to fail) by repeating the computation for failed workers. this mains that the worker tasks can happen multiple times – so they must be idempotent (i.e. side effect free). also the computation would need to be deterministic for re-doing of failed nodes to have no effect.
backup tasks: only as fast as your slowest worker – so as workers finish the unfinished tasks are duplicated to idle workers in the hopes that someone new will finish the task earlier
combiner function: can be run on the local map worker to compact the data before it is sent of to be reduced
skipping bad records: when some records continually cause workers to fail then they will be skipped
local execution: ideally workers will be selected which are close to the data which they will be analyzing

reading notes

readings

both truly distributed operating systems in contrast to most of today's large distributed system which has node-local OSs with a global managing agent.

Network Services

Fox97TACC.pdf

QuickSilver

Schmuck91QuickSilver.pdf

Munin DSM

Carter91Munin.pdf

CODA

Kistler91Coda.pdf

RAID

Patterson88RAID.pdf

NFS network file system

Sandberg85NFS.pdf

log file system

Rosenblum91LFS.pdf

fast file system

McKusick84FFS.pdf

Old FS: (order on disk)

superblock
inode blocks: direct (first 8 blocks) v.s. indirect blocks
data blocks: size (initially 512 then up to 1024)

issues with this setup

inodes not located near the data, so many non-contiguous jumps
issues with fragmentation
didn't take advantage of the structure of the disk (too much random access of the file)

New FS:

collocated inode and file data (in the same cylinder group)
replicate the superblock information across all cylinder groups (reliability)
variable block sizes (4k block size has average 2k internal fragmentation)
- split each block into anywhere from 1-8 fragments (powers of two) and managed free space on a fragment (rather than block basis). this can incur bookkeeping and overhead problems (as a file increases in size it may need to be continually copied between fragments and blocks).
exploit h/w characteristics by trying to adjusting notion of "contiguous" based on the speed with which the disk can move between segments
collocate directories and files

VM in Multics

Daley68MULTICS.pdf

goals

provide the user with a large virtual memory hiding moving of data between levels, and any machine-dependent stuffs
allow procedures to be called by name w/o any need to plan for the storage of the called procedure
permit sharing of procedures and data among users subject only to permission restraints (vital to efficient operation in a multiplexed system)

process, address space

processes and address space stand in a one-to-one correspondence

address space is composed of variable length segments, each segment is either data or procedure which affects it's access permissions.

segments are addressed using a directory structure similar to files.

addressing

generalized address

consists of a segment number and a word number

address formation

based on values of processor registers, different for process/data segments

process: segment number in procedure based register + the program counter
data: the segment tag of instruction selects a base register if the external flag is on. otherwise the segment number is taken from the base register

indirect addressing

in this case the generalized address is used to fetch two 36-bit words, these are combined to form another generalized address. can be nested

descriptor segment

generalized-address -> main-memory is done using a two-step hardware lookup

paging

of segments allows non-contiguous segments of main memory to be referenced as logically contiguous generalized addresses

intersegment linking and addressing

shared access and building upon others addresses are both important goals of multiplexed machines

requirements

pure procedure segments execution can't change their content
symbolic procedure calls without making prior arrangement for the procedure's use
segments of procedure invariant to recompilation of other segments

implementation

making a segment known: when the segment is called by symbolic name it is added to the caller's description segment and can later be referenced by number
linkage data: a processes code must be invariant to compilation, so the process will always use a segments name/path to address it. after the segment is known, then it's number can be used. a linkage segment will hold the information on name/path -> number transformations so that the numbers can be used for known segments w/o changing the contents of the process

VM in Vax

Levy82VAX-VMS.pdf

process & virtual address space

page number and offset within the page

address space divided into spaces (not segments)

system space

high-address half is system space and is shared across all processes. This contains OS stuff, executive code and protected data.

process space

low-address half (for the process)

program region (P0): low-address half of process space. contains the user's executable program. first page is reserved to cause errors on 0-address references
control region (P1): high-address half of process space. this region is used to hold process-specific data

each space/region has it's own page table

system space page table: in hardware, not swapped on context switch
process tables: in the system-space, are swapped on context switch

memory management

paging issues

effect of heavy pagers on other processes
high cost of startup/restart (by faulting it's way into main memory)
increased disk workload of paging
processor time searching page lists

pager and swapper

pager: OS procedure resulting from page fault
swapper: separate process which moves pages into/out-of memory

dealing with the above issues

the pager deals with this issue by evicting pages from the process which is requesting the new page, so one process won't push out everyone else's pages. also a limit is placed on the number of pages a process can have in memory.
the above helps with this as well
the VAX clusters the reading and writing of pages to relieve I/O burden on the disk
by not having a reference bit (used to mark recently used pages) the VAX system takes load (scanning page tables and setting these bits) off of the processor

when pages are removed they are placed on the free page list or the modified page list depending on their modified bit and whether they need be written to memory. these lists serve as physical caches for recently removed pages (it is quick to move a page from one of these lists back to the working memory).

by caching the modified pages in the modified page list the following for speedups are gained.

caches pages for quick return to the process
clustered writes (~100 pages on the development system)
arranged on paging file so clustering read is possible
many page writes are avoided entirely

additional structures

demand zero: when processes require new pages they are created and filled with zeros on demand
copy on reference: when multiple processes using a page

program control of memory

for real-time programs that need explicit memory control

expand it's P0 or P1
increase it's resident set size
lock (or unlock) pages in it's resident set
create/map sections into it's address space
record it's page-fault activity

lottery scheduling

Waldspurger94Lottery.pdf

(not required reading)

resource containers

Banga99ResourceContainers.pdf

(not required reading)

Scheduler activations

Anderson92SchedulerActivations.pdf

introduction

user threads vs. kernel threads

user threads

requires no kernel intervention
fast (on order of procedure call)
flexible
each thread runs on a "virtual processor" which still has to be multiplexed onto a real processor and interleaved with system calls, and kernel stuff leading to a performance hit
sometimes exhibit incorrect performance when involve I/O

kernel threads

directly maps each application thread to a physical processor
heavy weight
not a restricted (RE: side effects, I/O)

the goal of this paper is to combine user/kernel threads

common case (no kernel required) perform as user threads
acts as kernel threads when needs to talk to kernel
easily customizable
difficulty is that relevant information is scattered between kernel space and user address space

the approach described in this paper is to give each user-level thread system with it's own virtualized machine which can have any number of processors.

problems w/user threads over kernel threads

kernel threads must implement anything that any reasonable user-level thread system may need (too much overhead)
when a user-level thread blocks (for I/O, fault, etc…) it's kernel thread also blocks
if we create more kernel threads then there are processors then the OS must make scheduling decisions without any information about the priority / current-task / importance of the related user-level threads

design (scheduler activations)

each user-level thread system gets it own virtual multiprocessor

kernel gives processors to user thread systems
user thread system has complete control over use of it's virtual multiprocessor
user thread system can tell kernel when it needs more threads
user thread system only talks to kernel when it needs to
looks to the application programmer like they are using kernel threads
communication from the kernel to the user-level thread system which may cause it to reconsider it's scheduling decisions.
- roles
  - serves as the vessel or context of the user-level thread
  - notifies user-level thread of kernel event
  - stores user-level thread when it's blocked (e.g. for I/O)
- when a thread is stopped
  1. the kernel stuffs it into it's activation
  2. creates a new activation to tell the thread system that the thread has been stopped
  3. the thread system removes the thread, and tells the kernel the activation can be re-used
  4. the kernel does another upcall giving the newly released scheduler activation (processor) to the thread system to run a new thread on
  file:data/scheduler-activations-upcalls.pdf
- there are all ways as many activations assigned to an address space as there are actual processors
- in the same manner processors are moved from one address space (thread system) to another
how user-level thread systems keep the kernel informed about their amount of parallelism
- inform kernel when more threads than processors
- inform kernel when more processors than threads
when a thread is interrupted while in a critical section
1. the kernel makes an upcall informing the address space that the threads processor is ready
2. this upcall is intercepted and given to the thread until it is out of it's critical section
3. the thread is then put back on the ready queue and the address space is free to respond to the new processor however it sees fit

implementation

implemented by tweaking

Topaz: the native kernel threads for the firefly machine
FastThreads: a user-level thread package

performance

same order of magnitude as plain user-threads
upcall performance is slow, much slower than normal kernel thread operations
- written on top of existing kernel thread library (not from scratch)
- written in higher level language (not carefully tuned assembly)
N-body problem
- speedup with more processors
  - some increase over fast-threads
  - significant increase over kernel threads
- more robust than fast-threads to lower amounts of memory

related ideas

psyche and symunix are both NUMA OSs which provide virtual processors similar to activation contexts.

differences

both psyche and sumunix provide for shared address space between kernel and thread systems
neither provides the exact functionality of kernel threads (for I/O etc…)
neither provides efficient system for user-level thread system to notify kernel when it's hungry

summary

combine the performance of user-level threads with the functionality of kernel-level threads. this is done by supplying each user-level threading system with a virtual multiprocessor in which the application knows exactly how many processors it has at any one time (and each processor maps to an actual physical processor)

processor allocation (between applications) is done by the kernel
thread scheduling is done by address space
kernel notifies address space of events affecting it
- new processor
- less processor
- preempted thread
address space notifies the kernel if it needs more/less processors

Monitors (2)

Monitors: An OS structuring concept

Hoare74Monitors.pdf

monitors are procedures or functions called by software wishing to acquire a resources along with local administrative data

monitorname: monitor
  begin.. declarations of data local to the monitor; 
    procedure procname (... formal parameters...) ; 
      begin... procedure body... end; 
    ... declarations of other procedures local to the monitor; 
    ... initialization of local data of the monitor... 
  end;

a procedure will have to wait when the monitor is in use
when the program is waiting for the monitor, it needs to be sure that after the monitor is released, the very next procedure to execute will belong to itself
there are multiple reasons that a program will need to wait, so the program will have to set a condition variable to indicate that it is waiting for the monitor

example of a monitor (resource:monitor) with condition variable nonbusy

single resource:monitor 
begin busy: Boolean; 
    nonbusy : condition; 
  procedure acquire; 
    begin if busy then nonbusy.wait;
             busy : = true 
    end; 
  procedure release; 
    begin busy := false; 
          nonbusy.signal 
    end; 
  busy : = false; comment initial value;
end single resource

the above example simulates a boolean semaphore with aquire and release procedures.

interpretation

a process inside a monitor may need to signal another process. the signaler must wait for the signaled to complete and to allow it to proceed, it can increment an urgentcounter to indicate that it had control of the monitor and should get it back.

then whenever the monitor is released, the urgentcounter should be decremented and the longest waiting process on the counter restarted.

similarly we need to be able to allow process in monitors to wait as well as signal which could be implemented similarly (with a waitcounter)

given the above the monitor can be explicitly passed form one process to another, and only released when there are no more processes in the explicit passing of control

bounded buffer example

two processes running in parallel share a bounded buffer, one is the consumer (eating form the beginning) and one the producer (appending to the end).

the following implements this setup

bounded buffer:monitor
  begin buffer:array 0..N - 1 of portion;
        lastpointer:0..N - 1;
        count:0...N;
        nonempty,nonfull:condition;
    procedure append(x:portion);
      begin if count = N then nonfull.wait;
            note 0 <= count < N;
            buffer[lastpointer] := x;
            lastpointer := lastpointer + 1;
            count := count + 1;
            nonempty.signal
      end append;
    procedure remove(result x :portion);
      begin if count == 0 then nonempty.wait;
            note 0 < count <= N;
            x := buffer[lastpointer - count];
            nonfull.signal
      end remove;
    count := 0; lastpointer := 0;
  end bounded buffer;

scheduled waits

sometimes rather than just selecting the longest waiting process from a variable we would prefer to allow processes to have some priority

real world examples

buffer allocation
disk head scheduling elevator algorithm
readers and writers (only writers need exclusive access)
- to ensure writers can access elements, no readers can start while a writer is waiting
- to ensure readers get access, all readers queued during a write are allowed to read before the next write operation begins
- variables
  - startread
  - endread
  - startwrite
  - endwrite
  - number of waiting readers
  - is someone writing

conclusion

monitors can be an appropriate structure for an OS with parallel users

Experience with Processes and Monitors in Mesa

Lampson80MesaMonitors.pdf

Lampson and his team seem to make everything harder than it should be

issues

programming structure: must fit monitors into Mesa's module based organization
creating processes: need to be able to dynamically create processes after compile time (adds complications)
creating monitors: need to be able to dynamically create monitors after compile time (adds complications)
wait in nested monitor call: is confusing
exceptions: make Mesa's unwind functionality work well with monitors
scheduling: moving from recommendations to implementation proved difficult
input/output: again moving from theory to practice can be hairy

description

(see mesa-monitors)

implementation

equal division between

runtime: implements the heavier rarely used stuff like process creation deletion
compiler: implements the various syntactic constructs and translated into built-in support procedures
hardware: directly implements the more heavily used stuff like scheduling and entry/exit

performance


Construct	Time (ticks)
simple instruction	1
call + return	30
monitor call + return	50
process switch	60
WAIT	15
NOTIFY, no one waiting	4
NOTIFY, process waiting	9
FORK+JOIN	1,100

conclusion

integration of monitors into Mesa was harder than anticipated given the amount of literature on monitors and the high level of Mesa, however, much work was done to implement monitors in such a way that they can be used as the sole concurrency construct for an entire OS/language.

questions

wouldn't it also be a problem if I'm in my protected block, and hardware barges in and takes over the resource (breaks the monitor invariant)

Virtualization

Commodity Operating Systems on Scalable Multiprocessors

comodity-os-on-multiprocessors.pdf

again cites the size and complexity of modern operating systems as limiting factor, this time in effectively utilizing massively multiprocessor machines.

rather than customize the OS this paper inserts a small virtual machine monitor between the OS and the hardware.

Demonstrated on the Stanford FLASH shared memory multiprocessor, with an experiments cache coherent non-uniform memory architecture or ccNUMA setup.

data/virtual-machine-stack.png

problem

hardware development moves very quickly, yet people like to bring all of their existing software (which is OS dependent) to this new hardware.

there is a need for quickly porting existing OSs to new hardware as this is the limiting factor in adoption of new hardware setups

virtual machine monitors

the virtual machine monitors serves as a thin layer between the hardware and existing comodity OSs (like windows NT or *NIX), exporting to each OS a set of virtualized resources which it is able to manage.

while the machine can communicate through standard external interfaces (NFS, TCP/IP), the monitor is able to efficiently assign resources across machines (i.e. one machine may get more memory if needed, etc…)

with small changes the OSs can explicitly take advantage of the shared memory between virtual systems (e.g. a database could put it's buffer cache in shared memory supporting multiple query servers)

the VM takes many burdens off of the OS

only the VM need scale to the size of the hardware
the VM can isolate separate OSs protecting from faults
NUMA memory management
in general handling hardware quirks
VM issues
overhead
- additional
  - exception processing
  - instruction execution
  - memory requirements
- large structure duplicated for each OS (file system buffers)
resource management
the VM does not have high level information about the processing taking place, so it can't distinguish processing which is just the OSs busy loop from important calculations.

communication
looks like different OSs on the same hardware rather than each OS on it's own hardware, so
- same file can't be open in two different VMs
- same user can't start multiple VMs

DISCO (a virtual machine monitor)

DISCO is designed for the FLASH multiprocessor which consists of a collection of nodes arrayed on a high speed interconnect. each node contains a CPU, memory, and IO devices

Disco Interface

processors

exports a processor of the same type as those used by FLASH. OSs tuned to use disco can directly access some common processor functionality using special load/store instructions.

physical memory

exports continuous physical memory starting at 0, and handles all the NUMA stuff behind the scenes

I/O devices

provides each OS with the illusion of their own I/O devices. this means disco must intercept all I/O communication. again provides special instructions for disco-aware OSs to bypass this in special cases

DISCO provides a virtual subnetwork which the machines can use to communicate amongst themselves

DISCO implementation

general

as a multi-threaded shared memory program
the small code portion of DISCO is duplicated across processors so page-misses are all local
avoids linked-lists and other structures which perform poorly with caching

virtual CPU

for speed DISCO direct executes most instructions and only tries to intercept dangerous instructions (like TLB modifications)
runs in supervisor mode which is between kernel and user mode
monitor catches traps and simulates them to the VM

virtual memory

maintains machine-to-physical mapping
catches VM attempts to update the TLB and uses them to update it's own TLB
downsides which decrease performances
- TLB used for OS code/memory
- TLB flushed between CPU switches

memory management

tries to be smart
- copies pages to the nodes where they are most used
- duplicates read-heavy pages between nodes that use them
uses FLASH hardware support for counting cache misses per page and identifying hot pages

I/O devices

intercepts all devices access
add special DISCO device drivers into the OS
DMA map (translates physical to virtual address spaces?)

copy-on-write disks

multiple VMs can share pages in virtual memory
copy-on-write means that this is transparent to the machines
copy-on-write only makes sense for writes which will not be permanent or shared between machines
user files and persistent disks DISCO only allows one VM to mount the disk at a time (or using distributes file system protocol like NFS)

DISCO (commodity OS)

currently supports a version of UNIX (IRIX), most changes to the OS resided in the HAL (hardware abstraction layer)

the special load/store call mentioned earlier to avoid traps are implemented in the HAL

experimentation

all takes place on SimOS a machine simulator

conclusion

developing system software for shared-memory multiprocessors, and more generally for new hardware.

DISCO shows that many of the performance limitations of VM setups are no longer an issue (sort of).

although software and OSs are growing in complexity the hardware-interface has remained relatively simple. supporting new hardware through a thin VM monitor such as disco is simpler and easier then rewriting the OS.

question

DMA: what is it?

Xen and the Art of Virtualization

zen-and-virtualization.pdf

Exokernel

exokernel

don't hide power!

Allows untrusted user-level applications to have direct access to system hardware. They present ExOS, an operating system implemented entirely in user-space libraries.

does this by securely multiplexing hardware resources between untrusted software

many programs have specialized behavior and their performance is severely hampered by being forced into using general OS abstractions to access hardware

library OS

libraries implementing some part of the OS can be app specific
libraries can trust the application (the exokernel will errors from hurting other applications)
less OS-app transitions since much of the OS (the library) is in the application's address space

exokernel requirements

track ownership of resources
performing access control (guarding usage or binding points)
revoking access to resources

revocation

most OSs have invisible revocation of resources, so that application doesn't know when for example physical memory is being allocated or deallocated.

exokernels have visible revocation, so that applications can have some say in their allocation, and know when resources are scarce. even when the processor is taken at the end of a time-slice the application is notified.

this is necessary when the applications are using physical names to refer to resources, they must be notified upon revocation because their names will have to change

sometimes it's nice to allow "good faith" operations to take place before revocation of a resource

other times the exokernel will abort a misbehaving application

implementations

Aegis: exokernel
ExOS: Library OS

Aegis

process environments

store the information needed to deliver events associated with a resource to it's owner

exception
interrupt
protected entry
addressing

exceptions

transfers all exceptions to the application except system calls and interrupts

exception handling…

saves three "scratch" registers into an agreed upon place
loads the exception program counter, last non-valid virtual page address, and cause of exception
uses exception cause to jump to pre-specified application program counter where processing resumes

features

very fast
very simple (because does not have to differentiate between TLB exceptions and all others)

address translation (application level virtual memory)

TODO

summary

an exokernel eliminates high level abstractions and focuses purely on securely multiplexing the hardware. a library OS can be build very efficiently upon an exokernel providing many of the standard OS features in a fast and extensible manner.

by allowing applications direct access to hardware it is possible for applications to greatly speed up their performance as compared to a traditional OS.

by implementing the majority of the OS as application libraries it is trivial to extend or tailor major components of the OS.

the only downside seems to be that the application has much more to worry about if it wants to take advantage of the potential speedup.

µ-kernels

performance-of-µ-kernel-based-systems

This paper aims to show that µ-kernel systems

can run modern OS personalities
can perform in the same range as normal monolithic kernels
that extensions to µ-kernel based systems can be implemented efficiently in user space
supports four basic processes; address-spaces, threads, scheduling, and synchronous inter-process communication

intro

a µ-kernel only provides address space, threads, and IPC
many people think that µ-kernels are either

too low
and these people try to add safeguards, or abstractions for helping extensions

too high
and these people try to make µ-kernel interfaces look like hardware interfaces
first generation µ-kernels like Chorus and Mach
- evolved from monolithic kernels
second generation µ-kernels like QNX and L4
- designed form scratch
- more rigorous in pursuit of minimalist design
experiments
- linux adapted to run on L4
  - gives upper performance bound
  - compare L4Linux to a linux adapted to the Mach kernel
  - insight to µ-kernel functions that affect linux performance
- implemented pipes on top of µ-kernel and compared to native unix pipes
- implemented mapping-related OS extensions
- implemented first part of real time user-level memory management system
- moved the L4 to a new processor
- lower-level communication primitive

related work

L4 essentials

based on two basic concepts, threads and address spaces

thread: activity executing inside of an address space
IPC: cross address-space communication is a fundamental µ-kernel mechanism

the initial address space represents physical memory, additional address spaces are constructed by granting, mapping, and unmapping flex-pages of sizes 2ⁿ. the owner of an address space can grant map and unmap it's pages to/from other address spaces. these user-level pagers handle all address space construction and maintenance.

note: mapping and unmapping pages is like creating and deleting pages. mapped to physical memory or not

when there is a page-fault it is IPC'd by the µ-kernel to the pager associated with the faulting thread. the pager and thread have complete control as to how to handle the fault allowing many options for memory management

I/O ports are handled as address spaces, with device interrupts handled as IPC

exceptions and traps are synchronous to the executing thread, they are mirrored up to user-level

linux on L4

as linux now runs on multiple architectures there is a fairly well-defined interface between architecture dependent and independent sections

architecture-defendant section
- interrupt service routine
- low-level device driver support
- user process interaction
- context switching
- copyin/copyout data between kernel and user spaces
- signaling
- mapping/unmapping of address spaces
- system-call mechanism
linux uses a 3-level architecture independent page-table scheme

L4-linux design/implementation

fully binary compliant

µ-kernel tasks are used for user processes and provide linux services via a single linux server in a separate µ-kernel task.

the linux server: linux kernel's address space maps 1-1 to the underlying pager

Observations on the Development of an Operating System

hypotheses
1. Operating Systems can be divided into five kinds according to the style and direction of their development, independent of their structure.
2. OS's take about 5-7 years to develop
focus on life-cycle of OS development, with the running example of the Pilot OS developed at IBM

summary: No matter what you might think, or how disciplined your team going in. When trying to build an new OS to be used by clients which represents a major step away from existing OSs there will be delays, and bloat. Expect 5-7 years before the system will be mature or useful or able to survive in the wild on its own.

Pilot

kernel: 25,000 to 50,000 lines of Mesa code
system development project: 250,000 lines of Mesa
- kernel
- debugger
- compilers
- librarian tools, etc…
framework for thinking about designing/implementing systems for inter-subsystem and inter-computer communication

focus on 2nd meaning

Problems

size of the system: initially the kernel dominated the system size, but as outside functionality was absorbed and new tasks (development, running for multiple clients, etc…) added the system bloated both in and outside of the kernel

working set sizes: amount of real memory required to handle virtual memory without thrashing. Problems caused by the lull of virtual memory and lack of real feedback.

the working set of the kernel was almost constant across releases
at one point using more than double allowable working memory

programmer productivity: impossible to measure

holy wars:

processes and monitors v.s. message passing
different file system access systems

virtual memory system: based on assumption that disk access was very slow (this in the end was not the case). would have been almost as efficient to treat the disk as synchronous rather than jump through the many complex hoops built for async disk access

pipes filters and streams: Mesa streams are supposed to be like unix pipes. These streams are rarely used because Mesa is more of a type-safe API based language.

Comparing Pilot and other OSs

5 system types

favorite systems (e.g. unix)
- hugely successful
- develop a large user community outside of their developer base
- begin life as simple unambitious projects
- grow because new outside users find them easy to extend
planned systems
- cut from whole cloth
- generally with organizational backing
- goals/structures are the product of up front negotiations (not organic growth)
- some succeed and some don't
branches of existing systems
- major changes from existing system, but still able to borrow much supporting software
laboratory systems
- make contributions to the "art and science" of OS design
- never gain large user base
worthless systems

Five to Seven year rule

For planned systems of the second kind expect 5-7 years before reaching a viable OS.

time-line

planning design
initial implementation: no OS clients so little to no testing/feedback
initial functionality: some hardy users begin cutting through the forests of bugs and issues
painful refinement, making users happy
client buy in: if reached, this is when the community starts adapting to and adding to the OS

Systems of the second time almost have to bee too ambitious or general for anyone to finance them. Hence the propensity for overrun deadlines or outright failure.

Hints for Computer System Design

Collection of hints gathered from the Authors experience building a variety of systems.

Most important hints deal with interfaces which should

be simple
be complete
admit a sufficiently small and fast implementation

Keep it simple

Perfection is reached not when there is no longer anything to add, but when there is no longer anything to take away. (A. Saint-Exupery)

don't try to put too much into an interface
do one thing and do it right
don't try to generalize too much
don't spend time making something fast unless it's really needed
get it right
- don't expose functionality which if used will probably be used poorly
do it fast
- a fast operation (if available/usable) is probably better than a powerful one
- programs spend most of their time doing very simple things (loads, stores, incrementing, etc…)
don't hide power
- if something works well and is useful at a low level, don't build abstractions on top of it
use procedure (functional) arguments
- rather than defining a language of static arguments/options which then result in the procedure. (e.g. map, filter, etc…)
leave it to the client
- relates to simplicity, only encode what is needed in every case in the interface, for the rest let the client built what she needs
- unix, each command does one thing well, and the client connects them together

Continuity

keep basic interfaces stable
keep a place to stand
- by implementing the old interface on top of the new one
- word-swap debuggers, which re-create the memory on disk for stopping, inspecting, and restarting

Making implementations work

plan to throw one away
- if you're doing something novel you will burn through at least one unusable prototype
keep secrets
- assumptions of implementation that clients are not allowed to make
- tension here with not hiding power
An efficient program is an exercise in logical brinkmanship. (E. Dijkstra)
divide and conquer
- recursive or bite-by-bite
use a good idea again
- instead of generalizing it

Handling all the cases

handle normal and worst cases separately
- different requirements
  - normal must be fast
  - worst must be possible

Speed

split resources in a fixed way if in doubt (easier then sharing)
use static analysis when possible
- static analysis is analysis which doesn't require that the code be run
dynamic translation can be helpful.
- translation in incremental steps between convenient readable representations to those that can be easily evaluated
cache answers to expensive computations
use hints like cached answers but they may be wrong and this can be checked
when in doubt use brute force don't be too fancy, don't work around assumptions which may not hold
- special purpose hardware (e.g. FPGA)
compute in background take advantage of the lulls in activity
batch processing when you can do it all at once (rather than incrementally) then it will probably be easier and more reliable
safety first strive to avoid disaster before incrementally improving performance
shed load if demand is outstripping resources, begin dropping clients

Fault-tolerance

The unavoidable price of reliability is simplicity. (C. Hoare)

end-to-end

Error recovery at the application level is absolutely necessary for a reliable system, and any other error detection or recovery is not logically necessary but is strictly for performance. – Saltzer
- intermediate checks only serve performance
log updates it's cheap, reliable, and useful (like a transactional database)
make actions atomic or restartable

Conclusion

done

project

TODO paper [2/4]

DEADLINE: 2009-12-10 Thu

[X] go over 3-sched
[X] Con and LKML background
[X] data analysis
[X] look over results

BFS vs. CFS

Con vs. Ingo Molnar

according to Con Kolivas

BFS is simpler – ~9000 less lines of code than CFS
more appropriate for the loads of normal interactive desktop users
single runqueue -> much easier to gaurentee global fairness
no heuristics which try to guess interactivity from analysis of sleep time
interactive tasks will naturally be scheduled with high priority because:
- if they're just waking up then they haven't used up their CPU time
- they will have earlier effective deadlines

according to Ingo Molnar

people are regularly testing 3D smoothness, and they find CFS good enough and that matches my experience as well (as limited as it may be). In general my impression is that CFS and SD are roughly on par when it comes to 3D smoothness.

there was simply no code in existence before CFS which has proven the code simplicity/design virtues of 'fair scheduling' - SD was more of an argument against it than for it. I think maybe even Con might have been surprised by that simplicity: in his first lkml reaction to CFS he also wrote that he finds the CFS code 'beautiful', and my reply to Con's mail still addresses a good number of points raised in this thread i think.

Linus on choosing CFS over SD

Con can't be trusted to maintain his code

that was where the SD patches fell down. They didn't have a maintainer that I could trust to actually care about any other issues than his own.

as a long-term maintainer, trust me, I know what matters. And a person who can actually be bothered to follow up on problem reports is a hell of a lot more important than one who just argues with reporters

SD (Staircase Deadline) Scheduler

http://kerneltrap.org/SD_scheduler
http://lwn.net/Articles/231973/
- It has bound latency. CFS can't guarantee either as well as SD can. SD allows one to set the exact scheduling priority of everything and it is always respected, as there is no interactive renicing: it is very predictable.

Brain Fuck Scheduler

http://ck.kolivas.org/patches/bfs/bfs-faq.txt
- Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS. That's a load of 1000 on a quad core machine.

timeline

1999 Con gets into linux, and at around 2.4.18 he began preparing his own patches merging desktop-performance patches to the kernel (e.g. O1, preempt, low latency and compressed cache)
ck patchset seems to do great things for interactive kernel use

One thing is for sure, the -ck patches before that one did an increadible job. Still, many years and hardware generations after, the best performing system I ever had (as in user experience, gapless audio playback while copying large and many files, …) was a 300 MHz Pentium II with probably 512 MB RAM running a 2.4 -ck kernel.

My current systems still have gaps in Audio playback even though they are running at 1.8 GHz and more.

I wish back my old system, just for playing audio.
2002 Con is interviewed about ConTest (see here) a benchmarking tool which is heavily used by kernel developers
2004 Con releases the Staircase scheduler (see here) (see this email)
Early 2007 Rotating Staircase Deadline scheduler (see here)
Linus seems amenable to RSDS mainline inclusion

I agree, partly because it's obviously been getting rave reviews so far, but mainly because it looks like you can think about behaviour a lot better, something that was always very hard with the interactivity boosters with process state history.
the Staircase scheduler develops into the SD (Staircase Deadline) scheduler
early 2007 Ingo Molnar releases his own rewrite of Con's SD scheduler to much acclaim (see this node)
Cons is not pleased (see this email)
mid 2007 Con stops updating the -ck patchset (see this email)

It is clear that I cannot develop code for the linux kernel intended only to be used out of mainline and not have mainline get involved somewhere along the line. Whether it be the users or even other developers repeatedly asking "when will this be merged". This forever gets me into a cycle of actually trying to merge the stuff and … well you all know what happens at that point (again I had nastier words but decided not to use them.)

So, I've had enough. I'm out of here forever. I want to leave before I get so disgruntled that I end up using windows. I may play occasionally with userspace code but for me the kernel is a black hole that I don't want to enter the event horizon of again.
Ingo responds to Con's release 2009-09-06 (see this email)

I understand that BFS is still early code and that you are not targeting BFS for mainline inclusion - but BFS is an interesting and bold new approach, cutting a lot of code out of kernel/sched*.c, so it raised my curiosity and interest :-)

Alas, as it can be seen in the graphs, i can not see any BFS performance improvements, on this box.

So the testbox i picked fits into the upper portion of what i consider a sane range of systems to tune for - and should still fit into BFS's design bracket as well according to your description: it's a dual quad core system with hyperthreading.
Con responds 2009-09-07 (see this email)

/me sees Ingo run off to find the right combination of hardware and benchmark to prove his point.

[snip lots of bullshit meaningless benchmarks showing how great cfs is and/or how bad bfs is, along with telling people they should use these artificial benchmarks to determine how good it is, demonstrating yet again why benchmarks fail the desktop]

I'm not interested in a long protracted discussion about this since I'm too busy to live linux the way full time developers do, so I'll keep it short, and perhaps you'll understand my intent better if the FAQ wasn't clear enough.

Do you know what a normal desktop PC looks like? No, a more realistic question based on what you chose to benchmark to prove your point would be: Do you know what normal people actually do on them?

Feel free to treat the question as rhetorical.

notes

real tests

function latt-results(base="base"):

    results = Dir.entries(File.join(base)).map do |e|
      if e.match(/.*out(\d+).*/)
        [Integer($1)] +
          File.read(File.join(base, e)).map do |l|
          Integer($1) if l.match(/.*?(\d+) *usec.*/)
        end.compact
      end
    end.compact

data.each{ |l| puts "|"+l.join(" | ")+"|" }


1	3847	124	446	39	136903	100383	5966	515
2	20030	955	2430	155	219031	137104	13517	862
3	73647	13612	21009	1173	383096	174236	28855	1611
4	109658	21506	25827	1318	341028	226312	34078	1739
5	148674	27177	31495	1519	395191	281150	37515	1809
6	148416	33376	38127	1751	476689	333882	45291	2080
7	223346	37809	43699	1960	525645	396762	50692	2274
8	251356	43688	53118	2312	654439	454026	53991	2350
9	234711	47452	52388	2218	668008	512374	57961	2454
10	268947	50518	56947	2344	756613	567916	63370	2609

data.each{ |l| puts "|"+l.join(" | ")+"|" }


1	3675	69	325	26	49456	31094	2965	236
2	15451	188	1224	69	54753	31353	3027	171
3	44760	5873	8952	423	82646	40487	12700	601
4	46814	8432	10647	439	87481	47902	13865	572
5	73662	12727	14015	534	136676	56542	17872	680
6	62503	14414	14475	505	154784	65107	17297	603
7	116681	20178	19589	649	175359	76453	24407	809
8	110105	22831	21448	673	195287	81819	23305	731
9	124869	25198	23156	693	165885	89315	25439	761
10	157668	27586	24549	706	164980	96154	27432	789
11	154270	31515	27226	759	189019	106003	29155	813
12	204609	39826	35900	971	233421	106114	34894	943
13	168486	40721	34658	912	219374	120001	34546	909
14	163194	41588	33267	852	248706	128874	35918	919
15	203498	45197	37336	936	308278	141872	39753	997
16	213616	47945	38915	954	245362	147306	41478	1017
17	232214	52437	42495	1031	304720	157500	44672	1083
18	261034	58236	49930	1195	298037	158504	49982	1196
19	250611	58823	46255	1083	303229	172885	47975	1123
20	279880	57325	44428	1019	369985	186997	48912	1122

data.each{ |l| puts "|"+l.join(" | ")+"|" }


1	3675	69	325	26	49456	31094	2965	236
2	3603	81	333	19	50891	33909	3094	175
3	14977	1621	3119	147	63231	45225	4787	226
4	16288	3554	4621	191	78503	57241	5906	244
5	21650	5059	5668	214	101637	69758	7882	298
6	31288	6901	6948	244	115349	81085	8248	290
7	36701	8897	8525	283	132428	93030	10158	337
8	42805	10986	9876	311	151479	104323	11902	375
9	43571	12718	10766	324	168803	116987	13642	410
10	57919	15239	12198	355	184954	128700	14682	427
11	55153	16664	13189	372	206221	141415	17527	495
12	61766	18789	14428	394	230900	148623	15859	433
13	73299	20834	15409	409	244328	163776	19161	509
14	68849	22847	16692	433	258783	175110	19890	516
15	74255	24603	17802	453	267375	184825	21259	541
16	94934	27876	19536	488	307184	198535	25055	626
17	90519	30494	21592	532	319140	210595	28265	696
18	93456	32464	22524	545	341838	218002	29598	716
19	106604	36042	25485	616	367055	239063	40259	974
20	116848	38833	27290	654	389510	257751	44943	1077

test – new kernel

only taking stats from the first run as latt.c already does multiple runs for us and calculates error bars, etc…

results = Dir.entries(File.join(base)).map do |e|
  if e.match(/.*out(\d+).*/)
    [Integer($1)] +
      File.read(File.join(base, e)).map do |l|
      Integer($1) if l.match(/.*?(\d+) *usec.*/)
    end.compact
  end
end.compact


1	3847	124	446	39	136903	100383	5966	515
2	20030	955	2430	155	219031	137104	13517	862
3	73647	13612	21009	1173	383096	174236	28855	1611
4	109658	21506	25827	1318	341028	226312	34078	1739
5	148674	27177	31495	1519	395191	281150	37515	1809
6	148416	33376	38127	1751	476689	333882	45291	2080
7	223346	37809	43699	1960	525645	396762	50692	2274
8	251356	43688	53118	2312	654439	454026	53991	2350
9	234711	47452	52388	2218	668008	512374	57961	2454
10	268947	50518	56947	2344	756613	567916	63370	2609

work errorbars

data/netbook-cfs-clientyonly.png

frame drops

base = "./project/2.6.31.6_hausmaster-laptop/av/"
results = Dir.entries(File.join(base)).map do |e|
  if e.match(/out(\d+).txt/)
    [Integer($1)] +
      File.read(File.join(base, e)).map do |l|
      (l.match(/V\:(\d+)\:(\d+)/)) ? [Float($1), Integer($2)] : nil
    end.compact.map{|l,r| [100-((r / (l+1))*100)] }.last
  end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }


1	100.0
2	99.8543335761107
3	99.4147768836869
4	98.3618763961281
5	97.6761619190405
6	96.6565349544073
7	94.296875
8	93.7795275590551
9	94.5797329143755
10	92.0255183413078

data/frame-drops.png

actually running some tests

latt.c

base = "./project/bfs"
results = Dir.entries(base).map do |e|
  if e.match(/i(\d+).out/)
    [Integer($1)] +
      File.read(File.join(base, e)).split("\n").map do |l|
      Integer($1) if l.match(/.*?(\d+) *usec.*/)
    end.compact
  end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }

base = "./project/bfs"
Dir.entries(base).map do |e|
  if e.match(/i(\d+).out/)
    [Integer($1)] +
      File.read(File.join(base, e)).map{|l| Integer($1) if l.match(/.*?(\d+) *usec.*/)}.compact
  end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }

work errorbars

wakeup errorbars

all on one

jeff's results


1	322	11	26	2	40858	33434	2926	234
2	15778	3439	4858	284	93302	57327	7113	416
3	31602	7273	7675	381	122543	81251	10231	508
4	47619	11256	10508	468	146941	107658	12848	572
5	55621	15068	12915	529	179079	126734	17147	703
6	78896	20132	16834	652	224041	153557	21634	838
7	79376	24201	18530	683	261242	175972	24642	909
8	99925	28032	21225	758	300960	205240	32233	1151
9	102309	32829	24116	829	341542	226719	36207	1245
10	122910	38026	27450	931	378846	256072	41322	1401


1	39	11	3	0	63212	45110	3734	303
2	37330	9829	10733	648	123717	75794	15559	940
3	50036	16725	14469	744	166951	103040	20449	1052
4	60771	20001	16194	739	196976	121333	25811	1178
5	99898	26263	20435	860	222647	138132	30665	1290
6	125911	32808	24503	967	276826	159907	38284	1511
7	136318	38887	27918	1040	301758	173934	43273	1612
8	168979	44304	31485	1130	348513	192425	52425	1882
9	193398	49936	34993	1203	376297	206254	54942	1889
10	208970	56251	39117	1304	428826	219508	63024	2101

work errorbars

wakeup errorbars

taylor results


1	48	29	11	5	89117	88696	356	159
2	38317	9162	14364	5078	183274	157032	15264	5397
3	19015	3625	6949	2006	237982	163345	37587	10850
4	40082	7297	11816	2954	297934	218751	45478	11369
5	56967	12006	19883	5134	370527	299750	34033	8787
6	197095	29112	52762	12436	378163	320431	53987	12725
7	62656	18670	17292	3773	438030	400260	23497	5128
8	153654	29389	44716	9128	528557	417398	51049	10420
9	135979	43330	40470	7788	555059	466411	78488	15105
10	242612	42873	68167	12446	628015	475800	87386	15954


1	55	29	15	7	89662	89019	418	187
2	4736	1230	1794	634	149805	144967	3243	1147
3	7269	2684	2687	776	216993	202884	9159	2644
4	17619	5140	5474	1369	274191	256837	13168	3292
5	16372	6463	5778	1492	326104	307699	16855	4352
6	21754	10716	7862	1853	391064	360888	25385	5983
7	22960	10366	7590	1656	463452	427938	30555	6668
8	43914	16872	12730	2598	511854	477101	26558	5421
9	43543	15306	10991	2115	565166	534687	24529	4721
10	32396	12250	10698	2392	641921	602982	38807	8677

work errorbars

wakeup errorbars

results of the initial short run


clients	max	avg	stdev	stdev mean	max	avg	stdev	stdev mean
1	32	26	3	1	45521	40912	9169	4101
2	4618	491	1450	459	46845	45910	1374	434
3	35772	7188	12617	3258	66012	47203	12731	3287
4	47612	11190	13993	3129	92774	62032	20679	4624
5	78830	24899	27926	5585	99364	49770	17919	3584
6	53154	15118	14660	2676	114770	69815	20333	3712
7	55765	12266	18432	3483	121098	59936	20165	3811
8	61666	17244	20994	3711	135540	73728	34408	6083
9	98198	29768	28730	4788	149886	80922	32844	5474
10	119101	18923	30233	4780	164823	68145	41851	6617


clients	max	avg	stdev	stdev mean	max	avg	stdev	stdev mean
1	52	25	14	6	34052	31043	2882	1177
2	121	35	31	9	32951	30427	1652	477
3	7962	1149	2508	648	44959	41041	2498	645
4	12155	2068	3109	695	68560	56283	8685	1942
5	13584	4795	4967	993	75826	66148	8812	1762
6	21760	6601	7203	1315	86005	73206	8718	1592
7	19290	7422	6676	1128	107545	87649	8881	1501
8	42266	10625	10161	1607	110436	92718	14027	2218
9	40445	13647	11335	1690	134468	100833	18010	2685
10	31040	13661	10206	1614	136177	116693	9188	1453

building the kernel

(see these-kernel-building-instructions)

initial build

cd into the kernel directory
copy your local configuration into the kernel config
```
cp /boot/config-`uname -r` ./.config 
```
run the menuconfig
```
make menuconfig
```
select the "load configuration" option, load your the .config file, and then exit
now you can try to make the kernel with make

install the build tools, and header files

sudo apt-get install build-essential linux-headers-2-...

still didn't work, then switched to the unstable debian repos (replaced "lenny" with "unstable" in /etc/apt/sources.list)
with unstable I installed libc6-dev and tried again
now missing zlib instead of eventfd.h
installing zlib
```
sudo apt-get install zlib1g-dev
```
make the kernel make menuconfig. This spits out the following error message, but seems to succeed regardless
```
make[1]: *** No rule to make target `just'. Stop.
make: [Just] Error 2
```
now make the Debian kernel package
install the resulting .deb file
```
dpkg -i linux-image......
```
rebooted using the new kernel and it worked

bfs patch

Applied the BFS patch

downloaded from …
applied
```
path -p1 << bfs-patch...
```

secondary build

make the BFS-patched kernel

fakeroot make-kpkg clean
fakeroot make-kpkg --initrd --append-to-version=-bfs kernel_image kernel_headers

install the resulting kernel
```
sudo dpkg -i linux-image....bfs...deb
```

links

howto-kernel-compilation

file:data/10.1.1.59.6385.pdf

History of the linux kernel

http://www.kernel.org/doc/#History_of_the_Linux_Process_Scheduler

Linux test suite

http://ltp.sourceforge.net/

CFS

http://www.ibm.com/developerworks/linux/library/l-cfs/

Linus on CFS vs SD • http://kerneltrap.org/node/14008

Completely Fair Scheduler • http://en.wikipedia.org/wiki/Completely_Fair_Scheduler • http://kerneltrap.org/node/8059 • http://www.linuxinsight.com/files/sched-design-CFS.txt

CFS design document

SD Scheduler • http://kerneltrap.org/SD_scheduler • http://lwn.net/Articles/231973/

It has bound latency. CFS can't guarantee either as well as SD can. SD allows one to set the exact scheduling priority of everything and it is always respected, as there is no interactive renicing: it is very predictable.

Brain Fuck Scheduler

• http://ck.kolivas.org/patches/bfs/bfs-faq.txt

Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS. That's a load of 1000 on a quad core machine.

Scheduler Benchmarking

• http://kerneltrap.org/mailarchive/linux-kernel/2007/9/17/261647 • http://lkml.org/lkml/2007/9/13/385 • http://devresources.linux-foundation.org/craiger/hackbench/

Hackbench benchmarking program

• Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS. That's a load of 1000 on a quad core machine. • The 'latt' test app recently written by Jens Axboe is a better place for simpler to understand and useful numbers.

http://ck.kolivas.org/patches/bfs/bfs-faq.txt

• 3D Smoothness testing

http://kerneltrap.org/node/14023

other schedulers to implement

lottery scheduler

lottery-scheduling

seems nice, nice math/stat background

GA scheduler

somehow evolve different scheduling algorithms

testing suites

Scheduler Benchmarking

http://kerneltrap.org/mailarchive/linux-kernel/2007/9/17/261647
http://lkml.org/lkml/2007/9/13/385
http://devresources.linux-foundation.org/craiger/hackbench/
Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS. That's a load of 1000 on a quad core machine.
The 'latt' test app recently written by Jens Axboe is a better place for simpler to understand and useful numbers.
- http://ck.kolivas.org/patches/bfs/bfs-faq.txt
3D Smoothness testing
- http://kerneltrap.org/node/14023
http://www.linuxfordevices.com/files/article027/rh-rtpaper.pdf

project tasks [1/1]

DONE project proposal

DEADLINE: 2009-09-25 Fri

2-page proposal/description
- motivation
  - novel
  - solving problem
  - test (conventional wisdom)
  - measuring
  - comparing
- objective
- background
  - related work
  - literature
- methodology
  - approach
  - hypothesis
  - validation
  - challenges
    - make sure reasonable for time span
    - make sure we have resources
- expected results / impact
1-3 people group
would prefer hardcopy, but a PDF is fine
project need not be completely defined, but should touch on potential sticking points

outline / topic

CPU scheduling http://kerneltrap.org/node/14008

Motivation - learn about kernels, proper testing environments and scheduling polices and mechanisms.

Objective - compare CFS and SD schedulers from 2007. Indentify and quantify these differences. We hope to identify these and quantify these.

Hypothesis - As indicated in the discussion between Linus Torvald and Kasper Sandberg, we expect the CFS and SD schedulers to perform better in certain niches.

Methodology - Use existing methodology and test responsiveness of throughput read pg. 704

Challenges - Setting up a valid testing and development environment. Development and testing will most likely be different (VM vs. Physical Machine). Putting together a good test suite to test different types of usage. How to evaluate performance as it's running. How does our choice in hardware affect the outcome of the results (choosing the hardware model that best)?

composition (challenges)

Challenges

Setting up a valid testing and development environment. Development and testing will most likely be different (VM vs. Physical Machine). Putting together a good test suite to test different types of usage. How to evaluate performance as it's running. How does our choice in hardware affect the outcome of the results (choosing the hardware model that best)?

testing and development environment
- most likely different environments for development and for testing
- VM, kernel module, algorithmic simulation
test suite
- define what is meant by "interactive" use
- tailored to the particular aims of our investigation
- popular (so our results can be compared to others)
- how to perform a "live" evaluation of the performance
  - Heisenberg uncertainty principle
impacts of hardware on results
resources
- hardware
- test suite

note

Also, something to note about the history of linux schedulers is that the SD scheduler was never merged into the mainline kernel. The predecessor to CFS was the "O(1) Scheduler." The SD scheduler was more of a contemporary competitor to the CFS that lost out.

final

intro

The release of Completely Fair Scheduler in 2007 sparked significant debate on various Linux kernel mailing lists and forums. Compared to its predecessor (SD) which used run-queues, CFS utilizes a time ordered red-black tree. While CFS design implemented a “radical” shift in data structures, the benefits are not immediately visible. In several instances The SD scheduler was reported to handle 3D gaming better, providing a smoother display to the user. SD was viewed as the reference in the development for CFS yet it seems the decision to include CFS in the mainline was partially political. As Linus Torvalds was quoted, “[A] person [Ingo] who can actually be bothered to follow up on problem reports is a hell of a lot more important than one who just argues with reporters [Con]”.

Our objective is to analyze the differences between the two methods of scheduling (including patched versions) and to determine the possible benefits of using one system over the other. This implies a wide range of testing procedures in order to provide a balanced perspective on the debate. A secondary goal is to gain first hand experience with kernels, proper testing environments, scheduler policies and mechanisms.

We hypothesize that early versions of the CFS scheduler's performance does not match that of SD, but through tweaking and applied patches, CFS surpasses SD in performance.

methodology

Testing the schedulers will require modifying the Linux kernel. We will investigate modifying the kernel on two different levels:

The first is to implement schedulers as individual kernel modules. This way is preferred as we would not have to recompile and maintain independent kernels but instead have individual scheduling modules compiled for the same kernel. We could specify which scheduler to use as a boot flag or, ideally, on the fly–if possible.
If using kernel modules is not possible, then we will be required to compile and install independent kernels for each of the schedulers that we want to test. These will be chosen from at boot time.

The CFS scheduler is presently in the mainline kernel (true as of 2.6.13). Implementing the SD scheduler will require applying patches against the mainline kernel. If we desire to separate the schedulers into individual kernel modules, this will require adaptation of the patches.

After our schedulers are implemented and ready for testing, we will concentrate on devising effective tests and benchmarks with which to evaluate them. We will be evaluating the schedulers according to the following criteria:

CPU utilization: how effectively can the scheduler utilize the CPU
Throughput: the rate at which jobs are completed
Turnaround time: the time it takes to finish a job
Waiting time: the time a job spends in a waiting queue
Response time: the interval between activations on the waiting queue

We will research existing benchmarks for testing schedulers and only write our own as a last resort when no other appropriate benchmarks can be found. In addition to artificial benchmarks, we will also perform real world tests, such as listening to music when other processes are hogging the processor and benchmarking games such as Unreal Tournament 2004.

In addition to the above, we are also interesting in exploring the following optional paths:

Testing Kolivas's Brain Fuck Scheduler (BFS)–this is a recent (August 2009) successor to the SD scheduler
Implementing control group schedulers such as round-robin to become more comfortable with writing our own schedulers
Experimenting with possible improvements to the schedulers, such as by tweaking parameters

challenges

There will be a number of challenges inherent in carrying out our methodology. The first being the establishment of appropriate kernel development and testing environments. Each of these environments will have different requirements

development: A good development environment should allow for a reasonably quick closed testing loop for new code, and should be well protected from the unpredictable and likely harmful side effects of experimental code. Given these restrictions a good development environment will likely be contained inside of a VM, or on an expendable piece of hardware.
testing: A good testing environment should resemble as closely as possible the actual production environment of the kernel. For this reason we will probably test directly on a physical machine, rather than through a virtual machine. If a wider variety of hardware is desirable than is available some sort of "simulated" test environment may be required. such a simulated scheduling environment would allow more flexibility in varying simulated hardware components and the related performance determining constants, but may yield less veracious results.

Once we have established an acceptable development and testing framework the next challenge will be the acquisition of a suitable testing suite. Two issues related to the availability of a test suite are the possibly prohibitive cost of high quality "standard" test suites and the potential lack of any widely accepted test suites directed at the particular aims of our study (specifically scheduler performance over different "types" of load including interactive use and batch use).

Some tradeoff will have to be made between the amount of information returned by a test suite $Δ P$, and the suites impact on the load $Δ L$ on the system. A situation similar to the Heisenberg uncertainty principle is expected where increasing the precision of our knowledge of the system at any point decreases the our knowledge of the load such that the two are only knowable up to some hardware constant $\hbar$.

$$ Δ P × Δ L \geq \frac{\hbar}{ 2 } $$

If this tradeoff proves untenable then we may be required to resort to a simulated test environment, or a scheme of partitioning the running system inside of a virtual machine and collecting our metrics from outside of the machine.

implementation

kernel: 2.6.31 (this is what the current BFS patch is against)

exams [1/3]

TODO final exam

DEADLINE: 2009-12-15 Tue 07:30
in classroom

TODO final review

DEADLINE: 2009-12-11 Fri 09:00
in CS141

DONE midterm

DEADLINE: 2009-10-13 Tue

format
- questions like the reading response questions
- essay questions
topics
- kernel design
- memory management
- virtualization
test general OS concepts
care less about specifics, and more about the effects of the mechanisms
- not how did x solve y, rather, how could one solve y

topics

OS structure

standard monolithic

entire OS is in kernel space

pros

faster (less context switching)

cons

complexity, size
less flexible/extensible can't customize w/o changing kernel space code
harder to move to new hardware
less secure/stable (more low-level components to keep track of)

µ-kernel

only supports basic structures (l4 address spaces, threads, scheduling, and IPC) and pushes rest of the OS out into user-space servers

pros

simpler
easier to move to new hardware
flexible
more secure/reliable because of the simplicity of the low-level interface

cons

slower

exokernel

only does multiplexing of HW resources, rest of OS is in users pace libraries. end-to-end argument: application knows best how to handle it's own resources.

pros

direct access to hardware
flexible

cons

no security gains like in µ-kernel
cooperation

virtualization structures

as example of general system management structure

fault containment
porting old OS to new hardware
slower

understand

implication of these structures to the performance of the OS
- micro-benchmarks
- macro-benchmarks (applications)
implication for extensibility of the OS
separation of protection of resources, mechanisms, policy

processes and threads

address spaces

virtual memory of a process

process state

multi/batch/time-sharing programming

multi-programming: multiple tasks, can be single user
time-sharing: multiple tasks, normally multiple users
batch: space sharing rather than time sharing, but the CPU is generally only given to one task at a time. queue of processes

context switch

swap

models of communication

message passing
shared memory (on different architectures NUMA, UMA, etc…)

synchronization

monitors: software constructs surrounding critical sections
semaphores (less): counting or binary, primitive locks, counting counts how many people can be in critical section simultaneously, decrements as each individual enters the critical section
mutex: simple binary semaphore (potentially has additional features protecting against priority-inversion)
critical section: section of code which is run inside of a lock, semaphore etc…
deadlock: (see deadlock)
condition variables, their semantics: used for IPC, avoid spinning.

Models (see notes above)

kernel threads
user threads
hybridization
scheduler activations paper

process scheduling

metrics to consider

responsiveness (time from submission to first response), submission to completion, wait time (sum of the time spent on the ready queue), throughput (job completion in a chunk), turnaround (form start to finish)

user-centric: response, wait, turnaround
system-centric: throughput, utilization

preemptive v.s. non-preemptive

can't knock a process off of the CPU until the process yields

fair scheduling

CPU is equally distributed between users or groups rather than among processes

memory management

working set

set of pages needed while running

thrashing

when the working set doesn't fit in memory, when the OS spends more time paging then executing

allocating memory (contiguous vs. non-contiguous)

contiguous maps the address space directly to disk through a base and offset, non-contiguous (like paging) allows individual pages to be loaded w/o loading the entire address space at once.

address space protection

gained through paging or segmentation

segmentation

like in the Multics paper

variable length
semantics (program or data)
permissions like on files
potentially with a directory structure

paging

allocation, selection, levels of caches, replacement

fixed size
less semantics than segments
mapping pages to disk
page faults are resolved as high up the cache hierarchy as possible
LRU, stuff like that

copy-on-write

p.325 Dinosaur book

memory-mapped IO

map a section of memory to a place on disk, and all you have to do is write to memory. copies part of disk to ram

this requires explicit handling in the user-level application. initial system call to set it up (open/close)
faster, only have to write to memory (and it will later be written to the mapped portion of the disk)
there is an explicit system call to sync to disk
might be asynchronous
slower for changes to propagate to disk

miscellaneous

reliability
scalability (clients, processors, resources, etc…)

weak scaling
increase workload as increase resources (constant time)

strong scaling
decrease time as increase resources (constant workload)

Advanced Operating Systems

Table of Contents

Meta

class notes

2009-08-25 Tue

2009-08-27 Thu

concepts

overview

virtual memory

working set / footprint

design goals

papers

observations

hints

2009-09-01 Tue

2009-09-03 Thu

file systems

2009-09-08 Tue

processes

scheduling

threads

inter process communication

2009-09-10 Thu

address space == process

paging vs. contiguous allocation

1st v.s. 2nd generation µ-kernels

interrupts (µ-kernel is slower)

top-half / bottom-half interrupts (in linux)

system call mechanisms

scheduling

translation look-aside buffer (TLB)

dual space mistake

co-location

2009-09-17 Thu

Project Ideas

exokernels

secure bindings

pain to write to

µ-kernel vs. exokernel vs. monolithic-kernel

downside (cooperation)

2009-09-22 Tue

monitors

Mesa Monitors

deadlock

2009-09-24 Thu

scheduler activations

2009-09-29 Tue

lottery scheduling

resource containers

memory management

relocation

allocation

2009-10-01 Thu

2009-10-06 Tue

disco (implementation & performance)

virtual memory

Multics

VAX

2009-10-08 Thu

VM pros and cons

2009-10-20 Tue

disks

file system

2009-10-22 Thu

going through the midterm

2009-10-27 Tue

Network Files System (NFS)

2009-10-29 Thu

RAID / LFS

2009-11-03 Tue

LFS and RAID

CODA

2009-11-05 Thu

general consistency

Munin

2009-11-10 Tue

Munin implementation

Quicksilver

2009-11-12 Thu

Quicksilver