## Meta

 10 participation 25 homework 25 project 22.5 midterm 22.5 final

## class notes

### 2009-08-25 Tue

• same course path as undergraduate OS, just much more detail
• all reading in research papers
• groups
• all work
• one day's lecture
• research-lite project
• proposal in ~3rd weak of Sept.
• 1-2 month working on implementation
• will produce a research-paper
• undergrad lectures serve as good background for the course

### 2009-08-27 Thu

email server was down, try again or send email to Dorian

#### concepts

##### overview
OS
policies
what user can do
mechanisms
how policies enforced
permission levels
often controlled by indicator bit
• user
• kernel
system call
2. parameters stored in registers
3. switch to kernel mode
4. execute routine defined in the kernel
##### virtual memory

virtual address space which can be mapped to actual memory. this allows the process using the memory to be loaded/unloaded/moved etc…

if a page of virtual memory is not in physical memory a page fault occurs and the page is loaded into physical memory

##### working set / footprint

of a process is the parts of it's address space currently in use. this is the pages of memory that need to be loaded in physical memory to avoid memory.

#### design goals

efficiency
often a tradeoff between time and space
robustness
fulfills expectation of users
security
hardware interface
expose features/capabilities of hardware
user interface
present features/capabilities to user
portability
target hardware
economics
development cost, user base
scalability
range of supported hardware/user sizes/numbers
extensibility
ability to support new components

#### papers

keep in mind the context

• users back then were developers
• cpu used to be bottleneck now it's memory

increasing gap between CPU speed and the ability of memory and bandwidth to keep up.

• bandwidth is proving the limit on the amount of memory which can be used efficiently

### 2009-09-01 Tue

Stephens Richie and Thompson developed unix and TCP/IP

• Stephens
• part of the unix team
• wrote the unix bible

### 2009-09-03 Thu

uni-programming
only one process at a time, typically it would run to completion
batch-programming
still uni-programming but you maintain a queue of processes that are ready to run
multi-programming
allows multiple processes to run "simultaneously" on the machine using preemption, time slicing and by utilizing different hardware components in parallel
time sharing
multi-programming with multiple users creating processes. these days, tend to make the most sense for large batch processes, rather than interactive use

Mechanisms for multi-programming

context switch
switching processes on a CPU
process table
maintained by the OS, this contains an entry including a process control block for each process currently running on the system
process control block
contains the PID, the files the process is using, the program counter, register values, a pointer to the image
image
when you are about to run a process you load that process which creates an image of the process and bring it into memory. image+state=process. program -> image -> process

#### file systems

file
in general to the OS a file is just an uninterpreted raw ordered set of bytes, some specialized OSs do differentiate between file types for optimization.
directory
list of files, most OSs limit access to these files to system calls

mechanisms

• filename relates to an index # which points to the index table which relates the index # to an i-node
• when mounting a new disk the first couple of bites on the disk contain the information used by the OS to populate the index table
• generally corrupt disks are the results of damage to this meta data section on the front of the disk

the contents of the file is a pointer to the i-node of the file
the contents of the file is the name of the file to which it is pointing

deleting a file

• the actual data isn't "erased", rather the link counter in the i-node is decremented, and if there are no more links, then the blocks on which the file is written are added to the free list
• deleting a soft link doesn't change it's target's i-node

### 2009-09-08 Tue

processes
unit of work user gives to the OS
finer unit of work inside the process

#### processes

schedulers

batch queue
outside the OS, waiting jobs
short term scheduler
many different scheduling policies
• round robin
• priority
• shortest-first

#### scheduling

scheduling policy

dispatcher
implements the scheduler policy
goals
timing
responsiveness
time to first response
waiting
total time spent waiting
turnaround
time start to finish
resource utilization

policies

fifo
first in first out
round robin
move around giving everyone time slices
shortest job first
theoretical, not in real life
priority
not a complete policy, combined with fifo or round robin etc…
multi-level queues
semantics added to queues (i.e. system, interactive, batch, IO-bound, etc…)
multi-level feedback queues
jobs can change priority over time, based on things like increasing the priority of long-waiting jobs to avoid starvation

multiple processes sharing an address space

each thread would have it's own

• UID
• stack
• registers
• defining handling signals like C-c, segmentation violations, etc…

everything else is shared (text, static data, dynamic data)

don't really speed up execution, but could help to modularize a program
• thread operations (creation, deletion) are performed in user-space which makes them faster than having to do them in kernel space
• real parallelism
• less portable
• takes longer to create a thread
• real parallel execution
many-to-many (hybrid)
static
dynamic
pool
you can establish a pool of kernel threads, then do all further thread operations in user-space (faster)
limit
you can limit the number of kernel threads to something reasonable (number of processors) reducing overhead on the OS and times jumping the user|kernel boundaries
complicated
this is the most complex of the threading schemas
fixed
most often you have a fixed number of kernel threads

#### inter process communication

through sharing or message passing. each of these could be implemented in terms of the other

in both cases

• synchronous == blocking - asynchronous == non-blocking

message passing

• µ-kernels can be though of as using message passing
• messages typically have to pass through kernels
• can cross machine boundaries

shared memory

• monolithic kernels can be thought of as using shared memories
• typically faster than message passing
• requires shared physical media
• when two processes try to access a variable at the same time
3. i process and write
4. you process and write
5. we've both missed the other's changes
• atomic operations, locks, semaphores (binary, counting), etc…

### 2009-09-10 Thu

L4 kernel has hierarchical address space

• every process inherits address space from a parent, and the initial address space (sigma-0) maps directly to physical memory
• which is like monolithic unix where everything descends from the init process.

#### paging vs. contiguous allocation

contiguous allocation
has a base and offset registers which are used to map the virtual addresses to the physical disk
paging
pages of memory map randomly (not contiguous) on disk
• more complicated translation from virtual to physical addresses
• allows you to fill holes one disk (finer granularity because physical memory is eaten in page-size chunks rather than address-space chunks)
• allows portions of the address to be loaded individually as opposed to contiguous allocation where the entire address space must be loaded before any execution can take place

#### 1st v.s. 2nd generation µ-kernels

• second tailored more to hardware
• second built from scratch (rather than back from monolithic kernels)

#### interrupts (µ-kernel is slower)

main reason µ-kernels are slower is because every interaction is translated through IPC which has to go through the table

monolithic kernel
hardware interrupts in a monolithic kernel are directly looked up in a register table
µ-kernel
one thread waiting for each potential interrupt source

#### top-half / bottom-half interrupts (in linux)

tradeoff between speed of handling interrupts, and need to do significant amount of processing in many cases

top-half
responds quickly and does what needs to be done immediately
• for example it will just record that the interrupt occurred
• has high priority and can interrupt other interrupt handlers
• setup
bottom-half
does the actual bulk of the work
• has lower priority
• service

#### system call mechanisms

.so shared object
can be shared across multiple address spaces
.a static shared code
statically linked (shared) at compile time
trampolines
jumps the code execution to somewhere else, then jumps back

#### scheduling

in L4Linux the normal linux scheduler is used much like a many-to-one thread scheduler. Maps all of the linux "user-threads" to a single kernel thread.

the L4 kernel is scheduled using hard priority round robin

L4 scheduling priority levels:

1. top-half interrupt handler
2. bottom-half interrupt handler
3. kernel (which is the linux server)
4. user

#### translation look-aside buffer (TLB)

Fast associative memory that helps in address translation

It maps a virtual address to a physical address, if you hit in the TLB, then you don't have to look up the page.

tagged TLB
like a TLB plus information as to which process the address belongs to

in a normal TLB you have to flush all entries in the TLB to clear old mappings, in a tagged TLB you don't need to flush the TLB on context switch. this saves time which quickly switching between processes and back.

#### dual space mistake

tried to facilitate speedy kernel <-> user IPC through shared memory

• space costs (doubling memory usage)
• synchronization costs (takes time)

### 2009-09-17 Thu

#### Project Ideas

• System monitoring stuffs
• DynInst
• KernInst
• PIN
• PAPI (Performance API)
• massively parallel stuffs
• map-reduce
• MRNet large scale group operations
• RPC
• XML-RPC
• task-farming (programming model, issue tasks to the farm and collect results, example SETI-at-home)
• file systems
• encrypted
• process hijacking
• project HAIL
• FUSE (MAC-FUSE)
• Amazon Dynamo

#### exokernels

##### pain to write to

the spirit of the exokernel is that you would normally use abstractions exported by the library OS rather than always having to use your own

##### µ-kernel vs. exokernel vs. monolithic-kernel
• Monolithic kernel
+------------------------+
|   S1   S2     S3       |
|                        |   Monolithic kernel
|      S4    S5          |
+------------------------+
^
App

• µ-kernel
+------+
|App   |                 +------+
|      |-\               | App  |
+------+  -\             |      |
+-------------+    +-/----+
|             |    /-
|             |  /-
|  u-kernel   |--
+---/-----\---+
/       -\
/-          --\
/          +---------+
+--/-----+     | S2      |
|S1      |     |         |
|        |     +---------+
+--------+

• exokernel
+----+ +----+ +----+
|App | |App | |LOs |
+----+ +----+ +----+
+-------------------+
|    exokernel      |
+-------------------+
|    Hardware       |
+-------------------+

##### downside (cooperation)

when each application has direct access to the hardware it becomes difficult for applications to cooperate (intelligently share resources), which is routinely done in standard kernels.

### 2009-09-22 Tue

#### monitors

software construct which provides for mutual exclusion around a resource. maintains the invariant that when entering a monitor there is no-one else inside of the monitor.

condition variables
allow communication between processes avoiding spinning (while(condition !true);)
• signal() alerts other processes
• wait() sleep (relinquish cpu) and wait to be signaled
semaphore
semaphores are effectively equivalent to condition variables
• sem.p() waits on the semaphore
• sem.v() signals those waiting on the semaphore

Synchronization Problems

##### Mesa Monitors

Mesa

• programs comprised of modules
• clear API boundary between modules
• public interface
• private procedures

Mesa Monitors

• monitor module
• entry procedures (publicly interface)
• internal procedures (private procedures)
• external procedures (procedures that require no locking)

Issues

• when in a monitor (in function foo), and you call a function bar in another module then during the execution of that function you are not in the monitor (bar has no access to the structures of the monitor while as it is another module)
• if you don't release the lock when moving into bar
• you have the risk that something in bar tries to grab the resource protected by the monitor /deadlock/
• you have to unwind and open locks if say there is a deep exception
• if you do release the monitor while calling bar you need to
• ensure that you get the monitor back after executing bar
• potentially do cleanup before/after executing bar
• this is tradeoff between simplicity (per class) and efficiency (per object), best option really depends on the use case. monitor per class is sort of a strawman
• it is possible for lower-priority processes to run in front of higher-priority processes.
1. p1 acquires l1
2. p2 preempts p1
3. p3 preempts p2
4. p3 acquires l1 (but can't because p1 has the lock)

the problem is that p2 will run in front of p3, because p1 can't run and release the lock until p2 has run to completion

priority inheritance
associate a priority with a resource (lock) and the priority of that lock is set to the highest priority of those processes waiting for/on the lock. the priority of the process inside the lock is set to the lock's priority.

Difference between Mesa and Hoare

• in Hoare you are guaranteed that immediately upon signaling of a condition variable the waiting process will receive control, however in Mesa monitors the signal is more of a hint and you are not guaranteed to receive control when a signal is sent.
• Hoare
if(!cond){condition_v.wait()}

• Mesa (must re-check after condition becomes true)
while(!cond){timed_wait()}

• Mesa
timeout
abort
naked notify
allows hardware interrupt to signal a condition variable without first acquiring the monitor lock. this is more efficient than forcing a device driver to wait for a lock to be released before accessing a monitor.
• this could lead to a problem where a device signals that a resource is free, but the notification is missed by a process which is just switching from !cond to wait()
• note that this only allows the hardware interrupt to signal the condition variable, not to actually touch the resource

requires

1. circular wait
2. mutual exclusion
3. no preemption
4. hold-and-wait

### 2009-09-24 Thu

#### scheduler activations

wherever their system does well they present the numbers in a table. when their system doesn't fare so well they embed the numbers in the prose.

virtual processors to real processors
how do virtual processors map to real processors
SMMP
shared memory multi-processors, many CPUs which all have access to a single big block of shared memory
may not really be that much faster than kernel level threads (at least not to the point that this paper claims)

in this paper when they say user-level threads they mean the following model

kernel scheduling
priority levels and equal access for each priority level
lifo
when there are not enough processors to run all threads, then they follow a lifo policy to take advantage of cache locality (if I was running recently then my cache is still around)
critical section
need to be careful not to preempt a thread in a critical section (or at least let it get back quickly)

when a guy in a critical section is preempted and other people are waiting for his lock (and they've pushed him down the lifo queue) then you could deadlock (as he can't get up the queue until they finish and they can't finish until he runs)

solution
make a copy of each critical section which ends in a jump to an upcall. when the kernel preempts a process the kernel checks if the code is in a critical section, and if so, it jumps to the copy which is guaranteed to jump to an upcall when the section completes
spin lock
burns CPU, but keeps a process on the ready list, good for short wait, or when you have processor to burn
upcalls
used for the kernel to talk to the user process
• preemption
• blocking
• unblocking
downcalls
when the user-space communicates to the kernel
• more procs
• less procs

### 2009-09-29 Tue

#### lottery scheduling

a proportional-share-scheduling system where each entry (waiting consumer/process) gets some number of tickets, and whenever a resource is to be consumed a lottery is held and the winner's ticket is taken and the winner is placed in control of the resource.

specifics

• actually help many lotteries at once forming a queue (rather than a lottery every time quanta)
• processes can give their tickets to other processes (i.e. client server model, client could give tickets to server)
• given to processes that release the CPU before their time quantum has expired
• more uniform stat distribution with more samples -> smaller quantum leads to more samples
• tickets can be used for any resource
memory management
reverse lottery, when a page needs to be evicted from memory a lottery is held to select page to remove

#### resource containers

aimed at implementing a web server

relevant metrics for web server

• client metrics
• response time
• throughput
• server metrics
• number simultaneous clients
• quality of service, might want different levels for different clients

resource containers allow the application to specify resource containers and tell the kernel how to assign resources to the resource container.

mechanism of resource containers

1. connection comes in and is wrapped in a resource container
2. thread handling that connection is bound to the resource container
3. additional resources (i.e. file descriptor) are bound to the resource container

this can be useful for handling malicious requests (i.e. if they're tagged as malicious on the way in they can be given little/no resources)

#### memory management

handling the speed/capacity tradeoffs of memory maintaining

• performance
• protection
• correctness
                /\
/  \                |
speed ^       /    \      capacity v
|      / reg. \
/--------\
/          \
/  cache     \
/--------------\
/                \
/   main memory    \
/--------------------\
/                      \
/     local disk         \
/--------------------------\
/                            \
/  cloud, remote disk, tape    \
----------------------------------

##### relocation

source code
compiled code
relative refs (e.g. module x + offset)

so to change where the code is located in memory you will generally need to reload the code. dynamically relocatable code has it's absolute addresses resolved at runtime rather than at load time, so the code can be moved without reloading the code.

##### allocation
contiguous allocation
simple (base, limit, attr). makes context switches very simple (the kernel only need to change the base and limit registers)
external fragmentation
may not have enough contiguous free space
sharing
can't share w/o sharing entire address space (no portions)
setting attributes
same as above, can't identify parts of the space
segmentation allocations
divide address space into segments of arbitrary size. segment number -> (base, limit, attr)
external fragmentation
because with variable length sizes there could be many free spaces which aren't big enough to be used
paging
(most popular) fixed size segmentation. this ensures that there is no external fragmentation (if there is any space available then it is page sized and can be used). this is still vulnerable to internal fragmentation

page table

### 2009-10-06 Tue

#### disco (implementation & performance)

OS modifications

• drivers for DISCO specific "hardware"
• changes to keep OS from trying to access a small chunk of unmapped memory
• (small) allows the guest OS to request a 0'd page (so the guest OS doesn't have to re-0 a page)
• (disco) interprets the guest OS going into low power mode as the OS yielding the processor

#### virtual memory

##### Multics
• since segments are organized/structured as files they actually didn't have a file system. referencing a segment through it's symbolic name is like referencing a file
• seg.tag | address | opcode | external | addressing-mode
seg. tag
points to the base register of the owning base register
external
whether to use the segment tag (if external) or your own base register
• address points to another address. happens when you have multiple levels of paging hierarchy.
• indirect address points to 2 36-bit words, the new segment number and the new word number
• reference to external program
• symbolic name -> module name
• symbolic address -> function name or variable name
• are added to each process to hold the lookup information for external segments. after an initial reference the number of the link in the linking segment is used for future references.
##### VAX
2-bit seg. | 21-bit page number | 9-bit offset
segments
system space, program region, control region
program region
user data for the program
control region
kernel data for the program
TLB
the TLB is split in two (system/process), less has to be flushed on context switch

### 2009-10-08 Thu

#### VM pros and cons

• pros
• convenience in segmentation and paging
• code portability
• cons
• (time overhead) increased effective memory latency
• (space overhead) maintaining mappings, page tables
• increased complexity

### 2009-10-20 Tue

disks and file systems (see related 481 slides on Dorian's homepage)

#### disks

disks
stack of platter of concentric circles (or tracks) of sectors along with a movable arm (in all modern systems) and there is one arm/head per platter. each platter (aside from top and bottom) has data on both sides.

#### file system

semantics on top of disks

abstractions

• files
• directories

handles

• permissions
• mapping abstractions to disk
• enforcing resource quotas

directory

• just a special file which consists of a list of entries
• directory entry contains: filename, id, inode-#
• certain operations (cd, ls) can only take place on directory files
• organizations (in increasing complexity)
• 1-level directory
• trees (graph with no cycles)
• acyclic graphs (sharing: multiple links to the same content)
the file just maps to the name of another file (allows dangling pointers)
actually copies the inode-#, an inode (and the file) is removed when there are no more hard links pointing to the inode. this information is tracked in the inode
• general graphs

filesystem on disk:

• boot control block
• volume control block
• # of blocks
• # free blocks (list)
• directory structure
• starts @ root disk
• filenames, inode-#s
• file table
• maps inode-#s to inodes

when a device is mounted the OS loads the filesystem structures into memory

filesystem in memory:

• mount table
• cache directory structure
• open file table (another cache)
• variations: system wide or per process (know the pros and cons of each of these options)
• caching (pages/contents of the files)

### 2009-10-22 Thu

#### going through the midterm

meanmedmaxmax possible
p123253030
p211121415
p319202525
p48.7101515
total61618485

Review of problems (in general on the exam less is more)

1. exokernel, library OSs are linked into the application space of the application, so the getpid call is just a function call in the user-space which would not have to cross the user/kernel boundary, so this would be faster than the monolithic kernel
2. protection, multiplexing, IPC

3) 1)

1. it is much more complicated to move a process than to move a block of data. if you have many readers/writers of a block of data it may make more sense to move the users to the data rather than moving/replicating the data.
2. in message passing structured IPC using copy-on-write can allow pointers to be passed from process->kernel->process rather than the actual block of data

### 2009-10-27 Tue

will discuss LFS and RAID on Thursdays

LFS
wanted to improve performance and ended up improving filesystem reliability
RAID
vise-versa

#### Network Files System (NFS)

remote file access

• pros
• larger file servers (capacity)
• sharing
• robustness / redundancy
• cons
• speed (latency)
• availability
• consistency
• complexity
• NFS specific goals
• 80% speed of local disk
• simple crash recovery
• can repeat operations until success (idempotent). many operations are not naturally idempotent, for example the read operations read(f, out, nbytes) would normally increment a counter in the file, in nfs this counter must be tracked on the client side and passed as a parameter to the server
• no state on the server
• transparent access
• preserve Unix semantics
• Deployment Issues
• sharing the root file system
• scalability/performance sharing heavy use files (e.g. binaries required on startup)
• made these files local to each individual node
• /tmp files (use the process ID, which wouldn't be unique across different nodes)
• /dev entries of this directory have local semantics which make no sense to access on a remote system
• authentication across machines (need a global system of user IDs, "Yellow Pages")
• concurrency: local locks but no global locks, so two users on different nodes could have their writes to a file interleaved.
• performance: (solution is always caching)
• calls which occur often, but transfer small bits of data, (e.g. getattr which is called by ls, and pretty much every file access, this was initially 90% of the transactions) – so, they just cached attributes, this cache is invalidated every three seconds for files and thirty seconds for directories
• used UDP (Unreliable Data Transport), so if a packet in a RPC is lost they'd just redo the RPC
• really big packets
• read-ahead to try to get blocks before their needed – this doesn't help for executables with random access patterns

• VFS (virtual file system) abstraction on top of the specific file system used. allows file systems to be plugged in sort of like device drivers
• XDR is used as a canonical data representation ensuring that when the client and server share objects (ints, arrays, etc…) they cache their objects out into bits in the same way (endianness, float representations, etc…)

### 2009-10-29 Thu

(if we are ever really interested in a paper we could lead that lecture)

disk failures

updates, 3 parts – related to disk failure

• (D) data blocks
• (F) free blocks
• (M) meta-data blocks

disk failure part way through a write could lead to incoherence in the three above. most FS will perform the above in such a way the any inconsistency is a "functional" inconsistency – while space may be wasted everything will still "work".

some crash cases

(D) -> crash
no real problem, just wasted time writing to a block that's still on the free list
(D) -> (F) -> crash
leaked a data block that will not be recovered
(F) -> (M) -> crash
functional problem, file points to whatever was previously on disk (garbage or someone else's old data)

fsck checks that

• all blocks not on free list are in use – referenced by an inode
• all blocks referenced by an inode are not in the free list

journal/log-structured differences

• journal – transactions in progress which can be used to recover from crash/failure
• log structured FS – actually uses the log as the only structure on disk

#### RAID / LFS

writes are buffered in main memory until there is a segments worth of data to write to disk. This allows the entire segment to be written w/o any seeks taking advantage of the disk's full bandwidth.

in RAID there is a slowdown factor of N when writing to N disks.

in LFS the checkpoints become the journal

RAID levels (5 and 1 are the only common levels)

1. striping across a single disk
2. straight mirrored disks, faster reads as you can read from both disks and whichever returns first wins (best case seek), for a write you have to wait for the write to complete on both disks (worst case seek)
3. Hamming code for ECC
4. Single check disk per group
6. No single check disk (large performance increase over RAID level 4)

note know the basic read/write operations for each level and be able to discuss the performance implications

### 2009-11-03 Tue

#### LFS and RAID

LFS
main point is the caching setup. user <-> cache <-> disk
RAID
don't need to know the names of the specific levels, but should be able to derive the mechanisms for reading/writing, as well as the implications speed/reliability for these mechanisms. RAID can be implemented in hardware or software. Be able to extend these concepts (e.g. RAID 7 is )
0
block-level striping
1
simple mirrored disks
• read: could use either disk (faster), for a multi-block read each disk could serve up different blocks
• write: will necessarily use both disks
5
block-level striping and distributed parity – parity is spread across all disks
• read: will either only touch the specific disk which the block lives on, or will read all disks (including parity) and will reconstruct the data
• write: must touch all disks, writes to the disk on which the data will live and to the parity disk and reads from the other disks to calculate the parity

blocks and sectors

block
software construct, typically will be equal in size to either a single sector or multiple sectors
sector
the actual size sections of the physical disk

#### CODA

• call backs are used in asynchronous operations, they alleviate the need for active probing. allows the server to alert the client when a change occurs – used in CODA for cache coherence

### 2009-11-05 Thu

#### general consistency

by and large message passing has beat out shared memory when it comes to distributed computing. MPI is the de-facto distributed memory standard openMP is a new message passing alternative.

typically there is no global clock

strong consistency
(called sequential consistency in Munin paper) any write is immediately visible to subsequent reads
causal ordering
uses communication between processes to determine a global partial ordering
weak consistency
this is not really ever used. makes no guarantees that writes will be visible to future reads
eventual consistency
write will eventually be seen
release consistency
requires data to be visible only at certain synchronization points (i.e. at release or barrier)

#### Munin

Munin – shared program variables are annotated with their access pattern which is used by the OS

barrier
designate a point where you will wait at that point until every other thread gets to that point
split-phase barrier
two checkpoints, everyone can pass the first checkpoint arbitrarily, but no-one passes the second checkpoint until everyone has passed the first

Munin Annotations and Protocol Parameters

annotationsIRDFOMSFIW
migratoryYNNNNY
write-sharedNYYNYNNY
producer-consumerNYYNNYNY
reductionNYNYNNY
resultNYYYYYY
conventionalYYNNNNY

Meanings of Parameters

 I invalidate or update R replicas allowed? D delay vs. immediate FO fixed owner? M multiple writers allowed? S stable sharing pattern? FI flush changes to owner W writable?

Non-functional performance enhancing objects

• ability to map an object to a lock
• ability to explicitly flush changes to an object

Implementation

• maintained a hash table mapping object addresses to their attributes
• copyset was a list of where (which processors) an object currently exists
• delayed update queue (DUQ) to hold updates which will need to be propagated, generally held until barrier and then sent to everyone in it's copyset

question: why only use twins when there are multiple writers?

### 2009-11-10 Tue

#### Munin implementation

• DHT or Distributed Object Directory
• delayed update queue
• page twins: two copies of a page used to find out what the differences are between old/new versions of the page
• distributed locks were effectively a queue, person at the front owns the lock and everyone else is further down the line.

page faults used to track updates

1. write protect pages that process would normally be able to write to
2. when page faults allow write to go through but make a note and maybe update remote copies of the page

#### Quicksilver

transaction
collection of operations into a single atomic unit of consistency and recovery. techniques include…
• locks
• mutexes
• semaphores
• monitors
• h/w instructions
• interruptible disabling
commit protocols
some things to be considered as goals
• atomicity
• recovery semantics
• blocking/sync
• communication
two phase commit
coordinator and subordinates
transaction_begin
1
2
3
...
transaction_end

• the coordinator
1. initiates the transaction
2. prepare message is sent to all subordinates
3. subordinates act and respond
4. send commit
• the subordinate
1. upon receipt of prepare message the subordinates reply with either yes or no
2. no -> veto
3. or go to prepared state and update logs and respond yes

### 2009-11-12 Thu

#### Quicksilver

locks used to make a monolithic unit out of a series of operations

short lock
would only be held for a single operation inside of a transaction
long lock
could be held for an entire transaction
short lockslong locks
write
degree 0 consistency
short write lock and no read lock
degree 1 consistency
long write lock and no read lock
degree 2 consistency
long write lock, and short read lock
degree 3 consistency
long write lock, and long read lock

locks in the context of their DFS (Distributed File System)

• directories
• locks for renaming, creating, deleting
• write lock for dir.entries
• files
• short read locks and long write locks
##### highlights (distinguishing features)

distributed OS using transactions for data consistency

wrapped applications in trivial transactions, so bad quit would remove all previous changes

in order to share a transaction with another process you would need to fork that process

#### Cluster Based Scalable Network Services

• small unit of fault -> robust
• scalable
• cost effective

BASE

• Basically Available
• Soft state
• Eventual consistency

Condore is another system that finds idle machines and sends them work when work accrues

##### implementation

components of the system

• front end
• http server
• workers
• to provide services
• to hold the results of computation
• report failed services to the manager
• manager
• calculates load and sends requests to the front-end
• receives failure reports from workers

failure peers vs. failure pairs

failure peers
manager watches front-end and restarts if it crashes and vice versa
failure pairs
more generally called hot backups where each component has a backup which can take over if one fails

### 2009-12-01 Tue

cover CFS and do Map-Reduce on Thursday, presentations starting next week

final

• lets try to do a final-review outside of class
• final will sprinkle questions over the first half, but will focus on the second half

project

• paper is due at the end of next week 2009-12-11 Fri 22:49
• 10-12 minutes per group – 8-10 slides

#### CFS

lookup
(finger table and successor list) the successor list was slow because on average you would have to touch half the servers in the system, so the finger table was added to store IDs of far away people for quick jumps to distant portions of the circle.
caching & timeout

### 2009-12-03 Thu

#### map reduce

• stream programming collection of filters which the data passes through
       +---+
| F |
+---+
/-   -\
/       -\
/-          -\
+-+              +---+
|F|              | F |
+-+              +---+
-\         /--
-\     /-
+---+
| F |
/+---+\
/--       ---\
+-+             +---+
|F|             | F |
+-+\           /+---+
\         /
\       /
\     /
+---+
| F |
+---+


consistency
can handle failures in workers (just aborts if master happens to fail) by repeating the computation for failed workers. this mains that the worker tasks can happen multiple times – so they must be idempotent (i.e. side effect free). also the computation would need to be deterministic for re-doing of failed nodes to have no effect.
only as fast as your slowest worker – so as workers finish the unfinished tasks are duplicated to idle workers in the hopes that someone new will finish the task earlier
combiner function
can be run on the local map worker to compact the data before it is sent of to be reduced
when some records continually cause workers to fail then they will be skipped
local execution
ideally workers will be selected which are close to the data which they will be analyzing

### Amoeba vs. Sprite

both truly distributed operating systems in contrast to most of today's large distributed system which has node-local OSs with a global managing agent.

### fast file system

Old FS: (order on disk)

1. superblock
2. inode blocks: direct (first 8 blocks) v.s. indirect blocks
3. data blocks: size (initially 512 then up to 1024)

issues with this setup

• inodes not located near the data, so many non-contiguous jumps
• issues with fragmentation
• didn't take advantage of the structure of the disk (too much random access of the file)

New FS:

• collocated inode and file data (in the same cylinder group)
• replicate the superblock information across all cylinder groups (reliability)
• variable block sizes (4k block size has average 2k internal fragmentation)
• split each block into anywhere from 1-8 fragments (powers of two) and managed free space on a fragment (rather than block basis). this can incur bookkeeping and overhead problems (as a file increases in size it may need to be continually copied between fragments and blocks).
• exploit h/w characteristics by trying to adjusting notion of "contiguous" based on the speed with which the disk can move between segments
• collocate directories and files

### VM in Multics

#### goals

1. provide the user with a large virtual memory hiding moving of data between levels, and any machine-dependent stuffs
2. allow procedures to be called by name w/o any need to plan for the storage of the called procedure
3. permit sharing of procedures and data among users subject only to permission restraints (vital to efficient operation in a multiplexed system)

processes and address space stand in a one-to-one correspondence

address space is composed of variable length segments, each segment is either data or procedure which affects it's access permissions.

segments are addressed using a directory structure similar to files.

consists of a segment number and a word number
based on values of processor registers, different for process/data segments
process
segment number in procedure based register + the program counter
data
the segment tag of instruction selects a base register if the external flag is on. otherwise the segment number is taken from the base register
in this case the generalized address is used to fetch two 36-bit words, these are combined to form another generalized address. can be nested
descriptor segment
generalized-address -> main-memory is done using a two-step hardware lookup
paging
of segments allows non-contiguous segments of main memory to be referenced as logically contiguous generalized addresses

shared access and building upon others addresses are both important goals of multiplexed machines

requirements

• pure procedure segments execution can't change their content
• symbolic procedure calls without making prior arrangement for the procedure's use
• segments of procedure invariant to recompilation of other segments

implementation

making a segment known
when the segment is called by symbolic name it is added to the caller's description segment and can later be referenced by number
a processes code must be invariant to compilation, so the process will always use a segments name/path to address it. after the segment is known, then it's number can be used. a linkage segment will hold the information on name/path -> number transformations so that the numbers can be used for known segments w/o changing the contents of the process

### VM in Vax

page number and offset within the page

address space divided into spaces (not segments)

system space
high-address half is system space and is shared across all processes. This contains OS stuff, executive code and protected data.
process space
program region (P0)
low-address half of process space. contains the user's executable program. first page is reserved to cause errors on 0-address references
control region (P1)
high-address half of process space. this region is used to hold process-specific data

each space/region has it's own page table

system space page table
in hardware, not swapped on context switch
process tables
in the system-space, are swapped on context switch

#### memory management

paging issues

1. effect of heavy pagers on other processes
2. high cost of startup/restart (by faulting it's way into main memory)
3. increased disk workload of paging
4. processor time searching page lists

pager and swapper

pager
OS procedure resulting from page fault
swapper
separate process which moves pages into/out-of memory

dealing with the above issues

1. the pager deals with this issue by evicting pages from the process which is requesting the new page, so one process won't push out everyone else's pages. also a limit is placed on the number of pages a process can have in memory.
2. the above helps with this as well
3. the VAX clusters the reading and writing of pages to relieve I/O burden on the disk
4. by not having a reference bit (used to mark recently used pages) the VAX system takes load (scanning page tables and setting these bits) off of the processor

when pages are removed they are placed on the free page list or the modified page list depending on their modified bit and whether they need be written to memory. these lists serve as physical caches for recently removed pages (it is quick to move a page from one of these lists back to the working memory).

by caching the modified pages in the modified page list the following for speedups are gained.

2. clustered writes (~100 pages on the development system)
3. arranged on paging file so clustering read is possible
4. many page writes are avoided entirely

demand zero
when processes require new pages they are created and filled with zeros on demand
copy on reference
when multiple processes using a page

#### program control of memory

for real-time programs that need explicit memory control

• expand it's P0 or P1
• increase it's resident set size
• lock (or unlock) pages in it's resident set
• create/map sections into it's address space
• record it's page-fault activity

### Scheduler activations

##### introduction

• requires no kernel intervention
• fast (on order of procedure call)
• flexible
• each thread runs on a "virtual processor" which still has to be multiplexed onto a real processor and interleaved with system calls, and kernel stuff leading to a performance hit
• sometimes exhibit incorrect performance when involve I/O
• directly maps each application thread to a physical processor
• heavy weight
• not a restricted (RE: side effects, I/O)

the goal of this paper is to combine user/kernel threads

• common case (no kernel required) perform as user threads
• acts as kernel threads when needs to talk to kernel
• easily customizable
• difficulty is that relevant information is scattered between kernel space and user address space

the approach described in this paper is to give each user-level thread system with it's own virtualized machine which can have any number of processors.

• kernel threads must implement anything that any reasonable user-level thread system may need (too much overhead)
• when a user-level thread blocks (for I/O, fault, etc…) it's kernel thread also blocks
• if we create more kernel threads then there are processors then the OS must make scheduling decisions without any information about the priority / current-task / importance of the related user-level threads
##### design (scheduler activations)

each user-level thread system gets it own virtual multiprocessor

• kernel gives processors to user thread systems
• user thread system has complete control over use of it's virtual multiprocessor
• user thread system can tell kernel when it needs more threads
• user thread system only talks to kernel when it needs to
• looks to the application programmer like they are using kernel threads
• communication from the kernel to the user-level thread system which may cause it to reconsider it's scheduling decisions.
• roles
• serves as the vessel or context of the user-level thread
• notifies user-level thread of kernel event
• stores user-level thread when it's blocked (e.g. for I/O)
• when a thread is stopped
1. the kernel stuffs it into it's activation
2. creates a new activation to tell the thread system that the thread has been stopped
3. the thread system removes the thread, and tells the kernel the activation can be re-used
4. the kernel does another upcall giving the newly released scheduler activation (processor) to the thread system to run a new thread on
• there are all ways as many activations assigned to an address space as there are actual processors
• in the same manner processors are moved from one address space (thread system) to another
• how user-level thread systems keep the kernel informed about their amount of parallelism
• inform kernel when more threads than processors
• inform kernel when more processors than threads
• when a thread is interrupted while in a critical section
2. this upcall is intercepted and given to the thread until it is out of it's critical section
3. the thread is then put back on the ready queue and the address space is free to respond to the new processor however it sees fit
##### implementation

implemented by tweaking

Topaz
the native kernel threads for the firefly machine
##### performance
• same order of magnitude as plain user-threads
• upcall performance is slow, much slower than normal kernel thread operations
• written on top of existing kernel thread library (not from scratch)
• written in higher level language (not carefully tuned assembly)
• N-body problem
• speedup with more processors
• significant increase over kernel threads
• more robust than fast-threads to lower amounts of memory
##### related ideas

psyche and symunix are both NUMA OSs which provide virtual processors similar to activation contexts.

differences

• both psyche and sumunix provide for shared address space between kernel and thread systems
• neither provides the exact functionality of kernel threads (for I/O etc…)
• neither provides efficient system for user-level thread system to notify kernel when it's hungry
##### summary

combine the performance of user-level threads with the functionality of kernel-level threads. this is done by supplying each user-level threading system with a virtual multiprocessor in which the application knows exactly how many processors it has at any one time (and each processor maps to an actual physical processor)

• processor allocation (between applications) is done by the kernel
• kernel notifies address space of events affecting it
• new processor
• less processor
• address space notifies the kernel if it needs more/less processors

### Monitors (2)

#### Monitors: An OS structuring concept

• monitors are procedures or functions called by software wishing to acquire a resources along with local administrative data
monitorname: monitor
begin.. declarations of data local to the monitor;
procedure procname (... formal parameters...) ;
begin... procedure body... end;
... declarations of other procedures local to the monitor;
... initialization of local data of the monitor...
end;

• a procedure will have to wait when the monitor is in use
• when the program is waiting for the monitor, it needs to be sure that after the monitor is released, the very next procedure to execute will belong to itself
• there are multiple reasons that a program will need to wait, so the program will have to set a condition variable to indicate that it is waiting for the monitor

example of a monitor (resource:monitor) with condition variable nonbusy

single resource:monitor
begin busy: Boolean;
nonbusy : condition;
procedure acquire;
begin if busy then nonbusy.wait;
busy : = true
end;
procedure release;
begin busy := false;
nonbusy.signal
end;
busy : = false; comment initial value;
end single resource


the above example simulates a boolean semaphore with aquire and release procedures.

##### interpretation

a process inside a monitor may need to signal another process. the signaler must wait for the signaled to complete and to allow it to proceed, it can increment an urgentcounter to indicate that it had control of the monitor and should get it back.

then whenever the monitor is released, the urgentcounter should be decremented and the longest waiting process on the counter restarted.

similarly we need to be able to allow process in monitors to wait as well as signal which could be implemented similarly (with a waitcounter)

given the above the monitor can be explicitly passed form one process to another, and only released when there are no more processes in the explicit passing of control

##### bounded buffer example

two processes running in parallel share a bounded buffer, one is the consumer (eating form the beginning) and one the producer (appending to the end).

the following implements this setup

bounded buffer:monitor
begin buffer:array 0..N - 1 of portion;
lastpointer:0..N - 1;
count:0...N;
nonempty,nonfull:condition;
procedure append(x:portion);
begin if count = N then nonfull.wait;
note 0 <= count < N;
buffer[lastpointer] := x;
lastpointer := lastpointer + 1;
count := count + 1;
nonempty.signal
end append;
procedure remove(result x :portion);
begin if count == 0 then nonempty.wait;
note 0 < count <= N;
x := buffer[lastpointer - count];
nonfull.signal
end remove;
count := 0; lastpointer := 0;
end bounded buffer;

##### scheduled waits

sometimes rather than just selecting the longest waiting process from a variable we would prefer to allow processes to have some priority

##### real world examples
• buffer allocation
• disk head scheduling elevator algorithm
• readers and writers (only writers need exclusive access)
• to ensure writers can access elements, no readers can start while a writer is waiting
• to ensure readers get access, all readers queued during a write are allowed to read before the next write operation begins
• variables
• startread
• endread
• startwrite
• endwrite
• is someone writing
##### conclusion

monitors can be an appropriate structure for an OS with parallel users

#### Experience with Processes and Monitors in Mesa

Lampson and his team seem to make everything harder than it should be

issues

programming structure
must fit monitors into Mesa's module based organization
creating processes
need to be able to dynamically create processes after compile time (adds complications)
creating monitors
need to be able to dynamically create monitors after compile time (adds complications)
wait in nested monitor call
is confusing
exceptions
make Mesa's unwind functionality work well with monitors
scheduling
moving from recommendations to implementation proved difficult
input/output
again moving from theory to practice can be hairy
##### implementation

equal division between

runtime
implements the heavier rarely used stuff like process creation deletion
compiler
implements the various syntactic constructs and translated into built-in support procedures
hardware
directly implements the more heavily used stuff like scheduling and entry/exit
##### performance
ConstructTime (ticks)
simple instruction1
call + return30
monitor call + return50
process switch60
WAIT15
NOTIFY, no one waiting4
NOTIFY, process waiting9
FORK+JOIN1,100
##### conclusion

integration of monitors into Mesa was harder than anticipated given the amount of literature on monitors and the high level of Mesa, however, much work was done to implement monitors in such a way that they can be used as the sole concurrency construct for an entire OS/language.

##### questions
• wouldn't it also be a problem if I'm in my protected block, and hardware barges in and takes over the resource (breaks the monitor invariant)

### Virtualization

#### Commodity Operating Systems on Scalable Multiprocessors

again cites the size and complexity of modern operating systems as limiting factor, this time in effectively utilizing massively multiprocessor machines.

rather than customize the OS this paper inserts a small virtual machine monitor between the OS and the hardware.

Demonstrated on the Stanford FLASH shared memory multiprocessor, with an experiments cache coherent non-uniform memory architecture or ccNUMA setup.

##### problem

hardware development moves very quickly, yet people like to bring all of their existing software (which is OS dependent) to this new hardware.

there is a need for quickly porting existing OSs to new hardware as this is the limiting factor in adoption of new hardware setups

##### virtual machine monitors

the virtual machine monitors serves as a thin layer between the hardware and existing comodity OSs (like windows NT or *NIX), exporting to each OS a set of virtualized resources which it is able to manage.

while the machine can communicate through standard external interfaces (NFS, TCP/IP), the monitor is able to efficiently assign resources across machines (i.e. one machine may get more memory if needed, etc…)

with small changes the OSs can explicitly take advantage of the shared memory between virtual systems (e.g. a database could put it's buffer cache in shared memory supporting multiple query servers)

the VM takes many burdens off of the OS

• only the VM need scale to the size of the hardware
• the VM can isolate separate OSs protecting from faults
• NUMA memory management
• in general handling hardware quirks
• VM issues
• exception processing
• instruction execution
• memory requirements
• large structure duplicated for each OS (file system buffers)
resource management
the VM does not have high level information about the processing taking place, so it can't distinguish processing which is just the OSs busy loop from important calculations.
communication
looks like different OSs on the same hardware rather than each OS on it's own hardware, so
• same file can't be open in two different VMs
• same user can't start multiple VMs
##### DISCO (a virtual machine monitor)

DISCO is designed for the FLASH multiprocessor which consists of a collection of nodes arrayed on a high speed interconnect. each node contains a CPU, memory, and IO devices

Disco Interface

processors
exports a processor of the same type as those used by FLASH. OSs tuned to use disco can directly access some common processor functionality using special load/store instructions.
physical memory
exports continuous physical memory starting at 0, and handles all the NUMA stuff behind the scenes
I/O devices
provides each OS with the illusion of their own I/O devices. this means disco must intercept all I/O communication. again provides special instructions for disco-aware OSs to bypass this in special cases
• DISCO provides a virtual subnetwork which the machines can use to communicate amongst themselves
##### DISCO implementation

general

• as a multi-threaded shared memory program
• the small code portion of DISCO is duplicated across processors so page-misses are all local
• avoids linked-lists and other structures which perform poorly with caching

virtual CPU

• for speed DISCO direct executes most instructions and only tries to intercept dangerous instructions (like TLB modifications)
• runs in supervisor mode which is between kernel and user mode
• monitor catches traps and simulates them to the VM

virtual memory

• maintains machine-to-physical mapping
• catches VM attempts to update the TLB and uses them to update it's own TLB
• downsides which decrease performances
• TLB used for OS code/memory
• TLB flushed between CPU switches

memory management

• tries to be smart
• copies pages to the nodes where they are most used
• duplicates read-heavy pages between nodes that use them
• uses FLASH hardware support for counting cache misses per page and identifying hot pages

I/O devices

• intercepts all devices access
• add special DISCO device drivers into the OS
• DMA map (translates physical to virtual address spaces?)

copy-on-write disks

• multiple VMs can share pages in virtual memory
• copy-on-write means that this is transparent to the machines
• copy-on-write only makes sense for writes which will not be permanent or shared between machines
• user files and persistent disks DISCO only allows one VM to mount the disk at a time (or using distributes file system protocol like NFS)
##### DISCO (commodity OS)

currently supports a version of UNIX (IRIX), most changes to the OS resided in the HAL (hardware abstraction layer)

the special load/store call mentioned earlier to avoid traps are implemented in the HAL

##### experimentation

all takes place on SimOS a machine simulator

##### conclusion

developing system software for shared-memory multiprocessors, and more generally for new hardware.

DISCO shows that many of the performance limitations of VM setups are no longer an issue (sort of).

although software and OSs are growing in complexity the hardware-interface has remained relatively simple. supporting new hardware through a thin VM monitor such as disco is simpler and easier then rewriting the OS.

DMA
what is it?

### Exokernel

don't hide power!

Allows untrusted user-level applications to have direct access to system hardware. They present ExOS, an operating system implemented entirely in user-space libraries.

does this by securely multiplexing hardware resources between untrusted software

many programs have specialized behavior and their performance is severely hampered by being forced into using general OS abstractions to access hardware

#### library OS

• libraries implementing some part of the OS can be app specific
• libraries can trust the application (the exokernel will errors from hurting other applications)
• less OS-app transitions since much of the OS (the library) is in the application's address space

#### exokernel requirements

1. track ownership of resources
2. performing access control (guarding usage or binding points)

#### revocation

most OSs have invisible revocation of resources, so that application doesn't know when for example physical memory is being allocated or deallocated.

exokernels have visible revocation, so that applications can have some say in their allocation, and know when resources are scarce. even when the processor is taken at the end of a time-slice the application is notified.

this is necessary when the applications are using physical names to refer to resources, they must be notified upon revocation because their names will have to change

sometimes it's nice to allow "good faith" operations to take place before revocation of a resource

other times the exokernel will abort a misbehaving application

#### implementations

Aegis
exokernel
ExOS
Library OS

Aegis

process environments
store the information needed to deliver events associated with a resource to it's owner
• exception
• interrupt
• protected entry

#### exceptions

transfers all exceptions to the application except system calls and interrupts

exception handling…

1. saves three "scratch" registers into an agreed upon place
2. loads the exception program counter, last non-valid virtual page address, and cause of exception
3. uses exception cause to jump to pre-specified application program counter where processing resumes

features

• very fast
• very simple (because does not have to differentiate between TLB exceptions and all others)

TODO

#### summary

an exokernel eliminates high level abstractions and focuses purely on securely multiplexing the hardware. a library OS can be build very efficiently upon an exokernel providing many of the standard OS features in a fast and extensible manner.

by allowing applications direct access to hardware it is possible for applications to greatly speed up their performance as compared to a traditional OS.

by implementing the majority of the OS as application libraries it is trivial to extend or tailor major components of the OS.

the only downside seems to be that the application has much more to worry about if it wants to take advantage of the potential speedup.

### µ-kernels

This paper aims to show that µ-kernel systems

1. can run modern OS personalities
2. can perform in the same range as normal monolithic kernels
3. that extensions to µ-kernel based systems can be implemented efficiently in user space
4. supports four basic processes; address-spaces, threads, scheduling, and synchronous inter-process communication

#### intro

• many people think that µ-kernels are either
too low
and these people try to add safeguards, or abstractions for helping extensions
too high
and these people try to make µ-kernel interfaces look like hardware interfaces
• first generation µ-kernels like Chorus and Mach
• evolved from monolithic kernels
• second generation µ-kernels like QNX and L4
• designed form scratch
• more rigorous in pursuit of minimalist design
• experiments
• linux adapted to run on L4
• gives upper performance bound
• compare L4Linux to a linux adapted to the Mach kernel
• insight to µ-kernel functions that affect linux performance
• implemented pipes on top of µ-kernel and compared to native unix pipes
• implemented mapping-related OS extensions
• implemented first part of real time user-level memory management system
• moved the L4 to a new processor
• lower-level communication primitive

#### L4 essentials

activity executing inside of an address space
IPC
cross address-space communication is a fundamental µ-kernel mechanism

the initial address space represents physical memory, additional address spaces are constructed by granting, mapping, and unmapping flex-pages of sizes 2n. the owner of an address space can grant map and unmap it's pages to/from other address spaces. these user-level pagers handle all address space construction and maintenance.

note
mapping and unmapping pages is like creating and deleting pages. mapped to physical memory or not

when there is a page-fault it is IPC'd by the µ-kernel to the pager associated with the faulting thread. the pager and thread have complete control as to how to handle the fault allowing many options for memory management

I/O ports are handled as address spaces, with device interrupts handled as IPC

exceptions and traps are synchronous to the executing thread, they are mirrored up to user-level

#### linux on L4

as linux now runs on multiple architectures there is a fairly well-defined interface between architecture dependent and independent sections

• architecture-defendant section
• interrupt service routine
• low-level device driver support
• user process interaction
• context switching
• copyin/copyout data between kernel and user spaces
• signaling
• system-call mechanism
• linux uses a 3-level architecture independent page-table scheme
##### L4-linux design/implementation
• fully binary compliant

µ-kernel tasks are used for user processes and provide linux services via a single linux server in a separate µ-kernel task.

the linux server
linux kernel's address space maps 1-1 to the underlying pager

### Unix Time Sharing System

• perhaps the most important achievement is demonstration of cheap
Integer($1) if l.match(/.*?(\d+) *usec.*/) end.compact end end.compact  data.each{ |l| puts "|"+l.join(" | ")+"|" }   1 3847 124 446 39 136903 100383 5966 515 2 20030 955 2430 155 219031 137104 13517 862 3 73647 13612 21009 1173 383096 174236 28855 1611 4 109658 21506 25827 1318 341028 226312 34078 1739 5 148674 27177 31495 1519 395191 281150 37515 1809 6 148416 33376 38127 1751 476689 333882 45291 2080 7 223346 37809 43699 1960 525645 396762 50692 2274 8 251356 43688 53118 2312 654439 454026 53991 2350 9 234711 47452 52388 2218 668008 512374 57961 2454 10 268947 50518 56947 2344 756613 567916 63370 2609 data.each{ |l| puts "|"+l.join(" | ")+"|" }   1 3675 69 325 26 49456 31094 2965 236 2 15451 188 1224 69 54753 31353 3027 171 3 44760 5873 8952 423 82646 40487 12700 601 4 46814 8432 10647 439 87481 47902 13865 572 5 73662 12727 14015 534 136676 56542 17872 680 6 62503 14414 14475 505 154784 65107 17297 603 7 116681 20178 19589 649 175359 76453 24407 809 8 110105 22831 21448 673 195287 81819 23305 731 9 124869 25198 23156 693 165885 89315 25439 761 10 157668 27586 24549 706 164980 96154 27432 789 11 154270 31515 27226 759 189019 106003 29155 813 12 204609 39826 35900 971 233421 106114 34894 943 13 168486 40721 34658 912 219374 120001 34546 909 14 163194 41588 33267 852 248706 128874 35918 919 15 203498 45197 37336 936 308278 141872 39753 997 16 213616 47945 38915 954 245362 147306 41478 1017 17 232214 52437 42495 1031 304720 157500 44672 1083 18 261034 58236 49930 1195 298037 158504 49982 1196 19 250611 58823 46255 1083 303229 172885 47975 1123 20 279880 57325 44428 1019 369985 186997 48912 1122 data.each{ |l| puts "|"+l.join(" | ")+"|" }   1 3675 69 325 26 49456 31094 2965 236 2 3603 81 333 19 50891 33909 3094 175 3 14977 1621 3119 147 63231 45225 4787 226 4 16288 3554 4621 191 78503 57241 5906 244 5 21650 5059 5668 214 101637 69758 7882 298 6 31288 6901 6948 244 115349 81085 8248 290 7 36701 8897 8525 283 132428 93030 10158 337 8 42805 10986 9876 311 151479 104323 11902 375 9 43571 12718 10766 324 168803 116987 13642 410 10 57919 15239 12198 355 184954 128700 14682 427 11 55153 16664 13189 372 206221 141415 17527 495 12 61766 18789 14428 394 230900 148623 15859 433 13 73299 20834 15409 409 244328 163776 19161 509 14 68849 22847 16692 433 258783 175110 19890 516 15 74255 24603 17802 453 267375 184825 21259 541 16 94934 27876 19536 488 307184 198535 25055 626 17 90519 30494 21592 532 319140 210595 28265 696 18 93456 32464 22524 545 341838 218002 29598 716 19 106604 36042 25485 616 367055 239063 40259 974 20 116848 38833 27290 654 389510 257751 44943 1077 #### test – new kernel only taking stats from the first run as latt.c already does multiple runs for us and calculates error bars, etc… results = Dir.entries(File.join(base)).map do |e| if e.match(/.*out(\d+).*/) [Integer($1)] +
Integer($1) if l.match(/.*?(\d+) *usec.*/) end.compact end end.compact   1 3847 124 446 39 136903 100383 5966 515 2 20030 955 2430 155 219031 137104 13517 862 3 73647 13612 21009 1173 383096 174236 28855 1611 4 109658 21506 25827 1318 341028 226312 34078 1739 5 148674 27177 31495 1519 395191 281150 37515 1809 6 148416 33376 38127 1751 476689 333882 45291 2080 7 223346 37809 43699 1960 525645 396762 50692 2274 8 251356 43688 53118 2312 654439 454026 53991 2350 9 234711 47452 52388 2218 668008 512374 57961 2454 10 268947 50518 56947 2344 756613 567916 63370 2609 work errorbars ##### frame drops base = "./project/2.6.31.6_hausmaster-laptop/av/" results = Dir.entries(File.join(base)).map do |e| if e.match(/out(\d+).txt/) [Integer($1)] +
(l.match(/V\:(\d+)\:(\d+)/)) ? [Float($1), Integer($2)] : nil
end.compact.map{|l,r| [100-((r / (l+1))*100)] }.last
end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }

 1 100 2 99.8543 3 99.4148 4 98.3619 5 97.6762 6 96.6565 7 94.2969 8 93.7795 9 94.5797 10 92.0255

#### actually running some tests

base = "./project/bfs"
results = Dir.entries(base).map do |e|
if e.match(/i(\d+).out/)
[Integer($1)] + File.read(File.join(base, e)).split("\n").map do |l| Integer($1) if l.match(/.*?(\d+) *usec.*/)
end.compact
end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }

base = "./project/bfs"
Dir.entries(base).map do |e|
if e.match(/i(\d+).out/)
[Integer($1)] + File.read(File.join(base, e)).map{|l| Integer($1) if l.match(/.*?(\d+) *usec.*/)}.compact
end
end.compact.each{ |l| puts "|"+l.join(" | ")+"|" }


work errorbars

wakeup errorbars

all on one

##### jeff's results
 1 322 11 26 2 40858 33434 2926 234 2 15778 3439 4858 284 93302 57327 7113 416 3 31602 7273 7675 381 122543 81251 10231 508 4 47619 11256 10508 468 146941 107658 12848 572 5 55621 15068 12915 529 179079 126734 17147 703 6 78896 20132 16834 652 224041 153557 21634 838 7 79376 24201 18530 683 261242 175972 24642 909 8 99925 28032 21225 758 300960 205240 32233 1151 9 102309 32829 24116 829 341542 226719 36207 1245 10 122910 38026 27450 931 378846 256072 41322 1401
 1 39 11 3 0 63212 45110 3734 303 2 37330 9829 10733 648 123717 75794 15559 940 3 50036 16725 14469 744 166951 103040 20449 1052 4 60771 20001 16194 739 196976 121333 25811 1178 5 99898 26263 20435 860 222647 138132 30665 1290 6 125911 32808 24503 967 276826 159907 38284 1511 7 136318 38887 27918 1040 301758 173934 43273 1612 8 168979 44304 31485 1130 348513 192425 52425 1882 9 193398 49936 34993 1203 376297 206254 54942 1889 10 208970 56251 39117 1304 428826 219508 63024 2101

work errorbars

wakeup errorbars

##### taylor results
 1 48 29 11 5 89117 88696 356 159 2 38317 9162 14364 5078 183274 157032 15264 5397 3 19015 3625 6949 2006 237982 163345 37587 10850 4 40082 7297 11816 2954 297934 218751 45478 11369 5 56967 12006 19883 5134 370527 299750 34033 8787 6 197095 29112 52762 12436 378163 320431 53987 12725 7 62656 18670 17292 3773 438030 400260 23497 5128 8 153654 29389 44716 9128 528557 417398 51049 10420 9 135979 43330 40470 7788 555059 466411 78488 15105 10 242612 42873 68167 12446 628015 475800 87386 15954
 1 55 29 15 7 89662 89019 418 187 2 4736 1230 1794 634 149805 144967 3243 1147 3 7269 2684 2687 776 216993 202884 9159 2644 4 17619 5140 5474 1369 274191 256837 13168 3292 5 16372 6463 5778 1492 326104 307699 16855 4352 6 21754 10716 7862 1853 391064 360888 25385 5983 7 22960 10366 7590 1656 463452 427938 30555 6668 8 43914 16872 12730 2598 511854 477101 26558 5421 9 43543 15306 10991 2115 565166 534687 24529 4721 10 32396 12250 10698 2392 641921 602982 38807 8677

work errorbars

wakeup errorbars

##### results of the initial short run
clientsmaxavgstdevstdev meanmaxavgstdevstdev mean
1322631455214091291694101
24618491145045946845459101374434
33577271881261732586601247203127313287
447612111901399331299277462032206794624
578830248992792655859936449770179193584
6531541511814660267611477069815203333712
7557651226618432348312109859936201653811
8616661724420994371113554073728344086083
9981982976828730478814988680922328445474
101191011892330233478016482368145418516617
clientsmaxavgstdevstdev meanmaxavgstdevstdev mean
15225146340523104328821177
21213531932951304271652477
379621149250864844959410412498645
41215520683109695685605628386851942
51358447954967993758266614888121762
621760660172031315860057320687181592
7192907422667611281075458764988811501
8422661062510161160711043692718140272218
94044513647113351690134468100833180102685
10310401366110206161413617711669391881453

#### building the kernel

##### initial build
1. cd into the kernel directory
2. copy your local configuration into the kernel config
cp /boot/config-uname -r ./.config


make menuconfig



select the "load configuration" option, load your the .config file, and then exit

4. now you can try to make the kernel with make
5. install the build tools, and header files
sudo apt-get install build-essential linux-headers-2-...


6. still didn't work, then switched to the unstable debian repos (replaced "lenny" with "unstable" in /etc/apt/sources.list)
7. with unstable I installed libc6-dev and tried again
8. now missing zlib instead of eventfd.h
9. installing zlib
sudo apt-get install zlib1g-dev


10. make the kernel make menuconfig. This spits out the following error message, but seems to succeed regardless
make[1]: *** No rule to make target just'. Stop.
make: [Just] Error 2

11. now make the Debian kernel package

12. install the resulting .deb file
dpkg -i linux-image......


13. rebooted using the new kernel and it worked
##### bfs patch

Applied the BFS patch

2. applied
path -p1 << bfs-patch...


##### secondary build
1. make the BFS-patched kernel
fakeroot make-kpkg clean
fakeroot make-kpkg --initrd --append-to-version=-bfs kernel_image kernel_headers


2. install the resulting kernel
sudo dpkg -i linux-image....bfs...deb

`

History of the linux kernel

Linux test suite

CFS

Linus on CFS vs SD • http://kerneltrap.org/node/14008

• CFS design document
• It has bound latency. CFS can't guarantee either as well as SD can. SD allows one to set the exact scheduling priority of everything and it is always respected, as there is no interactive renicing: it is very predictable.

Brain Fuck Scheduler

• Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS.  That's a load of 1000 on a quad core machine.

Scheduler Benchmarking

• Hackbench benchmarking program

• Testing this scheduler vs CFS with the test app "forks" which forks 1000 tasks that do simple work, shows no difference in time to completion compared to CFS.  That's a load of 1000 on a quad core machine. • The 'latt' test app recently written by Jens Axboe is a better place for simpler to understand and useful numbers.

• 3D Smoothness testing

#### other schedulers to implement

##### lottery scheduler

seems nice, nice math/stat background

##### GA scheduler

somehow evolve different scheduling algorithms

#### testing suites

Scheduler Benchmarking

#### DONE project proposal

• 2-page proposal/description
• motivation
• novel
• solving problem
• test (conventional wisdom)
• measuring
• comparing
• objective
• background
• related work
• literature
• methodology
• approach
• hypothesis
• validation
• challenges
• make sure reasonable for time span
• make sure we have resources
• expected results / impact
• 1-3 people group
• would prefer hardcopy, but a PDF is fine
• project need not be completely defined, but should touch on potential sticking points

#### outline / topic

CPU scheduling http://kerneltrap.org/node/14008

Motivation - learn about kernels, proper testing environments and scheduling polices and mechanisms.

Objective - compare CFS and SD schedulers from 2007. Indentify and quantify these differences. We hope to identify these and quantify these.

Hypothesis - As indicated in the discussion between Linus Torvald and Kasper Sandberg, we expect the CFS and SD schedulers to perform better in certain niches.

Methodology - Use existing methodology and test responsiveness of throughput read pg. 704

Challenges - Setting up a valid testing and development environment. Development and testing will most likely be different (VM vs. Physical Machine). Putting together a good test suite to test different types of usage. How to evaluate performance as it's running. How does our choice in hardware affect the outcome of the results (choosing the hardware model that best)?

#### composition (challenges)

Challenges
Setting up a valid testing and development environment. Development and testing will most likely be different (VM vs. Physical Machine). Putting together a good test suite to test different types of usage. How to evaluate performance as it's running. How does our choice in hardware affect the outcome of the results (choosing the hardware model that best)?
• testing and development environment
• most likely different environments for development and for testing
• VM, kernel module, algorithmic simulation
• test suite
• define what is meant by "interactive" use
• tailored to the particular aims of our investigation
• popular (so our results can be compared to others)
• how to perform a "live" evaluation of the performance
• Heisenberg uncertainty principle
• impacts of hardware on results
• resources
• hardware
• test suite

#### note

Also, something to note about the history of linux schedulers is that the SD scheduler was never merged into the mainline kernel. The predecessor to CFS was the "O(1) Scheduler." The SD scheduler was more of a contemporary competitor to the CFS that lost out.

#### final

##### intro

The release of Completely Fair Scheduler in 2007 sparked significant debate on various Linux kernel mailing lists and forums. Compared to its predecessor (SD) which used run-queues, CFS utilizes a time ordered red-black tree. While CFS design implemented a “radical” shift in data structures, the benefits are not immediately visible. In several instances The SD scheduler was reported to handle 3D gaming better, providing a smoother display to the user. SD was viewed as the reference in the development for CFS yet it seems the decision to include CFS in the mainline was partially political. As Linus Torvalds was quoted, “[A] person [Ingo] who can actually be bothered to follow up on problem reports is a hell of a lot more important than one who just argues with reporters [Con]”.

Our objective is to analyze the differences between the two methods of scheduling (including patched versions) and to determine the possible benefits of using one system over the other. This implies a wide range of testing procedures in order to provide a balanced perspective on the debate. A secondary goal is to gain first hand experience with kernels, proper testing environments, scheduler policies and mechanisms.

We hypothesize that early versions of the CFS scheduler's performance does not match that of SD, but through tweaking and applied patches, CFS surpasses SD in performance.

##### methodology

Testing the schedulers will require modifying the Linux kernel. We will investigate modifying the kernel on two different levels:

• The first is to implement schedulers as individual kernel modules. This way is preferred as we would not have to recompile and maintain independent kernels but instead have individual scheduling modules compiled for the same kernel. We could specify which scheduler to use as a boot flag or, ideally, on the fly–if possible.
• If using kernel modules is not possible, then we will be required to compile and install independent kernels for each of the schedulers that we want to test. These will be chosen from at boot time.

The CFS scheduler is presently in the mainline kernel (true as of 2.6.13). Implementing the SD scheduler will require applying patches against the mainline kernel. If we desire to separate the schedulers into individual kernel modules, this will require adaptation of the patches.

After our schedulers are implemented and ready for testing, we will concentrate on devising effective tests and benchmarks with which to evaluate them. We will be evaluating the schedulers according to the following criteria:

CPU utilization
how effectively can the scheduler utilize the CPU
Throughput
the rate at which jobs are completed
Turnaround time
the time it takes to finish a job
Waiting time
the time a job spends in a waiting queue
Response time
the interval between activations on the waiting queue

We will research existing benchmarks for testing schedulers and only write our own as a last resort when no other appropriate benchmarks can be found. In addition to artificial benchmarks, we will also perform real world tests, such as listening to music when other processes are hogging the processor and benchmarking games such as Unreal Tournament 2004.

In addition to the above, we are also interesting in exploring the following optional paths:

• Testing Kolivas's Brain Fuck Scheduler (BFS)–this is a recent (August 2009) successor to the SD scheduler
• Implementing control group schedulers such as round-robin to become more comfortable with writing our own schedulers
• Experimenting with possible improvements to the schedulers, such as by tweaking parameters
##### challenges

There will be a number of challenges inherent in carrying out our methodology. The first being the establishment of appropriate kernel development and testing environments. Each of these environments will have different requirements

development
A good development environment should allow for a reasonably quick closed testing loop for new code, and should be well protected from the unpredictable and likely harmful side effects of experimental code. Given these restrictions a good development environment will likely be contained inside of a VM, or on an expendable piece of hardware.
testing
A good testing environment should resemble as closely as possible the actual production environment of the kernel. For this reason we will probably test directly on a physical machine, rather than through a virtual machine. If a wider variety of hardware is desirable than is available some sort of "simulated" test environment may be required. such a simulated scheduling environment would allow more flexibility in varying simulated hardware components and the related performance determining constants, but may yield less veracious results.

Once we have established an acceptable development and testing framework the next challenge will be the acquisition of a suitable testing suite. Two issues related to the availability of a test suite are the possibly prohibitive cost of high quality "standard" test suites and the potential lack of any widely accepted test suites directed at the particular aims of our study (specifically scheduler performance over different "types" of load including interactive use and batch use).

Some tradeoff will have to be made between the amount of information returned by a test suite $Δ P$, and the suites impact on the load $Δ L$ on the system. A situation similar to the Heisenberg uncertainty principle is expected where increasing the precision of our knowledge of the system at any point decreases the our knowledge of the load such that the two are only knowable up to some hardware constant $\hbar$.

$$Δ P × Δ L \geq \frac{\hbar}{ 2 }$$

If this tradeoff proves untenable then we may be required to resort to a simulated test environment, or a scheme of partitioning the running system inside of a virtual machine and collecting our metrics from outside of the machine.

### implementation

kernel
2.6.31 (this is what the current BFS patch is against)

## exams [1/3]

in classroom

in CS141

### DONE midterm

• format
• questions like the reading response questions
• essay questions
• topics
• kernel design
• memory management
• virtualization
• test general OS concepts
• care less about specifics, and more about the effects of the mechanisms
• not how did x solve y, rather, how could one solve y

#### topics

##### OS structure
standard monolithic
entire OS is in kernel space
pros
faster (less context switching)
cons
• complexity, size
• less flexible/extensible can't customize w/o changing kernel space code
• harder to move to new hardware
• less secure/stable (more low-level components to keep track of)
µ-kernel
only supports basic structures (l4 address spaces, threads, scheduling, and IPC) and pushes rest of the OS out into user-space servers
pros
• simpler
• easier to move to new hardware
• flexible
• more secure/reliable because of the simplicity of the low-level interface
cons
• slower
exokernel
only does multiplexing of HW resources, rest of OS is in users pace libraries. end-to-end argument: application knows best how to handle it's own resources.
pros
• flexible
cons
• no security gains like in µ-kernel
• cooperation
virtualization structures
as example of general system management structure
• fault containment
• porting old OS to new hardware
• slower

understand

• implication of these structures to the performance of the OS
• micro-benchmarks
• macro-benchmarks (applications)
• implication for extensibility of the OS
• separation of protection of resources, mechanisms, policy
virtual memory of a process
process state
multi/batch/time-sharing programming
multi-programming
multiple tasks, can be single user
time-sharing
batch
space sharing rather than time sharing, but the CPU is generally only given to one task at a time. queue of processes
context switch
swap

models of communication

• message passing
• shared memory (on different architectures NUMA, UMA, etc…)

synchronization

monitors
software constructs surrounding critical sections
semaphores (less)
counting or binary, primitive locks, counting counts how many people can be in critical section simultaneously, decrements as each individual enters the critical section
mutex
simple binary semaphore (potentially has additional features protecting against priority-inversion)
critical section
section of code which is run inside of a lock, semaphore etc…
condition variables, their semantics
used for IPC, avoid spinning.

Models (see notes above)

• hybridization
• scheduler activations paper
##### process scheduling
metrics to consider
responsiveness (time from submission to first response), submission to completion, wait time (sum of the time spent on the ready queue), throughput (job completion in a chunk), turnaround (form start to finish)
user-centric
response, wait, turnaround
system-centric
throughput, utilization
preemptive v.s. non-preemptive
can't knock a process off of the CPU until the process yields
fair scheduling
CPU is equally distributed between users or groups rather than among processes
##### memory management
working set
set of pages needed while running
thrashing
when the working set doesn't fit in memory, when the OS spends more time paging then executing
allocating memory (contiguous vs. non-contiguous)
contiguous maps the address space directly to disk through a base and offset, non-contiguous (like paging) allows individual pages to be loaded w/o loading the entire address space at once.
gained through paging or segmentation
segmentation
like in the Multics paper
• variable length
• semantics (program or data)
• permissions like on files
• potentially with a directory structure
paging
allocation, selection, levels of caches, replacement
• fixed size
• less semantics than segments
• mapping pages to disk
• page faults are resolved as high up the cache hierarchy as possible
• LRU, stuff like that
copy-on-write
p.325 Dinosaur book
memory-mapped IO
map a section of memory to a place on disk, and all you have to do is write to memory. copies part of disk to ram
• this requires explicit handling in the user-level application. initial system call to set it up (open/close)
• faster, only have to write to memory (and it will later be written to the mapped portion of the disk)
• there is an explicit system call to sync to disk
• might be asynchronous
• slower for changes to propagate to disk
##### miscellaneous
• reliability
• scalability (clients, processors, resources, etc…)
weak scaling
increase workload as increase resources (constant time)
strong scaling
decrease time as increase resources (constant workload)

## concepts / terms

### MIPS

originally stood for "Microprocessor without Interlocked Pipeline Stages"

it is a RISC instruction set architecture.

### proportional share scheduling

each entry is given some portion of the system relative to other entries proportion or relative to the total amount of resource

each process is assigned some portion relative to the portions of other applications

### interrupts

p.499 in dinosaur