# Comp 409

Clark Verbrugge • Course Webpage

# Acronyms

AMO At Most One
CAS Compare and Swap
CMP on Chip Multi-Processor
CS Critical Section
CV Condition Variable
IPC Instructions Per Cycle
JLE Jump Less Than
JMP Jump
JNS Jump Not Signed
ME Mutual Exclusion
MP Multi-Processor
NUMA Non-Uniform Memory Access
PC Process Consistency
SC Sequential Consistency
SMP Symmetric Multi-Processor
TS Test and Set
UMA Uniform Memory Access
UP Uni-Processor

# Interfaces

## Semaphore

/**
* Block thread until [count] permits are available before acquiring them
* acquire() = acquire(1)
*/
void acquire(int count)

/**
* Get number of currently available permits
*/
int availablePermits()

/**
* Acquire and return all available permits
*/
int drainPermits()

/**
* Makes [count] more permits available
* release() = release(1)
*/
void release(int count)

/**
* Pseudo: acquire()
*/
void P()

/**
* Pseudo: release()
*/
void V()


/**
* interruption to actually be affected
*/
void interrupt()

/**
* Checks if thread is interrupted
*/
boolean isInterrupted()

/**
* Optionally add long parameter for max number of millis
* to wait
*/
void join()

/**
* Launches thread by making JVM call run()
*/
void start()

/**
* Sleeps current thread for specified time
*/
static void sleep(long millis) throws InterruptedException


## Object

/**
* Wakes up single thread waiting on monitor
* Often times, notifyAll() is preferred
*/
void notify()

/**
*/
void notifyAll()

/**
* Causes current thread to wait until notify()
* or notifyAll() is called from this object
* Optionally specify timeout (long millis)
*/
void wait() throws InterruptedException


# Lecture 1. 2018/01/09

• Parallelism
• Control the ways things happen at the same time
• Concurrency
• Define constraints on execution, which is influenced by other aspects such as the OS
• Things may happen at the same time
• Process — large heavyweight with its own address space and handles
• Efficiently switched — changing threads may just involve changing some pointers, as the rest of the space may be the same
• Asynchronous Execution — threads execute at their own rate (dictated by OS)

• No interaction (independent) — “embarrassingly parallel”
• Share memory (need to consider synchronization, coordination, resource consistency, etc)

• OS Part
• Schedule policy
• Priority
• For this course, wew will assume little or no control (black box)
• User Part
• Register set
• Stack pointer (separate stacks)
• PC
• Shared memory
• Amdahl’s Law
• Total time is based on two pieces: sequential part + parallel part
• Parallel part can be distributed among the $n$ threads: $t = s + p / n$
• Speedup: $\dfrac{1}{(1 — p) + \frac{p}{n}}$
• Hiding latency (cache miss, pipeline stalls)
• Switching context efficiently
• Keeping CPU busy
• Increasing responsiveness
• Eg long running execution + GUI thread for listening for inputs & interrupts
• Appropriate Parallelism
• Web server — requests are naturally parallel
• Downsides
• May be difficult to debug
• Is concurrency fundamentally different?
• We may reason that with Turing Machines, n TMs does not give any more power than 1 TM. However, with parallel execution, there are some subtle differences.
• Example 1 — consider the following conversions & constraints:
a → ab
b → ab
There cannot be any sequence of aa or bb.

If we started with ab, we would not be able to make any conversions. However, if we did both conversions in parallel and ignore the middle transition, we will be able to transition to abab

• Example 2 — consider 5 items spaced evenly in a circle, with equal distance from the center point. Our goal is to rotate them all at the same time. If we were to do it sequentially, there will be a speed for which an item will cross over another. However, if we were to do it all in parallel, which would not be an issue.

# Lecture 2. 2018/01/16

## Hardware

• UP — Basic uniprocessor
• CPU — cache — memory
• Low level parallelism — modern, super-scalar processor, pipelined, multi-issue
• Time-sliced multiprocessor
• Simulated parallelism
• MP — Multiprocessor
• Many [CPU — cache] to shared memory
• UMA — Uniform memory access
• All memory accesses cost the same (modulo cache)
• Note some caches need to reflect the same view on memory
• Caches need to be consistent
• NUMA — Non-UMA
• Many [CPU — cache] to shared memory (slow) & many local memories (fast)
• SMP — Symmetric MP
• Multiprocessor with same CPUs
• Disadvantage — given that lots of programs are single threaded and the CPUs are independent, some of them may be doing wasted or duplicate work
• CMP — on-Chip MP
• CPUs & caches are on a chip
• Faster cache communication
• CMT/FMT — Coarse/fine grained multi-threaded
• CMT — context switch every so many cycles
• Disadv — no real parallelism
• FMT — context switch every cycle
• May be referred to as barrel-processing
• Cray’s TERA was FMT based
• Had no data-cache, meaning data access involves 70 cycles. However, due to the threads, there is no idle
• Not necessarily responsive, but has high throughput
• SMT — Simultaneous MT
• Multiple functional units

Example

T1 T2 T3

Assume each process has 2 x ALU, 2 x FPU, 1 x LSU int/float ops → 1 cycle load → 2 cycles

UP

slot 1 slot 2

SMP/CMP (2 processes)

P0   P1
slot1 slot2 slot1 slot2

### Waste

• Vertical Waste — waste on next cycle
• Horizontal Waste — waste for slots in a cycle
• Low-level instruction-level parallelism is in practice limited

• IPC — instructions per cycle
• Need a good instruction mix
• In practice, IPC ≤ 3

## Atomicity

• Read set — set of variables read from but not written to
• Write set — set of variables written to (and perhaps read from)
• Independence — when each write set is disjoint from all other read & write sets

Example

Consider the case where x = y = z = 0, and where

• Thread 1 executes x = y + z
• Thread 2 executes y = 1, z = 2

• add r3, r1 + r2
• store r3, [x]

Given that no assumptions can be made about the speeds of either CPU, thread 2 may execute its two instructions in between any of the executions of thread 1. So instead of having the desired response of x = 0, thread 1 may product three different outputs.

Usually

• Assignment of constant to machine word sized value is atomic
• Assignments to bigger sizes may be separated

In Java

• Assignment to 32-bit or smaller type is guaranteed atomic
• Includes aspects like x = y + z
• 64-types (long, double, etc) are not necessarily atomic; can be declared as volatile

# Lecture 3. 2018/01/18

For word sized x = y + z, assignment is atomic, but computation of y + z may not be. Note that if no other thread writes to y or z, then the computation will appear to be atomic

An expression has a critical reference if it uses a variable that some other thread changes

Word sized expressions with “at-most-one” critical reference at a time are “effectively” atomic. The constraints are that:

1. An expr with at most one CR is not read by other threads
2. An expr with no CRs may be read by other threads

## Examples

x = y = 0

x = y + 1 y = y + 1

As AMO is satisfied in both cases, there are no unexpected values to be considered. The process will happen with one expression before the other without interleaving

x = y = 0

x = y + 1 y = x + 1
inc r1
str r1, [x]
inc r1
str r1, [y]

Neither satisfy AMO, so there may be interleaving (eg resulting in x = 1, y = 1)

# interleavings need to consider
n threads, each with m atomic instructions $\dfrac{(nm)!}{(m!)^n}$

For n = 2, m = 3 → 20 possibilities

To resolve this, we should only allow 1 thread to execute such changes at one time → critical section

Assume 2 threads with unique ids 0 and 1.
We will go through the stages of init, enter protocol, exit protocol

init()
turn = 0

enter(int id)
while (turn != id)  // spin

exit(int id)
turn = 1 &mdash; id



In this case, given that the turn is initially for thread 0, if thread 1 shows up first and thread 0 has no intention of entering the CS, thread 1 will need to wait unnecessarily.

init()
flag[0] = flag[1] = false // indicates interest for thread[id]

enter(int id)
while (flag[1 &mdash; id])  // spin
flag[id] = true       // indicate self interest

exit(int id)
flag[id] = false


enter is not actually atomic. If both show up at the same time with both flags set to false, both will pass the spin and set their own flags to true

init()
flag[0] = flag[1] = false // indicates interest for thread[id]

enter(int id)
// same as previous but with sequence switched
flag[id] = true       // indicate self interest
while (flag[1 &mdash; id])  // spin

exit(int id)
flag[id] = false


This case will now enforce ME, but there may be an issue when both threads show up, both threads set their flag to true, and both threads spin forever (deadlock).

init()
flag[0] = flag[1] = false // indicates interest for thread[id]

enter(int id)
flag[id] = true       // indicate self interest
while (flag[1 &mdash; id])
flag[id] = false  // give up self interest
randSleep()       // sleep for some random time
flag[id] = true   // show self interest again

exit(int id)
flag[id] = false


For this to work, our delays cannot sync together. Though our sleep uses random durations, it is possible for both threads to wait the same time, set both their flags, then repeat (livelock)

Note that

• small delay → increased chance of lock-step behaviour
• big delay → more likely for long unnecessary wait

## Peterson’s Algorithm


init()
turn = 0
flag[0] = flag[1] = false

enter(int id)
flag[id] = true  // show self interest
turn = id
while (turn == id && flag[1 &mdash; id]) // spin

exit(int id)
flag[id] = false

flag[0] = true flag[1] = true
turn = 0 turn = 1
while (turn == 0 && flag[1]) while (turn == 1 && flag[0])

Turn will be set to either 0 or 1 in all cases. Without loss of generality, we will assume turn ends as 1. In this case, thread 0 has set turn first, and gets to execute first.

What we are looking for

1. ME over the CS
2. Absence of deadlock — if multiple threads try to get into CS, one succeeds
3. No unnecessary delay — if no one is in the CS, we should get in promptly
4. Eventual entry — if we try & get into the CS, we should eventually succeed (starvation free)

POSIX - standard and standalone library - links to apps

Better if integrated into language

### Java

• Core language/API
• Higher level API - java.util.concurrent
runnable interface determining code to execute
start() native code - gets the thread running
run() runs runnnable if not null
• Threads may be created by subclassing thread to run what we want, or by creating a runnable object and passing it to the thread constructor. Typically the latter is used; the former is if you wish to change the behaviour of threads, not just the code it runs.
• Program stops when all threads are finished
• Subtlety - program exists when all non-daemon threads (default) finish
• Daemon threads are intended as services, and terminate automatically when appropriate
• A thread that has been started may not necessarily be running - OS may choose to switch it to a scheduled or de-scheduled state
• Threads may also sleep, which goes to a waiting mode, or be woken up

# Lecture 4. 2018/01/23

• Priority based
• Nominatively priority-preemptive
• Threads at highest level is executed in preference
• For threads of same priority, should run with round robin time slices (but not guaranteed)
sleep(millis) Lower bound idle time
yield() Give up time slice
Thread.currentThread() Get reference to current thread
isAlive() returns true if thread could be scheduled. Always true if called on self (as it wouldn’t be callable otherwise). If called on another thread, returns stale information on live state
join() Wait for another thread to finish before continuing

Asynchronous termination is bad. stop() and destroy() are such methods and are deprecated.

### Basic synchronization

• Every object has a lock, for which only 1 thread can acquire at a time
synchronized(lock) {
// begin lock
...
// end lock
}


Threads that attempt to access an already locked object will wait until it unlocks.

In Java, you can relock locks you own, on the condition that you unlock for the same number of times.

## Volatile Keyword

• Used for variables to indicate that it may change arbitrarily/asynchronously
• Helps avoid accidental optimization (eg when a thread checks for a flag and loops, and the thread itself never changes the flag)

• POSIX library
• pthread.create(&threadHandle, attributes, startRoutine, args)
• Detached (daemon)
• May not be joined
• Act as services
• Joinable (non-daemon)
• Must join with them
Scheduling
SCHED_RR round-robin, time sliced, priority preemptive
SCHED_FIFO run to completion, no time slice
SCHED_OTHER offered by OS

## Mutual Exclusion

• Mutex objects
• acquire() and release()
• Up ot developer to
• Acquire before release
• By default, no recursive acquire allowed

• Filter Lock
• Generalization of the Peterson’s 2-process tie breaker
• At each stage, ensure that at least one thread trying to get in is successful (ensure progress)
• If more than 1 thread is trying to enter, at least one is blocked
init:
stage: id -> stage
int stage[n]
waiting stage -> id
initiation[n]
stage 1 -> not trying to set n?

enter id:
for s in 0 until n:
stage[id] = s
waiting[s] = id
do:
spin = false
for j in 0 until n:
if j == id continue
if stage[j] >= s && waiting[s] == id:
spin = true
break // from for loop
while spin

exit id:
stage[id] = 0


# Lecture 5. 2018/01/25

From last class, part of the proof was by contradiction (at least one thread left behind)

Let $t_A$ be the last thread at level stage to write init Waiting stage = $t_A$

Another thread is in $t_X$ We know that waiting[stage] = $t_X$ → waiting[stage] = $t_A$ We know $t_A$ writes into stage[t_X] = stage before writing init writing.

stage[tx] = tx → waiting[stage] = tx → waiting[stage] = ta

We know $t_A$ checks stage[j] after writing to wait

• filter lock → locking → doorway piece (finite # instructions), spinning piece
• fairness → FCFS
• If $t_A$ finished doorway before thread $t_B$, then $t_A$ gets in CS before $t_B$
• filter lock
• not FCFS
• starvation free

### Ticket Algorithm

• classic locks
• take next #
• wait for # to be called
init:
next = 0
int turn[n] = 0
number = 0

enter id:
turn[id] = next++        // needs to be atomic
while turn[id] != number // spin

exit:
number++


### Bakery Algorithm

• Broken ticket dispenser
init:
turn[id] = 0

enter id:
turn[id] = (0 until n).map { turn[it] }.max() + 1 // needs to be atomic
(0 until n).filter { it != id }.forEach {
while (turn[it] != 0 && turn[id] > turn[it])  // spin
}

• be careful about wrap-arounds, as there is eventual overflow
• hardware primitive - special instructions

Test and set

TS x y:           // all atomic
temp = x
x = y
temp

init:
lock = 0

enter:
while (TS(lock, 1) == 1) // spin

exit:
lock = 0


enter:
while (lock == 1)
while (TS(lock, 1) == 1)
while (lock == 1)

• Not FCFS
• Not necessarily starvation free

FA x c:             // all atomic
temp = x
x += c
temp


Compare and Swap

CAS x a b       // all atomic, return type indicates success/fail
if x != a
false
else
x = b
true


## Queue Locks

MCS

• Works for arbitrary # of threads
• Maintains explicit queue
  class Node {
Node next
boolean locked
}

• Each thread has node and a static tail
enter:
me.next = null
pred: Node = TS(tail, me)
if pred != null
me.locked = true
pred.next = me
while me.locked         // spin

exit:
if me.next == null
if CAS(tail, me, null) // try set tail back to null
return
while me.next == null   // someone has just set the tail
// spin until you see the new node.
me.next.lock = false
me.next = null


# Lecture 6. 2018/01/30

MCS Cont

• (+) FCFS
• (+) Cache friendly
• (+) Space efficient
• Every thread needs one node
• (-) Needs both TS & CAS
• (-) Can spin on exit

CLH Lock

class Node {
boolean locked
}

Node me = new Node()
Node myPred = null
}

enter:
me.locked = true        // signifies to others that they should be locked
myPred = TS(tail, me)   // set the tail to my node
while (myPred.locked)   // spin

exit:
me.locked = false
me = myPred


Property

• Notices that nodes get moved around
• No spin on exit
• Still FCFS
• Cache Friendly
• Space Efficient
• Only requires TS

## Java’s Locks (synchronous)

“Thin” Lock (Bacon lock)

• Optimized lock, contention is rare
1. Lock an unlocked object
2. Recursive lock on lock you own (shallow)
3. Deep recursive locking (over 256)
4. Contention (shallow, deep)
• Java aims to make 1. and 2. fast, at the expense of 3. and 4.
• Use mutex (pthread) for 3. and 4.
• Each object has lock word
• Divided into 8, 8, 15, 1 bits
• First 8 bits is reserved
• Second 8 bits - recursive lock count (-1)
• 15 bits represent thread id of owner
• 0 represents no thread, and consequently no lock
• Last 1 bit = lock shape - 0 = cheap (thin); 1 = expensive (fat) loc
• Lock done using CAS(lock, id << 1, 0)
• true → locked, false → check ownership
• ownership: true → locking again, add 1 to lock count. Make sure there is no overflow for the count
• If fail occurs anywhere, transition into “fat lock”
• Only owner can do that
• If we don’t own it, we spin until we do, or until it becomes a fat lock (known from shape bit)

“Fat” Lock

• Need mutex
• Lock divided into 8, 24, 1
• 24 bit - pointer to mutex
• 1 bit - shape bit (1)

Semaphore & Mutexer

• Spinning uses CPU
• Better if we went to sleep(), and someone else wakes us up
• However, easy to get lost wakeup

Semaphore

• Blocking synch → sleep & wake
• Abstraction
• P(S) “down” (all atomic)
  while (s == 0)
sleep()
s--

• V(s) “up” (all atomic)
  s++
wakeup()    // call some wakeup


Binary Semaphore

• Always 0 or 1
• Starts at 1

Counting Semaphore

• Not just 0 & 1
• Useful for representing resource

Producer/Consumer (bounded buffer)

• Data buffer[n]
• Semaphore spaces = n
• Semaphore filled = 0
produce:
while (true)
Data d = produce()
P(spaces)
buffer[pindex] = d
pindex = (pindex + 1) % n
V(filled)

consumer:
while (true)
Data d;
p(filled)
d = buffer[cindex]
cindex = (cindex + 1) % n
V(spaces)
consume(d)



Other Binary Semaphore

• Starts at 0 → signalling
• Makes threads wait for each other

Drawbacks

• 2 ideas together - mutual exclusion & signalling
• P(), V() → separate, fragile design

# Lecture 7. 2018/02/01

## Monitors

• Dijkstra, Per Brinch Hansen
• Package data
• Operations are mutually-exclusive
class Monitor {
private int d1, d2, ...

synchronized void foo() { ... }

synchronized int bar() { ... }
}

• Condition variable for thread communication
• Always associated with monitor (mutex)
• Can have more than 1 (ie Pthreads)
• In base Java → one, unnamed
• 2 ops

sleep() wait()
signal() notify() (can only invoke inside monitor)
• Calling wait() inside a monitor will give it up & sleep (atomic)
• When another thread calls notify(), sleeping thread may be woken up. Note that a thread that is woken cannot continue on until it has reacquired the lock

How does it work?

• Have a queue (set) of threads trying to get into the monitor (mq)
• Each CV implies another queue → cvq

Atomic ops

enter T:
if no one is in T
enter
else
mq & sleep

exit T:

wait T cvq:
sleep
exit

notify cvq:
take a thread from cvq & put to mq

notifyAll cvq:
move all threads from cvq to mq

• Notices that we wake one thread with notify()
• Upon being woken up, conditions may not hold; hence conditions should be checked again (typically with while loop)
• Spurious wakeups - may be woken up without being notified
• Also solved by waiting in a while loop

## Different semantics for CV

• Signal & Continue
• Notifier retains control of monitor
• Signal & Wait
• Notifier transfers monitor control to the woken thread
• Notifier needs to re-acquire lock to continue
• Signal & Urgent Wait
• Same as signal & wait, except a woken thread will give control back to the notifier once finished
• DB context - only 1 write at a time

R/W Lock

int readers = 0
BinarySemaphore r, rw // init to 1

down r
down rw
up r
down r
up rw
up r

writer:
down rw
write()
up rw


• Note that this gives preference to readers
• A continuous stream of readers may starve waiting writers
• There are variants for writer’s preference & fair solution

## Termination

• Asynchronous termination may easily corrupt the state
• Eg terminating thread when it’s in the middle of a syscall
• Java does not have cancellation, but instead has interrupts
• Many operations will capture it by default and throw an InterruptedException to be handled. Note that some exceptions may be thrown by spurious wakeups. Threads can also use .interrupted() to check its interruption status.

## Priorities

• Nominatively priority pre-emptive
• Java - 10 priorities
• NT - 7 levels

# Lecture 8. 2018/02/06

## Priority Inversion

• Low level thread locks and executes → high priority thread enters and attempts to acquire lock → medium priority thread comes in as well and acquires lock from low priority → high priority thread must wait for medium priority thread to finish before the lock can be acquired

• Mars Pathfinder

• High priority thread - bus manager
• Medium priority thread - communication
• Low priority thread - meteor logical
• Once priority inversion occurs, another watchdog thread will reset everything

Solutions

Priority Inheritance

• Thread holding a lock will temporarily acquire the priority of the highest priority thread waiting for the lock

Priority Ceilings

• Locks are associated with priority
• High priority lock → high priority thread

Barrier

• Want thread to wait for each other

volatile int numThreads = n

while true
// work
else
yield()


Sense Reversing Barrier

boolean phase = true
boolean phases[n] // all false

phase = phases[id]
else
while phases[id] != phase
yield()



TSD/TLS

• Stack data - local to threads
• Global data - shared Sometimes convenient to have thread-specific versions of global data

errno in C

• Global data - could get modified by different threads and be inconsistent
• Less confusion if each thread has own errno

TSD - pthread TLS - Java

// Java TLS

tls.set(2) // t0
tls.set(1) // t1


TSD

• Allocate

Dining Philosopher Problem

5 philosophers at a round table with 5 plates and 5 utensils. Each will think, eat, and repeat. Eating requires getting two utensils adjacent to the philosophers. Goal is to avoid deadlock.

Solutions

Single Global Mutex

think()
P(global)
eat()
V(global)


Works, but not very concurrent

Mutex per Utensil

c = mutex[5]

think()
P(c[i])
P(c[(i + 1) % 5])
eat()
V(c[(i + 1) % 5])
V(c[i])


Doesn’t actually work. If everyone tries to grab left utensil, no one will be able to grap right utensil and complete eat()

Create Global Lock

lock = mutex // global lock
c = mutex[5] // lock per utensil

think()
P(lock)
P(c[i])
P(c[(i + 1) % 5])
V(lock)
eat()
V(c[(i + 1) % 5])
V(c[i])


Works, but still not very concurrent. If philosopher 1 eats, and philosopher 2 attempts to eat, philosopher 2 will hold the global lock and block others who could have eaten from eating.

Order the Locks

think()
j = ((i + 1) % 5)
P(min(i, j))
P(max(i, j))
eat()
V(max(i, j)) // order here isn't as important
V(min(i, j)) // but reverse unlock is good practice


# Lecture 9. 2018/02/08

Solutions

• Order resources

Coffman’s Condition

• 4 necessary conditions
1. Serially reusable resources
2. Incremental acquisition
3. No preemption
4. Dependency Cycle

Livelock

• Threads are not stopped, but are also not making useful progress

Race Condition

• Happening due to lack of ordering guarantees
• Note: different from non-determinism & atomicity

Concurrent Pros

• 2 ordering relations
1. Intra-thread ordering - program order
• A data-race in a multithreaded program occurs when two of more threads access the same memory location, with no ordering constraints and at least on of them being a writer

Consensus

• Lots of synch primitives (TS, FA, CAS)
• We can build locks with any/all of those
caslock

CAS x a b:
bool rc
while TS(caslock, 1) == 1; // spin
if x == a
x = b
rc = true
else
rc = false
caslock = null
return rc

• CAS original - fixed amount of time to execute (wait free)
• Simulate CAS - spins, potentially indefinitely
• Wait free
• Finite # of steps
• Fault-tolerant
• Consensus problem
• n threads, each with different values
• Want to agree on the value
• Requirements
• Consistent - all agree in the end
• Valid - agreed value is one of the starting values
• Wait-free - finite # steps, fault tolerant
• Consensus number of a sync primitive is the max # of threads for which they can solve the consensus problem

• Binary consensus {0, 1}
• Show we cannot solve 2-consensus
• Start in a bivalent state
• System → {0, 1}
• End up in a univalent state
• {0} or {1}

(I got tired after this part)

• T1 → might read or write
• T1 tree also necessarily exists before T0’s action - not univalent
• Therefore both T0 & T1 must both write
• Could write different vars
• T0 & T1 write different vars
• They both write the same var x
• Cannot be univalent → cannot be critical
• R/w primitives have a consensus # of 1

# Lecture 10. 2018/02/15

R/W - consensus #1

FA, TS, consensus #2

int decide (int v):
x = TS(decider, v) // x is old value
if x == 1:
return v
return x


Solves 2 consensus, but not 3

CAS, consensus # ∞

int decide (int v):
CAS(decider, 1, v)
return decider


## Linearization

• Concurrent programs
• In general, not functional
• Output can depend on timing/scheduling
• Correctness
• If whatever happens in our concurrent program has an equivalent sequential program producing the same output

Example - 2 queues p, q - enqueue, dequeue

T0 T1
1: p.enq(x) 4: q.enq(y)
2: q.enq(x) 5: p.enq(y)
3: p.deq() -> returns y 6: q.deq() -> returns x

(numbers are just for future reference and do not refer to runtime order)

is this linearizable?

• Since p returns y, 5 must occur before 1 (FIFO)
• Since q returns x, 2 must occur before 4
• However, given that the calls in within each thread is sequential, we know that this result isn’t possible

Memory models

T0 T1
x = 1 y = 2
b = y a = x

(variables start at 0)

Possibilities: (a, b) = (1, 0), (0, 2), (1, 2)

Note that cases like (0, 0) is not possible; invalid interleaving

Write-buffering

P0 WB0 Mem P1 WB1
x = 1 → x = 1 x = 0 y = 2 → y = 2
- - y = 0 - -
b = y → b = 0 a = 0 - -
- - - a = x → a = 0
- writes x x = 1 - -
- - y = 2 - writes y
- - a = 0 - -
- - b = 0 - -

As a result, with buffering, we can have (0, 0).

&therefore; we cannot just think of interleavings

• Model types
• Sequential Consistency (SC) - operations appear to execute in program order
• Both cases result in a global timeline for operations
• Coherence
• The write to a given variable must be seen in the same order by all threads
• Doesn’t require sequential consistency. Every variable has their own timeline
• Process Consistency (PC)
• T0 T1 T2 T3
x = 1 x = 3 a = x(1) d = x (3)
y = 2 y = 4 b = y (2) e = y (4)
- - c = x (3) f = x (1)
• T2 sees T0 before T1
• T3 sees T1 before T0
• Note that this is not coherent
• Note
• SC implies cohrerence and PC
• PC does not imply coherence
• Coherence does not imply PC

No class

# Lecture 12. 2018/02/22

Intel/AMD

• Memory management implied in the hardware manual
• White paper
• Informal rules
• “litmus” tests
• Some kind of casual consistency
• IRIW - independent read independent write
• Probably don’t want, but not ruled out
P0 P1 P2 P3
x = 1 y = 1 eax = x, ebx = y ecx = y, edx = x

For P2, it sees eax = 1, ebx = 0, meaning P0 happened before P1 For P3, it sees ecx = 1, edx = 0, meaning P1 happened before P0

• Also cases that should not happen, but can be obsered in practice n6

P0 P1
x = 1 y = 2
eax = x x = 2
ebx = y

Could observe that eax = 1, ebx = 0, x = 1

• Could happen with write buffers
• Order for which they are committed to main memory is undefined
• IRIW becae disallowed
• MFENCE - memory fence to monitor read write access
• Same processor writes are ordered and observed in the same order by all others
• Constraints on inter-process ordering
• Any 2 stores are seeing consistent ordering by processes other than those doing the write. Leaves opens another case n5
• n5

P0 P1
x = 1 x = 2
eax = x ebx = x

eax = 2, ebx = 1 not disallowed, but also not observed

• x86-TSO - abstract model by academics
• Allow everything observed
• Respect litmus tests
• Every hardware thread has its own write buffer
• All writes stored in order
• Committed in order
• Has no overwrites
• Global lock that can be acquired by any hardware thread (with constraints)
• MFENCE
• All WB flushes must be done before
• Some instructions
• LOCK: prefix - thread must acquire global lock
• Buffered write can be committed to memory at any time except when another thread holds the lock
• Before releasing lock, WB must be empty
• A thread is blocked if it does not hold the lock & someone else does
• Rp[a] = v : p can read value v from address a if p is not blocked; there are no writes to a in WBp and mem[a] = v
• Rp[a] = v : p can read v from address a if p is not blocked, and p has a (latest) store “a = v” in WBp
• Wp[a] = v : p can write a = v into WBp at any time
• τp : if p is not blocked, it can silently send the oldest write in WBp to memory
• Fp : if WBp is empty, p can issue an MFENCE instruction
• Lp : if the lock is not held by another process, p can acquire it
• Up : if p has the lock & WBp is empty, we can release the lock
• progress condition : all buffered writes are eventually committed
• Example

P: WBp = [0x55] = 0 Q: WBq = [0x55] = 7
Lock : Inc [0x55]
Lp
Rp[0x55] = 0
Wp[0x55] = 1
τp (0x55 = 0)
τp (0x55 = 1)
Up τq
• Without lock, p can begin increment, q can set to 7, and p can increment the new value by 1 → 8
• spinlock
• Address of lock is eax
• Start: Lock : Dec[eax] JNS : enter (jump no sign)
• Spin: CMP[eax], 0 JLE : spin JMP start (jump)
• Exit : MOV[eax], 1

Missed class

# Lecture 14. 2018/03/15

Lock free designs make use of CAS, LL/SL, etc as opposed to locks

tryAdd(Node n, Node prev):
n.next = prev.next
return CAS(prev.next, n.next n)


Sadly these naive methods do not work. Given a list H → xyz → T:

1. If one thread tries to remove x, and another thread tries to add w between x and y, w will be lost.
2. If onne thread tries to remove x and another thread tries to remove y, both may succeed with y remaining in the list.

Various solutions exist

• Valois - “auxiliary nodes”
• Time Harris - lazy solution - mark node to be deleted then delete lazily


next = prev.next
return CAS(prev.next, n.next to false, n to false)

tryRemove(Node n, Node prev):
Node succ = n.next
if CAS(n.next, succ to false, succ to true): // mark first
CAS(prev.next, n to false, succ to false)  // delete is ok to fail
return true
return false

find(int data):
while(true):
pred = H
curr = pred.next
while curr != T: @restart
succ = curr.next
while curr.marked:
if !CAS(pred.next, curr to false, succ to false):
continue@restart
curr = succ
succ = curr.next
if curr.data == data:
return curr
pred = curr
curr = succ
return null



### Elimination Stack

With a lock-free stack, we still have a lot of contention. Stack ops are fundamentally sequential. However, if push() and pop() show up at the same time, it might be better to just short-circuit the whole thing. ie, push() gives the value to pop; push + pop cancel/do nothing.

Lock free exchanger (2 threads exchange data)

• state info (pair of <value, state>)
• need atomic operators to r/w pair atomically
• value: data exchange
• state: EMPTY, WAITING, BUSY
• EMPTY - ready to do the swap
CAS(pair, null to EMPTY, A to EMPTY)
• CAS to set the slow to item & state WAITING, but only if the state is EMPTY
• If successful:
• wait for the first thread to show up
• spin until we see the state as BUSY
• grab item & set state to EMPTY
• If we wait for too long
• Try & give up
• Use CAS to set back to EMPTY
CAS(pair, A to EMPTY, null to EMPTY)
• If successful, we go & do push/pop
• Else complete the exchange
• WAITING - one thread (second thread to show up)
• check the state → not EMPTY
• grab item
• try CAS(pair, A to WAITING, B to BUSY)
• If successful, we have done our part
• If fail, restart to resolve push/pop
• BUSY - two threads (third thread to show up → give up)
• second thread has completed the exchange
• grab value B
• set state to EMPTY

Can associate an exchange for the state

• push(x) is expected to exchange x for null
• pop() is expected to exchange null for x
• if successful, don’t need to actually go through the state
• in fact, we can associate an array of exchangers
• threads above random index to try for an exchange
• try for a while, if a matching push/pop does not show up or a non-matching situation
• push/posh or pop/pop occurs, give up & resort to the lock free stack
• Java’s exchanger is based on an array of exchangers

# Lecture 15. 2018/03/20

Lock Free algorithms are problematic in the way that references from deallocated objects may be resued unknowingly

Universal construction - almost any data structure can be made into a lock free version

• Assume data stracture has some interface. We can make another interface that wraps the original to ensure that the invocations are sequential
• Need threads to agree on the order of operations in the log → consensus
newInvoc(...):
do:
j = consensus i
while i != j
// we are now the next op

s = tail
r = tail.state
do:
r = r.apply s  // without modifying data, check the previous ops to update the state
s = s.next
while s != i

return r


One concern is that our consensus algorithm is one shot and not reusable. However, knowing that we can construct a new consensus object with each invocation, this is not an issue.

## Open MP

Not a language, but rather a set of directives (structured comments: pragma) on top of an existing language that makes parallelism easy

#pragma omp parallel For single statements
#pragma omp for For loops; will be partitioned amongst the thread team
#pragma omp sections Another way of partitioning work
#pragma omp single Part within section that should only be executed by one thread
#pragme omp master Part that must be executed by a specified thread (master)

## Data Model

• shared(x, y, z)
• private(x, y, z) - each thread has its own copy; uninitialized, not persistent
• threadPrivate(x, y, z) - like private, but persistent & presumably initialized
• firstPrivate - var is initialized from the present scope
• lastPrivate - value is given back to the parent
• reduction(opList) - opList can be + (init at 0) or * (init at 1)

• flush - mian maechanism for synchronzing cache & memory
• commits any pending writes
• invalidates cached copies
• rules
• if intersection of 2 flush sets is non-empty, flushes must be sseen in same order by everyone
• if thread redds, writes, modifies in flush set, program order is respected
• atomic ops have implicit flush of the vars involved

# Lecture 16. 2018/03/22

## PGAS

• Various parallel models
• Shared memory
• Message passing
• SIMD/SPSMD
• Data parallel
• Locality of computation is important

## Titanium

• Based on Java: compiler → C → executable
• spmd language - processes execute the same code
• To synchronize, add a barrier
• All threads must reach barrier before any proceed
• Problematic if we create barriers on many threads and none on others. Some threads may end up stuck indefinitely.
• Titanium has an analysis tool
• Single value data - x expr is a constant or function of only single-valued data
• Barrier must (only) be reachable through single-valued computations/tests

XIO

• PGAS
• APGAS - asynchronous

Basic mechanism:

async {
... content
}


No guarantee as to when async is done, if at all

finish {
... content
}


All async’s inside the finish block (including nested async’s) must be done before this continues

Async indicates that a new thread may be created for the code to run. Often times, the operations may be small, and making a new thread would be extremely inefficient. Some optimizations will be done to see when it is worth making new threads.

Java has executor mechanism:

execute(Runnable r)


Given runnables take in nothing and output nothing, they are not always enough. There also exists Callable<V> which returns V and allows exception throwing

ExecutorServie gives different ways of executing. ThreadPoolExecutor allows for specifications for pools of threads

# Lecture 19. 2018/04/03

Last Time

• Process algebra
• Message passing (synchronous)
• π-calculus - g1/g2 - builds on ealier PA called CCS
• motivated by limitation in CCS
• calculus of mobility - mobile communication - structure can change dynamically
• Defined with 2 entities
• Processes or agents
• Channels or names
• Very abstract - no “real” arithmetic/low level data
• Induction definition for processes
• &emptyset; - process that does nothing (STOP, HALT)
•  P & Q are processes, then P Q is a parallel composition
• Choice operator, ie \Σi &in; I> πi Pi
• Send: \overline{x} y, where x is channel, y is what is sent
• Receive: x(y), where x is channel, y is what the data will be bound to
•  Ex (x(y) e1 \overline{x} z e2) reduces to (e1[y → z] e2)
• Other constructs
• !P - replication operator; as many copies of P as we want
• (ν x) p - creates a new (“nu”) fresh channel called x in p
• Reduction rules
• PAR - processes can evolve at independent rates
• RES
• STDOUT
•  Equivalence (structural congruence) - (vx)∅ ≡ ∅, ∅ P ≡ P ∅ ≡ P
•  Associative - (P Q) W ≡ P (Q W)

• λ-calculus, seq progression
• (λ x (λ y M)) can be simplified to (λ xy M)

I was away

# Lecture 21. 2018/04/10

Last time

• Data flow -> static data flow
• Dynamic deletion
• Functions from stream to stream
• Domain of streams - prefix ordered
• Constraints
• Monotonic - if x < y, f(x) < f(y)
• Continuous - eg preserves bounds
• Network of continuous functions is described by a aystem of (recursive) stream flow equations. The solutions describe the possible flow of our network
• Kahn’s principle
• Cons - prepends 0 ot act as identity
• Buriden’s Ass - given a donkey between and equidistant to two bales of hay, will it ever starve by taking too long in making a decision as to which bale it will go to?
• Buridan’s Principle - for any device making a decision among a finte nuumber of possible outcomes based on a continuing possible imputs, there will be inputs for which the device takes abitrarily long to make a decision
• How to fix this?
• Default decision - only pushes the decision; when can we not decide?
• One decision
• Bipartition - evey var is either a gate input & wire output, a wire input & gate output, an I/O wire
• Integrity - each var is the output of at most one gate
• ??? - each wire is input to at most one gate, gate output may connect to multiple wire inputs
• Feedback - no minimal self-cycles
• Arbiter - Pin, Qin, Pout, Qout
• Initially, Pin = Qin = true; Pout = Qout = false
• In any finite history, find state must be such that Pout ≠ Qout
• There exists histories with Pout &wedge; ¬Qout = true and ¬Pout &wedge; Qout = true
• We want it to be
• Delay bounded - all histories are finite
• Stable - in each history, one output remains unchanged
• Glitch-free if delay-bounded & stable

# Lecture 22. 2018/04/12

• Similar to transactional program
• Automatic parallelization
• Takes advantage of multicore machines
• Aims for ok performance
• optimistic/speculative technique
• Parallelize code we don’t know can be parallelizable
• Safety guarantees to ensure correct execution
• Speculative threads execute in isolation - all writes are buffered, do not affefct main memory
• Commits buffer at join time and validates using set of assumptions with current memory
• If mismatch, discard result and continuing executing sequentially
• Buffer/isolation ensures we do not see writes from the future
• Major issue
• Hardware - cheaper but limited capacity
• Software - much more expensive, but with logs of capacity
• Deciding where to speculate
• Loop level speculation
• Method level speculations
• Value prediction
• 0, 1
• Last value
• Last stride (diff in return values)
• How any spec threads can we have, and who can speculate
• “out of order” speculation