Clark Verbrugge • Course Webpage
AMO | At Most One |
CAS | Compare and Swap |
CMP | on Chip Multi-Processor |
CMT | Coarse-grained Multi Threaded |
CS | Critical Section |
CV | Condition Variable |
FA | Fetch and Add |
FMT | Fine-grained Multi Threaded |
IPC | Instructions Per Cycle |
JLE | Jump Less Than |
JMP | Jump |
JNS | Jump Not Signed |
ME | Mutual Exclusion |
MP | Multi-Processor |
NUMA | Non-Uniform Memory Access |
PC | Process Consistency |
SC | Sequential Consistency |
SMP | Symmetric Multi-Processor |
SMT | Simultaneous Multi Threaded |
TLS | Thread Local Storage |
TS | Test and Set |
TSD | Thread Specific Data |
UMA | Uniform Memory Access |
UP | Uni-Processor |
LL | Linked List |
TLS | Thread Level Specialization |
/**
* Block thread until [count] permits are available before acquiring them
* acquire() = acquire(1)
*/
void acquire(int count)
/**
* Get number of currently available permits
*/
int availablePermits()
/**
* Acquire and return all available permits
*/
int drainPermits()
/**
* Makes [count] more permits available
* release() = release(1)
*/
void release(int count)
/**
* Pseudo: acquire()
*/
void P()
/**
* Pseudo: release()
*/
void V()
/**
* Interrupts the thread; thread must abide by
* interruption to actually be affected
*/
void interrupt()
/**
* Checks if thread is interrupted
*/
boolean isInterrupted()
/**
* Makes current thread wait for specified thread to die
* Optionally add long parameter for max number of millis
* to wait
*/
void join()
/**
* Launches thread by making JVM call run()
*/
void start()
/**
* Sleeps current thread for specified time
*/
static void sleep(long millis) throws InterruptedException
/**
* Wakes up single thread waiting on monitor
* Often times, notifyAll() is preferred
*/
void notify()
/**
* Wakes up all threads all threads waiting on monitor
*/
void notifyAll()
/**
* Causes current thread to wait until notify()
* or notifyAll() is called from this object
* Optionally specify timeout (long millis)
*/
void wait() throws InterruptedException
Asynchronous Execution — threads execute at their own rate (dictated by OS)
Example 1 — consider the following conversions & constraints:
a → ab
b → ab
There cannot be any sequence of aa
or bb
.
If we started with ab
, we would not be able to make any conversions.
However, if we did both conversions in parallel and ignore the middle transition, we will be able to transition to abab
Example
T1 | T2 | T3 |
---|---|---|
add | load | fadd |
load | add | fadd |
add | add | fadd |
Assume each process has 2 x ALU, 2 x FPU, 1 x LSU int/float ops → 1 cycle load → 2 cycles
UP
slot 1 | slot 2 |
---|---|
T1.1 add | T1.2 load |
— | — |
T1.3 add | — |
T2.load | — |
— | — |
T.2.add | T2.add |
T3.fadd | T3.fadd |
T3.fadd | — |
SMP/CMP (2 processes)
P0 | P1 | |||
---|---|---|---|---|
slot1 | slot2 | slot1 | slot2 | |
T1.add | T1.load | T2.load | — | |
— | — | — | — | |
T1.add | — | T2.add | T2.add | |
T3.fadd | T3.fadd | — | — | |
T3.fadd | — | — | — | — |
Example
Consider the case where x = y = z = 0, and where
Thread 1s execution is not atomic. Instead, it is:
Given that no assumptions can be made about the speeds of either CPU, thread 2 may execute its two instructions in between any of the executions of thread 1. So instead of having the desired response of x = 0, thread 1 may product three different outputs.
Usually
In Java
For word sized x = y + z
, assignment is atomic, but computation of y + z
may not be.
Note that if no other thread writes to y
or z
, then the computation will appear to be atomic
An expression has a critical reference if it uses a variable that some other thread changes
Word sized expressions with “at-most-one” critical reference at a time are “effectively” atomic. The constraints are that:
x = y = 0
Thread 1 | Thread 2 |
---|---|
x = y + 1 | y = y + 1 |
1 CR, x not read | 0 CR, y is read |
As AMO is satisfied in both cases, there are no unexpected values to be considered. The process will happen with one expression before the other without interleaving
x = y = 0
Thread 1 | Thread 2 |
---|---|
x = y + 1 | y = x + 1 |
1 CR, x is read | 1 CR, y is read |
load r1, [y] inc r1 str r1, [x] |
load r1, [x] inc r1 str r1, [y] |
Neither satisfy AMO, so there may be interleaving (eg resulting in x = 1, y = 1)
# interleavings need to consider
n threads, each with m atomic instructions
$\dfrac{(nm)!}{(m!)^n}$
For n = 2, m = 3 → 20 possibilities
To resolve this, we should only allow 1 thread to execute such changes at one time → critical section
Assume 2 threads with unique ids 0 and 1.
We will go through the stages of init
, enter protocol
, exit protocol
init()
turn = 0
enter(int id)
while (turn != id) // spin
exit(int id)
turn = 1 — id
In this case, given that the turn is initially for thread 0, if thread 1 shows up first and thread 0 has no intention of entering the CS, thread 1 will need to wait unnecessarily.
init()
flag[0] = flag[1] = false // indicates interest for thread[id]
enter(int id)
while (flag[1 — id]) // spin
flag[id] = true // indicate self interest
exit(int id)
flag[id] = false
enter
is not actually atomic. If both show up at the same time with both flags set to false, both will pass the spin and set their own flags to true
init()
flag[0] = flag[1] = false // indicates interest for thread[id]
enter(int id)
// same as previous but with sequence switched
flag[id] = true // indicate self interest
while (flag[1 — id]) // spin
exit(int id)
flag[id] = false
This case will now enforce ME, but there may be an issue when both threads show up, both threads set their flag to true, and both threads spin forever (deadlock).
init()
flag[0] = flag[1] = false // indicates interest for thread[id]
enter(int id)
flag[id] = true // indicate self interest
while (flag[1 — id])
flag[id] = false // give up self interest
randSleep() // sleep for some random time
flag[id] = true // show self interest again
exit(int id)
flag[id] = false
For this to work, our delays cannot sync together. Though our sleep uses random durations, it is possible for both threads to wait the same time, set both their flags, then repeat (livelock)
Note that
init()
turn = 0
flag[0] = flag[1] = false
enter(int id)
flag[id] = true // show self interest
turn = id
while (turn == id && flag[1 — id]) // spin
exit(int id)
flag[id] = false
Thread 0 | Thread 1 |
---|---|
flag[0] = true | flag[1] = true |
turn = 0 | turn = 1 |
while (turn == 0 && flag[1]) | while (turn == 1 && flag[0]) |
Turn will be set to either 0
or 1
in all cases. Without loss of generality, we will assume turn
ends as 1
. In this case, thread 0 has set turn
first, and gets to execute first.
What we are looking for
POSIX - standard and standalone library - links to apps
Better if integrated into language
java.util.concurrent
Thread | |
---|---|
runnable |
interface determining code to execute |
start() |
native code - gets the thread running |
run() |
runs runnnable if not null |
Thread API | |
---|---|
sleep(millis) |
Lower bound idle time |
yield() |
Give up time slice |
Thread.currentThread() |
Get reference to current thread |
isAlive() |
returns true if thread could be scheduled. Always true if called on self (as it wouldn’t be callable otherwise). If called on another thread, returns stale information on live state |
join() |
Wait for another thread to finish before continuing |
Asynchronous termination is bad. stop()
and destroy()
are such methods and are deprecated.
synchronized(lock) {
// begin lock
...
// end lock
}
Threads that attempt to access an already locked object will wait until it unlocks.
In Java, you can relock locks you own, on the condition that you unlock for the same number of times.
pthread.create(&threadHandle, attributes, startRoutine, args)
Scheduling | |
---|---|
SCHED_RR | round-robin, time sliced, priority preemptive |
SCHED_FIFO | run to completion, no time slice |
SCHED_OTHER | offered by OS |
acquire()
and release()
init:
stage: id -> stage
int stage[n]
waiting stage -> id
initiation[n]
stage 1 -> not trying to set n?
enter id:
for s in 0 until n:
stage[id] = s
waiting[s] = id
do:
spin = false
for j in 0 until n:
if j == id continue
if stage[j] >= s && waiting[s] == id:
spin = true
break // from for loop
while spin
exit id:
stage[id] = 0
From last class, part of the proof was by contradiction (at least one thread left behind)
Let $t_A$ be the last thread at level stage to write init Waiting stage = $t_A$
Another thread is in $t_X$ We know that waiting[stage] = $t_X$ → waiting[stage] = $t_A$ We know $t_A$ writes into stage[t_X] = stage before writing init writing.
stage[tx] = tx → waiting[stage] = tx → waiting[stage] = ta
We know $t_A$ checks stage[j] after writing to wait
init:
next = 0
int turn[n] = 0
number = 0
enter id:
turn[id] = next++ // needs to be atomic
while turn[id] != number // spin
exit:
number++
init:
turn[id] = 0
enter id:
turn[id] = (0 until n).map { turn[it] }.max() + 1 // needs to be atomic
(0 until n).filter { it != id }.forEach {
while (turn[it] != 0 && turn[id] > turn[it]) // spin
}
Test and set
TS x y: // all atomic
temp = x
x = y
temp
init:
lock = 0
enter:
while (TS(lock, 1) == 1) // spin
exit:
lock = 0
enter:
while (lock == 1)
while (TS(lock, 1) == 1)
while (lock == 1)
Fetch and Add
FA x c: // all atomic
temp = x
x += c
temp
Compare and Swap
CAS x a b // all atomic, return type indicates success/fail
if x != a
false
else
x = b
true
MCS
class Node {
Node next
boolean locked
}
enter:
me.next = null
pred: Node = TS(tail, me)
if pred != null
me.locked = true
pred.next = me
while me.locked // spin
exit:
if me.next == null
if CAS(tail, me, null) // try set tail back to null
return
while me.next == null // someone has just set the tail
// spin until you see the new node.
me.next.lock = false
me.next = null
MCS Cont
CLH Lock
class Node {
boolean locked
}
class Thread {
Node me = new Node()
Node myPred = null
}
enter:
me.locked = true // signifies to others that they should be locked
myPred = TS(tail, me) // set the tail to my node
while (myPred.locked) // spin
exit:
me.locked = false
me = myPred
Property
“Thin” Lock (Bacon lock)
CAS(lock, id << 1, 0)
“Fat” Lock
Semaphore & Mutexer
Semaphore
while (s == 0)
sleep()
s--
s++
wakeup() // call some wakeup
Binary Semaphore
Counting Semaphore
Producer/Consumer (bounded buffer)
produce:
while (true)
Data d = produce()
P(spaces)
buffer[pindex] = d
pindex = (pindex + 1) % n
V(filled)
consumer:
while (true)
Data d;
p(filled)
d = buffer[cindex]
cindex = (cindex + 1) % n
V(spaces)
consume(d)
Other Binary Semaphore
Drawbacks
class Monitor {
private int d1, d2, ...
synchronized void foo() { ... }
synchronized int bar() { ... }
}
2 ops
Pthread | Java |
---|---|
sleep() |
wait() |
signal() |
notify() (can only invoke inside monitor) |
wait()
inside a monitor will give it up & sleep (atomic)notify()
, sleeping thread may be woken up. Note that a thread that is woken cannot continue on until it has reacquired the lockHow does it work?
Atomic ops
enter T:
if no one is in T
enter
else
mq & sleep
exit T:
wake up thread in mq
wait T cvq:
add T to cvq
sleep
exit
notify cvq:
take a thread from cvq & put to mq
notifyAll cvq:
move all threads from cvq to mq
notify()
R/W Lock
int readers = 0
BinarySemaphore r, rw // init to 1
reader:
down r
readers++
if readers == 1 // acquire lock if first reader
down rw
up r
read()
down r
readers--
if readers == 0 // give up lock if last reader
up rw
up r
writer:
down rw
write()
up rw
InterruptedException
to be handled. Note that some exceptions may be thrown by spurious wakeups. Threads can also use .interrupted()
to check its interruption status.Low level thread locks and executes → high priority thread enters and attempts to acquire lock → medium priority thread comes in as well and acquires lock from low priority → high priority thread must wait for medium priority thread to finish before the lock can be acquired
Mars Pathfinder
Solutions
Priority Inheritance
Priority Ceilings
Barrier
For more than 2 threads
volatile int numThreads = n
// for each thread
while true
// work
if FA(numThreads, -1) == -1
numThreads = n
else
while numThreads != 0
yield()
Sense Reversing Barrier
boolean phase = true
boolean phases[n] // all false
if FA(numThreads, -1) == -1
numThreads = n
phase = phases[id]
else
while phases[id] != phase
yield()
TSD/TLS
errno in C
TSD - pthread TLS - Java
// Java TLS
static fls = new ThreadLocal()
// per thread; independent
tls.set(2) // t0
tls.set(1) // t1
TSD
Dining Philosopher Problem
5 philosophers at a round table with 5 plates and 5 utensils. Each will think, eat, and repeat. Eating requires getting two utensils adjacent to the philosophers. Goal is to avoid deadlock.
Solutions
Single Global Mutex
think()
P(global)
eat()
V(global)
Works, but not very concurrent
Mutex per Utensil
c = mutex[5]
think()
P(c[i])
P(c[(i + 1) % 5])
eat()
V(c[(i + 1) % 5])
V(c[i])
Doesn’t actually work. If everyone tries to grab left utensil, no one will be able to grap right utensil and complete eat()
Create Global Lock
lock = mutex // global lock
c = mutex[5] // lock per utensil
think()
P(lock)
P(c[i])
P(c[(i + 1) % 5])
V(lock)
eat()
V(c[(i + 1) % 5])
V(c[i])
Works, but still not very concurrent. If philosopher 1 eats, and philosopher 2 attempts to eat, philosopher 2 will hold the global lock and block others who could have eaten from eating.
Order the Locks
think()
j = ((i + 1) % 5)
P(min(i, j))
P(max(i, j))
eat()
V(max(i, j)) // order here isn't as important
V(min(i, j)) // but reverse unlock is good practice
Solutions
Coffman’s Condition
Livelock
Race Condition
Concurrent Pros
Consensus
caslock
CAS x a b:
bool rc
while TS(caslock, 1) == 1; // spin
if x == a
x = b
rc = true
else
rc = false
caslock = null
return rc
Consensus number of a sync primitive is the max # of threads for which they can solve the consensus problem
{0, 1}
{0, 1}
{0}
or {1}
(I got tired after this part)
R/W - consensus #1
FA, TS, consensus #2
int decide (int v):
x = TS(decider, v) // x is old value
if x == 1:
return v
return x
Solves 2 consensus, but not 3
CAS, consensus # ∞
int decide (int v):
CAS(decider, 1, v)
return decider
Example - 2 queues p, q - enqueue, dequeue
T0 | T1 |
---|---|
1: p.enq(x) | 4: q.enq(y) |
2: q.enq(x) | 5: p.enq(y) |
3: p.deq() -> returns y | 6: q.deq() -> returns x |
(numbers are just for future reference and do not refer to runtime order)
is this linearizable?
Memory models
T0 | T1 |
---|---|
x = 1 | y = 2 |
b = y | a = x |
(variables start at 0)
Possibilities: (a, b) = (1, 0), (0, 2), (1, 2)
Note that cases like (0, 0) is not possible; invalid interleaving
Write-buffering
P0 | WB0 | Mem | P1 | WB1 |
---|---|---|---|---|
x = 1 | → x = 1 | x = 0 | y = 2 | → y = 2 |
- | - | y = 0 | - | - |
b = y | → b = 0 | a = 0 | - | - |
- | - | - | a = x | → a = 0 |
- | writes x | x = 1 | - | - |
- | - | y = 2 | - | writes y |
- | - | a = 0 | - | - |
- | - | b = 0 | - | - |
As a result, with buffering, we can have (0, 0).
∴ we cannot just think of interleavings
T0 | T1 | T2 | T3 | |
---|---|---|---|---|
x = 1 | x = 3 | a = x(1) | d = x (3) | |
y = 2 | y = 4 | b = y (2) | e = y (4) | |
- | - | c = x (3) | f = x (1) |
No class
Intel/AMD
P0 | P1 | P2 | P3 |
---|---|---|---|
x = 1 | y = 1 | eax = x, ebx = y | ecx = y, edx = x |
For P2, it sees eax = 1, ebx = 0, meaning P0 happened before P1 For P3, it sees ecx = 1, edx = 0, meaning P1 happened before P0
Also cases that should not happen, but can be obsered in practice
n6
P0 | P1 |
---|---|
x = 1 | y = 2 |
eax = x | x = 2 |
ebx = y |
Could observe that eax = 1, ebx = 0, x = 1
n5
P0 | P1 |
---|---|
x = 1 | x = 2 |
eax = x | ebx = x |
eax = 2, ebx = 1 not disallowed, but also not observed
p
can read value v
from address a
if p
is not blocked; there are no writes to a
in WBp and mem[a] = vp
can read v
from address a
if p
is not blocked, and p has a
(latest) store “a = v” in WBp
p
can write a = v
into WBp at any timep
is not blocked, it can silently send the oldest write in WBp to memoryp
can issue an MFENCE instructionp
can acquire itp
has the lock & WBp is empty, we can release the lockExample
P: WBp = [0x55] = 0 | Q: WBq = [0x55] = 7 |
---|---|
Lock : Inc [0x55] | |
Lp | |
Rp[0x55] = 0 | |
Wp[0x55] = 1 | |
τp (0x55 = 0) | |
τp (0x55 = 1) | |
Up | τq |
Missed class
Lock free designs make use of CAS, LL/SL, etc as opposed to locks
tryAdd(Node n, Node prev):
n.next = prev.next
return CAS(prev.next, n.next n)
Sadly these naive methods do not work.
Given a list H → x
→ y
→ z
→ T:
x
, and another thread tries to add w
between x
and y
, w
will be lost.x
and another thread tries to remove y
, both may succeed with y
remaining in the list.Various solutions exist
tryAdd(Node n, Node prev):
next = prev.next
return CAS(prev.next, n.next to false, n to false)
tryRemove(Node n, Node prev):
Node succ = n.next
if CAS(n.next, succ to false, succ to true): // mark first
CAS(prev.next, n to false, succ to false) // delete is ok to fail
return true
return false
find(int data):
while(true):
pred = H
curr = pred.next
while curr != T: @restart
succ = curr.next
while curr.marked:
if !CAS(pred.next, curr to false, succ to false):
continue@restart
curr = succ
succ = curr.next
if curr.data == data:
return curr
pred = curr
curr = succ
return null
With a lock-free stack, we still have a lot of contention. Stack ops are fundamentally sequential. However, if push()
and pop()
show up at the same time, it might be better to just short-circuit the whole thing. ie, push()
gives the value to pop
; push + pop
cancel/do nothing.
Lock free exchanger (2 threads exchange data)
EMPTY
, WAITING
, BUSY
EMPTY
- ready to do the swap WAITING
, but only if the state is EMPTY
BUSY
EMPTY
EMPTY
EMPTY
, null to EMPTY
)
WAITING
- one thread (second thread to show up)
EMPTY
WAITING
, B to BUSY
)
BUSY
- two threads (third thread to show up → give up)
B
EMPTY
Can associate an exchange for the state
Lock Free algorithms are problematic in the way that references from deallocated objects may be resued unknowingly
Universal construction - almost any data structure can be made into a lock free version
newInvoc(...):
do:
j = consensus i
while i != j
// we are now the next op
s = tail
r = tail.state
do:
r = r.apply s // without modifying data, check the previous ops to update the state
s = s.next
while s != i
return r
One concern is that our consensus algorithm is one shot and not reusable. However, knowing that we can construct a new consensus object with each invocation, this is not an issue.
Not a language, but rather a set of directives (structured comments: pragma) on top of an existing language that makes parallelism easy
#pragma omp parallel |
For single statements |
#pragma omp for |
For loops; will be partitioned amongst the thread team |
#pragma omp sections |
Another way of partitioning work |
#pragma omp single |
Part within section that should only be executed by one thread |
#pragme omp master |
Part that must be executed by a specified thread (master) |
Threads share static, heap data
shared(x, y, z)
private(x, y, z)
- each thread has its own copy; uninitialized, not persistentthreadPrivate(x, y, z)
- like private, but persistent & presumably initializedfirstPrivate
- var is initialized from the present scopelastPrivate
- value is given back to the parentreduction(opList)
- opList can be + (init at 0) or * (init at 1)…
XIO
Basic mechanism:
async {
... content
}
No guarantee as to when async is done, if at all
finish {
... content
}
All async’s inside the finish block (including nested async’s) must be done before this continues
Async indicates that a new thread may be created for the code to run. Often times, the operations may be small, and making a new thread would be extremely inefficient. Some optimizations will be done to see when it is worth making new threads.
Java has executor mechanism:
execute(Runnable r)
Given runnables take in nothing and output nothing, they are not always enough. There also exists Callable<V>
which returns V
and allows exception throwing
ExecutorServie gives different ways of executing. ThreadPoolExecutor allows for specifications for pools of threads
Last Time
P & Q are processes, then P | Q is a parallel composition |
Ex (x(y) e1 | \overline{x} z e2) reduces to (e1[y → z] | e2) |
Equivalence (structural congruence) - (vx)∅ ≡ ∅, ∅ | P ≡ P | ∅ ≡ P |
Associative - (P | Q) | W ≡ P | (Q | W) |
I was away
Last time