Comp 558
Course Website
Lecture 1  2018/09/04
 Applications of Computer Vision
 Image editing
 Object scanning for 3D printing
 Handwriting recognition
 Face detection and recognition
 Image search & retrieval
 People detection and tracking
 Navigation systems
 Medical imaging
 Biological imaging
Lecture 2  2018/09/06
 720p = 1MP; 1080p = 2MP
 #f0f magenta; #ff0 yellow; #0ff cyan
 The human eye has varying sensitivity to RGB at varying wavelengths
 We are more sensitive to green
 Bayer pattern  checkerboard pattern of [G, R; B, G]
 I_{RGB}(x) = C_{RGB}(λ) S(x, λ) dλ
 Exposure_{RGB}(x) = I_{RGB}(x) * t
 Analog to digital conversion
 gain control  adjust overall sensitivity of responses
 white balance  adjust relative sensitivity of R, G, B
Lecture 3  2018/09/11
 Image noise  random variation of brightness or color information; usually electronic noise
 Image smoothing  averaging the intensities of RGBs to reduce noise
 Local average (1D)  I_{smooth}(x) = ¼ I(x + 1) + ½ I(x) + ¼ I(x  1)
 Smoothing an edge will decrease the slope of steps, which blends the edge colors.
 Gaussian function: G(x) = (1/ sqrt(2 π) σ) e^(x^2/(2 σ^2))
 Integrates to 1
 Often called normal distribution
 Can use it for smoothing weights
 The bigger the σ, the more smoothing there is
 Local difference: dI(x)/dx m ≈ ½ I(x + 1)  ½ I(x  1)
 Cross correlation: f(x) cross I(x) ≡ Σ_{u} f(u  x) I(u)
 Sliding a template across an image, and taking the inner product
 Shows how well a filter matches the image
 For the local average (1D)
 f(x) = ¼ when x = +1, 1
 f(x) = ½ when x = 0
 f(x) = 0 otherwise
 I is the image, f is the filter or kernel
 Convolution: f(x) * I(x) ≡ Σ_{u} f(x  u) I(u)
 Summing weighted, shifted versions of the function f
 Shows how much image intensity contributes to the filtered output
 Impulse function: σ(x) = 1 when x = 0, 0 otherwise
 σ(x) * I(x) = I(x)
 Outputs an impulse response
Lecture 4  2018/09/13
 Motivation for edge detection
 Find boundaries of objects
 Match features in images
 Edge detection based on threshold

I(x + 1)  I(x  1) 
> τ where τ is the threshold 
 Can also use partial derivatives to account for changes in both x and y
 Prewitt used gradients [1, 0, 1; 1, 0, 1; 1, 0, 1] and [1, 1, 1; 0, 0, 0; 1, 1, 1]
 Sobel used gradients [1, 0, 1; 2, 0, 2; 1, 0, 1] and [1, 2, 1; 0, 0, 0; 1, 2, 1]
Lecture 5  2018/09/18
 Canny edge detection
 Mark pixel as candidate edge if gradient magnitude is greater than some threshold τ_{high}
 Nonmaximum suppression  compare gradient magnitude with neighbours nearest to the direction of that gradient. Eliminate edge pixels if they are not a local maximum
 Hysteresis thresholding
Lecture 6  2018/09/20
 Inliers  points explained by line model
 Outliers  points not explained by line model
 Reduce sensitivity to outliers by reducing penalty for large distances
 Eg distance rather than distance squared
 Eg constant penalty after certain distance
 Hough transform
 Lines can be represented with some positive value r (from origin), at some angle θ between 0 and 2π. The line is perpendicular to the line at angle θ, crossing the line r away from origin.
 For Hough transform, for every θ in [0, 2π), compute r for line in direction θ through point (x_{i}, y_{i}). Vote for this (r, θ), and return the (r, θ) with the maximum count
 To detect peaks, we can smooth the voting map to remove noise
 RANSAC  RANdom SAmple Consensus  reduces sensitivity
 Pick two different points, and fit a line
 Good model if many other points lie near the line
 We can find consensus set by counting the number of remaining points that are within τ from the line
 Do this many times and find the best line model (highest consensus size)
 Take best model, find consensus, then refit using least square method
 Can be scaled up to n points, where n is minimum # of points needed for exact model fit
 Why minimum # of points?
 Trade off is that more points lead to greater chance of including an outlier
 Vanishing points
Lecture 7  2018/09/25
 Recall
 Convolution: f(x, y) * I(x, y) = Σ_{u, v} f(x  u, y  v) I(u, v)
 Zero mean gaussian kernel: G(x, y, σ) = 1/(2πσ^{2}) e^{(x2 + y2)/(2σ2)}
 Gaussian scale space  family of images smoothed by Gaussians of increasing σ
 Laplace’s equation (aka heat equation or isotropic diffusion): ∂I(x, y, t)/∂t = ΔI(x, y, t) = ∂^{2}I(x, y, t)/∂x^{2} + ∂^{2}I(x, y, t)/∂y^{2}
 Also equivalent to div(∇I(x, y, t))
 PersonaMalik introduces coefficient c(x, y, t), which is a montonically decreasing function of the magnitude of the image gradient
 ∂I(x, y, t)/∂t = div(c(x, y, t)∇I(x, y, t)) = c(x, y, t)ΔI(x, y, t) + ∇c(x, y, t)∇I(x, y, t)
 It is more efficient and faster to compute with coarser/smaller versions of an input image
Lecture 8  2018/09/27
 Gaussian pyramid  smoothing + subsampling
 Think of ∇I as 2 x 1 matrix (or a column vector)
 StructureTensor is a 2 x 2 matrix given by ∇I (∇I)^{T}
 The eigenvalues & eigenvectors of matrix reveal invariant geometric properties
 Order the eigenvalues in descending order (λ_{1} ≥ λ_{2} > 0)
 Then e_{1} is the direction aligned with the local gradient
 Special case is when λ_{2} = 0, I is constant along e_{2}
 If both are around 0, then no dominant gradient direction locally
 To find corners, find a way to capture locations where I(x_{0}, y_{0}) is different from that of its neighbours
 Keep in mind
 We always ‘smooth’ a little bit (convolution with a kernel) so that ∇I is acceptable
 I(x_{0} + Δx, y_{0} + Δy)
From lecture notes (not in class)
 Corner  region where two edges come together at sharp angle

Intensities at a point are locally distinctive if the following “error” is large, relative to distance 
(Δx, Δy) 
to local neighbour 
 Σ_{(x, y) ∈ N gd(x0, y0)} (I(x_{0}, y_{0})  I(x_{0} + Δx, y_{0} + Δy))^{2}
 Second moment matrix (M)
 Σ_{(x, y) ∈ N gd(x0, y0)} (∇I)(∇I)^{T} = [Σ(∂I/∂x)^{2}, Σ(∂I/∂x)(∂I/∂y) \\ Σ(∂I/∂x)(∂I/∂y), Σ(∂I/∂y)^{2}]
 Harris corners
 Computing eigenvalues involves solving a quadratic equation, which is slow
 We may note that a matrix with eigenvalues λ_{1} and λ_{2} has a determinant of λ_{1}λ_{2} and a trace of λ_{1} + λ_{2}
 With a small constant k, λ_{1}λ_{2}  k(λ_{1} + λ_{2})^{2}:
 Is negative if one eigenvalue is 0
 Is small if both eigenvalues are small
 Trace and determinant can be used to find edges and corners
Lecture 9  2018/10/02
 Midterm material ends this week
 Normalized correlation  d_{correl}(H_{1}, H_{2}) = Σ(H’_{1}(i) · H’_{2}) / √(Σ(H’_{1}^{2}(i) · H’_{2}^{2}(i)))
 1  perfect match
 1  perfect mismatch
 0  no correlation
 Chisquare  d_{chisquare}(H_{1}, H_{2}) = Σ_{i} (H_{1}(i)  H_{2}(i))^{2} / (H_{1}(i) + H_{2}(i))
 0  perfect match
 ∞  mismatch
 Bhattacharya
 0  perfect match
 1  total mismatch
 Histogram equalization
 Histogram of oriented gradients
 Compute gradients
 Divide into bins by orientation
 Count number of entries in each bin
 Dense computation with chosen cell size and block size
 SIFT  scale invariant feature transform
 Feature detector
 Features are local and sparse; invariant to translation, rotation, scale
 Matches feature vectors using correlation
 Translation is easy to handle; most extraction methods are invariant to translation
 Rotation is harder; needs canonical orientation
 Uses orientation at peak of smoothed histogram
 Sometimes there may be multiple peaks passed threshold; in that case, generate multiple keypoints for each orientation
 Scaling is even harder; uses scalespace theory
Lecture 10  2018/10/04
See Stanford’s CS231 for more material
 Convolutional neural networks (CNN)
 Black box
 Lots of nodes computing weighted summation of inputs
 Weighted sums are sampled by activation function to create sparsity
 Sigmoid  f(z) = (1 + e^{z})^{1}
 Hyperbolic tan  f(z) = tanh(z) = (e^{z}  e^{z})/(e^{z} + e^{z})
 RELU  f(z) = max(0, z)
 Softmax  given vector, transforms it so that each entry is in [0, 1] and all entries sum to 1
 Initial layer convolutional, later ones fully connected
 Propagation
 Forward  writing down layer (l + 1)’s activations given layer l’s activations then iterating layer through layer
 Backward  propagating gradients in reverse direction to minimize a loss function
Lecture 11  2018/10/09
Midterm next week until lecture 10
 CNNs
 Convolutional layers (MLPs)
 Activation functions, tanh ,sigmoid, RELV
 Pooling, avg, max, others
 Backward Propagation
 Gol is to estimate W and b to minimize sum of square error loss function
 Review properties of 2nd moment matrix
Lecture 12  2018/10/11
 RANSAC on midterm
 Image registration
 Given two similar images, for each (x_{0}, y_{0}), find (h_{x}, h_{y}) that minimizes the difference
 Assume that intensity variation is linear
 Windows need to be “appropriate”; smaller windows > finer matches
 Deformation matrix
 [s, 0 \\ 0, s]  scales by s
 [cosθ, sinθ \\ sinθ, cosθ]  CCW rotation by θ
2018/10/16
Midterm
Lecture 13  2018/10/18
Lecture 14  2018/10/23
 Beginning of 3D computer vision
 Left handed frame (Z goes into canvas vs out of canvas)
 To go from 3D to 2D, we can just drop the z coordinate (orthogonal view?)
 Projection view  take depth into account by drawing images such that each ray goes to a single center of projection
 Plane: aX + bY + cZ = d
 f/Z  scale factor in perspective projection
 Multiplying plane by f/Z, we can rewrite to ax + by + cf = fd/Z
 When Z → ∞, ax = by + cf = 0, line at infinity
 Ground plane
 Y = h
 y = fY/Z = fh/Z (when Z → ∞, y → 0)
 aka ‘horizon’
Lecture 15  2018/10/25
Lecture 16  2018/10/30
 Majority of notes done on paper. Below reflects some bigger points.
 Rotation: [x(t) \ y(t) \ z(t)] = [cos(Ωt) & 0 & sin(Ωt) \ 0 & 1 & 0 \ sin(Ωt) & 0 & cos(Ωt)] = [x_{0} \ y_{0} \ z_{0}]
 Rotation about y axis
 x(t) = f(X(t))/Z(t)
 y(t) = f(Y(t))/Z(t)
 Derive both where t = 0
 V_{x} = f(Ωx_{0}^{2} + Ωx_{0}^{2})/Z_{0}^{2}
 V_{y} = fΩx_{0}y_{0}/z_{0}^{2} = Ωxy/f
 Rotation about x axis
 By symmetry, exchange roles of x and y
 V_{x} = Ωxy/f
 V_{y} = fΩ(1 + (y/f)^{2})
 Rotation about z axis
 V_{x} = Ω  y
 V_{y} = Ω x
 Discrete rotation
 [R] [x_{0} \ y_{0} \ z_{0}] = [x_{r} \ y_{r} \ z_{r}]
 R satisfies:
 det(R) = 1
 RR^{T} = R^{T}R = I_{3x3}
 Ie orthogonal matrix, R^{T} = R^{1}
 Rotation matrix preserves angle between 2 vectors
 Rotation matrix has 3 eigenvalues, 2 of which are complex, while the third is 1
 RP = λP
 If you choose the third eigenvalue and its associated eigenvector, then RP = P; P is the axis of rotation
 To simplify cross product, let [a]_{x}b = a x b = [0 & a_{3} & a_{2} \ a_{3} & 0 & a_{1} \ a_{2} & a_{1} & 0]
 To map 3D point to 4D point, [x \ y \ z] → [x \ y \ z \ 1]
 Translation [x + t_{x} \ y + t_{y} \ z + t_{z} \ 1] = [1 & 0 & 0 & t_{x} \ 0 & 1 & 0 & t_{y} \ 0 & 0 & 1 & t_{z} \ 0 & 0 & 0 & 1] [x \ y \ z \ 1]
 Scaling [σ_{x}X \ σ_{y}Y \ σ_{z}Z \ 1] = [σ_{x} & 0 & 0 & 0 \ 0 & σ_{y} & 0 & 0 \ 0 & 0 & σ_{z} & 0 \ 0 & 0 & 0 & 1][x \ y \ z \ 1]
 Rotation appears in the top left 3x3, with all other elements as 0, and cell 4, 4 = 1.
Lecture 17  2018/11/01
Lecture 18  2018/11/06
Lecture 19  2018/11/08
Lecture 20  2018/11/13
 Homograph  invertible 3 x 3 matrix that allows you to map pixels in on camera’s image to pixels in another camera’s image
 Case 1  look at 3D scene plane
 Projection of location on 3D plane
 Case 2  multiple cameras
 Case 3  scene no longer planer, but camera undergoes pure rotation
Case 1 (plane)
 We assume that (s, t) are 2D coordinates on scene plane
 Bottom left hand corner of plane has coordinates (x_{0}, y_{0}, z_{0}) in a world coordinate frame
 Unit vectors in the directions of the axes respectively in world coordinates
 a = (a_{x}, a_{y}, a_{z})
 b = (b_{x}, b_{y}, b_{z})
 Coordinates of point (s, t) on plane in 3D
 (x_{0}, y_{0}, z_{0}) + t = (x, y, z)
 t = sa + tb
 t = sa_{x} + sa</sub>y</sub> + sa _{z} + tb_{x} + tb_{y} + tb_{z}
 To get pixel coordinates, premultiply P_{3 x 4}
Lecture 21  2018/11/15
Case 2 (stereo (2) cameras)
 Rectification  transform (warp) both cameras’ images such that their axes are in alignment
 T is written with respect to camera 1’s coordinate frame
 Eg T = (1, 0, 0) implies that it is aligned with camera 1’s x axis
 R_{rect} is a 3 x 3 matrix which, when used to premultiply T, rotates it so that you end up with an axis system for camera 1 where the X axis is aligned with T
 To align camera 2, use R_{rect}R_{1}R_{2}^{1}
 To warp camera one  K_{1}R_{rect}K_{1}^{1}
 To warp camera two  K_{1}R_{rect}R_{1}R_{2}^{1}K_{2}^{1}
 3D scene point X_{1} = (X_{1}, Y_{1}, Z_{1})^{T} is coordinate w.r.t. camera 1
 Line of sight from one camera’s optical center to the second camera’s center intersects the image planes at the two epipoles, e_{1}, e_{2}
 A unique 3D scene point defines an epipolar plane π
 A unique 3D scenepoint, π, intersects the 2 camera image planes at conjugated epipolar lines
 X_{1} (above), along with T_{1} = (T_{x}, T_{y}, T_{z})^{T} and (X_{1}  T_{1})^{T} lie in the epipolar plane
Lecture 22  2018/11/20
TODO
Lecture 23  2018/11/22
TODO
Lecture 24  2018/11/27
 Lambertian shading  surface whose reflected image obeys the equation:
 I(X) = N(X) · L
 Classical “shape from shading”
 I(x, y) = N(x, y) · L
 Too simplistic
 Most surfaces do not have Lambertian reflectance
 Cannot solve for single N without further constraints
 Hard to distinguish shading and shadows
 Scene colours depend on the illumination colour, though computer and human vision models typically normalize the image, assuming global colour bias
 Some non Lambertian surface materials
 Plastic
 Polished fruit
 Metal
 Wet surfaces
 Focal length = 1/field of view angle