Comp 551
Course Website
Lecture 1  2019/01/08
 10% weekly quizzes
 60% projects
 30% midterm
 2 hour midterm March 26, 68 pm
 Projects in groups of three; can’t have same group more than once
 Case studies
 Computer vision  recognizing faces
 NLP  machine translation
 Core principle is representing words as vectors
 Social recommendations
 Important to consider ethical implications
 Game playing  eg AlphaZero
 Supervised learning
 Classification  eg spam or not spam. Category based
 Regression  eg output speed for self driving car. Predicting continuous variable
 Unsupervised learning
 Generation  eg creating synthetic images
 Learning = representation + evaluation + optimization
 Representation  classification is represented as some formal language that a computer can handle
 Eg SVM, Naive Bayes, logic programs, neural networks
 Evaluation determines if a classifier is good or bad
 Eg Accuracy, squared error, cost/utility margin
 Optimization  want to find classifier with high evaluation
 Eg greedy search, gradient descent, linear programming
 Goal of ML is to generalize beyond examples in training set
 Cross validation  randomly splitting test and training data, avoids overfitting
 Learners typically don’t beat random guessing, but fortunately, most data is not completely random, and assumptions can be made
 Bias  tendency to consistently learn the same wrong thing
 Variance  tendency to learn independently random things
 Regularization term also helps avoid overfitting
 Eg penalize classifiers with more structure, since larger structures are more likely to overfit
 Multiple testing helps reduce the false discovery rate, caused by unrelated correlation
 Curse of dimensionality  problem increases exponentially as dimensionality increases
 Adding new features that are unrelated may end up hindering performance, due to the overhead
 Blessing of nonuniformity  to counteract the point above, most applications are not uniform throughout the instance space, but rather concentrated on a lowerdimensional manifold.
 Ex the set of handwritten digits is typically much smaller than the set of all pixel combinations in images of the same size
 Simplicity does not imply accuracy
 Given two classifiers with the same training error, the simpler one will likely have the lowest test error
 Not everything representable is learnable
 Correlation does not imply causation
Lecture 2  2019/01/10
 Predictions  classification vs regression
 I.I.D. assumption  Independently and identically distributed
 Important assumption for training set
 Empirical risk minimization (ERM)
 Define notion of error
 Want to find function that minimized error over training set
 Linear hypothesis
 Weight and bias term attached to each variable
 Least square solution method
 Large coefficient & small standard error → expect small change in x to have a large effect on y
 Gradient descent
 Incremental approach to minimizing mean squared error (MSE)
 Take steps in negative direction of gradient
 Multiplied by parameter α, which is the learning rate
 Keep descending until error is within some acceptable magnitude
 α_{k} are a RobbinsMonroe sequence if
 Σ_{k = 0: ∞} a_{k} = ∞
 Σ_{k = 0: ∞} a_{k}^{2} < ∞
 Ensures convergence of w_{k} to local minimum of error function
 Example α_{k}
 1/(k + 1)
 1/2 for k = 1, …, T
 In nonconvex functions, local minima is not always global minima
Lecture 3  2019/01/15
Linear Regression
 Least square solution w_{hat} = (X^{T}X)^{1}X^{T}T
 Two main failures
 X^{T}X is singular, inverse is undefined
 Weights may not be uniquely defined  one is linear function of the other
 Number of features/weights (m) exceeds the number of training examples (n)
 Simple linear function is a bad fit
 Solutions for failures
 Pick a better function
 Use more features
 Get more data
 Input variables
 Original quantitative variables
 Transformation of variables (eg log)
 Basic expansions
 Interaction terms (combining multiple variables)
 Numeric coding of qualitative variables (eg 1, 0 for
true
, false
)
 Coding variable (feature design)
 Eg for categorical variables: {blue, red, green, yellow}
 Cannot introduce four binary variables, as they are perfectly correlated; three false values require the other one to be true
 Alternative solution is to have three separate binary variables with the last colour being the intercept
 Validation set  compare different models during development
 Test set  evaluate final performance of model
 Two sets must be separate of we may result in “meta”overfitting
 Hyperparameter  parameter that is not learned but could impact performance
 Kfold cross validation  separate data into many training/validation sets vs just one split
 Leaveoneout cross validation  kfold cross validation but where each validation set is a single point from training set
Lecture 4  2019/01/17
 Applications of classification
 Text classification (spam filtering, sentiment analysis)
 Image classification (face detection, object recognition)
 Prediction of cancer recurrence
 Recommendation systems
 Binary classification  probabilistic approach
 Discriminative learning  estimate
P(yx)
 Generative learning  model
P(xy)
and P(y)
, and use Bayes’ rule to estimate P(yx)

Logodds ratio: a = ln(P(y = 1 
x)/P(y = 0 
x)) 
 How much more likely is
y = 1
compared to y = 0
?
 Modelled with a linear function, as we often don’t know the actual values in the real world
 Logistic function: σ = 1/(1 + exp(a))
 Maximizing likelihood is numerically unstable, as values are extremely small; maximizing loglikelihood is preferred
 Crossentropy loss  measures number of bits of info required to fix model
 Minimizing loss = maximizing likelihood
## Lecture 5  2019/01/22
 For assignment, can use different stopping condition; error is often used
 Descent with sum of squared errors will require a much smaller learning curve than mean error
 Generative learning
 More flexible
 Eg change P(y) if spammers become more active; other features are the same
 Works well with smaller datasets
 Linear Discriminant Analysis (LDA)
 Assumes
 Multivariate gaussian, mean μ and covariance matrix Σ
 Different classes have same covariance matrix, Σ
 Quadratic Discriminant Analysis (QDA)
 Allows different covariance matrices, Σ_{y} for each y
 Note that covariance is m x m, requiring O(m^{2}) parameters. Much more costly than logistic regression
 Naive Bayes  assume x_{j} are conditionally independent given y

P(x_{j} 
y) = P(x_{j} 
y, x_{k}) for all j, k 
 Σ is diagonal
 Laplace smoothing  add one count per encountered attribute in numerator, and number of attributes to denominator
 Allows us to give odds greater than 0 for unknown attributes