View the Project on GitHub

Last modified on 23 Jan 2019

Comp 551

Course Website

Lecture 1 - 2019/01/08

  • 10% weekly quizzes
  • 60% projects
  • 30% midterm
  • 2 hour midterm March 26, 6-8 pm
  • Projects in groups of three; can’t have same group more than once
  • Case studies
    • Computer vision - recognizing faces
    • NLP - machine translation
      • Core principle is representing words as vectors
    • Social recommendations
      • Important to consider ethical implications
    • Game playing - eg AlphaZero
  • Supervised learning
    • Classification - eg spam or not spam. Category based
    • Regression - eg output speed for self driving car. Predicting continuous variable
  • Unsupervised learning
    • Generation - eg creating synthetic images

Reading

  • Learning = representation + evaluation + optimization
    • Representation - classification is represented as some formal language that a computer can handle
      • Eg SVM, Naive Bayes, logic programs, neural networks
    • Evaluation determines if a classifier is good or bad
      • Eg Accuracy, squared error, cost/utility margin
    • Optimization - want to find classifier with high evaluation
      • Eg greedy search, gradient descent, linear programming
  • Goal of ML is to generalize beyond examples in training set
  • Cross validation - randomly splitting test and training data, avoids overfitting
  • Learners typically don’t beat random guessing, but fortunately, most data is not completely random, and assumptions can be made
  • Bias - tendency to consistently learn the same wrong thing
  • Variance - tendency to learn independently random things
  • Regularization term also helps avoid overfitting
    • Eg penalize classifiers with more structure, since larger structures are more likely to overfit
  • Multiple testing helps reduce the false discovery rate, caused by unrelated correlation
  • Curse of dimensionality - problem increases exponentially as dimensionality increases
    • Adding new features that are unrelated may end up hindering performance, due to the overhead
  • Blessing of non-uniformity - to counteract the point above, most applications are not uniform throughout the instance space, but rather concentrated on a lower-dimensional manifold.
    • Ex the set of handwritten digits is typically much smaller than the set of all pixel combinations in images of the same size
  • Simplicity does not imply accuracy
    • Given two classifiers with the same training error, the simpler one will likely have the lowest test error
  • Not everything representable is learnable
  • Correlation does not imply causation

Lecture 2 - 2019/01/10

  • Predictions - classification vs regression
  • I.I.D. assumption - Independently and identically distributed
    • Important assumption for training set
  • Empirical risk minimization (ERM)
    • Define notion of error
    • Want to find function that minimized error over training set
  • Linear hypothesis
    • Weight and bias term attached to each variable
  • Least square solution method
    • Large coefficient & small standard error → expect small change in x to have a large effect on y
  • Gradient descent
    • Incremental approach to minimizing mean squared error (MSE)
    • Take steps in negative direction of gradient
    • Multiplied by parameter α, which is the learning rate
    • Keep descending until error is within some acceptable magnitude
  • αk are a Robbins-Monroe sequence if
    • Σk = 0: ∞ ak = ∞
    • Σk = 0: ∞ ak2 < ∞
    • Ensures convergence of wk to local minimum of error function
    • Example αk
      • 1/(k + 1)
      • 1/2 for k = 1, …, T
  • In non-convex functions, local minima is not always global minima

Lecture 3 - 2019/01/15

Linear Regression

  • Least square solution what = (XTX)-1XTT
  • Two main failures
    • XTX is singular, inverse is undefined
      • Weights may not be uniquely defined - one is linear function of the other
      • Number of features/weights (m) exceeds the number of training examples (n)
    • Simple linear function is a bad fit
  • Solutions for failures
    • Pick a better function
    • Use more features
    • Get more data
  • Input variables
    • Original quantitative variables
    • Transformation of variables (eg log)
    • Basic expansions
    • Interaction terms (combining multiple variables)
    • Numeric coding of qualitative variables (eg 1, 0 for true, false)
  • Coding variable (feature design)
    • Eg for categorical variables: {blue, red, green, yellow}
      • Cannot introduce four binary variables, as they are perfectly correlated; three false values require the other one to be true
      • Alternative solution is to have three separate binary variables with the last colour being the intercept
  • Validation set - compare different models during development
  • Test set - evaluate final performance of model
  • Two sets must be separate of we may result in “meta”-overfitting
  • Hyperparameter - parameter that is not learned but could impact performance
  • K-fold cross validation - separate data into many training/validation sets vs just one split
  • Leave-one-out cross validation - k-fold cross validation but where each validation set is a single point from training set

Lecture 4 - 2019/01/17

  • Applications of classification
    • Text classification (spam filtering, sentiment analysis)
    • Image classification (face detection, object recognition)
    • Prediction of cancer recurrence
    • Recommendation systems
  • Binary classification - probabilistic approach
    • Discriminative learning - estimate P(y|x)
    • Generative learning - model P(x|y) and P(y), and use Bayes’ rule to estimate P(y|x)
  • Log-odds ratio: a = ln(P(y = 1 x)/P(y = 0 x))
    • How much more likely is y = 1 compared to y = 0?
    • Modelled with a linear function, as we often don’t know the actual values in the real world
  • Logistic function: σ = 1/(1 + exp(-a))
  • Maximizing likelihood is numerically unstable, as values are extremely small; maximizing log-likelihood is preferred
  • Cross-entropy loss - measures number of bits of info required to fix model
    • Minimizing loss = maximizing likelihood

## Lecture 5 - 2019/01/22

  • For assignment, can use different stopping condition; error is often used
  • Descent with sum of squared errors will require a much smaller learning curve than mean error
  • Generative learning
    • More flexible
      • Eg change P(y) if spammers become more active; other features are the same
    • Works well with smaller datasets
  • Linear Discriminant Analysis (LDA)
    • Assumes
      • Multivariate gaussian, mean μ and covariance matrix Σ
      • Different classes have same covariance matrix, Σ
  • Quadratic Discriminant Analysis (QDA)
    • Allows different covariance matrices, Σy for each y
    • Note that covariance is m x m, requiring O(m2) parameters. Much more costly than logistic regression
  • Naive Bayes - assume xj are conditionally independent given y
    • P(xj y) = P(xj y, xk) for all j, k
    • Σ is diagonal
  • Laplace smoothing - add one count per encountered attribute in numerator, and number of attributes to denominator
    • Allows us to give odds greater than 0 for unknown attributes