Last modified on 17 Dec 2021

Comp 550

Course Website

Terms & Concepts

CL - computational linguistics
NLP - natural language processing
NLU - natural language understanding
NLG - natural language generation
Semantics, pragmatics, discourse, syntax, phonology, phonetics, morphology
Morpheme - inflection, derivation; free, bound (prefixes, suffixes, infixes, circumfixes)
Language - isolating, synthetic
Lemmatization
Lexicon
Morphotactics
FSA - finite state automata - Q Σ q₀ σ F
Porter stemmer
Morphological parsing - surface (foxes), intermediate (fox^s#), underlying (fox +N +Pl)
FST - finite state transducer - FSA with output symbols (Δ)
Word - tokens, types
Term frequency - TF(w, S)
Relative frequency - RF(w, S)
Corpus
Zipf’s Law, Zipf-Mandelbrow Law
Long tail
N-grams - unigram, bigram, trigram
k-fold cross validation
Entropy - H(p)
Cross entropy - H(p, q)
- Estimation
Perplexity - Perplexity(p, q)
MLE - maximum likelihood estimation
Good-turing smoothing
Naïve Bayes
Supervised, semi-supervised, unsupervised learning
- Clustering, grammar induction, word relatedness
POS - part of speech
Classification, regression
N-grams, bag of words
Naïve Bayes
POS, modals, auxiliary verbs, conjunctions, particles
Open classes, neologisms, content words
Closed classes, function words
Forward algorithm, backward algorithm, forward & backward, viterbi algorithm
Random restarts, biased initialization
Chunking (shallow parsing), named-entity recognition
IOB tagging
Generative, discriminative
LC-CRF - Linear-chain conditional random fields
Constituency
FSA vs CFG
CYK
PCFG
Constituent parsing, dependency parsing
Hearst
Hypernymy, hyponymy
Synonymy, antonymy
Polysemy
Homonymy, homophone, homograph
Metonymy, synecdoche
Holonymy, meronymy
WordNet, synset
Lesk, Yarowsky, bootstrapping
PMI - pointwise mutation information
SVD - singular value decomposition
Word embeddings
Co-compositionality
Domain of discourse, variables, predicates, functions, logical connectives, quantifiers
FOL - first order logic
Lambda calculus, beta reduction, type-raising
Neo-davidsonian event semantics
Russell’s definite descriptions
Cooper storage, store
Discourse markers
Referent, referring expression
Anaphora, antecedents, cataphora, zero anaphora, bridging reference
Pleonastic pronouns, clefting
Hobb

Lecture 1 - 2018/09/04

Lecture 2 - 2018/09/06

Morpheme - subpart of a word that cannot be broken down any further
- Free morphemes can occur on their own
- Bound morphemes must occur with other morphemes as parts of words
  - Prefix (before), suffix (after), infix (between), circumfix (around)
Inflectional morphology - expresses grammatical function required by language
Derivational morphology - derives a new word, possibly of a different part of speech
Isolating languages - low morpheme-to-word ratio
Synthetic languages - high morpheme-to-word ratio
Computation Tasks
- Morphological recognition - is this a well formed word?
- Stemming - cut affixes off to find the stem
- Morphological analysis
  - Lemmatization - remove inflectional morphology & recover lemma
  - Full morphological analysis - recover full structure
FSA - finite state automata

Lecture 3 - 2018/09/11

Word - smallest unit that can appear in isolation
Convenient assumption is that words are delimited by spaces
Word frequency
- Term frequency: TF(w, S) = #w in corpus S
  - eg TF(cat, the cat sat on the mat) = 1
- Relative frequency: RF(w, S) = TF(w, S) / |S|
  - eg RF(cat, the cat sat on the mat) = &frac16;
Corpus - collection of text; used for text to count
Zipf’s Law: f ∝ 1/r
- Frequency of word type is inversely proportional to rank of word type (by frequency)
Zipf-Mandelbrow Law - better fit
- f = P/(r + ρ)^B or log f = log P - B log(r + ρ)

Lecture 4 - 2018/09/13

Entropy - minimum number of bits needed to communicate some message if we know what probability distribution the message is drawn from
Cross entropy is for when we don’t know the distribution
- H(p, q) = -Σ^k_{i = 1} p_ilog₂q_i
- Estimate: H(p, q) = -(1/N) log₂ q(w₁…w_N)
- Perplexity(p, q) = 2^{H(p, q)}
- Eg: [A B C B B], M: P(A) = 0.3, P(B) = 0.4, P(C) = 0.3
  - q = 0.3 * 0.4</sup>3</sup> * 0.3
  - Perp(M) = 2^{-&frac15;log₂q}
Overfitting - model that is tailored to the training data, making it a poor representation of new data
OOV - out of vocabulary
Explicitly modelling OOV items
- Assign vocabulary items less frequent than some frequency threshold <UNK>
- Treat <UNK> as some vocabulary word
- Use <UNK> for unseen words during testing
Smoothing - shift probability mass to cases that we haven’t seen before/are unsure about on unseen data
Good-turing smoothing
- N = Σ_if_i x i
- c* = (c + 1)f_c+1/f_c
- P(w_c) = c*/N
- P(UNK) = f₁/N
- Problem with high c as f_c+1 is often 0. Solution is to estimate log f_c^LR = a log c + b (LR → linear relationship)

Lecture 5 - 2018/09/18

Supervised learning - model has access to some input data and corresponding output data (eg a label)
Unsupervised learning - model has only input data
Clustering - finding hidden structure in data without labels
Learning - creating good characterization of data
Semi-supervised learning - outputs for some inputs but not all
Grey Area
- Specific rules
  - eg anything ending in -ed is a verb
  - Variously called semi-supervised, distantly supervised, minimally supervised, or unsupervised
- Give seed set
  - eg learning sentiment lexicon; gives positive seeds (good, awesome, great) and negative seeds (bad, awful, dreadful)
Regression - y is a continuous outcome
Classification - y is a discrete outcome
Linear Regression
N-grams
- Aka bag-of-words
- Versions
  - Presence of absence of an N-gram (1 or 0)
  - Count of N-gram
  - Proportion of total document
  - Scaled versions of counts
POS tags
- Crudely captures syntactic patterns
- Needs preprocessing
Naïve Bayes
- P(y|x) = P(y)∏_iP(x_i|y)/P(x)

Lecture 6 - 2018/09/20

Part of speech
- Identifies some grammatical properties of words
- Includes
  - Model & auxiliary words - Did the chicken cross the road?
  - Conjunctions - and, but
  - Particles - look up, turn down
Penn Treebank (PTB) Tagset
Open class - words tend to be added (neologism)
- Noun, verb, adjective, etc
Closed class - new words tend not to be added
- Pronoun, determiners, quantifiers, conjunctions, modals & auxiliary, preposition
PTB distinguishes between singular and plural nouns , but not transitive and intransitive verbs

Lecture 7 - 2018/09/25

P(outcome i) = #(outcome i)/#(all events)
π_i = P(Q₁ = 1) = #(Q₁ = 1)/#(sentences)
a_ij = P(Q_t+1 = j | Q_t = i) = #(i, j)/#(i)
b_ik = P(O_t = k | Q_t = i) = #(word k, tag i)/#(i)
Forward algorithm
- Uses DP to avoid unnecessary recalculations
- Create table of all possible state sequences, annotated with probabilities
- Trellis - table for possible state sequences
- Runtime O(N²T)
Backward algorithm
- Trellis with cells β_i(t)
- Unlike α_i(t), it excludes the current word
Forward & backward
- P(O|θ) = Σ^N_i=1 α_i(t)β_i(t) for any t = 1 .. T
Log sum trick - to avoid underflow (from small numbers), work in log domain
Viterbi algorithm - similar to forward algorithm, but replace summation with max
- Used to find most likely state sequence
- Backpointers - keep track of max entry; work backwards to recover best label sequence
Unsupervised training
- No state sequences; have to estimate them
- Initialize params randomly
- Viterby EM (‘Hard EM)
  - Prediction and updates using Viterbi algorithm
- Soft predictions - probabilities of all possible state sequences

Lecture 8 - 2018/09/27

Chunking - find syntactic chunks in sentence; not hierarchical
Named-entity recognition (NER) - identify elements in text corresponding to high level categories (eg person, organization, location)
IOB Tagging - label whether word is inside, outside, or at beginning of span
- Eg for Org, McGill University would be labelled B I, as both compose a single organization
Generative models - learn joint distributions P(X, Y)
- Can be applied to new models, unlike discriminative
Discriminative models - learn conditional distributions P(Y|X)
Standard HMM - product of probabilities
- Forward algorithm - P(X|θ)
- Viterbi algorithm - argmax_yP(X, y|θ)
Linear-chain conditional random fields (CRF) - Sums over features & time-steps
- Replaces HMM products by numbers that are not probabilities, but linear combinations of weights & feature values
- Allows word templates
- Forward algorithm - Z(X)
- Viterbi algorithm - argmax_YP(Y|X, θ)
To avoid overfitting, make weights close to zero
- Done be adding an extra term (-θ_k/σ²) for regularization

Lecture 9 - 2018/10/02

Linear models
- Includes logistic regressions, support vector machines, etc
- Cannot learn complex non-linear functions (eg starts with capital and not at beginning of sentence -> proper noun)
Artificial neural network
- Automatically learns non-linear functions from input to output
Feedforward
- All connections flow forward (no loops)
- Each layer of hidden units fully connected to next
Activation function
- Linear combination of inputs & weights → non-linearity
- Popular choices
  - Sigmoid function
  - Tanh function
  - Rectifier/ramp function: g(x) = max(0, x)
- Need non-linearity, otherwise we are just transforming a linear function into another linear function
Loss function
- Measures how much error is made during predictions
- y - correct distribution
- ŷ - predicted distribution
- L(y, ŷ) - loss function
Training
- Typically by stochastic gradient descent
- Back propagation - derive new values from error signal back to inputs
Recurrent neural network
Long-term dependencies in language
- Dependencies between words can be arbitrarily far apart
  - Eg look up [x] can be written as look [x] up
- Cannot easily model with HMMs or LC-CRFs, but possible with RNNs

Lecture 10 - 2018/10/04

Syntax
- How words can be arranged to form grammatical sentence
- Invalid sentences are prefixed with *
Constituency
- Group of words that behave as a unit
  - Noun phrases - three people on the bus, Justin Trudeau
  - Adjective phrases - blue, very good
- Tests for constituency
  - Check if they can appear in similar syntactic environment
    - Eg I saw [it, three people on the bus, *on the]
  - Can be placed in different positions/replaced as a unit
  - Can be used to answer a questions
Grammatical relations
- Subject - tend to come before verbs
- Direct object - the boy kicked the ball
- Indirect object - she gave him a good beating
- Some verbs require different number of arguments
  - Eg relax (subj), steal (subj, dobj), give (subj, iobj, dobj)
Formal grammar - rules that generate a set of strings that make up a language
- Format of A → B
Constituent tree
- Trees & sentences generated by previous rules
- Entries on the LHS of arrow are non-terminals
- Entries on the RHS of arrows are terminals
free grammar (CFG)
- 4-tuple
  - N - set of non-terminal symbols
  - Σ - set of terminal symbols
  - R - set of rules or productions of form A → (Σ ∪ N)*, and A &in; N
  - S - designated start symbol, S &in; N

Lecture 11 - 2018/10/09

Parsing algorithms
- Top-down - start at top of tree, expanding downwards with rewrite rules
  - Eg Earley parser
- Bottom-up - start from input words and build bigger subtrees until a tree spans the whole sentence
  - Eg CYK algorithm, shift-reduce parser
CYK Algorithm
- DP algorithm - partial solutions stored & efficiently reused
- Steps
  - Convert CFG to appropriate form
  - Set up table of possible constituents
  - Fill in table
  - Read table to recover all possible parsers
CNF Rules
- A → B C D becomes A → X1 D and X1 → B C
- A → s B becomes A → X2 B and X2 → s
- A → B - replace all B on LHS with A
Set up table
- Let sentence have N words, w[0] ... w[N - 1]
- Create table where cell i, j represents phrase spanning from w[i:j + 1]
  - Only need half table since i < j
Running through table
- Base case - one word phrase - check all non-terminals
PCFG - CFG with probabilities for each rule
- For each nonterminal A &in; N
  - Σ_{α → β &in; R s.t. α = A} Pr(α → β) = 1
Extend CYK to PCFG by calculating partial probabilities as you fill the table
- Only best probability is included per cell
Train PCFG with treebank, such as WSG
- Simple version: MLE estimate
  - Pr(α → β) = #(α → β) / #α
    - Not great:
      - No context
        
        Example: NPs in subject and object positions are not identically distributed
      - Too sparse
        
        WIth the word ‘ate’, ‘ate quickly’, ‘ate with a fork’, ‘ate a sandwich’, etc are all labelled differently. Should be factorized, eg having an adverbial modifier

Lecture 12 - 2018/10/11

Adding context to PCFG
- Append parent categories (eg NP^S → PRP to NP^VP → PRP)
Removing context from long chain rule by splitting them into separate rule
- Eg Pr(VP → START AdvP VBD NP PP END) to
  - Pr(VP → START AdvP)
  - Pr(VP → AdvP VBD)
  - Pr(VP → VBD NP)
  - Pr(VP → NP PP)
  - Pr(VP → PP END)
Markovization
- Vertical markovization - adding ancestors as context
  - Zeroth order - vanilla PCFGs
  - First order - one parent only (see above)
  - nth order - add n parents
- Horizontal markovization - breaking RHS into parts
  - Infinite order - vanilla PCFGs
  - First order - pairs (see above)
  - Can choose any order with interpolations
Semantics - study of meaning in language
- Meaning of words - lexical semantics
Referent - person/thing to which a expression refers
Extensional definition - all things for which a definition applies, as opposed to intention or comprehension
Intensional definition - necessary and sufficient conditions
Multiple words can refer to the same referent. Used in different contexts → different senses.
- Example of venus, morning star, evening star, which are all venus
Hyponym/hypernym - ISA (is a) relationship
- Monkey & mammal, Montreal & city, wine & beverage, etc. ([hyponym] is a [hypernym])
Synonym/Antonym - roughly same/different meaning
Homonymy
- Homophone - same sound (son vs sun)
- Homograph - same written form (lead (noun) vs lead (verb))
Polysemy - multiple related meanings
- Eg newspaper
  - Daily weekly publication
  - Business that publishes newspaper (meaning above)
  - Physical object that is the product of a newspaper publisher
  - Cheap paper made from wood pulp
Homonymy is typically with unrelated meaning, whereas polysemy is with related meaning
Metonymy - substitution of one entity for another related one
- Eg
  - Ordering a dish (food, not a physical dish)
  - I work for the local paper (place, not object)
  - Quebec City is cutting our budget again (government, not location)
  - The loonie is at a 11-year low (value, not physical loonie)
- Synecdoche - specific kind involving whole-part relations
  - All hands on deck
Holonymy/meronymy - whole part relationships
- Holonym → whole, meronym → part
Computational approach
- Heuristic - eg Lesk’s
- Supervised ML - lots of work
- Unsupervised - eg Yarowsky’s
Lesk’s
- Given a sentence, construct a word bag from definitions of all senses of context words, get the overlaps between each sense and the word in question, then select the one with the highest overlap
Yarowsky’s
- Gather data set with target word to be disambiguated
- Automatically label small seed set of examples
- Repeat for a while
  - Train from seed set
  - Apply model to entire data set
  - Keep highly confident classification outputs to be new seed set
- Use last model as final model

Lecture 13 - 2018/10/16

Hearst patterns - pairs of terms in hyponym-hypernym relationships tend to occur in certain lexico-syntactic patterns
- Can find relations through key words (cause, induce, stem (from), etc)
- Can find relations through co-occurring words (such as, and, or)
- Can find relations through words that tend to co-occur with target words
Term-context matrix
- Each row is a vector representation of word through context
  - Context can refer to nearest ‘x’ neighbouring words
Cosine similarity
- sim(A, B) = (A · B)/(||A|| ||B||)
- -1 if vectors point in opposite direction, 0 if orthogonal, 1 if same direction
- Positive (eg count) vectors are scored within 0 and 1
Cosine similarity is not enough to determine synonymy
- High cosine similarity can also indicate antonyms, word senses
Similarity - synonymy, hypernymy, hyponymy
Relatedness - anything that might be associated (eg good and bad)
Cosine similarity is really a measure of relatedness
Rescaling vectors
- Pointwise mutual information (PMI)
  - pmi(w₁, w₂) = log (P(w₁, w₂)/(P(w₁)P(w₂)))
  - Numerator - probability of both words occurring
  - Denominator - probability of each word occurring in general
- If ratio < 1, PMI is negative
  - Often disregarded → positive PMI (PPMI)
Compressing term-context matrix
- Often very sparse, lots of zeros
- Singular value decomposition (SVD)
  - X = W x Σ x C^T
    - m is rank of matrix X
    - Rows of W are new word vectors
    - Rows of C (cols of C^T) are new context vectors
    - Σ is diagonal matrix of singular values of X (sqrt of eigenvalues of X^TX, from highest to lowest)
- Truncated SVD
  - Throw out some singular values (lowest ones) in Σ
- Word embeddings
  - Neural network models
- Skip-gram
Validation
- Word similarity
  - a : b :: c : d - what is d?
    - Eg man : king :: woman : ? (man is to king as woman is to queen)
  - Solved by assuming vector differences are 0: v_a - v_b = v_c - v_d
Can download pretrained word2vec embeddings from web
- Advantages - trained, large quantity, cheap, easy to try
- Disadvantages - doesn’t always work

Lecture 14 - 2018/10/18

Compositionality - meaning of phrase depends on meanings of its parts
- Idioms often violate compositionality
Co-compositionality - meanings of words depend also on other words that they are composed with
- Considering words such as rose, wine, cheeks, red does not combine compositionally, but rather co-compositionally
Semantic inference - making explicit something that is implicit
- I want to visit the capital of Italy; capital of Italy is Rome; &therefore; I want to visit Rome
Montagovian semantics - using logical formalism to represent natural language inferences
First-order predicate calculus
- Domain of discourse - set of entities we care about
- Variables - potential elements of domain
  - Typically lower case
- Predicates - maps elements of domain to truth values
  - Can be different valences (eg takes multiple elements)
- Functions - maps elements to other elements
  - Can be different valences
    - Valence 0 function is a constant
- Logical connectives - ¬, &wedge;, &vee;, →, ↔
- Quantifiers - ∃, ∀
Interpretation of first order logic (FOL)
- Domain of discourse (D)
- Mapping for functions to elements of D
- Mapping for predicates to True or False
- All students who study and do homework will get an A
  - ∀x.(study(x) &wedge; hw(x)) → grade(x) = A
Lambda calculus - describes computation using mathematical functions
- Definitions
  - Variables (eg x)
  - λx.t, t is a lambda term
  - ts, t and s are lambda terms
- Function application (or beta reduction)
  - Given (λx.t)s, replace all instances of x in t with expression s
  - Left associative - abcd = ((ab)c)d
  - (λx.x + y)2 → 2 + y
  - (λx.xx)(λx.x) = (λx.x)(λx.x) = λx.x
- Allows for partial computations
Syntax-driven semantic composition
- Augment CFG with lambda expressions (syntactic composition = function application)
- Semantic attachments
  - Syntactic composition: A → a₁ … a_n
  - Semantic attachment: {f(a_j.sem, …, a_k.sem)}
- Proper noun: PN → COMP550 to {COMP550}
- Type-raised proper noun: PN → COMP550 to {λx.x(COMP550)}
  - We create a function taking in the PN as argument
- NP rule: NP → PN to {PN.sem}
- Common noun: N → student to {λx.Student(x)}
  - Type &langle;e, t&rangle;; takes an entity and returns a truth value
- Intransitive verbs: V → rules to {λx.∃e.Rules(e) &wedge; Ruler(e, x)}
  - Created an event variable e
- Composition: S → NP VP to {NP.sem(VP.sem)}
Neo-Davidsonian event semantics
- Method 1: multi-place predicate: Rules(x)
- Method 2: Neo-Davidsonian version with event variable: ∃e.Rules(e) &wedge; Ruler(e, x)
  - Reifying event variable makes things more flexible
    - Optional elements such as location and time
    - Can add information to event variable

Lecture 15 - 2018/10/23

Neo-Davidsonian semantics cont.
- Transitive verbs: V → enjoys to {λw.λz.w(λx.∃e.Enjoys(e) &wedge; Enjoyer(e, z) &wedge; Enjoyee(e, x))}
Note that for quantifiers, universal quantifiers use →, while existential ones use &wedge;
Russell (1905)’s Definite descriptions
- How to represent ‘the’ in FOL?
  - In “the student took Comp550”, we must enforce
    - An entity who is a student
    - At most one thing referred as a student
    - Student participates in some predicate (“took Comp550”)
  - ∃x.Student(x) &wedge; ∀y.(Student(y) → y = x) &wedge; took(x, COMP550
- Det → a to {λP.λQ.∃x. P(x) &wedge; Q(x)}
- Det → the {λP.λQ.∃x P(x) &wedge; ∀y(P(y) → y = x)}
Scopal ambiguity
- Eg. Every student took a course
  - every > a: for all students, there exists a course that was taken by that student
  - a > every: there exists a course for which all students took
Underspecified representation - meaning representation that can embody all possible readings without explicitly enumerating lal of them
Cooper Storage
- Associate a store with each FOL so that each reading can be recovered
- Every student took a course
  - ∃e.took(e) &wedge; taker(e, s₁) &wedge; takee(e, s₂)
    - (λQ.∀x.Student(x) → Q(x), 1)
    - (λQ.∃y.Course(y) &wedge; Q(y), 2)
  - We do not specify the order of each sub expression
- Compositional rules now modify the inside part of a store, and introduce new idnex variables

Lecture 16 - 2018/10/25

Coherent - property of discourse that makes sense
- Contains logical structure/meaning
Incoherent - eg two completely unrelated sentences
Cohesion - use of linguistic devices to tie together text units
- Lexical cohesion - related words in passage
- Discourse markers - cue words for discourse relations
  - Eg also, and, therefore, however
Referent - actual entity in the world
Referring expressions - phrases that refer to the referent
Referring expressions towards the same referent are said to corefer
Antecedent - thing that exists before or logically precedes another
Anaphor - points to an antecedent that precedes it
Cataphor - points to an antecedent that follows it
Zero anaphora - ommitting pronouns in certain contexts
- Languages with this are referred to as pro-drop
- Eg No habl-o español
- Others can be dependent on context
Bridging reference - reference to entities that can be inferred from previously mentioned entity
- I like my office. The windows are large.
Non-referential pronouns
- Pleonastic pronouns
  - It is raining - not sure what ‘it’ is referring to
- Clefting
  - It is he that is Bob - ‘it’ used to focus on some point
Pronominal anaphora resolution
- Relevant info to identify pronouns include gender & number (eg 3Sg), syntactic information, and recency
C-command - reflexives that must be bound by a subject in certain syntactic relationships
- Eg ‘the students taught themselves’, themselves refers to the students, if it was ‘them’ it would refer to some other group
Hobb’s algorithm
1. Begin at NP node i mmediately dominating the pronoun
2. Go to first NP or S above it. Call this node X and the path to it p
3. Do left to right breadth first traversal of all branches below X to left o p. Propose as antecedent any NP node encountered that has an NP or S between it and X
4. If X is highest S in sentence, consider parse trees of previous sentences in recency order, and traverse each in left to right breadth first order. When NP encountered, propose as antecendent. If X not highest S, continue to step 5
5. From X, go to first NP or S above it. Call this node X and path to it p
6. If X is NP and p doens’t pass through Nominal that X immediately dominates, propose X as antecedent
7. Do left to right breadth first traversal of all branches below X to left of p. Propose any NP encountered as antecedent
8. If X is S, do left to right breadth first traversal of all branches below X to right of p, but don’t go below any NP or S encountered, Propose any NP encountered as antecedent
9. Go to step 4

Lecture 17 - 2018/10/30

This-anaphora
- May refer to complex chunks of information
Midterm review
- CYK