Comp 550
Course Website
Terms & Concepts
- CL - computational linguistics
- NLP - natural language processing
- NLU - natural language understanding
- NLG - natural language generation
- Semantics, pragmatics, discourse, syntax, phonology, phonetics, morphology
- Morpheme - inflection, derivation; free, bound (prefixes, suffixes, infixes, circumfixes)
- Language - isolating, synthetic
- Lemmatization
- Lexicon
- Morphotactics
- FSA - finite state automata - Q Σ q0 σ F
- Porter stemmer
- Morphological parsing - surface (foxes), intermediate (fox^s#), underlying (fox +N +Pl)
- FST - finite state transducer - FSA with output symbols (Δ)
- Word - tokens, types
- Term frequency - TF(w, S)
- Relative frequency - RF(w, S)
- Corpus
- Zipf’s Law, Zipf-Mandelbrow Law
- Long tail
- N-grams - unigram, bigram, trigram
- k-fold cross validation
- Entropy - H(p)
- Cross entropy - H(p, q)
- Perplexity - Perplexity(p, q)
- MLE - maximum likelihood estimation
- Good-turing smoothing
- Naïve Bayes
- Supervised, semi-supervised, unsupervised learning
- Clustering, grammar induction, word relatedness
- POS - part of speech
- Classification, regression
- N-grams, bag of words
- Naïve Bayes
- POS, modals, auxiliary verbs, conjunctions, particles
- Open classes, neologisms, content words
- Closed classes, function words
- Forward algorithm, backward algorithm, forward & backward, viterbi algorithm
- Random restarts, biased initialization
- Chunking (shallow parsing), named-entity recognition
- IOB tagging
- Generative, discriminative
- LC-CRF - Linear-chain conditional random fields
- Constituency
- FSA vs CFG
- CYK
- PCFG
- Constituent parsing, dependency parsing
- Hearst
- Hypernymy, hyponymy
- Synonymy, antonymy
- Polysemy
- Homonymy, homophone, homograph
- Metonymy, synecdoche
- Holonymy, meronymy
- WordNet, synset
- Lesk, Yarowsky, bootstrapping
- PMI - pointwise mutation information
- SVD - singular value decomposition
- Word embeddings
- Co-compositionality
- Domain of discourse, variables, predicates, functions, logical connectives, quantifiers
- FOL - first order logic
- Lambda calculus, beta reduction, type-raising
- Neo-davidsonian event semantics
- Russell’s definite descriptions
- Cooper storage, store
- Discourse markers
- Referent, referring expression
- Anaphora, antecedents, cataphora, zero anaphora, bridging reference
- Pleonastic pronouns, clefting
- Hobb
Lecture 1 - 2018/09/04
Lecture 2 - 2018/09/06
- Morpheme - subpart of a word that cannot be broken down any further
- Free morphemes can occur on their own
- Bound morphemes must occur with other morphemes as parts of words
- Prefix (before), suffix (after), infix (between), circumfix (around)
- Inflectional morphology - expresses grammatical function required by language
- Derivational morphology - derives a new word, possibly of a different part of speech
- Isolating languages - low morpheme-to-word ratio
- Synthetic languages - high morpheme-to-word ratio
- Computation Tasks
- Morphological recognition - is this a well formed word?
- Stemming - cut affixes off to find the stem
- Morphological analysis
- Lemmatization - remove inflectional morphology & recover lemma
- Full morphological analysis - recover full structure
- FSA - finite state automata
Lecture 3 - 2018/09/11
- Word - smallest unit that can appear in isolation
- Convenient assumption is that words are delimited by spaces
- Word frequency
- Term frequency: TF(w, S) = #w in corpus S
- eg TF(cat, the cat sat on the mat) = 1
- Relative frequency: RF(w, S) = TF(w, S) / |S|
- eg RF(cat, the cat sat on the mat) = ⅙
- Corpus - collection of text; used for text to count
- Zipf’s Law: f ∝ 1/r
- Frequency of word type is inversely proportional to rank of word type (by frequency)
- Zipf-Mandelbrow Law - better fit
- f = P/(r + ρ)B or log f = log P - B log(r + ρ)
Lecture 4 - 2018/09/13
- Entropy - minimum number of bits needed to communicate some message if we know what probability distribution the message is drawn from
- Cross entropy is for when we don’t know the distribution
- H(p, q) = -Σki = 1 pilog2qi
- Estimate: H(p, q) = -(1/N) log2 q(w1…wN)
- Perplexity(p, q) = 2H(p, q)
- Eg: [A B C B B], M: P(A) = 0.3, P(B) = 0.4, P(C) = 0.3
- q = 0.3 * 0.4</sup>3</sup> * 0.3
- Perp(M) = 2-⅕log2q
- Overfitting - model that is tailored to the training data, making it a poor representation of new data
- OOV - out of vocabulary
- Explicitly modelling OOV items
- Assign vocabulary items less frequent than some frequency threshold
<UNK>
- Treat
<UNK>
as some vocabulary word
- Use
<UNK>
for unseen words during testing
- Smoothing - shift probability mass to cases that we haven’t seen before/are unsure about
on unseen data
- Good-turing smoothing
- N = Σifi x i
- c* = (c + 1)fc+1/fc
- P(wc) = c*/N
- P(UNK) = f1/N
- Problem with high c as fc+1 is often 0. Solution is to estimate log fcLR = a log c + b (LR → linear relationship)
Lecture 5 - 2018/09/18
- Supervised learning - model has access to some input data and corresponding output data (eg a label)
- Unsupervised learning - model has only input data
- Clustering - finding hidden structure in data without labels
- Learning - creating good characterization of data
- Semi-supervised learning - outputs for some inputs but not all
- Grey Area
- Specific rules
- eg anything ending in -ed is a verb
- Variously called semi-supervised, distantly supervised, minimally supervised, or unsupervised
- Give seed set
- eg learning sentiment lexicon; gives positive seeds (good, awesome, great) and negative seeds (bad, awful, dreadful)
- Regression - y is a continuous outcome
- Classification - y is a discrete outcome
- Linear Regression
- N-grams
- Aka bag-of-words
- Versions
- Presence of absence of an N-gram (1 or 0)
- Count of N-gram
- Proportion of total document
- Scaled versions of counts
- POS tags
- Crudely captures syntactic patterns
- Needs preprocessing
- Naïve Bayes
- P(y|x) = P(y)∏iP(xi|y)/P(x)
Lecture 6 - 2018/09/20
- Part of speech
- Identifies some grammatical properties of words
- Includes
- Model & auxiliary words - Did the chicken cross the road?
- Conjunctions - and, but
- Particles - look up, turn down
- Penn Treebank (PTB) Tagset
- Open class - words tend to be added (neologism)
- Noun, verb, adjective, etc
- Closed class - new words tend not to be added
- Pronoun, determiners, quantifiers, conjunctions, modals & auxiliary, preposition
- PTB distinguishes between singular and plural nouns , but not transitive and intransitive verbs
Lecture 7 - 2018/09/25
- P(outcome i) = #(outcome i)/#(all events)
- πi = P(Q1 = 1) = #(Q1 = 1)/#(sentences)
- aij = P(Qt+1 = j | Qt = i) = #(i, j)/#(i)
- bik = P(Ot = k | Qt = i) = #(word k, tag i)/#(i)
- Forward algorithm
- Uses DP to avoid unnecessary recalculations
- Create table of all possible state sequences, annotated with probabilities
- Trellis - table for possible state sequences
- Runtime O(N2T)
- Backward algorithm
- Trellis with cells βi(t)
- Unlike αi(t), it excludes the current word
- Forward & backward
- P(O|θ) = ΣNi=1 αi(t)βi(t) for any t = 1 .. T
- Log sum trick - to avoid underflow (from small numbers), work in log domain
- Viterbi algorithm - similar to forward algorithm, but replace summation with max
- Used to find most likely state sequence
- Backpointers - keep track of max entry; work backwards to recover best label sequence
- Unsupervised training
- No state sequences; have to estimate them
- Initialize params randomly
- Viterby EM (‘Hard EM)
- Prediction and updates using Viterbi algorithm
- Soft predictions - probabilities of all possible state sequences
Lecture 8 - 2018/09/27
- Chunking - find syntactic chunks in sentence; not hierarchical
- Named-entity recognition (NER) - identify elements in text corresponding to high level categories (eg person, organization, location)
- IOB Tagging - label whether word is inside, outside, or at beginning of span
- Eg for Org, McGill University would be labelled B I, as both compose a single organization
- Generative models - learn joint distributions P(X, Y)
- Can be applied to new models, unlike discriminative
- Discriminative models - learn conditional distributions P(Y|X)
- Standard HMM - product of probabilities
- Forward algorithm - P(X|θ)
- Viterbi algorithm - argmaxyP(X, y|θ)
- Linear-chain conditional random fields (CRF) - Sums over features & time-steps
- Replaces HMM products by numbers that are not probabilities, but linear combinations of weights & feature values
- Allows word templates
- Forward algorithm - Z(X)
- Viterbi algorithm - argmaxYP(Y|X, θ)
- To avoid overfitting, make weights close to zero
- Done be adding an extra term (-θk/σ2) for regularization
Lecture 9 - 2018/10/02
- Linear models
- Includes logistic regressions, support vector machines, etc
- Cannot learn complex non-linear functions (eg starts with capital and not at beginning of sentence -> proper noun)
- Artificial neural network
- Automatically learns non-linear functions from input to output
- Feedforward
- All connections flow forward (no loops)
- Each layer of hidden units fully connected to next
- Activation function
- Linear combination of inputs & weights → non-linearity
- Popular choices
- Sigmoid function
- Tanh function
- Rectifier/ramp function: g(x) = max(0, x)
- Need non-linearity, otherwise we are just transforming a linear function into another linear function
- Loss function
- Measures how much error is made during predictions
- y - correct distribution
- ŷ - predicted distribution
- L(y, ŷ) - loss function
- Training
- Typically by stochastic gradient descent
- Back propagation - derive new values from error signal back to inputs
- Recurrent neural network
- Long-term dependencies in language
- Dependencies between words can be arbitrarily far apart
- Eg look up [x] can be written as look [x] up
- Cannot easily model with HMMs or LC-CRFs, but possible with RNNs
Lecture 10 - 2018/10/04
- Syntax
- How words can be arranged to form grammatical sentence
- Invalid sentences are prefixed with
*
- Constituency
- Group of words that behave as a unit
- Noun phrases - three people on the bus, Justin Trudeau
- Adjective phrases - blue, very good
- Tests for constituency
- Check if they can appear in similar syntactic environment
- Eg I saw [it, three people on the bus, *on the]
- Can be placed in different positions/replaced as a unit
- Can be used to answer a questions
- Grammatical relations
- Subject - tend to come before verbs
- Direct object - the boy kicked the ball
- Indirect object - she gave him a good beating
- Some verbs require different number of arguments
- Eg relax (subj), steal (subj, dobj), give (subj, iobj, dobj)
- Formal grammar - rules that generate a set of strings that make up a language
- Constituent tree
- Trees & sentences generated by previous rules
- Entries on the LHS of arrow are non-terminals
- Entries on the RHS of arrows are terminals
- free grammar (CFG)
- 4-tuple
- N - set of non-terminal symbols
- Σ - set of terminal symbols
- R - set of rules or productions of form A → (Σ ∪ N)*, and A ∈ N
- S - designated start symbol, S ∈ N
Lecture 11 - 2018/10/09
- Parsing algorithms
- Top-down - start at top of tree, expanding downwards with rewrite rules
- Bottom-up - start from input words and build bigger subtrees until a tree spans the whole sentence
- Eg CYK algorithm, shift-reduce parser
- CYK Algorithm
- DP algorithm - partial solutions stored & efficiently reused
- Steps
- Convert CFG to appropriate form
- Set up table of possible constituents
- Fill in table
- Read table to recover all possible parsers
- CNF Rules
- A → B C D becomes A → X1 D and X1 → B C
- A → s B becomes A → X2 B and X2 → s
- A → B - replace all B on LHS with A
- Set up table
- Let sentence have N words,
w[0] ... w[N - 1]
- Create table where cell i, j represents phrase spanning from
w[i:j + 1]
- Only need half table since i < j
- Running through table
- Base case - one word phrase - check all non-terminals
- PCFG - CFG with probabilities for each rule
- For each nonterminal A ∈ N
- Σα → β ∈ R s.t. α = A Pr(α → β) = 1
- Extend CYK to PCFG by calculating partial probabilities as you fill the table
- Only best probability is included per cell
- Train PCFG with treebank, such as WSG
- Simple version: MLE estimate
- Pr(α → β) = #(α → β) / #α
- Not great:
- No context
- Example: NPs in subject and object positions are not identically distributed
- Too sparse
- WIth the word ‘ate’, ‘ate quickly’, ‘ate with a fork’, ‘ate a sandwich’, etc are all labelled differently. Should be factorized, eg having an adverbial modifier
Lecture 12 - 2018/10/11
- Adding context to PCFG
- Append parent categories (eg
NP^S
→ PRP to NP^VP
→ PRP)
- Removing context from long chain rule by splitting them into separate rule
- Eg Pr(VP → START AdvP VBD NP PP END) to
- Pr(VP → START AdvP)
- Pr(VP → AdvP VBD)
- Pr(VP → VBD NP)
- Pr(VP → NP PP)
- Pr(VP → PP END)
- Markovization
- Vertical markovization - adding ancestors as context
- Zeroth order - vanilla PCFGs
- First order - one parent only (see above)
- nth order - add n parents
- Horizontal markovization - breaking RHS into parts
- Infinite order - vanilla PCFGs
- First order - pairs (see above)
- Can choose any order with interpolations
- Semantics - study of meaning in language
- Meaning of words - lexical semantics
- Referent - person/thing to which a expression refers
- Extensional definition - all things for which a definition applies, as opposed to intention or comprehension
- Intensional definition - necessary and sufficient conditions
- Multiple words can refer to the same referent. Used in different contexts → different senses.
- Example of venus, morning star, evening star, which are all venus
- Hyponym/hypernym - ISA (is a) relationship
- Monkey & mammal, Montreal & city, wine & beverage, etc. ([hyponym] is a [hypernym])
- Synonym/Antonym - roughly same/different meaning
- Homonymy
- Homophone - same sound (son vs sun)
- Homograph - same written form (lead (noun) vs lead (verb))
- Polysemy - multiple related meanings
- Eg newspaper
- Daily weekly publication
- Business that publishes newspaper (meaning above)
- Physical object that is the product of a newspaper publisher
- Cheap paper made from wood pulp
- Homonymy is typically with unrelated meaning, whereas polysemy is with related meaning
- Metonymy - substitution of one entity for another related one
- Eg
- Ordering a dish (food, not a physical dish)
- I work for the local paper (place, not object)
- Quebec City is cutting our budget again (government, not location)
- The loonie is at a 11-year low (value, not physical loonie)
- Synecdoche - specific kind involving whole-part relations
- Holonymy/meronymy - whole part relationships
- Holonym → whole, meronym → part
- Computational approach
- Heuristic - eg Lesk’s
- Supervised ML - lots of work
- Unsupervised - eg Yarowsky’s
- Lesk’s
- Given a sentence, construct a word bag from definitions of all senses of context words, get the overlaps between each sense and the word in question, then select the one with the highest overlap
- Yarowsky’s
- Gather data set with target word to be disambiguated
- Automatically label small seed set of examples
- Repeat for a while
- Train from seed set
- Apply model to entire data set
- Keep highly confident classification outputs to be new seed set
- Use last model as final model
Lecture 13 - 2018/10/16
- Hearst patterns - pairs of terms in hyponym-hypernym relationships tend to occur in certain lexico-syntactic patterns
- Can find relations through key words (cause, induce, stem (from), etc)
- Can find relations through co-occurring words (such as, and, or)
- Can find relations through words that tend to co-occur with target words
- Term-context matrix
- Each row is a vector representation of word through context
- Context can refer to nearest ‘x’ neighbouring words
- Cosine similarity
- sim(A, B) = (A · B)/(||A|| ||B||)
- -1 if vectors point in opposite direction, 0 if orthogonal, 1 if same direction
- Positive (eg count) vectors are scored within 0 and 1
- Cosine similarity is not enough to determine synonymy
- High cosine similarity can also indicate antonyms, word senses
- Similarity - synonymy, hypernymy, hyponymy
- Relatedness - anything that might be associated (eg good and bad)
- Cosine similarity is really a measure of relatedness
- Rescaling vectors
- Pointwise mutual information (PMI)
- pmi(w1, w2) = log (P(w1, w2)/(P(w1)P(w2)))
- Numerator - probability of both words occurring
- Denominator - probability of each word occurring in general
- If ratio < 1, PMI is negative
- Often disregarded → positive PMI (PPMI)
- Compressing term-context matrix
- Often very sparse, lots of zeros
- Singular value decomposition (SVD)
- X = W x Σ x CT
- m is rank of matrix X
- Rows of W are new word vectors
- Rows of C (cols of CT) are new context vectors
- Σ is diagonal matrix of singular values of X (sqrt of eigenvalues of XTX, from highest to lowest)
- Truncated SVD
- Throw out some singular values (lowest ones) in Σ
- Word embeddings
- Skip-gram
- Validation
- Word similarity
- a : b :: c : d - what is d?
- Eg man : king :: woman : ? (man is to king as woman is to queen)
- Solved by assuming vector differences are 0: va - vb = vc - vd
- Can download pretrained word2vec embeddings from web
- Advantages - trained, large quantity, cheap, easy to try
- Disadvantages - doesn’t always work
Lecture 14 - 2018/10/18
- Compositionality - meaning of phrase depends on meanings of its parts
- Idioms often violate compositionality
- Co-compositionality - meanings of words depend also on other words that they are composed with
- Considering words such as rose, wine, cheeks, red does not combine compositionally, but rather co-compositionally
- Semantic inference - making explicit something that is implicit
- I want to visit the capital of Italy; capital of Italy is Rome; ∴ I want to visit Rome
- Montagovian semantics - using logical formalism to represent natural language inferences
- First-order predicate calculus
- Domain of discourse - set of entities we care about
- Variables - potential elements of domain
- Predicates - maps elements of domain to truth values
- Can be different valences (eg takes multiple elements)
- Functions - maps elements to other elements
- Can be different valences
- Valence 0 function is a constant
- Logical connectives - ¬, ∧, ∨, →, ↔
- Quantifiers - ∃, ∀
- Interpretation of first order logic (FOL)
- Domain of discourse (D)
- Mapping for functions to elements of D
- Mapping for predicates to True or False
- All students who study and do homework will get an A
- ∀x.(study(x) ∧ hw(x)) → grade(x) = A
- Lambda calculus - describes computation using mathematical functions
- Definitions
- Variables (eg x)
- λx.t, t is a lambda term
- ts, t and s are lambda terms
- Function application (or beta reduction)
- Given (λx.t)s, replace all instances of x in t with expression s
- Left associative - abcd = ((ab)c)d
- (λx.x + y)2 → 2 + y
- (λx.xx)(λx.x) = (λx.x)(λx.x) = λx.x
- Allows for partial computations
- Syntax-driven semantic composition
- Augment CFG with lambda expressions (syntactic composition = function application)
- Semantic attachments
- Syntactic composition: A → a1 … an
- Semantic attachment: {f(aj.sem, …, ak.sem)}
- Proper noun: PN → COMP550 to {COMP550}
- Type-raised proper noun: PN → COMP550 to {λx.x(COMP550)}
- We create a function taking in the PN as argument
- NP rule: NP → PN to {PN.sem}
- Common noun: N → student to {λx.Student(x)}
- Type ⟨e, t⟩; takes an entity and returns a truth value
- Intransitive verbs: V → rules to {λx.∃e.Rules(e) ∧ Ruler(e, x)}
- Created an event variable e
- Composition: S → NP VP to {NP.sem(VP.sem)}
- Neo-Davidsonian event semantics
- Method 1: multi-place predicate: Rules(x)
- Method 2: Neo-Davidsonian version with event variable: ∃e.Rules(e) ∧ Ruler(e, x)
- Reifying event variable makes things more flexible
- Optional elements such as location and time
- Can add information to event variable
Lecture 15 - 2018/10/23
- Neo-Davidsonian semantics cont.
- Transitive verbs: V → enjoys to {λw.λz.w(λx.∃e.Enjoys(e) ∧ Enjoyer(e, z) ∧ Enjoyee(e, x))}
- Note that for quantifiers, universal quantifiers use →, while existential ones use ∧
- Russell (1905)’s Definite descriptions
- How to represent ‘the’ in FOL?
- In “the student took Comp550”, we must enforce
- An entity who is a student
- At most one thing referred as a student
- Student participates in some predicate (“took Comp550”)
- ∃x.Student(x) ∧ ∀y.(Student(y) → y = x) ∧ took(x, COMP550
- Det → a to {λP.λQ.∃x. P(x) ∧ Q(x)}
- Det → the {λP.λQ.∃x P(x) ∧ ∀y(P(y) → y = x)}
- Scopal ambiguity
- Eg. Every student took a course
- every > a: for all students, there exists a course that was taken by that student
- a > every: there exists a course for which all students took
- Underspecified representation - meaning representation that can embody all possible readings without explicitly enumerating lal of them
- Cooper Storage
- Associate a store with each FOL so that each reading can be recovered
- Every student took a course
- ∃e.took(e) ∧ taker(e, s1) ∧ takee(e, s2)
- (λQ.∀x.Student(x) → Q(x), 1)
- (λQ.∃y.Course(y) ∧ Q(y), 2)
- We do not specify the order of each sub expression
- Compositional rules now modify the inside part of a store, and introduce new idnex variables
Lecture 16 - 2018/10/25
- Coherent - property of discourse that makes sense
- Contains logical structure/meaning
- Incoherent - eg two completely unrelated sentences
- Cohesion - use of linguistic devices to tie together text units
- Lexical cohesion - related words in passage
- Discourse markers - cue words for discourse relations
- Eg also, and, therefore, however
- Referent - actual entity in the world
- Referring expressions - phrases that refer to the referent
- Referring expressions towards the same referent are said to corefer
- Antecedent - thing that exists before or logically precedes another
- Anaphor - points to an antecedent that precedes it
- Cataphor - points to an antecedent that follows it
- Zero anaphora - ommitting pronouns in certain contexts
- Languages with this are referred to as pro-drop
- Eg No habl-o español
- Others can be dependent on context
- Bridging reference - reference to entities that can be inferred from previously mentioned entity
- I like my office. The windows are large.
- Non-referential pronouns
- Pleonastic pronouns
- It is raining - not sure what ‘it’ is referring to
- Clefting
- It is he that is Bob - ‘it’ used to focus on some point
- Pronominal anaphora resolution
- Relevant info to identify pronouns include gender & number (eg 3Sg), syntactic information, and recency
- C-command - reflexives that must be bound by a subject in certain syntactic relationships
- Eg ‘the students taught themselves’, themselves refers to the students, if it was ‘them’ it would refer to some other group
- Hobb’s algorithm
- Begin at NP node i mmediately dominating the pronoun
- Go to first NP or S above it. Call this node X and the path to it p
- Do left to right breadth first traversal of all branches below X to left o p. Propose as antecedent any NP node encountered that has an NP or S between it and X
- If X is highest S in sentence, consider parse trees of previous sentences in recency order, and traverse each in left to right breadth first order. When NP encountered, propose as antecendent. If X not highest S, continue to step 5
- From X, go to first NP or S above it. Call this node X and path to it p
- If X is NP and p doens’t pass through Nominal that X immediately dominates, propose X as antecedent
- Do left to right breadth first traversal of all branches below X to left of p. Propose any NP encountered as antecedent
- If X is S, do left to right breadth first traversal of all branches below X to right of p, but don’t go below any NP or S encountered, Propose any NP encountered as antecedent
- Go to step 4
Lecture 17 - 2018/10/30
- This-anaphora
- May refer to complex chunks of information
- Midterm review