7 - Log Linear and Neural LMs

ucla | CS 162 | 2024-02-06 12:24

Log-Linear Language Models

we want to create a conditional distribution $p(y

f r o m a s c o r i n g f u n c t i o n, w e d e f i n e a s \(score(x, y) = \sum_{k} θ_{k} \cdot f_{k} (x, y) = \vec{θ} \cdot \vec{f} (x, y) \) w h e r e

\theta

i s t h e w e i g h t o f f e a t u r e

k$ and the feature function can be many representations (3.g., counts, binary, strength)

we can make the conditional probability distribution wrt the weights as
Training
given $n$ training instances and $f_{1}, f_{2}, \dots$ feature functions, we maximize the log probs of training instances conditioned on the weights: $$\sum_{i=1}^n\log\space p_{\vec\theta}(y_i x_i)$$
- originally it is actually a joint probability distribution that we prod over but easier to sum over logs
  Gradient Descent
the thing we want to improve is make the prob dist approach the RHS
Cross Entropy
same as neg log likelihood of our model
Generalization and OoD (unseen) samples
Neural LM
NN Review
idea for NN for LMs - FFNN (3-MLP)
forward pass
backprop
weight update

map tokens to dense low-dim vecs to create prob dists over
these vector representations allow similarity comparison and analogies
we construct LMs in such a way to learn the model and representations i.e., update embeddings along with the weights
Objective Function
likelihood is softmax, we want to maximize softmax