2 - Text Classification

ucla | CS 162 | 2024-01-11 23:26

Classifiers
- Rule Based Classifier
- Supervised Classifier
Probabilistic Classifier
Naive Bayes Classifier
- Learner for NB classifier
- Prediction
Practicalities

Classifiers

Rule Based Classifier

doesn’t generalize, but good for specific corpuses
Supervised Classifier
takes inputs $\vec{x}$ from the corpus and labels $y$ per input as a class and learns from supervised training
vecotrized: $X = [{\vec{x}}_{1}, \dots, {\vec{x}}_{n}]$ , $Y = [y_{1}, \dots, y_{n}]$
a classifier $C : {\vec{x}}_{i} \to y_{i}$
a learner $L : X \times Y \to C$
Probabilistic Classifier
logits = log probs of $p(y \vec x)$
we want our classifier to output $\text{argmax}_y\space p(y \vec x)$
i.e. return the label with the highest probability of being correct and create a model that does this for the most inputs correctly
Creating the model
the distribution is a multinomial distribution because of the many classes
we want to find a direct MLE (maximum likelihood estimation) https://blog.jakuba.net/maximum-likelihood-for-multinomial-distribution/
i.e. the MLE is the word’s frequency in the corpus so its a terrible estimator because the prediction space is exponentially large
so we make model assumptions
Bag of Words (BoW) Assumption
assume the order of words does not matter i.e. $$p(y \text{“Filled with horrific”})=p(y \text{“with”,”horrific”,”filled”})$$
this sets the inputs as a set instead of a sequence
however this simplifies the problem but introduces new issues in positional context
but this is still not enough because this represents the sentences as a set but the likelihood this occurs in the corpus is still incredibly low
Naive Bayes Assumption
words are independent conditioned on their class
i.e. the bayes independence formula $$p(\text{“Filled”,”with”,”horrific”} y)=p(y)\prod_{x_i\in \vec x} p(x_i y)$$
now we reduce the space to the vocab size (from the corpus to unique words) (from $O (V^{n})$ )
now to find $p(y \text{filled,horrific,with})$ can be solved with bayes rule due to the above assumption to allow us to find each words probs
both assumptions brings us to
- Final Model - MAP prediction
simplification brings us to:
the prior is the frequency of the label from the multinomial distribution
the likelihood can be found by counting the occurrences of X=x when Y=y and divide by the overall number of times Y=y
Complexity
initially $k - 1$ parameters for $k$ classes and $V^{n} - 1$ parameters for $V$ vocab size, the last class and word can be found from not being in the parameters
now we have only $k (V - 1)$ overall parameters
Naive Bayes Classifier
using all of these assumptions we can make the following classifier
assume features (words in sentences) are conditionally independent given the label $y$
to predict we need the prior $p (y)$ and the likelihood for each $x_{j}$ for $p(x_j y) $, t h e n w e h a v e t h e p r e d i c t i o n (h y p o t h e s i s) :$ $h_{NB}(\vec x)=\text{arg}\max_y\space P(y)\prod_j P(x_j y)$$
where the term in the prod is (note here we add smoothing for 0 counts)
Learner for NB classifier
Prediction
for prediction we can drop tokens/words that are not in our test input
we can also supply smoothing (regularization) by +1 to numerator of likelihoods and +$ V $ to denominator
for the denominator likelihood counts, we count the number of words that occur for each label ocurrence

Practicalities

Log Probs

large vocab size multiplying bayes probs will cause underflow
instead use logits to represent multiplication of words by probs into sum of log probs
- Generative vs Discriminative Models
Generative models learn a joint probability distribution $p (\vec{x}, y)$
Discriminative models learn a conditional probability distribution $p(y \vec x)$
naive bayes learns $p(\vec x y)\propto p(\vec x,y)$ so its really generative
so to make $p(y \vec x) $f r o m o u r c u r r e n t n a i v e b a y e s m o d e l, w e c a n a s s u m e a s c o r i n g f u n c t i o n f o r a t e x t i n s t a n c e a n d i t s l a b e l$ \text w^T\Phi(\vec x,y)$
- here the weights are $w^{T}$
- the feature function $Φ (\vec{x}, y)$ is a score not a probability and returns a vector
- multiplying the 2 gets a scalar
- we can then take the exponent of the function to make the range $(0, \infty)$ and monotone (nondecreasing)
- then we can model it is as softmax (or multiclass logistic regressor)
now we can treat this as a prob maximization by picking best weights
Feature Engineering
now given the feature function we need to do some feature selection for $Φ$
n-grams for sequence information
punctuation, text length, etc.
then we use a MLE to find max weights
this is ongruent to mnimizing the negative log likelihood of
now wee use gradient descent
Log Reg Gradients
the gradient is:
where z is the sum over the classes ie the 2nd part of the loss
and the weight update is:
Differences

2 - Text Classification

Table of Contents

Classifiers

Rule Based Classifier

Supervised Classifier

Probabilistic Classifier

Creating the model

Bag of Words (BoW) Assumption

Naive Bayes Assumption

Final Model - MAP prediction

Complexity

Naive Bayes Classifier

Learner for NB classifier

Prediction

Practicalities

Log Probs

Generative vs Discriminative Models

Feature Engineering

Log Reg Gradients

Differences