04 - Logistic Regression - lec. 5,6

ucla | CS M146 | 2023-04-17T15:31


Table of Contents

Supplemental

  • any event $E\in\mathcal E$ s.t. $0\le P(E)\le 1$
  • sum of probs $1=\sum_{E\in\mathcal E}P(E)$
  • logistic regression is a classification model
  • $\log$ is always base $e$ i.e. $\log\implies\ln$
  • loss functionin classification is binary or softmax cross-entropy loss

Lecture

Classification using Probability

  • instead of predicting the class, predict the probability that inctance belongs to that class i.e. $P(y\bm x)$
  • binary calssification: $y\in{0,1}$ as events for an input $\bm x$

Logistic Regression

Logistic (Sigmoid) Regression Model/Func

  • hypothesis function is the probability in [0,1] i.e. $P_{\bm\theta}(y=1\bm x)$

$h_{\bm\theta}(\bm x)=g\big(\bm\theta^T\bm x\big)\quad \text{s.t.}\quad g(z)=\frac{1}{1+e^{-z}}$

$h_{\bm\theta}(\bm x)={1}\bigg /\bigg[{1+e^{-\bm\theta^T\bm x}}\bigg]$

Interpreting Hypothesis function

  • hypo func gives probability label=1 given some input, e.g.
  • logistic regression assumes the log odds is a linear function of $\bm x$

    $\log\frac{P(y=1\bm x;\bm\theta)}{P(y=0\bm x;\bm\theta)}=\bm\theta^T\bm x$

Non-Linear Decision Boundary

  • we can applya basis function expansion to features just like we did for linear regression
  • NOTE: Loss functions don’t need to be averaged bc minimization via gradient descent will work the same regardless

Loss Function

  • loss of a single instance

$\ell(y^{(i)},\bm x^{(i)},\bm\theta)=\begin{cases}-\log \big(h_{\bm\theta}(\bm x^{(i)})\big) & y^{(i)}=1\$

  • logistic regression loss

$J(\bm\theta)=\sum_i^n\ell(y^{(i)},\bm x^{(i)},\bm\theta)$

$J(\bm\theta)=-\sum_i^n\bigg[y^{(i)}\log h_{\bm\theta}(\bm x^{(i)})+\big(1-y^{(i)}\big)\log\big(1-h_{\bm\theta}(\bm x^{(i)})\big)\bigg]$

Intuition behind loss

  • non-linear loss implies largely wrong guesses result in much higher loss than less wrong guesses

Regularized Loss Function

  • Given the loss function

$J_{\text{reg}}(\bm\theta)=J(\bm\theta)+\frac\lambda2|\bm\theta_{1:d}|_2^2$

  • note the L2 norm is from index 1 to $d$
  • we don’t regularize basis

Gradient Desceent

  • weight updates (simultaneous) - similar as lin. reg. and perceptrons

$\theta_j\leftarrow\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\bm\theta)$

Multi-Class Classsification

Discussion

Resources


📌

**SUMMARY
**