04 - Logistic Regression - lec. 5,6

ucla | CS M146 | 2023-04-17T15:31

Supplemental
Lecture
Discussion
Resources

Supplemental

any event $E \in E$ s.t. $0 \leq P (E) \leq 1$
sum of probs $1 = \sum_{E \in E} P (E)$
logistic regression is a classification model
$\log$ is always base $e$ i.e. $\log ⟹ \ln$
loss functionin classification is binary or softmax cross-entropy loss

Lecture

Classification using Probability

instead of predicting the class, predict the probability that inctance belongs to that class i.e. $P(y \bm x)$
binary calssification: $y \in 0, 1$ as events for an input $\bm x$

Logistic Regression

Logistic (Sigmoid) Regression Model/Func

hypothesis function is the probability in [0,1] i.e. $P_{\bm\theta}(y=1 \bm x)$

$h_{\bm θ} (\bm x) = g (\bm θ^{T} \bm x) s.t. g (z) = \frac{1}{1 + e^{- z}}$

$h_{\bm θ} (\bm x) = 1 / [1 + e^{- \bm θ^{T} \bm x}]$

Interpreting Hypothesis function

hypo func gives probability label=1 given some input, e.g.

logistic regression assumes the log odds is a linear function of $\bm x$
$\log\frac{P(y=1 \bm x;\bm\theta)}{P(y=0 \bm x;\bm\theta)}=\bm\theta^T\bm x$

Non-Linear Decision Boundary

we can applya basis function expansion to features just like we did for linear regression

NOTE: Loss functions don’t need to be averaged bc minimization via gradient descent will work the same regardless

Loss Function

loss of a single instance

$\ell(y^{(i)},\bm x^{(i)},\bm\theta)=\begin{cases}-\log \big(h_{\bm\theta}(\bm x^{(i)})\big) & y^{(i)}=1$

logistic regression loss

$J (\bm θ) = \sum_{i}^{n} ℓ (y^{(i)}, \bm x^{(i)}, \bm θ)$

$J (\bm θ) = - \sum_{i}^{n} [y^{(i)} \log h_{\bm θ} (\bm x^{(i)}) + (1 - y^{(i)}) \log (1 - h_{\bm θ} (\bm x^{(i)}))]$

Intuition behind loss

non-linear loss implies largely wrong guesses result in much higher loss than less wrong guesses

Regularized Loss Function

Given the loss function

$J_{reg} (\bm θ) = J (\bm θ) + \frac{λ}{2} | \bm θ_{1 : d} |_{2}^{2}$

note the L2 norm is from index 1 to $d$
we don’t regularize basis

Gradient Desceent

weight updates (simultaneous) - similar as lin. reg. and perceptrons

$θ_{j} \leftarrow θ_{j} - α \frac{\partial}{\partial θ_{j}} J (\bm θ)$

Multi-Class Classsification

Discussion

Resources

📌

**SUMMARY
**