01 - Linear Regression - lec. 1,2,3

ucla | CS M146 | 2023-04-07T12:06

Definitions
Lecture
Discussion - Least Squares Gradient
Resources

Definitions

hypothesis function
$h_\theta(\vec x)\in\mathcal H$
A function that maps inputs to outputs using input features ($\vec x^{(i)}$) and labels ($y^{(i)}$) and belongs to a “hypothesis class ($\mathcal H$)” (a set of similar functions) by using arbitrary weights $\vec\theta$
E.g. Linear regression hypothesis function:
$h_\theta(\vec x)=\theta_0+\theta_1 x_1 + … + \theta_d x_d=\vec{\theta^T}\vec x$
convex function
- A real-valued function is convex if the line segment joining any two points lies above the function

Lecture

Recall from lec. 1, an ML model takes features and labels aas inputs and returns a hypothesis function.

Linear Regression Properties

Hypothesis function - linear w.r.t. weights @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$\theta$

Parametrized

$h_{\vec{\theta}}(\vec x)=\theta_0+\theta_1 x_1 + … + \theta_d x_d\vec x\quad\theta\in R^{d+1}$

Vectorized ($x_0=1$)

$h_{\vec{\theta}}(\vec{x})=\sum_{j=0}^d \theta_jx_j=\vec{\theta}^T\vec{x}$

$\theta_0$ is the bias parameter and $\theta_{1:d}$ are the weights
Visualized with a graph for multiple dimensions

Loss Function - Least Squares Loss

Intuitively, sps. $d=1,\theta_0=0\implies x\in\R,\vec\theta=[0,\theta_1]$

$J(\vec\theta)=\frac1{2n}\sum_{i=1}^n(h_{\vec\theta}(\vec x^{(i)})-y^{(i)})^2$

Convex Functions - Optimization

a function is convex iff for all $t\in[0,1]$ and $x_1,x_2\in X$, the line through 2 point of the function lies above the function itself
$f(tx_1+(1-t)x_2)\le tf(x_1)+(1-t)f(x_2)$
strictly convex - the inequality is strict, no flat surface in the function
Least squares is convex and differentiable w.r.t. $\theta$ → convex functions have a unique optima (minima)

Optimization Visual

Blue is a minima, gradient outward to red

Loss Optimization (Gradient Decent, Normal Equation)

Given loss function $J(\vec\theta)$ we want to find $\min_{\vec\theta} J(\vec\theta)$
choose initial values of $\vec\theta$ and select new until we reach a minima (chaotic - location of minima depends heavily on the values of parameters chosen)
convex functions have a unique minima
normalized los function

Gradient Descent

given an initial parameters $\vec\theta$, we can recursively update theta by moving against (down) the gradient by some step size called the learning rate $\alpha$:

$\theta_j\leftarrow\theta_j-\alpha\cdot\frac{\partial}{\partial \theta_j}J(\vec\theta)\quad\text{simultaneous per $j$ }$

PROOF: indexed partial of least squares
convergence - when gradient is close to 0: $ \vec\theta_{new}-\vec\theta_{old} _2\lt \epsilon$
- L2 Norm $ \vec v _2$ - calculates distance (magnitude) of vector from origin
$ \vec v 2=\sqrt{\sum_i v_i^2}=\sqrt{v_1^2+v_2^2+…+v^2{ v }}$

Scaling - SGD, GD, MBGD

Mean squared error

$J(\vec\theta)=\frac1n\sum_{i=1}^n\ell(\vec x^{(i)},y^{(i)}, \vec\theta)$

$\ell(\vec x^{(i)},y^{(i)}, \vec\theta)$ is the least squared error of $h_{\vec\theta}(\vec x)$ for features and labels

Full Batch Gradient Descent (GD)

$\vec\theta\leftarrow\vec\theta-\alpha\frac1n\sum_{=1}^n\big(h_{\vec\theta}(\vec x^{(i)})-y^{(i)}\big)\vec x^{(i)}$

computing average gradient over $n$ training instances (simultaneously) is $O(n)$* time (slow)
SEE: quantum computing for finding GD for each example simultaneously
*is actually $O(n\cdot d)$ but generally $n\gg d$
finding the true loss relative to the population (dataset)

Stochastic Gradient Descent (SGD)

computes gradient for every example, relative to the example - iterated over examples in batches → calculating per example loss

$\vec\theta\leftarrow\vec\theta-\alpha\nabla_{\vec\theta}\ell\big(\vec x^{(i)},y^{(i)},\vec\theta\big)$

$\vec\theta\leftarrow\vec\theta-\alpha\big(h_{\vec\theta}(\vec x^{(i)})-y^{(i)}\big)\vec x^{(i)}$

GD vs SGD comparison (visual)
pros: memory efficient, computationally cheap, implicit regularization (see further)
cons: high noise, does not exploit GPUs to fullest

Mini-batch Gradient Descent (MBGD)

sample a batch of $B$ points $D_B$ at random from full dataset $D$ (w/o replacement)

$\vec\theta\leftarrow\vec\theta-\alpha\frac1B\sum_{(\vec x^{(i)},y^{(i)})\in D_B}\nabla_{\vec\theta}\ell(\vec x^{(i)},y^{(i)}, \vec\theta)$

$B=1\implies \text{SGD}$
$B=n\implies \text{GD}$
comparison visual

Vectorization

compact equations, faster code - see discussion

Normal Equation

finding the optimal $\vec\theta$ (weights/parameters) analytically s.t. $\frac\partial{\partial\vec\theta}J(\vec\theta)$

Optimized Normal Equation

$\vec\theta=(X^TX)^{-1}X^T\vec y$

if $X$ is not invertible, try pseudo-inverse (np.linalg.pinv(X)), remove linearly dependent features, remove extra features s.t. $d\le n$
normalized loss function

Linear Regression Recipe

Loss gradient

Discussion - Least Squares Gradient

Resources

📌

**SUMMARY
**