01 - Linear Regression - lec. 1,2,3
ucla | CS M146 | 2023-04-07T12:06
Table of Contents
- Definitions
- Lecture
- Discussion - Least Squares Gradient
- Resources
Definitions
hypothesis function
$h_\theta(\vec x)\in\mathcal H$
A function that maps inputs to outputs using input features ($\vec x^{(i)}$) and labels ($y^{(i)}$) and belongs to a “hypothesis class ($\mathcal H$)” (a set of similar functions) by using arbitrary weights $\vec\theta$
E.g. Linear regression hypothesis function:
$h_\theta(\vec x)=\theta_0+\theta_1 x_1 + … + \theta_d x_d=\vec{\theta^T}\vec x$
convex function
- A real-valued function is convex if the line segment joining any two points lies above the function

Lecture
- Recall from lec. 1, an ML model takes features and labels aas inputs and returns a hypothesis function.
Linear Regression Properties
Hypothesis function - linear w.r.t. weights @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$\theta$
- Parametrized
$h_{\vec{\theta}}(\vec x)=\theta_0+\theta_1 x_1 + … + \theta_d x_d\vec x\quad\theta\in R^{d+1}$
- Vectorized ($x_0=1$)
$h_{\vec{\theta}}(\vec{x})=\sum_{j=0}^d \theta_jx_j=\vec{\theta}^T\vec{x}$
- $\theta_0$ is the bias parameter and $\theta_{1:d}$ are the weights
Visualized with a graph for multiple dimensions

Loss Function - Least Squares Loss
Intuitively, sps. $d=1,\theta_0=0\implies x\in\R,\vec\theta=[0,\theta_1]$

$J(\vec\theta)=\frac1{2n}\sum_{i=1}^n(h_{\vec\theta}(\vec x^{(i)})-y^{(i)})^2$
Convex Functions - Optimization
a function is convex iff for all $t\in[0,1]$ and $x_1,x_2\in X$, the line through 2 point of the function lies above the function itself
$f(tx_1+(1-t)x_2)\le tf(x_1)+(1-t)f(x_2)$
- strictly convex - the inequality is strict, no flat surface in the function
- Least squares is convex and differentiable w.r.t. $\theta$ → convex functions have a unique optima (minima)
Optimization Visual
Blue is a minima, gradient outward to red
![]() |
![]() |
![]() |
![]() |
Loss Optimization (Gradient Decent, Normal Equation)
- Given loss function $J(\vec\theta)$ we want to find $\min_{\vec\theta} J(\vec\theta)$
choose initial values of $\vec\theta$ and select new until we reach a minima (chaotic - location of minima depends heavily on the values of parameters chosen)

- convex functions have a unique minima
- normalized los function
![]() |
Gradient Descent
given an initial parameters $\vec\theta$, we can recursively update theta by moving against (down) the gradient by some step size called the learning rate $\alpha$:

$\theta_j\leftarrow\theta_j-\alpha\cdot\frac{\partial}{\partial \theta_j}J(\vec\theta)\quad\text{simultaneous per $j$ }$
PROOF: indexed partial of least squares

convergence - when gradient is close to 0: $ \vec\theta_{new}-\vec\theta_{old} _2\lt \epsilon$ L2 Norm $ \vec v _2$ - calculates distance (magnitude) of vector from origin
$ \vec v 2=\sqrt{\sum_i v_i^2}=\sqrt{v_1^2+v_2^2+…+v^2{ v }}$
Scaling - SGD, GD, MBGD
- Mean squared error
$J(\vec\theta)=\frac1n\sum_{i=1}^n\ell(\vec x^{(i)},y^{(i)}, \vec\theta)$
- $\ell(\vec x^{(i)},y^{(i)}, \vec\theta)$ is the least squared error of $h_{\vec\theta}(\vec x)$ for features and labels
Full Batch Gradient Descent (GD)
$\vec\theta\leftarrow\vec\theta-\alpha\frac1n\sum_{=1}^n\big(h_{\vec\theta}(\vec x^{(i)})-y^{(i)}\big)\vec x^{(i)}$
- computing average gradient over $n$ training instances (simultaneously) is $O(n)$* time (slow)
- SEE: quantum computing for finding GD for each example simultaneously
- *is actually $O(n\cdot d)$ but generally $n\gg d$
- finding the true loss relative to the population (dataset)
Stochastic Gradient Descent (SGD)
- computes gradient for every example, relative to the example - iterated over examples in batches → calculating per example loss
$\vec\theta\leftarrow\vec\theta-\alpha\nabla_{\vec\theta}\ell\big(\vec x^{(i)},y^{(i)},\vec\theta\big)$
$\vec\theta\leftarrow\vec\theta-\alpha\big(h_{\vec\theta}(\vec x^{(i)})-y^{(i)}\big)\vec x^{(i)}$
GD vs SGD comparison (visual)

- pros: memory efficient, computationally cheap, implicit regularization (see further)
- cons: high noise, does not exploit GPUs to fullest
Mini-batch Gradient Descent (MBGD)
- sample a batch of $B$ points $D_B$ at random from full dataset $D$ (w/o replacement)
$\vec\theta\leftarrow\vec\theta-\alpha\frac1B\sum_{(\vec x^{(i)},y^{(i)})\in D_B}\nabla_{\vec\theta}\ell(\vec x^{(i)},y^{(i)}, \vec\theta)$
- $B=1\implies \text{SGD}$
- $B=n\implies \text{GD}$
comparison visual

Vectorization
- compact equations, faster code - see discussion
![]() |
![]() |
![]() |
Normal Equation
- finding the optimal $\vec\theta$ (weights/parameters) analytically s.t. $\frac\partial{\partial\vec\theta}J(\vec\theta)$
![]() |
- Optimized Normal Equation
$\vec\theta=(X^TX)^{-1}X^T\vec y$
- if $X$ is not invertible, try pseudo-inverse (np.linalg.pinv(X)), remove linearly dependent features, remove extra features s.t. $d\le n$
- normalized loss function
![]() |
Linear Regression Recipe
![]() |
Loss gradient
![]() |
Discussion - Least Squares Gradient
Resources
📌
**SUMMARY
**









