01 - Linear Regression - lec. 1,2,3
ucla | CS M146 | 2023-04-07T12:06
Table of Contents
- Definitions
- Lecture
- Discussion - Least Squares Gradient
- Resources
Definitions
hypothesis function
A function that maps inputs to outputs using input features (
) and labels ( ) and belongs to a “hypothesis class ( )” (a set of similar functions) by using arbitrary weights E.g. Linear regression hypothesis function:
convex function
- A real-valued function is convex if the line segment joining any two points lies above the function
Lecture
- Recall from lec. 1, an ML model takes features and labels aas inputs and returns a hypothesis function.
Linear Regression Properties
Hypothesis function - linear w.r.t. weights @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)
- Parametrized
- Vectorized (
)
is the bias parameter and are the weightsVisualized with a graph for multiple dimensions
Loss Function - Least Squares Loss
Intuitively, sps.
Convex Functions - Optimization
a function is convex iff for all
and , the line through 2 point of the function lies above the function itself- strictly convex - the inequality is strict, no flat surface in the function
- Least squares is convex and differentiable w.r.t.
→ convex functions have a unique optima (minima)
Optimization Visual
Blue is a minima, gradient outward to red
![]() |
![]() |
![]() |
![]() |
Loss Optimization (Gradient Decent, Normal Equation)
- Given loss function
we want to find choose initial values of
and select new until we reach a minima (chaotic - location of minima depends heavily on the values of parameters chosen)- convex functions have a unique minima
- normalized los function
![]() |
Gradient Descent
given an initial parameters
, we can recursively update theta by moving against (down) the gradient by some step size called the learning rate :
PROOF: indexed partial of least squares
convergence - when gradient is close to 0: $ \vec\theta_{new}-\vec\theta_{old} _2\lt \epsilon$ L2 Norm $ \vec v _2$ - calculates distance (magnitude) of vector from origin
$ \vec v 2=\sqrt{\sum_i v_i^2}=\sqrt{v_1^2+v_2^2+…+v^2{ v }}$
Scaling - SGD, GD, MBGD
- Mean squared error
is the least squared error of for features and labels
Full Batch Gradient Descent (GD)
- computing average gradient over
training instances (simultaneously) is * time (slow) - SEE: quantum computing for finding GD for each example simultaneously
- *is actually
but generally - finding the true loss relative to the population (dataset)
Stochastic Gradient Descent (SGD)
- computes gradient for every example, relative to the example - iterated over examples in batches → calculating per example loss
GD vs SGD comparison (visual)
- pros: memory efficient, computationally cheap, implicit regularization (see further)
- cons: high noise, does not exploit GPUs to fullest
Mini-batch Gradient Descent (MBGD)
- sample a batch of
points at random from full dataset (w/o replacement)
comparison visual
Vectorization
- compact equations, faster code - see discussion
![]() |
![]() |
![]() |
Normal Equation
- finding the optimal
(weights/parameters) analytically s.t.
![]() |
- Optimized Normal Equation
- if
is not invertible, try pseudo-inverse (np.linalg.pinv(X)), remove linearly dependent features, remove extra features s.t. - normalized loss function
![]() |
Linear Regression Recipe
![]() |
Loss gradient
![]() |
Discussion - Least Squares Gradient
Resources
📌
**SUMMARY
**