01 - Linear Regression - lec. 1,2,3

ucla | CS M146 | 2023-04-07T12:06


Table of Contents

Definitions

  • hypothesis function

    hθ(x)H

    A function that maps inputs to outputs using input features (x(i)) and labels (y(i)) and belongs to a “hypothesis class (H)” (a set of similar functions) by using arbitrary weights θ

    E.g. Linear regression hypothesis function:

    hθ(x)=θ0+θ1x1++θdxd=θTx

  • convex function

    • A real-valued function is convex if the line segment joining any two points lies above the function

Lecture

  • Recall from lec. 1, an ML model takes features and labels aas inputs and returns a hypothesis function.

Linear Regression Properties

Hypothesis function - linear w.r.t. weights @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)θ

  • Parametrized

hθ(x)=θ0+θ1x1++θdxdxθRd+1

  • Vectorized (x0=1)

hθ(x)=j=0dθjxj=θTx

  • θ0 is the bias parameter and θ1:d are the weights
  • Visualized with a graph for multiple dimensions

Loss Function - Least Squares Loss

  • Intuitively, sps. d=1,θ0=0x\R,θ=[0,θ1]

J(θ)=12ni=1n(hθ(x(i))y(i))2

Convex Functions - Optimization

  • a function is convex iff for all t[0,1] and x1,x2X, the line through 2 point of the function lies above the function itself

    f(tx1+(1t)x2)tf(x1)+(1t)f(x2)

  • strictly convex - the inequality is strict, no flat surface in the function
  • Least squares is convex and differentiable w.r.t. θ → convex functions have a unique optima (minima)

Optimization Visual

Blue is a minima, gradient outward to red

Loss Optimization (Gradient Decent, Normal Equation)

  • Given loss function J(θ) we want to find minθJ(θ)
  • choose initial values of θ and select new until we reach a minima (chaotic - location of minima depends heavily on the values of parameters chosen)

  • convex functions have a unique minima
  • normalized los function

Gradient Descent

  • given an initial parameters θ, we can recursively update theta by moving against (down) the gradient by some step size called the learning rate α:

θjθjαθjJ(θ)simultaneous per j 

  • PROOF: indexed partial of least squares

  • convergence - when gradient is close to 0: $ \vec\theta_{new}-\vec\theta_{old} _2\lt \epsilon$
    • L2 Norm $ \vec v _2$ - calculates distance (magnitude) of vector from origin
    $ \vec v  2=\sqrt{\sum_i v_i^2}=\sqrt{v_1^2+v_2^2+…+v^2{v}}$

Scaling - SGD, GD, MBGD

  • Mean squared error

J(θ)=1ni=1n(x(i),y(i),θ)

  • (x(i),y(i),θ) is the least squared error of hθ(x) for features and labels

Full Batch Gradient Descent (GD)

θθα1n=1n(hθ(x(i))y(i))x(i)

  • computing average gradient over n training instances (simultaneously) is O(n)* time (slow)
  • SEE: quantum computing for finding GD for each example simultaneously
  • *is actually O(nd) but generally nd
  • finding the true loss relative to the population (dataset)

Stochastic Gradient Descent (SGD)

  • computes gradient for every example, relative to the example - iterated over examples in batches → calculating per example loss

θθαθ(x(i),y(i),θ)

θθα(hθ(x(i))y(i))x(i)

  • GD vs SGD comparison (visual)

  • pros: memory efficient, computationally cheap, implicit regularization (see further)
  • cons: high noise, does not exploit GPUs to fullest

Mini-batch Gradient Descent (MBGD)

  • sample a batch of B points DB at random from full dataset D (w/o replacement)

θθα1B(x(i),y(i))DBθ(x(i),y(i),θ)

  • B=1SGD
  • B=nGD
  • comparison visual

Vectorization

  • compact equations, faster code - see discussion

Normal Equation

  • finding the optimal θ (weights/parameters) analytically s.t. θJ(θ)
  • Optimized Normal Equation

θ=(XTX)1XTy

  • if X is not invertible, try pseudo-inverse (np.linalg.pinv(X)), remove linearly dependent features, remove extra features s.t. dn
  • normalized loss function

Linear Regression Recipe

Loss gradient

Discussion - Least Squares Gradient

Resources


📌

**SUMMARY
**