02 - Generalization - lec. 3,4

ucla | CS M146 | 2023-04-10T14:02


Table of Contents

Supplemental

  • dimensionality - line or hyperplane, dependent on dimensionality of features
  • linear or extended linear - dependent on feature transformations (polynomial, root, etc.)

Lecture

  • the term linear is generally. reserved for function linear w.r.t. the weights, so we can input non linear transformations of the inputs/features into linear regression

Linear Basis

Linear Basis Function Models

hθ(x)=θTϕ(x)=j=0kθjϕj(x)

  • ϕ(x):\Rd\Rk is a k-dimensional basis w/ params θ\Rk
  • ϕ0(x)=1 usually so first param is till bias
  • k can b different from feature dimension d+1, e.g. polynomial reg: d=1,k=4

Linear Regression: lin. feature transforms

Extended Linear Regression: non-lin. feature transforms

Generalization

  • more complex is not always better - overfitting (on training data)

generalization - ability of ML model to make good predictions on unseen (test) data

  • the LSE loss we looked at so far is ON TRAINING DATA - empirical risk minimization (ERM)
  • does ERM generalize on unseen x
  • theoretically
    • depends on hypothesis class, data size, learning algo → learning theory
  • empirically
    • can assess via validation data
  • algorithmically
    • can strengthen via regularization
  • underfitting - hypothesis is not very expressive/complex for data
  • overfitting - hypo is too complex for data
  • hypothesis complexity - hard to define, polynomial degree for regression
    • an n-degree polynomial can reach 0 loss on size n dataset easily

Option 1: Cross Validation (validation split)

  • train, validation, test - split

  • if test is similar to validation (similar distribution) - generalization is achieved

@import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)k-fold Cross Validation

  • partition dataset of n instances into k disjoint folds (subsets)
  • choose fold i[1,k] as the validation
  • train on k1 remaining folds and cross validate and eval accuracy on i
  • compute average over k folds or chose best model on a certain k fold
  • “leave-one-out”: k=n
  • visual

Option 2: Regularization

  • eliminating features, getting more data - regularize dataset
  • loss regularization - method to prevent overfitting by controlling complexity of the learned hypothesis
  • penalize large weights (absolute) during optimization → loss

Ridge Regression: @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)2-regularized

  • λ0 is the regularization hyperparameter → appends squared L2 norm of weights onto loss → when minimizing, we know try to minimize regularization too

$\sum^d_{j=1}\theta_j^2=|\bm\theta_{1:d}|2^2=|\bm\theta{1:d}-\vec0|_2^2$

  • pulls weights towards the origin (minimizes)
  • vectorized

Hyperparameters

  • additional unknowns (other than weights) for improving learning - λ, α
  • model hyperparameters - influence representation

    • hypothesis class H
    • basis function ϕ
  • algorithmic hyperparameters - influence traning

    • learning rate α
    • regularization coefficient λ
    • batch size B
  • model selection: best hyperparams are ones that help generalize → eval based on valudation loss

Discussion

Linear Basis Function Models

ϕ(\bmx):\Rd\Rk\bmθ\Rk

h\bmθ(\bmx)=\bmθTϕ(\bmx)=jkθjϕj(\bmx)

  • examples: polynomial and gaussian

Regularized Linear (Ridge) Regression

  • Closed form

Resources


📌

**SUMMARY
**