02 - Generalization - lec. 3,4

ucla | CS M146 | 2023-04-10T14:02


Table of Contents

Supplemental

  • dimensionality - line or hyperplane, dependent on dimensionality of features
  • linear or extended linear - dependent on feature transformations (polynomial, root, etc.)

Lecture

  • the term linear is generally. reserved for function linear w.r.t. the weights, so we can input non linear transformations of the inputs/features into linear regression

Linear Basis

Linear Basis Function Models

$h_{\vec\theta}(\vec x)=\vec\theta^T\phi(\vec x)=\sum_{j=0}^k\theta_j\phi_j(\vec x)$

  • $\phi(\vec x):\R^d\to\R^k$ is a k-dimensional basis w/ params $\vec\theta\in\R^k$
  • $\phi_0(\vec x)=1$ usually so first param is till bias
  • $k$ can b different from feature dimension $d+1$, e.g. polynomial reg: $d=1,k=4$

Linear Regression: lin. feature transforms

Extended Linear Regression: non-lin. feature transforms

Generalization

  • more complex is not always better - overfitting (on training data)

generalization - ability of ML model to make good predictions on unseen (test) data

  • the LSE loss we looked at so far is ON TRAINING DATA - empirical risk minimization (ERM)
  • does ERM generalize on unseen $x$
  • theoretically
    • depends on hypothesis class, data size, learning algo → learning theory
  • empirically
    • can assess via validation data
  • algorithmically
    • can strengthen via regularization
  • underfitting - hypothesis is not very expressive/complex for data
  • overfitting - hypo is too complex for data
  • hypothesis complexity - hard to define, polynomial degree for regression
    • an $n$-degree polynomial can reach 0 loss on size $n$ dataset easily

Option 1: Cross Validation (validation split)

  • train, validation, test - split

  • if test is similar to validation (similar distribution) - generalization is achieved

@import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$k$-fold Cross Validation

  • partition dataset of $n$ instances into $k$ disjoint folds (subsets)
  • choose fold $i\in[1,k]$ as the validation
  • train on $k-1$ remaining folds and cross validate and eval accuracy on $i$
  • compute average over $k$ folds or chose best model on a certain $k$ fold
  • “leave-one-out”: $k=n$
  • visual

Option 2: Regularization

  • eliminating features, getting more data - regularize dataset
  • loss regularization - method to prevent overfitting by controlling complexity of the learned hypothesis
  • penalize large weights (absolute) during optimization → loss

Ridge Regression: @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$\ell_2$-regularized

  • $\lambda\ge 0$ is the regularization hyperparameter → appends squared L2 norm of weights onto loss → when minimizing, we know try to minimize regularization too

$\sum^d_{j=1}\theta_j^2=|\bm\theta_{1:d}|2^2=|\bm\theta{1:d}-\vec0|_2^2$

  • pulls weights towards the origin (minimizes)
  • vectorized

Hyperparameters

  • additional unknowns (other than weights) for improving learning - $\lambda$, $\alpha$
  • model hyperparameters - influence representation

    • hypothesis class $\mathcal H$
    • basis function $\phi$
  • algorithmic hyperparameters - influence traning

    • learning rate $\alpha$
    • regularization coefficient $\lambda$
    • batch size $B$
  • model selection: best hyperparams are ones that help generalize → eval based on valudation loss

Discussion

Linear Basis Function Models

$\phi(\bm x):\R^d\to\R^k\quad \bm \theta\in\R^k$

$h_{\bm\theta}(\bm x)=\bm\theta^T\phi(\bm x)=\sum_j^k\theta_j\phi_j(\bm x)$

  • examples: polynomial and gaussian

Regularized Linear (Ridge) Regression

  • Closed form

Resources


📌

**SUMMARY
**