02 - Generalization - lec. 3,4
ucla | CS M146 | 2023-04-10T14:02
Table of Contents
- Supplemental
- Lecture
- Linear Basis
- Linear Regression: lin. feature transforms
- Extended Linear Regression: non-lin. feature transforms
- Generalization
- generalization - ability of ML model to make good predictions on unseen (test) data
- Option 1: Cross Validation (validation split)
- @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$k$-fold Cross Validation
- Option 2: Regularization
- Ridge Regression: @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$\ell_2$-regularized
- Hyperparameters
- Discussion
- Resources
Supplemental
- dimensionality - line or hyperplane, dependent on dimensionality of features
- linear or extended linear - dependent on feature transformations (polynomial, root, etc.)
Lecture
- the term linear is generally. reserved for function linear w.r.t. the weights, so we can input non linear transformations of the inputs/features into linear regression
Linear Basis
Linear Basis Function Models
$h_{\vec\theta}(\vec x)=\vec\theta^T\phi(\vec x)=\sum_{j=0}^k\theta_j\phi_j(\vec x)$
- $\phi(\vec x):\R^d\to\R^k$ is a k-dimensional basis w/ params $\vec\theta\in\R^k$
- $\phi_0(\vec x)=1$ usually so first param is till bias
$k$ can b different from feature dimension $d+1$, e.g. polynomial reg: $d=1,k=4$

Linear Regression: lin. feature transforms
![]() |
Extended Linear Regression: non-lin. feature transforms
![]() |
Generalization
more complex is not always better - overfitting (on training data)

generalization - ability of ML model to make good predictions on unseen (test) data
- the LSE loss we looked at so far is ON TRAINING DATA - empirical risk minimization (ERM)
- does ERM generalize on unseen $x$
- theoretically
- depends on hypothesis class, data size, learning algo → learning theory
- empirically
- can assess via validation data
- algorithmically
- can strengthen via regularization
- underfitting - hypothesis is not very expressive/complex for data
- overfitting - hypo is too complex for data
- hypothesis complexity - hard to define, polynomial degree for regression
- an $n$-degree polynomial can reach 0 loss on size $n$ dataset easily
Option 1: Cross Validation (validation split)
train, validation, test - split

if test is similar to validation (similar distribution) - generalization is achieved
@import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$k$-fold Cross Validation
- partition dataset of $n$ instances into $k$ disjoint folds (subsets)
- choose fold $i\in[1,k]$ as the validation
- train on $k-1$ remaining folds and cross validate and eval accuracy on $i$
- compute average over $k$ folds or chose best model on a certain $k$ fold
- “leave-one-out”: $k=n$
visual



Option 2: Regularization
- eliminating features, getting more data - regularize dataset
- loss regularization - method to prevent overfitting by controlling complexity of the learned hypothesis
- penalize large weights (absolute) during optimization → loss
Ridge Regression: @import url(‘https://cdnjs.cloudflare.com/ajax/libs/KaTeX/0.13.2/katex.min.css’)$\ell_2$-regularized
![]() |
- $\lambda\ge 0$ is the regularization hyperparameter → appends squared L2 norm of weights onto loss → when minimizing, we know try to minimize regularization too
$\sum^d_{j=1}\theta_j^2=|\bm\theta_{1:d}|2^2=|\bm\theta{1:d}-\vec0|_2^2$
- pulls weights towards the origin (minimizes)
![]() |
- vectorized
![]() |
Hyperparameters
- additional unknowns (other than weights) for improving learning - $\lambda$, $\alpha$
model hyperparameters - influence representation
- hypothesis class $\mathcal H$
- basis function $\phi$

algorithmic hyperparameters - influence traning
- learning rate $\alpha$
- regularization coefficient $\lambda$
- batch size $B$

- model selection: best hyperparams are ones that help generalize → eval based on valudation loss
Discussion
Linear Basis Function Models
$\phi(\bm x):\R^d\to\R^k\quad \bm \theta\in\R^k$
$h_{\bm\theta}(\bm x)=\bm\theta^T\phi(\bm x)=\sum_j^k\theta_j\phi_j(\bm x)$
examples: polynomial and gaussian

Regularized Linear (Ridge) Regression
![]() |
Closed form

Resources
📌
**SUMMARY
**





