02 - Generalization - lec. 3,4

ucla | CS M146 | 2023-04-10T14:02

Supplemental

dimensionality - line or hyperplane, dependent on dimensionality of features
linear or extended linear - dependent on feature transformations (polynomial, root, etc.)

the term linear is generally. reserved for function linear w.r.t. the weights, so we can input non linear transformations of the inputs/features into linear regression

$h_{\vec\theta}(\vec x)=\vec\theta^T\phi(\vec x)=\sum_{j=0}^k\theta_j\phi_j(\vec x)$

$\phi(\vec x):\R^d\to\R^k$ is a k-dimensional basis w/ params $\vec\theta\in\R^k$
$\phi_0(\vec x)=1$ usually so first param is till bias
$k$ can b different from feature dimension $d+1$, e.g. polynomial reg: $d=1,k=4$

the LSE loss we looked at so far is ON TRAINING DATA - empirical risk minimization (ERM)
does ERM generalize on unseen $x$
theoretically
- depends on hypothesis class, data size, learning algo → learning theory
empirically
- can assess via validation data
algorithmically
- can strengthen via regularization
underfitting - hypothesis is not very expressive/complex for data
overfitting - hypo is too complex for data
hypothesis complexity - hard to define, polynomial degree for regression
- an $n$-degree polynomial can reach 0 loss on size $n$ dataset easily

train, validation, test - split
if test is similar to validation (similar distribution) - generalization is achieved

eliminating features, getting more data - regularize dataset
loss regularization - method to prevent overfitting by controlling complexity of the learned hypothesis
penalize large weights (absolute) during optimization → loss

$\lambda\ge 0$ is the regularization hyperparameter → appends squared L2 norm of weights onto loss → when minimizing, we know try to minimize regularization too

$\sum^d_{j=1}\theta_j^2=|\bm\theta_{1:d}|2^2=|\bm\theta{1:d}-\vec0|_2^2$

additional unknowns (other than weights) for improving learning - $\lambda$, $\alpha$
model hyperparameters - influence representation
- hypothesis class $\mathcal H$
- basis function $\phi$
algorithmic hyperparameters - influence traning
- learning rate $\alpha$
- regularization coefficient $\lambda$
- batch size $B$
model selection: best hyperparams are ones that help generalize → eval based on valudation loss