02 - Generalization - lec. 3,4

ucla | CS M146 | 2023-04-10T14:02

Supplemental

dimensionality - line or hyperplane, dependent on dimensionality of features
linear or extended linear - dependent on feature transformations (polynomial, root, etc.)

the term linear is generally. reserved for function linear w.r.t. the weights, so we can input non linear transformations of the inputs/features into linear regression

$h_{\vec{θ}} (\vec{x}) = {\vec{θ}}^{T} ϕ (\vec{x}) = \sum_{j = 0}^{k} θ_{j} ϕ_{j} (\vec{x})$

$ϕ (\vec{x}) : \R^{d} \to \R^{k}$ is a k-dimensional basis w/ params $\vec{θ} \in \R^{k}$
$ϕ_{0} (\vec{x}) = 1$ usually so first param is till bias
$k$ can b different from feature dimension $d + 1$ , e.g. polynomial reg: $d = 1, k = 4$

the LSE loss we looked at so far is ON TRAINING DATA - empirical risk minimization (ERM)
does ERM generalize on unseen $x$
theoretically
- depends on hypothesis class, data size, learning algo → learning theory
empirically
- can assess via validation data
algorithmically
- can strengthen via regularization
underfitting - hypothesis is not very expressive/complex for data
overfitting - hypo is too complex for data
hypothesis complexity - hard to define, polynomial degree for regression
- an $n$ -degree polynomial can reach 0 loss on size $n$ dataset easily

train, validation, test - split
if test is similar to validation (similar distribution) - generalization is achieved

eliminating features, getting more data - regularize dataset
loss regularization - method to prevent overfitting by controlling complexity of the learned hypothesis
penalize large weights (absolute) during optimization → loss

$λ \geq 0$ is the regularization hyperparameter → appends squared L2 norm of weights onto loss → when minimizing, we know try to minimize regularization too

$\sum^d_{j=1}\theta_j^2=|\bm\theta_{1:d}|2^2=|\bm\theta{1:d}-\vec0|_2^2$

additional unknowns (other than weights) for improving learning - $λ$ , $α$
model hyperparameters - influence representation
- hypothesis class $H$
- basis function $ϕ$
algorithmic hyperparameters - influence traning
- learning rate $α$
- regularization coefficient $λ$
- batch size $B$
model selection: best hyperparams are ones that help generalize → eval based on valudation loss