07 - Kernels - lec. 9,10

ucla | CS M146 | 2023-05-01T13:59

Supplemental
Lecture
Discussion
Resources

Supplemental

Linear Model regularization review

Lecture

Feature Maps (Basis Functions)

any function $\phi:X\to\hat{X}$
e.g. for features $\bm x\in\R^2$, we have a feature map $\phi:\R^2\to\R^6$

$\phi(\bm x)=\phi([x_1,x_2])=[1,x_1,x_2,x_1^2,x_2^2,x_1,x_2]$

with this, we can still use a linear hypothesis class: $h_{\bm\theta}(\bm x)=\bm \theta^T\phi(\bm x^{(i)})$

Gradient descent w/ feature maps

each update requires computing the hypothesis → high dimensions feature maps require more computation
very expressive feature map (high dimension) tend to overfit → need regularization

Regularization w/ feature maps

Linear Model regularization review
for linear models with possible feature maps, we can express them as

Representor Theorem

used to efficiently train and evaluate expresive feature mapd
theorem (special case): for any $\lambda>0$, there exists some ral-valued coefficients $\beta_i\in\R$ (scalars) such that the minimizer of the regularized loss function is:
i.e. the optimal gradient for the regularized loss is

$\widehat{\bm\theta}=\sum_{i=1}^n\beta_i\phi(\bm x^{(i)})$

$\beta_i=-\frac1{2\lambda n}L’\bigg(\bm\theta^T\phi(\bm x^{(i)}),y^{(i)}\bigg)$

proof

Reparametrized Risk

representor theorem allows us to reparametrize the $\ell_2$ regularized loss:

$J(\bm\theta)\frac1n\sum_{i=1}^nL\bigg(\bm\theta^T\phi(\bm x^{(i)}),y^{(i)}\bigg)+\lambda|\bm\theta|_2^2$

thus, by representor theorem, we can use the minimized gradient into the loss function above to obtaain a reparametriz objective (loss) function w.r.t. $\bm\beta$

the kernel function can be precomputed for all training instances during training → we only optimize over $\bm\beta$ insteaad of computing th hypothesis (dot product) for each iteration

Kernel Trick

thus, we can (in some situations) compute the kernel dot product faster than the fature map itself

Kernelized Linear Model

apply kernel trick: only compute kernel function instad of feature map → kernelized linear models
thus, given a kernel function on a set of training intances of cardinallity $m$, we can dfine a kernell matrix $K\in\R^{m\times m}$

$K_{ij}=k(\bm x^{(i)},\bm x^{(j)})$

Kernalized Ridge Regression

Valid Kernels

given a feature map, a kernel function defined as the dot product of the feature map with itself → defines a notion of similrity between the two vectors (data points)
e.g. $K_{ij}=k(\bm x^{(i)},\bm x^{(j)})$ shows similarity between $\bm x^{(i)}$ and $\bm x^{(j)}$
other form of similarity include: euclidean distance, cosine similarity, any similar kernel function
but any arbitrary kernel function does NOT imply a feature map

Is a Kernel Valid?

Mercer’s Theorem

e.g. $k(\bm u,\bm v)=(\bm u^T\bm v)^2$

Kernel Composition Rules

algebraic operation on known valid kernels → result in valid kernels

Example: Gaussian Kernel (RBF)

$k(\bm x^{(i)}, \bm x^{(j)})=\exp\bigg(-\frac{|\bm x^{(i)}-\bm x^{(j)}|_2^2}{2\sigma^2}\bigg)$

has. val 1 when $\bm x^{(i)}=\bm x^{(j)}$
value falls off to 0 w/ increasing distance
note: needs feature scaling before using RBF
$\sigma$ is the bandwidth hyperparameter → determines how much L2 norm affects the kernel

Discussion

Resources

📌

**SUMMARY
**