07 - Kernels - lec. 9,10
ucla | CS M146 | 2023-05-01T13:59
Table of Contents
Supplemental
Linear Model regularization review

Lecture
Feature Maps (Basis Functions)
any function $\phi:X\to\hat{X}$

e.g. for features $\bm x\in\R^2$, we have a feature map $\phi:\R^2\to\R^6$
$\phi(\bm x)=\phi([x_1,x_2])=[1,x_1,x_2,x_1^2,x_2^2,x_1,x_2]$
- with this, we can still use a linear hypothesis class: $h_{\bm\theta}(\bm x)=\bm \theta^T\phi(\bm x^{(i)})$
Gradient descent w/ feature maps
![]() |
- each update requires computing the hypothesis → high dimensions feature maps require more computation
very expressive feature map (high dimension) tend to overfit → need regularization

Regularization w/ feature maps
Linear Model regularization review

for linear models with possible feature maps, we can express them as
![]() |
Representor Theorem
- used to efficiently train and evaluate expresive feature mapd
- theorem (special case): for any $\lambda>0$, there exists some ral-valued coefficients $\beta_i\in\R$ (scalars) such that the minimizer of the regularized loss function is:
- i.e. the optimal gradient for the regularized loss is
$\widehat{\bm\theta}=\sum_{i=1}^n\beta_i\phi(\bm x^{(i)})$
$\beta_i=-\frac1{2\lambda n}L’\bigg(\bm\theta^T\phi(\bm x^{(i)}),y^{(i)}\bigg)$
proof

Reparametrized Risk
- representor theorem allows us to reparametrize the $\ell_2$ regularized loss:
$J(\bm\theta)\frac1n\sum_{i=1}^nL\bigg(\bm\theta^T\phi(\bm x^{(i)}),y^{(i)}\bigg)+\lambda|\bm\theta|_2^2$
thus, by representor theorem, we can use the minimized gradient into the loss function above to obtaain a reparametriz objective (loss) function w.r.t. $\bm\beta$


![]() |
- the kernel function can be precomputed for all training instances during training → we only optimize over $\bm\beta$ insteaad of computing th hypothesis (dot product) for each iteration
Kernel Trick
![]() |
- thus, we can (in some situations) compute the kernel dot product faster than the fature map itself
Kernelized Linear Model
- apply kernel trick: only compute kernel function instad of feature map → kernelized linear models
- thus, given a kernel function on a set of training intances of cardinallity $m$, we can dfine a kernell matrix $K\in\R^{m\times m}$
$K_{ij}=k(\bm x^{(i)},\bm x^{(j)})$
![]() |
Kernalized Ridge Regression
![]() |
Valid Kernels
- given a feature map, a kernel function defined as the dot product of the feature map with itself → defines a notion of similrity between the two vectors (data points)
- e.g. $K_{ij}=k(\bm x^{(i)},\bm x^{(j)})$ shows similarity between $\bm x^{(i)}$ and $\bm x^{(j)}$
- other form of similarity include: euclidean distance, cosine similarity, any similar kernel function
- but any arbitrary kernel function does NOT imply a feature map
Is a Kernel Valid?
![]() |
Mercer’s Theorem
![]() |
e.g. $k(\bm u,\bm v)=(\bm u^T\bm v)^2$

Kernel Composition Rules
- algebraic operation on known valid kernels → result in valid kernels
![]() |
Example: Gaussian Kernel (RBF)
$k(\bm x^{(i)}, \bm x^{(j)})=\exp\bigg(-\frac{|\bm x^{(i)}-\bm x^{(j)}|_2^2}{2\sigma^2}\bigg)$
- has. val 1 when $\bm x^{(i)}=\bm x^{(j)}$
- value falls off to 0 w/ increasing distance
- note: needs feature scaling before using RBF
- $\sigma$ is the bandwidth hyperparameter → determines how much L2 norm affects the kernel
Discussion
Resources
📌
**SUMMARY
**








