07 - Kernels - lec. 9,10

ucla | CS M146 | 2023-05-01T13:59


Table of Contents

Supplemental

  • Linear Model regularization review

Lecture

Feature Maps (Basis Functions)

  • any function ϕ:XX^

  • e.g. for features \bmx\R2, we have a feature map ϕ:\R2\R6

ϕ(\bmx)=ϕ([x1,x2])=[1,x1,x2,x12,x22,x1,x2]

  • with this, we can still use a linear hypothesis class: h\bmθ(\bmx)=\bmθTϕ(\bmx(i))

Gradient descent w/ feature maps

  • each update requires computing the hypothesis → high dimensions feature maps require more computation
  • very expressive feature map (high dimension) tend to overfit → need regularization

Regularization w/ feature maps

  • Linear Model regularization review

  • for linear models with possible feature maps, we can express them as

Representor Theorem

  • used to efficiently train and evaluate expresive feature mapd
  • theorem (special case): for any λ>0, there exists some ral-valued coefficients βi\R (scalars) such that the minimizer of the regularized loss function is:
  • i.e. the optimal gradient for the regularized loss is

\bmθ^=i=1nβiϕ(\bmx(i))

βi=12λnL(\bmθTϕ(\bmx(i)),y(i))

  • proof

Reparametrized Risk

  • representor theorem allows us to reparametrize the 2 regularized loss:

J(\bmθ)1ni=1nL(\bmθTϕ(\bmx(i)),y(i))+λ|\bmθ|22

  • thus, by representor theorem, we can use the minimized gradient into the loss function above to obtaain a reparametriz objective (loss) function w.r.t. \bmβ

  • the kernel function can be precomputed for all training instances during training → we only optimize over \bmβ insteaad of computing th hypothesis (dot product) for each iteration

Kernel Trick

  • thus, we can (in some situations) compute the kernel dot product faster than the fature map itself

Kernelized Linear Model

  • apply kernel trick: only compute kernel function instad of feature map → kernelized linear models
  • thus, given a kernel function on a set of training intances of cardinallity m, we can dfine a kernell matrix K\Rm×m

Kij=k(\bmx(i),\bmx(j))

Kernalized Ridge Regression

Valid Kernels

  • given a feature map, a kernel function defined as the dot product of the feature map with itself → defines a notion of similrity between the two vectors (data points)
  • e.g. Kij=k(\bmx(i),\bmx(j)) shows similarity between \bmx(i) and \bmx(j)
  • other form of similarity include: euclidean distance, cosine similarity, any similar kernel function
  • but any arbitrary kernel function does NOT imply a feature map

Is a Kernel Valid?

Mercer’s Theorem

  • e.g. k(\bmu,\bmv)=(\bmuT\bmv)2

Kernel Composition Rules

  • algebraic operation on known valid kernels → result in valid kernels

Example: Gaussian Kernel (RBF)

k(\bmx(i),\bmx(j))=exp(|\bmx(i)\bmx(j)|222σ2)

  • has. val 1 when \bmx(i)=\bmx(j)
  • value falls off to 0 w/ increasing distance
  • note: needs feature scaling before using RBF
  • σ is the bandwidth hyperparameter → determines how much L2 norm affects the kernel

Discussion

Resources


📌

**SUMMARY
**