14 - Deep Learning - lec. 19

ucla | CS M146 | 2023-06-05T14:16


Table of Contents

Supplemental

  • issue with model complexty

  • DNNs became popular do to 3 advancements: algorithms, compute, data
  • algorithmic advancements (optimization/training and generalization/testing) in 3 spheres: architectures, regularizers, optimizers
  • convolve - slide a filter over the image spatially and computing dot products → makes a smaller convolved image
  • Jacobian - gradients
  • Hessian - 2nd order gradients, gradients of jacobian

Lecture

  • DNNs became popular do to 3 advancements: algorithms, compute, data
  • algorithmic advancements (optimization/training and generalization/testing) in 3 spheres: architectures, regularizers, optimizers

Neural Net Architectures

Early Architctures

  • early archs were MLPs (multi-layer perceptrons) - fully connected (dense) layers
  • used sigmoid and tanh non-linear activations
  • new architectures focused on:
  • better connectivity
    • CNNs for translational invariance in object recognition
    • archs for many modalitites: RNNs CNNs, transformers, graph NNs
  • better backprop
    • new activations (ReLU) or normalizing activations for vanishing/exploding gradients

Convolutional Neural Networks

  • Architecture

  • Fully connected layer → outputs

  • Convolutional Layer
    • applies weight matrix called a filter - a smaller subset of pixels of the image

    • uses the filter (e.g. 3x5x5) across the image (e.g. 3x32x32) to make an activation map (1x28x28) and repeats for some number of filters

    • stack these filters (e.g. 6x3x5x5 + 6x1 bias) to get stacked activation maps (6x1x28x28) → output image (6x28x28)

    • filters are the weights → assignd randomly then updated wiith gradient descent

  • Batched convolution

    • e.g. 2x3x32x32 images
  • Non-linear activations between convolutions

    • can use activations between layers

    • MaxPool(Relu(x)) = Relu(MaxPool(x))

Neural Net Regularizers

Dropout

  • at every training iteration, we drop hidden nodes with certain probability (we disable the droupout at testing)
  • this means after the node is dropped, backprop does not update the weights to that node → BUT that node (And the full net) is still used to compute test predictions → reduces overfitting on node by node basis → ensembling

Optimizing DNNs

  • loss is highly non-convex in DNNs wrt parameters

  • trying to find curvature using Hessian is O(d2) so no

Momentum

  • use a heurtic to approx rate of change of gradient and use for first-order optimization
  • consider a noisy cosine func - we can make a smooth estimation using moving averages as momentum using a hyperparam β

  • we can use this momentum to regularize (smooth) gradients or MBGD

  • find the GIF for optimizer comparison

Discussion

Resources


📌

**SUMMARY
**