14 - Deep Learning - lec. 19

ucla | CS M146 | 2023-06-05T14:16

Supplemental

issue with model complexty
DNNs became popular do to 3 advancements: algorithms, compute, data
algorithmic advancements (optimization/training and generalization/testing) in 3 spheres: architectures, regularizers, optimizers
convolve - slide a filter over the image spatially and computing dot products → makes a smaller convolved image
Jacobian - gradients
Hessian - 2nd order gradients, gradients of jacobian

DNNs became popular do to 3 advancements: algorithms, compute, data
algorithmic advancements (optimization/training and generalization/testing) in 3 spheres: architectures, regularizers, optimizers

early archs were MLPs (multi-layer perceptrons) - fully connected (dense) layers
used sigmoid and tanh non-linear activations
new architectures focused on:
better connectivity
- CNNs for translational invariance in object recognition
- archs for many modalitites: RNNs CNNs, transformers, graph NNs
better backprop
- new activations (ReLU) or normalizing activations for vanishing/exploding gradients

Architecture
Fully connected layer → outputs
Convolutional Layer
- applies weight matrix called a filter - a smaller subset of pixels of the image
- uses the filter (e.g. 3x5x5) across the image (e.g. 3x32x32) to make an activation map (1x28x28) and repeats for some number of filters
- stack these filters (e.g. 6x3x5x5 + 6x1 bias) to get stacked activation maps (6x1x28x28) → output image (6x28x28)
- filters are the weights → assignd randomly then updated wiith gradient descent
Batched convolution
- e.g. 2x3x32x32 images
Non-linear activations between convolutions
- can use activations between layers
- MaxPool(Relu(x)) = Relu(MaxPool(x))

at every training iteration, we drop hidden nodes with certain probability (we disable the droupout at testing)
this means after the node is dropped, backprop does not update the weights to that node → BUT that node (And the full net) is still used to compute test predictions → reduces overfitting on node by node basis → ensembling

use a heurtic to approx rate of change of gradient and use for first-order optimization
consider a noisy cosine func - we can make a smooth estimation using moving averages as momentum using a hyperparam $β$
we can use this momentum to regularize (smooth) gradients or MBGD
find the GIF for optimizer comparison