14 - Deep Learning - lec. 19
ucla | CS M146 | 2023-06-05T14:16
Table of Contents
Supplemental
issue with model complexty
- DNNs became popular do to 3 advancements: algorithms, compute, data
- algorithmic advancements (optimization/training and generalization/testing) in 3 spheres: architectures, regularizers, optimizers
- convolve - slide a filter over the image spatially and computing dot products → makes a smaller convolved image
- Jacobian - gradients
- Hessian - 2nd order gradients, gradients of jacobian
Lecture
- DNNs became popular do to 3 advancements: algorithms, compute, data
- algorithmic advancements (optimization/training and generalization/testing) in 3 spheres: architectures, regularizers, optimizers
Neural Net Architectures
Early Architctures
- early archs were MLPs (multi-layer perceptrons) - fully connected (dense) layers
- used sigmoid and tanh non-linear activations
- new architectures focused on:
- better connectivity
- CNNs for translational invariance in object recognition
- archs for many modalitites: RNNs CNNs, transformers, graph NNs
- better backprop
- new activations (ReLU) or normalizing activations for vanishing/exploding gradients
Convolutional Neural Networks
Architecture
Fully connected layer → outputs
- Convolutional Layer
applies weight matrix called a filter - a smaller subset of pixels of the image
uses the filter (e.g. 3x5x5) across the image (e.g. 3x32x32) to make an activation map (1x28x28) and repeats for some number of filters
stack these filters (e.g. 6x3x5x5 + 6x1 bias) to get stacked activation maps (6x1x28x28) → output image (6x28x28)
filters are the weights → assignd randomly then updated wiith gradient descent
Batched convolution
- e.g. 2x3x32x32 images
Non-linear activations between convolutions
can use activations between layers
MaxPool(Relu(x)) = Relu(MaxPool(x))
Neural Net Regularizers
Dropout
- at every training iteration, we drop hidden nodes with certain probability (we disable the droupout at testing)
- this means after the node is dropped, backprop does not update the weights to that node → BUT that node (And the full net) is still used to compute test predictions → reduces overfitting on node by node basis → ensembling
Optimizing DNNs
loss is highly non-convex in DNNs wrt parameters
trying to find curvature using Hessian is
so no
Momentum
- use a heurtic to approx rate of change of gradient and use for first-order optimization
consider a noisy cosine func - we can make a smooth estimation using moving averages as momentum using a hyperparam
we can use this momentum to regularize (smooth) gradients or MBGD
- find the GIF for optimizer comparison
Discussion
Resources
📌
**SUMMARY
**