9 - Transformers

ucla | CS 162 | 2024-02-09 02:15

Seq2Seq Models
- Encoder-Decoder Structure
- Seq2Seq w/ Attention (Weighted sum)
Transformers

Seq2Seq Models

input one sequence output another sequence
e.g., audio -> text, text -> text (contextualized), text -> text
Encoder-Decoder Structure
idea is to use output of last cell as encoded vector bc (w/ RNN at least) the last cell contains information and contxt on the entire sentence so far
Seq2Seq w/ Attention (Weighted sum)
the pathway of encoder to context to attention vector; we compute attention by looking at all the attention vectors made by the encoder ($\overline{ h_s}$) and compare to the decoder cell attention ($h_t$) (see arrows from sep token in decode to attention weights from the encoder)
we make predictions based on attention vectors instead of hidden states

Transformers

relies entirely on attention for encoder-decoder model instead of relying on recurrent dependency -> parallelizable
Self-Attention
attention between every token to every other token
using QKV, compute the following circuits:
QK circuit:
- Generate $\hat O=QK^T$ then norm & scale and softmax to get attentions $Z=\text{attention}=O = \text{softmax}\bigg(\frac{QK^T}{\sqrt{d_k}}\bigg)$
OV circuit:
- generate contextualized (encoded) vectors by multiplying attention to each key/value s.t. $C=OV=\text{softmax}\bigg(\frac{QK^T}{\sqrt{d_k}}\bigg)\cdot V$
we can parallelize and introduce new weights by first making the QKV matrices using weights and the inputs: $Q=X\cdot W^Q$$K=X\cdot W^K$ $V = X\cdot W^V$
Multi-Headed Attention
we create multiple self-attention heads using multiple QKV weight matrices and generate multiple contextualized versions of the input vector from each head:
then we concatenate the outputs and multiply by new output weight matrix $W^O$ and get a final contextualized matrix (by projection using the new weights) for the inputs:
Positional Embedding
something to be said abt positional relevance (words closer together likely are related in the same context), so we also include positional encoding along with input embeddings:
t is the word position, k is the parity, i is the dimension index, d is the dimensionality
Residual Connections and Layer Norm
information loss of the input vector multiple steps in + vanishing gradients -> add back in the sublayer through a residual connection:
and we want to normalize outputs by adding back in the original embeddings to the contextualized vectors
for each dimension, normalize value wrt to mean to prevent vectors from becoming to large
Encoder-Decoder Attention
similar self attention but apply a look ahead mask after dimensional scaling in the self attention
then pass that into Encoder-Decoder attention which is basically multi-headed attention but pass in Q,K as the contextualized representations from the Encoder

9 - Transformers

Table of Contents

Seq2Seq Models

Encoder-Decoder Structure

Seq2Seq w/ Attention (Weighted sum)

Transformers

Self-Attention

Multi-Headed Attention

Positional Embedding

Residual Connections and Layer Norm

Encoder-Decoder Attention