9 - Transformers

ucla | CS 162 | 2024-02-09 02:15


Table of Contents

Seq2Seq Models

  • input one sequence output another sequence
  • e.g., audio -> text, text -> text (contextualized), text -> text

    Encoder-Decoder Structure

  • idea is to use output of last cell as encoded vector bc (w/ RNN at least) the last cell contains information and contxt on the entire sentence so far

    Seq2Seq w/ Attention (Weighted sum)

  • the pathway of encoder to context to attention vector; we compute attention by looking at all the attention vectors made by the encoder ($\overline{ h_s}$) and compare to the decoder cell attention ($h_t$) (see arrows from sep token in decode to attention weights from the encoder)
  • we make predictions based on attention vectors instead of hidden states

Transformers

  • relies entirely on attention for encoder-decoder model instead of relying on recurrent dependency -> parallelizable

    Self-Attention

  • attention between every token to every other token
  • using QKV, compute the following circuits:
  • QK circuit:
    • Generate $\hat O=QK^T$ then norm & scale and softmax to get attentions \(Z=\text{attention}=O = \text{softmax}\bigg(\frac{QK^T}{\sqrt{d_k}}\bigg)\)
  • OV circuit:
    • generate contextualized (encoded) vectors by multiplying attention to each key/value s.t. \(C=OV=\text{softmax}\bigg(\frac{QK^T}{\sqrt{d_k}}\bigg)\cdot V\)
  • we can parallelize and introduce new weights by first making the QKV matrices using weights and the inputs: \(Q=X\cdot W^Q\)\(K=X\cdot W^K\) \(V = X\cdot W^V\)

    Multi-Headed Attention

  • we create multiple self-attention heads using multiple QKV weight matrices and generate multiple contextualized versions of the input vector from each head:
  • then we concatenate the outputs and multiply by new output weight matrix $W^O$ and get a final contextualized matrix (by projection using the new weights) for the inputs:

    Positional Embedding

  • something to be said abt positional relevance (words closer together likely are related in the same context), so we also include positional encoding along with input embeddings:
  • t is the word position, k is the parity, i is the dimension index, d is the dimensionality

    Residual Connections and Layer Norm

  • information loss of the input vector multiple steps in + vanishing gradients -> add back in the sublayer through a residual connection:
  • and we want to normalize outputs by adding back in the original embeddings to the contextualized vectors
  • for each dimension, normalize value wrt to mean to prevent vectors from becoming to large

    Encoder-Decoder Attention

  • similar self attention but apply a look ahead mask after dimensional scaling in the self attention
  • then pass that into Encoder-Decoder attention which is basically multi-headed attention but pass in Q,K as the contextualized representations from the Encoder