9 - Transformers

ucla | CS 162 | 2024-02-09 02:15


Table of Contents

Seq2Seq Models

  • input one sequence output another sequence
  • e.g., audio -> text, text -> text (contextualized), text -> text

    Encoder-Decoder Structure

  • idea is to use output of last cell as encoded vector bc (w/ RNN at least) the last cell contains information and contxt on the entire sentence so far

    Seq2Seq w/ Attention (Weighted sum)

  • the pathway of encoder to context to attention vector; we compute attention by looking at all the attention vectors made by the encoder (hs) and compare to the decoder cell attention (ht) (see arrows from sep token in decode to attention weights from the encoder)
  • we make predictions based on attention vectors instead of hidden states

Transformers

  • relies entirely on attention for encoder-decoder model instead of relying on recurrent dependency -> parallelizable

    Self-Attention

  • attention between every token to every other token
  • using QKV, compute the following circuits:
  • QK circuit:
    • Generate O^=QKT then norm & scale and softmax to get attentions Z=attention=O=softmax(QKTdk)
  • OV circuit:
    • generate contextualized (encoded) vectors by multiplying attention to each key/value s.t. C=OV=softmax(QKTdk)V
  • we can parallelize and introduce new weights by first making the QKV matrices using weights and the inputs: Q=XWQK=XWK V=XWV

    Multi-Headed Attention

  • we create multiple self-attention heads using multiple QKV weight matrices and generate multiple contextualized versions of the input vector from each head:
  • then we concatenate the outputs and multiply by new output weight matrix WO and get a final contextualized matrix (by projection using the new weights) for the inputs:

    Positional Embedding

  • something to be said abt positional relevance (words closer together likely are related in the same context), so we also include positional encoding along with input embeddings:
  • t is the word position, k is the parity, i is the dimension index, d is the dimensionality

    Residual Connections and Layer Norm

  • information loss of the input vector multiple steps in + vanishing gradients -> add back in the sublayer through a residual connection:
  • and we want to normalize outputs by adding back in the original embeddings to the contextualized vectors
  • for each dimension, normalize value wrt to mean to prevent vectors from becoming to large

    Encoder-Decoder Attention

  • similar self attention but apply a look ahead mask after dimensional scaling in the self attention
  • then pass that into Encoder-Decoder attention which is basically multi-headed attention but pass in Q,K as the contextualized representations from the Encoder