Skip to content

Glossary

Definitions for words used in the code and documentation.

  • example: one dataset item (image, sentence, audio clip, point cloud, graph instance).
  • token: one model position in the encoder’s residual stream (the thing with hidden size d_model). Always "token" inside the model.
  • content token: tokens derived from the raw input (image patches, wordpieces, audio windows, nodes, etc.).
  • special token: tokens not directly derived from the raw input (class/summary token, [SEP], [MASK], [PAD], register tokens, etc.).
  • sequence length L: total tokens per example (content + special). If variable, call it “ragged”.
  • layer: an integer index into the encoder’s stack.
  • activation kind (optional but useful): which stream you saved (e.g., resid_pre, resid_post, mlp_out, attn_out, qkv, head_out).

Modality-specific vocab:

  • patch (vision): a 2D content token. Often laid out on a grid with shape (H_patches, W_patches).
  • frame/token or tube (video): content token in time × space; often (T, H, W).
  • wordpiece / subword (text): content token from a tokenizer.
  • window / frame (audio): time–frequency window.
  • node (graph), point (point cloud).