Glossary¶

Definitions for words used in the code and documentation.

example: one dataset item (image, sentence, audio clip, point cloud, graph instance).
token: one model position in the encoder’s residual stream (the thing with hidden size d_model). Always "token" inside the model.
content token: tokens derived from the raw input (image patches, wordpieces, audio windows, nodes, etc.).
special token: tokens not directly derived from the raw input (class/summary token, [SEP], [MASK], [PAD], register tokens, etc.).
sequence length L: total tokens per example (content + special). If variable, call it “ragged”.
layer: an integer index into the encoder’s stack.
activation kind (optional but useful): which stream you saved (e.g., resid_pre, resid_post, mlp_out, attn_out, qkv, head_out).

Modality-specific vocab:

patch (vision): a 2D content token. Often laid out on a grid with shape (H_patches, W_patches).
frame/token or tube (video): content token in time × space; often (T, H, W).
wordpiece / subword (text): content token from a tokenizer.
window / frame (audio): time–frequency window.
node (graph), point (point cloud).