Storage & Run Manifest Spec (v1)
There are two main locations:
$SAEV_SCRATCH/saev/shards
: where we store transformer activations (referred to asshards_root
in the codebase).$SAEV_NFS/saev/runs
: where we store checkpoints and other computed intermediate stuff like example images, probe1d results, etc. (referred to asruns_root
in the codebase).
Visually, these are:
$SAEV_SCRATCH/saev/
shards/
<shard_hash>/
metadata.json
shards.json
acts000000.bin
acts000001.bin
...
labels.bin
and
$SAEV_NFS/saev/
runs/
<run_id>/
checkpoint/ # output of train.py on <shard_hash>
sae.pt
config.json
links/ # Symlinks
train-shards # $SCRATCH/saev/shards/<shard_hash>
train-dataset # Whatever the original image dataset was
val-shards # $SCRATCH/saev/shards/<shard_hash>
val-dataset # Whatever the original image dataset was
inference/ # outputs from dump.py
<shard_hash>/
config.json
patch_acts.npz
visuals/ # output of visuals.py
Each $SAEV_SCRATCH/shards/<shard_hash>/
MUST include:
metadata.json
(UTF-8, canonical spec; seeprotocol.md
)shards.json
(UTF-8, shard index and sizes; seeprotocol.md
)acts*.bin
(binary shards; format inprotocol.md
)labels.bin
(binary patch labels aligned to shards; format inprotocol.md
)
Note
Immutability: Files under saev/shards/<shard_hash>/
MUST be treated as read-only after publication. Any change yields a new shard_hash
.
All CLI entrypoints should accept a single --run <path>
argument.
Every other path MUST be resolved from the run root:
- ViT activations:
links/shards
→saev/shards/<shard_hash>
- Dataset:
links/dataset
→ Dataset root, wherever it is on disk. - SAE checkpoint:
checkpoint/sae.pt
Example resolution:
run = pathlib.Path(cfg.run)
shards_root = (run / "links" / "shards").resolve()
dataset_root = (run / "links" / "dataset").resolve()
ckpt = run / "checkpoint" / "sae.pt"
labels = vit_root / "labels.bin"
$SAEV_SCRATCH
and$SAEV_NFS
should be set for all users/processes running saev tools.
FAQs
-
Where do patch labels live? Next to
acts*.bin
in$SAEV_SCRATCH/shards/<shard_hash>/labels.bin
. Scripts discover them vialinks/shards/labels.bin
. -
Can I put datasets directly in
$SAEV_SCRATCH
? Sure, but not in$SAEV_SCRATCH/shards
.