Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

Samuel Stevens, Wei-Lun (Harry) Chao, Tanya Berger-Wolf, Yu Su

The Ohio State University

To understand vision models, we cannot only interpret their features; we must also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or provide uninterpretable model control. We apply sparse autoencoders (SAEs) to state-of-the-art vision models and reveal key differences in model-learned abstractions. We also show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior.

Sparse autoencoders (SAEs) trained on pre-trained ViT activations discover a wide spread of features across both visual patterns and semantic structures. We show eight different features from an SAE trained on ImageNet-1K activations from a CLIP-trained ViT-B/16.

Demos

There are two web-based demos to replicate the core findings in Section 5 in our preprint.

Interpreting bird classifications (Section 5.1)
Interpreting semantic segmentation (Section 5.2)

`saev`

saev is a package for training sparse autoencoders (SAEs) on vision transformers (ViTs) in PyTorch. It also includes some interactive demos for scientifically rigorous interpretation of ViTs.

API reference docs are available below, as well as the source code on GitHub.

API Docs

We provide auto-generated API docs based on docstrings. There is also a guide to getting started. You can also copy-paste all of the package from llms.txt into a long-context model and ask it questions.

saev: A package to train SAEs for vision transformers in PyTorch.
contrib: Sub-packages for doing interesting things with the trained SAEs.

References & Citations

Please cite our preprint or our code, depending on which is most valuable to your work.

Code:

@software{stevens2025saev,
    title = {{saev}}, 
    author = {Stevens, Samuel and Wei-Lun Chao and Tanya Berger-Wolf and Yu Su},
    license = {MIT},
    url = {https://github.com/osu-nlp-group/saev}
}

Preprint:

@misc{stevens2025rigorous,
    title={Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models}, 
    author={Samuel Stevens and Wei-Lun Chao and Tanya Berger-Wolf and Yu Su},
    year={2025},
    eprint={2502.06755},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2502.06755}, 
}

Acknowledgements

We would like to thank our colleagues from the Imageomics Institute, the ABC Center, and the OSU NLP group for their insightful and valuable feedback.

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.