← NLA experimentsNatural Language Autoencoders · follow-up series
21·d
Verbalizations vs top SAE features
The same 1000 activations, read two ways: the NLA's prose explanation next to the SAE's top-10 features. Corrected rerun of 020·d with 021 artifacts.
- substrate
- 1000 gold activations, UFW en 100k–100.2k
- verbalizations
- 021 baseline decodes (greedy)
- features
- gemma-scope-2 L40 16k, encoded at L41 (gap noted)
- experiment
- 021_interp_combos
Caveat — site gap —SAE features are encoded at layers.40 output (hidden_states[41]); the NLA verbalizations are decoded from hidden_states[42], one block later. The features shown are therefore the SAE's best approximation at a slightly different position — expect some systematic differences that reflect the gap, not genuine disagreement.
§1
Side by side
Context ends at the highlighted extraction token. Feature labels are Neuronpedia autointerp where fetched, logit-lens tokens otherwise; hover a feature line for its max-activating corpus example. Note the systematic differences in kind: the NLA narrates document context and predicts continuation; SAE features mark local token-level properties.
loading…