← NLA experimentsNatural Language Autoencoders · follow-up series
21·d

Verbalizations vs top SAE features

The same 1000 activations, read two ways: the NLA's prose explanation next to the SAE's top-10 features. Corrected rerun of 020·d with 021 artifacts.

substrate
1000 gold activations, UFW en 100k–100.2k
verbalizations
021 baseline decodes (greedy)
features
gemma-scope-2 L40 16k, encoded at L41 (gap noted)
experiment
021_interp_combos
Caveat — site gapSAE features are encoded at layers.40 output (hidden_states[41]); the NLA verbalizations are decoded from hidden_states[42], one block later. The features shown are therefore the SAE's best approximation at a slightly different position — expect some systematic differences that reflect the gap, not genuine disagreement.
§1

Side by side

Context ends at the highlighted extraction token. Feature labels are Neuronpedia autointerp where fetched, logit-lens tokens otherwise; hover a feature line for its max-activating corpus example. Note the systematic differences in kind: the NLA narrates document context and predicts continuation; SAE features mark local token-level properties.

loading…