Verbalizations vs top SAE features

The same 1000 activations, read two ways: the NLA's prose explanation next to the SAE's top-10 features. Corrected rerun of 020·d with 021 artifacts.

substrate: 1000 gold activations, UFW en 100k–100.2k
verbalizations: 021 baseline decodes (greedy)
features: gemma-scope-2 L40 16k, encoded at L41 (gap noted)
experiment: 021_interp_combos

Caveat — site gap —SAE features are encoded at layers.40 output (hidden_states[41]); the NLA verbalizations are decoded from hidden_states[42], one block later. The features shown are therefore the SAE's best approximation at a slightly different position — expect some systematic differences that reflect the gap, not genuine disagreement.

§1

Side by side

no scoring this phase — read

Context ends at the highlighted extraction token. Feature labels are Neuronpedia autointerp where fetched, logit-lens tokens otherwise; hover a feature line for its max-activating corpus example. Note the systematic differences in kind: the NLA narrates document context and predicts continuation; SAE features mark local token-level properties.

loading…