← NLA experimentsNatural Language Autoencoders · follow-up series
21·b

Matched SAE features through the AV

GemmaScope-2 (L40, 16k, l0=60) decoder directions nearest each diff-of-means axis, run through the same two cells as the dom vectors.

sae
gemma-scope-2-27b-it · resid_post L40 16k medium
matching
cos(dom, W_dec rows), top-3 per axis
features
23
experiment
021_interp_combos
Caveat — site gapThe SAE is trained at layers.40 output (hidden_states[41]). The NLA reads hidden_states[42], one block downstream. Every cosine here is depressed by this one-block gap — low-cos matches are expected, not a quality issue. The matches are still the SAE's best available handle on each axis; reading their decodes tells you what the dictionary has where the contrast lives.
§1

The matches

Cosine between each dom vector (L41) and the SAE decoder rows (L40 — one block earlier). Low-cos matches are kept on purpose: they are the dictionary's closest approximation, and the decode comparison is still informative.

axisf₁ (cos)f₂ (cos)f₃ (cos)
eval_awarenessf808 (0.512)f428 (0.508)f304 (0.498)
deceptionf7532 (0.350) — just askingartpatriotsmanlyf507 (0.276)f30 (0.257) — the followed by a noun
refusalf55 (0.488) — options and numbersf15736 (0.468) — list items *f1657 (0.458) — word beginnings and endings
sandbaggingf372 (0.342)f345 (0.317) — # followed by Output or command IDf182 (0.314) — feelings and states of being
piratef6425 (0.430) — pirate and nautical speechf372 (0.414)f396 (0.399) — established record Guinness
angerf372 (0.306)f0 (0.293) — lists with numbers and bullet pointsf53 (0.293) — possessive pronouns linking to personal
fearf0 (0.485) — lists with numbers and bullet pointsf396 (0.480) — established record Guinnessf273 (0.478)
joyf372 (0.512)f0 (0.499) — lists with numbers and bullet pointsf484 (0.494) — a or the followed by a word
sadnessf0 (0.447) — lists with numbers and bullet pointsf273 (0.444)f396 (0.444) — established record Guinness
disgustf0 (0.408) — lists with numbers and bullet pointsf273 (0.406)f316 (0.406)
frenchf269 (0.808)f355 (0.808)f220 (0.801)
germanf355 (0.771)f127 (0.763) — Spanish and German word beginningsf220 (0.763)
russianf355 (0.785)f269 (0.778)f220 (0.775)

labels: Neuronpedia autointerp where available; logit-lens tokens shown per feature below. Cosines depressed by one-block site gap.

§2

Decodes

Feature directions get the full Cell B mult sweep; Cell C runs the same scales as the dom vectors. Compare against the same axis on the 21·a page. Degeneration chips on all mult/scale groups; collapsed cells kept and labeled.