Matched SAE features through the AV

GemmaScope-2 (L40, 16k, l0=60) decoder directions nearest each diff-of-means axis, run through the same two cells as the dom vectors.

sae: gemma-scope-2-27b-it · resid_post L40 16k medium
matching: cos(dom, W_dec rows), top-3 per axis
features: 23
experiment: 021_interp_combos

Caveat — site gap —The SAE is trained at layers.40 output (hidden_states[41]). The NLA reads hidden_states[42], one block downstream. Every cosine here is depressed by this one-block gap — low-cos matches are expected, not a quality issue. The matches are still the SAE's best available handle on each axis; reading their decodes tells you what the dictionary has where the contrast lives.

§1

The matches

low cos expected — site gap

Cosine between each dom vector (L41) and the SAE decoder rows (L40 — one block earlier). Low-cos matches are kept on purpose: they are the dictionary's closest approximation, and the decode comparison is still informative.

axis	f₁ (cos)	f₂ (cos)	f₃ (cos)
eval_awareness	f808 (0.512)	f428 (0.508)	f304 (0.498)
deception	f7532 (0.350) — just askingartpatriotsmanly	f507 (0.276)	f30 (0.257) — the followed by a noun
refusal	f55 (0.488) — options and numbers	f15736 (0.468) — list items *	f1657 (0.458) — word beginnings and endings
sandbagging	f372 (0.342)	f345 (0.317) — # followed by Output or command ID	f182 (0.314) — feelings and states of being
pirate	f6425 (0.430) — pirate and nautical speech	f372 (0.414)	f396 (0.399) — established record Guinness
anger	f372 (0.306)	f0 (0.293) — lists with numbers and bullet points	f53 (0.293) — possessive pronouns linking to personal
fear	f0 (0.485) — lists with numbers and bullet points	f396 (0.480) — established record Guinness	f273 (0.478)
joy	f372 (0.512)	f0 (0.499) — lists with numbers and bullet points	f484 (0.494) — a or the followed by a word
sadness	f0 (0.447) — lists with numbers and bullet points	f273 (0.444)	f396 (0.444) — established record Guinness
disgust	f0 (0.408) — lists with numbers and bullet points	f273 (0.406)	f316 (0.406)
french	f269 (0.808)	f355 (0.808)	f220 (0.801)
german	f355 (0.771)	f127 (0.763) — Spanish and German word beginnings	f220 (0.763)
russian	f355 (0.785)	f269 (0.778)	f220 (0.775)

labels: Neuronpedia autointerp where available; logit-lens tokens shown per feature below. Cosines depressed by one-block site gap.

§2

Decodes

same cells as 21·a — compare directly

Feature directions get the full Cell B mult sweep; Cell C runs the same scales as the dom vectors. Compare against the same axis on the 21·a page. Degeneration chips on all mult/scale groups; collapsed cells kept and labeled.