← NLA experimentsNatural Language Autoencoders · follow-up series
21·b
Matched SAE features through the AV
GemmaScope-2 (L40, 16k, l0=60) decoder directions nearest each diff-of-means axis, run through the same two cells as the dom vectors.
- sae
- gemma-scope-2-27b-it · resid_post L40 16k medium
- matching
- cos(dom, W_dec rows), top-3 per axis
- features
- 23
- experiment
- 021_interp_combos
Caveat — site gap —The SAE is trained at layers.40 output (hidden_states[41]). The NLA reads hidden_states[42], one block downstream. Every cosine here is depressed by this one-block gap — low-cos matches are expected, not a quality issue. The matches are still the SAE's best available handle on each axis; reading their decodes tells you what the dictionary has where the contrast lives.
§1
The matches
Cosine between each dom vector (L41) and the SAE decoder rows (L40 — one block earlier). Low-cos matches are kept on purpose: they are the dictionary's closest approximation, and the decode comparison is still informative.
| axis | f₁ (cos) | f₂ (cos) | f₃ (cos) |
|---|---|---|---|
| eval_awareness | f808 (0.512) | f428 (0.508) | f304 (0.498) |
| deception | f7532 (0.350) — just askingartpatriotsmanly | f507 (0.276) | f30 (0.257) — the followed by a noun |
| refusal | f55 (0.488) — options and numbers | f15736 (0.468) — list items * | f1657 (0.458) — word beginnings and endings |
| sandbagging | f372 (0.342) | f345 (0.317) — # followed by Output or command ID | f182 (0.314) — feelings and states of being |
| pirate | f6425 (0.430) — pirate and nautical speech | f372 (0.414) | f396 (0.399) — established record Guinness |
| anger | f372 (0.306) | f0 (0.293) — lists with numbers and bullet points | f53 (0.293) — possessive pronouns linking to personal |
| fear | f0 (0.485) — lists with numbers and bullet points | f396 (0.480) — established record Guinness | f273 (0.478) |
| joy | f372 (0.512) | f0 (0.499) — lists with numbers and bullet points | f484 (0.494) — a or the followed by a word |
| sadness | f0 (0.447) — lists with numbers and bullet points | f273 (0.444) | f396 (0.444) — established record Guinness |
| disgust | f0 (0.408) — lists with numbers and bullet points | f273 (0.406) | f316 (0.406) |
| french | f269 (0.808) | f355 (0.808) | f220 (0.801) |
| german | f355 (0.771) | f127 (0.763) — Spanish and German word beginnings | f220 (0.763) |
| russian | f355 (0.785) | f269 (0.778) | f220 (0.775) |
labels: Neuronpedia autointerp where available; logit-lens tokens shown per feature below. Cosines depressed by one-block site gap.
§2
Decodes
Feature directions get the full Cell B mult sweep; Cell C runs the same scales as the dom vectors. Compare against the same axis on the 21·a page. Degeneration chips on all mult/scale groups; collapsed cells kept and labeled.