Diff-of-means vectors into the AV
13 contrast axes + a norm-matched random control, decoded alone and as steering interventions. Corrected rerun of exp 020 with wider coefficient sweeps and degeneration tags.
- battery
- extracted
- cell B
- 84 cells
- cell C
- 112 cells
- experiment
- 021_interp_combos
The battery
Each axis is a mean difference of L41 activations over ~200 contrast pairs (system-prompt contrast over a shared user-turn bank; reading-flavor text pairs for the languages). The control is the difference of two unrelated document activations, norm-matched to the battery median — anything it produces below is what “a meaningless direction of the same size” looks like, unblinded by design.
| axis | ‖dom‖ | n pairs | top SAE match (cos) |
|---|---|---|---|
| eval_awareness | 1113 | 200 | f808 (0.512) |
| deception | 2715 | 200 | f7532 (0.350) |
| refusal | 3895 | 200 | f55 (0.488) |
| sandbagging | 3318 | 200 | f372 (0.342) |
| pirate | 4827 | 200 | f6425 (0.430) |
| anger | 3819 | 200 | f372 (0.306) |
| fear | 4132 | 200 | f0 (0.485) |
| joy | 4470 | 200 | f372 (0.512) |
| sadness | 3971 | 200 | f0 (0.447) |
| disgust | 3833 | 200 | f0 (0.408) |
| french | 5600 | 40 | f269 (0.808) |
| german | 5673 | 40 | f355 (0.771) |
| russian | 5693 | 40 | f355 (0.785) |
| control | 3971 | — | — |
anger axis vs validated f2796 decoder direction: cos = 0.224 (one-block site gap depresses this). Median gold ‖h‖ = 58385.
Decodes
Cell B injects the bare direction at the AV's vector slot (off-manifold by construction — it never saw isolated directions in training). Cell C steers real document forwards upstream at layers.40, decodes what arrives at L41, and shows the behavioral completion at the same scale. Collapsed cells remain visible and labeled. Degeneration chips (coherent / degrading / collapsed) appear on each mult-group and scale button.