← NLA experimentsNatural Language Autoencoders · follow-up series
21·a

Diff-of-means vectors into the AV

13 contrast axes + a norm-matched random control, decoded alone and as steering interventions. Corrected rerun of exp 020 with wider coefficient sweeps and degeneration tags.

battery
extracted
cell B
84 cells
cell C
112 cells
experiment
021_interp_combos
§1

The battery

Each axis is a mean difference of L41 activations over ~200 contrast pairs (system-prompt contrast over a shared user-turn bank; reading-flavor text pairs for the languages). The control is the difference of two unrelated document activations, norm-matched to the battery median — anything it produces below is what “a meaningless direction of the same size” looks like, unblinded by design.

axis‖dom‖n pairstop SAE match (cos)
eval_awareness1113200f808 (0.512)
deception2715200f7532 (0.350)
refusal3895200f55 (0.488)
sandbagging3318200f372 (0.342)
pirate4827200f6425 (0.430)
anger3819200f372 (0.306)
fear4132200f0 (0.485)
joy4470200f372 (0.512)
sadness3971200f0 (0.447)
disgust3833200f0 (0.408)
french560040f269 (0.808)
german567340f355 (0.771)
russian569340f355 (0.785)
control3971

anger axis vs validated f2796 decoder direction: cos = 0.224 (one-block site gap depresses this). Median gold ‖h‖ = 58385.

§2

Decodes

Cell B injects the bare direction at the AV's vector slot (off-manifold by construction — it never saw isolated directions in training). Cell C steers real document forwards upstream at layers.40, decodes what arrives at L41, and shows the behavioral completion at the same scale. Collapsed cells remain visible and labeled. Degeneration chips (coherent / degrading / collapsed) appear on each mult-group and scale button.