← NLA experimentsNatural Language Autoencoders · follow-up series
21·c

Residual analysis

Primary: drop-comparison encodes h and ĥ at the same L40 site so the site gap cancels — features strong in h but absent in ĥ are what the channel drops. Secondary: raw-residual enrichment is a labeled null (noise overlap 30/30).

substrate
1000 gold + 021 AR reconstructions
sae
gemma-scope-2 L40 16k l0=60
primary method
drop-comparison (gap-cancelled)
experiment
021_interp_combos
Caveat — site gapThe SAE is trained at layers.40 output (hidden_states[41]); the NLA reads hidden_states[42]. The drop-comparison method (§2) explicitly encodes both h and ĥ at the same L40 site, so the gap cancels and the primary result is valid. The raw-residual enrichment (§3) does not cancel the gap and is a labeled null.SAE trained at layers.40 output = hidden_states[41]; the NLA reads hidden_states[42], one block downstream. This gap is NOT corrected in 021 — it is the honest reason fvu_h is elevated (~0.44 in 020) and the residual methods are mushy.
§1

Sanity gate

The SAE was trained on full activations at layers.40. We hand it (a) gold activations one block later (h), (b) AR reconstructions (ĥ), (c) residuals r = h − ĥ, and (d) Gaussian noise norm-matched to those residuals. FVU on r and noise are both noise-level — confirming the raw-residual enrichment is an off-manifold artifact.

0.44
FVU on h
L0 64 (trained l0 60)
0.39
FVU on ĥ
L0 48
1554
FVU on r
L0 905
1453
FVU on noise
L0 886
FindingFVU on h (~0.44) is elevated vs the paper's ~0.08 in-distribution number — expected from the one-block site gap. FVU on r and noise are both ~1400–1554 (noise-level), confirming the residual is entirely off-manifold for this SAE.
§2

Drop-comparison (primary)

Method: encode h and ĥ separately at the same L40 site (passing each through the SAE directly, not via r = h − ĥ). The site gap applies equally to both encodings, so it cancels. Features present in the h encoding but absent or weak in the ĥ encoding are what the text channel drops during AR reconstruction.

separate-encode h vs h-hat at the same L40 site (gap cancels); PRIMARY residual method for 021

0.44
FVU encode(h)
same site as ĥ
0.39
FVU encode(ĥ)
510
features in h
min freq 0.02
30
top dropped
featurelabel / logit-lensfreq hfreq ĥmean act hmean act ĥdrop ratio
f3726operation finishes0.0750.00046.50.076.0
f3023lists followed by and/or to0.0390.00022.80.040.0
f2384上記の 上記 उपरोक्त ছাড়াও0.0530.00130.60.527.0
f8485code or special characters0.0220.00015.90.023.0
f8625order status0.0280.00117.70.514.5
f3819again0.0370.00215.90.612.7
f12505complex numbers0.0230.00212.31.48.0
f1632closing punctuation marks0.0310.00514.52.35.3
f15243military alliances or problems0.0550.01030.14.35.1
f13871code snippets and multi-language keywords0.0290.00515.43.05.0
f2303hon'ble legal context0.0200.00410.02.04.2
f4025oxygen0.0200.00410.81.84.2
f9263foreign and multilingual references0.0560.01341.38.74.1
f4120personal and meaningful connection0.0290.00713.33.13.8
f1248 yine again Again again0.0210.00510.22.03.7
f2394oretically 并不是 *, ,0.0420.01124.56.43.6
f6169lists of things like languages or services0.0230.00612.94.43.4
f2186various forms of likewise0.0250.00712.42.83.3
f15222the daycare0.0230.00713.03.33.0
f2211edge devices and entering hurricanes0.0450.01524.17.52.9
f13181DSP, CAP, acronyms0.0350.01224.38.62.8
f2714React hooks and technologies0.0320.01118.76.62.8
f4151technical terms and diverse languages0.0210.00712.15.52.8
f3265goodness0.0480.01724.88.72.7
f3360start of critical or related phrases0.0260.00913.34.62.7
f12830called or so-called0.0230.00814.04.82.7
f15301month, form, threat0.0280.01013.33.82.6
f2862 typical Typical typical Typical0.0200.00712.84.92.6
f9517. \n0.0220.0089.13.62.6
f6727other languages0.0270.01015.45.92.5

Features strong in h (freq ≥ min_freq_h) but absent/weak in ĥ. drop_ratio = mean_act_h / max(mean_act_hhat, ε). High ratio = strongly dropped by the channel.

§3

Raw-residual enrichment (labeled null)

This is a negative result. NEGATIVE RESULT / labeled null: SAE-encode(h-hat) of the raw residual. The SAE is not linear over differences, so this is semi-baked regardless of site; kept for the record. The drop_compare method (separate-encode) is the primary one.

30/30
noise overlap
top-30 'enriched' features shared with noise arm
5192
active in r
11153
active in h
4432
active in noise
FindingNoise overlap is 30/30 — the top-30 “enriched” features all fire on the noise arm too. The enrichment table below is measuring off-manifold breakage, not what the channel drops. Use §2 (drop-comparison) for the actual result.
featurelabel / logit-lensenrichfreq rfreq hfreq noisemean act
f169official/original sources481.000.021.00986
f39numbers and units461.000.021.001881
f507 camaraderie социа музей disgraced461.000.021.003441
f86't or '2461.000.021.001507
f826wired charging and plots451.000.021.001266
f711which and which valamint430.990.020.991558
f527code, numbers, ````430.980.020.97904
f403 وغيرها 等等 इत्यादी 以及421.000.021.001424
f400 multitude plethora lot Vielzahl421.000.021.005096
f439K H G the421.000.021.001436
f1082generate revenue or transmit information401.000.021.001280
f348 `` čne とはいえ izinsuku381.000.031.003829
f504 ibid https formerly http371.000.031.001237
f21breakdown of why/how/what360.800.020.70544
f7513code following punctuation360.790.020.76353
f1643instructions and constraints350.730.020.70411
f541numbers and ranges351.000.031.002327
f5251special, unique, or distinct concepts340.770.020.76508
f489* <ul> Саша 331.000.031.001950
f318 patitth pabbaj entusiasmo ettha331.000.031.001931
f9517. \n330.760.020.70333
f5444restrictive lung disease, surnames320.680.020.72324
f352<unused541> abbanti <unused284> <unused291>321.000.031.002936
f322 gahet sajana pona iha321.000.031.001473
f346aka S iaz others311.000.030.991120
f269<unused437> saddhim niektor <unused218>301.000.031.006018
f631government agreements and announcements300.810.030.78354
f30the followed by a noun291.000.031.004969
f37ending lists or explanations280.900.030.86648
f79Context, What, Progress280.980.030.99607

Caption: artifact table. freq_noise ≈ 1.0 throughout confirms these are off-manifold threshold activations, not content. Reported for completeness only.

§4

Per-example drill-down

Raw-residual top features per example (top_r) alongside gold-h top features (top_h). Same caveats as §3 apply to top_r — these are off-manifold activations. Top_h is on-distribution and meaningful.

loading…