Residual analysis
Primary: drop-comparison encodes h and ĥ at the same L40 site so the site gap cancels — features strong in h but absent in ĥ are what the channel drops. Secondary: raw-residual enrichment is a labeled null (noise overlap 30/30).
- substrate
- 1000 gold + 021 AR reconstructions
- sae
- gemma-scope-2 L40 16k l0=60
- primary method
- drop-comparison (gap-cancelled)
- experiment
- 021_interp_combos
Sanity gate
The SAE was trained on full activations at layers.40. We hand it (a) gold activations one block later (h), (b) AR reconstructions (ĥ), (c) residuals r = h − ĥ, and (d) Gaussian noise norm-matched to those residuals. FVU on r and noise are both noise-level — confirming the raw-residual enrichment is an off-manifold artifact.
Drop-comparison (primary)
Method: encode h and ĥ separately at the same L40 site (passing each through the SAE directly, not via r = h − ĥ). The site gap applies equally to both encodings, so it cancels. Features present in the h encoding but absent or weak in the ĥ encoding are what the text channel drops during AR reconstruction.
separate-encode h vs h-hat at the same L40 site (gap cancels); PRIMARY residual method for 021
| feature | label / logit-lens | freq h | freq ĥ | mean act h | mean act ĥ | drop ratio |
|---|---|---|---|---|---|---|
| f3726 | operation finishes | 0.075 | 0.000 | 46.5 | 0.0 | 76.0 |
| f3023 | lists followed by and/or to | 0.039 | 0.000 | 22.8 | 0.0 | 40.0 |
| f2384 | 上記の 上記 उपरोक्त ছাড়াও | 0.053 | 0.001 | 30.6 | 0.5 | 27.0 |
| f8485 | code or special characters | 0.022 | 0.000 | 15.9 | 0.0 | 23.0 |
| f8625 | order status | 0.028 | 0.001 | 17.7 | 0.5 | 14.5 |
| f3819 | again | 0.037 | 0.002 | 15.9 | 0.6 | 12.7 |
| f12505 | complex numbers | 0.023 | 0.002 | 12.3 | 1.4 | 8.0 |
| f1632 | closing punctuation marks | 0.031 | 0.005 | 14.5 | 2.3 | 5.3 |
| f15243 | military alliances or problems | 0.055 | 0.010 | 30.1 | 4.3 | 5.1 |
| f13871 | code snippets and multi-language keywords | 0.029 | 0.005 | 15.4 | 3.0 | 5.0 |
| f2303 | hon'ble legal context | 0.020 | 0.004 | 10.0 | 2.0 | 4.2 |
| f4025 | oxygen | 0.020 | 0.004 | 10.8 | 1.8 | 4.2 |
| f9263 | foreign and multilingual references | 0.056 | 0.013 | 41.3 | 8.7 | 4.1 |
| f4120 | personal and meaningful connection | 0.029 | 0.007 | 13.3 | 3.1 | 3.8 |
| f1248 | yine again Again again | 0.021 | 0.005 | 10.2 | 2.0 | 3.7 |
| f2394 | oretically 并不是 *, , | 0.042 | 0.011 | 24.5 | 6.4 | 3.6 |
| f6169 | lists of things like languages or services | 0.023 | 0.006 | 12.9 | 4.4 | 3.4 |
| f2186 | various forms of likewise | 0.025 | 0.007 | 12.4 | 2.8 | 3.3 |
| f15222 | the daycare | 0.023 | 0.007 | 13.0 | 3.3 | 3.0 |
| f2211 | edge devices and entering hurricanes | 0.045 | 0.015 | 24.1 | 7.5 | 2.9 |
| f13181 | DSP, CAP, acronyms | 0.035 | 0.012 | 24.3 | 8.6 | 2.8 |
| f2714 | React hooks and technologies | 0.032 | 0.011 | 18.7 | 6.6 | 2.8 |
| f4151 | technical terms and diverse languages | 0.021 | 0.007 | 12.1 | 5.5 | 2.8 |
| f3265 | goodness | 0.048 | 0.017 | 24.8 | 8.7 | 2.7 |
| f3360 | start of critical or related phrases | 0.026 | 0.009 | 13.3 | 4.6 | 2.7 |
| f12830 | called or so-called | 0.023 | 0.008 | 14.0 | 4.8 | 2.7 |
| f15301 | month, form, threat | 0.028 | 0.010 | 13.3 | 3.8 | 2.6 |
| f2862 | typical Typical typical Typical | 0.020 | 0.007 | 12.8 | 4.9 | 2.6 |
| f9517 | . \n | 0.022 | 0.008 | 9.1 | 3.6 | 2.6 |
| f6727 | other languages | 0.027 | 0.010 | 15.4 | 5.9 | 2.5 |
Features strong in h (freq ≥ min_freq_h) but absent/weak in ĥ. drop_ratio = mean_act_h / max(mean_act_hhat, ε). High ratio = strongly dropped by the channel.
Raw-residual enrichment (labeled null)
This is a negative result. NEGATIVE RESULT / labeled null: SAE-encode(h-hat) of the raw residual. The SAE is not linear over differences, so this is semi-baked regardless of site; kept for the record. The drop_compare method (separate-encode) is the primary one.
| feature | label / logit-lens | enrich | freq r | freq h | freq noise | mean act |
|---|---|---|---|---|---|---|
| f169 | official/original sources | 48 | 1.00 | 0.02 | 1.00 | 986 |
| f39 | numbers and units | 46 | 1.00 | 0.02 | 1.00 | 1881 |
| f507 | camaraderie социа музей disgraced | 46 | 1.00 | 0.02 | 1.00 | 3441 |
| f86 | 't or '2 | 46 | 1.00 | 0.02 | 1.00 | 1507 |
| f826 | wired charging and plots | 45 | 1.00 | 0.02 | 1.00 | 1266 |
| f711 | which and which valamint | 43 | 0.99 | 0.02 | 0.99 | 1558 |
| f527 | code, numbers, ```` | 43 | 0.98 | 0.02 | 0.97 | 904 |
| f403 | وغيرها 等等 इत्यादी 以及 | 42 | 1.00 | 0.02 | 1.00 | 1424 |
| f400 | multitude plethora lot Vielzahl | 42 | 1.00 | 0.02 | 1.00 | 5096 |
| f439 | K H G the | 42 | 1.00 | 0.02 | 1.00 | 1436 |
| f1082 | generate revenue or transmit information | 40 | 1.00 | 0.02 | 1.00 | 1280 |
| f348 | `` čne とはいえ izinsuku | 38 | 1.00 | 0.03 | 1.00 | 3829 |
| f504 | ibid https formerly http | 37 | 1.00 | 0.03 | 1.00 | 1237 |
| f21 | breakdown of why/how/what | 36 | 0.80 | 0.02 | 0.70 | 544 |
| f7513 | code following punctuation | 36 | 0.79 | 0.02 | 0.76 | 353 |
| f1643 | instructions and constraints | 35 | 0.73 | 0.02 | 0.70 | 411 |
| f541 | numbers and ranges | 35 | 1.00 | 0.03 | 1.00 | 2327 |
| f5251 | special, unique, or distinct concepts | 34 | 0.77 | 0.02 | 0.76 | 508 |
| f489 | * <ul> Саша | 33 | 1.00 | 0.03 | 1.00 | 1950 |
| f318 | patitth pabbaj entusiasmo ettha | 33 | 1.00 | 0.03 | 1.00 | 1931 |
| f9517 | . \n | 33 | 0.76 | 0.02 | 0.70 | 333 |
| f5444 | restrictive lung disease, surnames | 32 | 0.68 | 0.02 | 0.72 | 324 |
| f352 | <unused541> abbanti <unused284> <unused291> | 32 | 1.00 | 0.03 | 1.00 | 2936 |
| f322 | gahet sajana pona iha | 32 | 1.00 | 0.03 | 1.00 | 1473 |
| f346 | aka S iaz others | 31 | 1.00 | 0.03 | 0.99 | 1120 |
| f269 | <unused437> saddhim niektor <unused218> | 30 | 1.00 | 0.03 | 1.00 | 6018 |
| f631 | government agreements and announcements | 30 | 0.81 | 0.03 | 0.78 | 354 |
| f30 | the followed by a noun | 29 | 1.00 | 0.03 | 1.00 | 4969 |
| f37 | ending lists or explanations | 28 | 0.90 | 0.03 | 0.86 | 648 |
| f79 | Context, What, Progress | 28 | 0.98 | 0.03 | 0.99 | 607 |
Caption: artifact table. freq_noise ≈ 1.0 throughout confirms these are off-manifold threshold activations, not content. Reported for completeness only.
Per-example drill-down
Raw-residual top features per example (top_r) alongside gold-h top features (top_h). Same caveats as §3 apply to top_r — these are off-manifold activations. Top_h is on-distribution and meaningful.
loading…