neurologyLLM & Modeller

Akihiko Komada a.k.a 駒田明彦@aki1770
Akihiko Komada a.k.a 駒田明彦 (@aki1770)
How it Works
Attention Residuals (AttnRes) replace the fixed accumulation used in standard PreNorm residual connections with a dynamic, attention‑based mechanism.
• – learned pseudo‑query vector for layer l
The softmax attention lets each layer selectively aggregate earlier representations based on the current input, rather than treating all prior layers equally.
Variants
•Full AttnRes – every layer attends to allprevious layer outputs.
Memory: O(Ld) ( L = number of layers).
•Block AttnRes – layers are grouped into N blocks (e.g., ~8 blocks).
•Within a block: standard residual accumulation.
•Across blocks: attention is applied only to block‑level summaries plus any partial sum from the current incomplete block.
Memory: O(Nd).
Both variants are drop‑in replacements that keep the two‑phase transformer computation (attention → MLP) and typically use RMSNorm for stability.
Why It Matters
Uniform residuals in deep PreNorm transformers cause:
•Gradient dilution – earlier layers receive weaker updates.
•Uncontrolled hidden‑state growth – magnitudes explode with depth.
AttnRes introduces learned, input‑dependent depth selection, which:
•Keeps output norms bounded.
•Distributes gradients uniformly across layers.
•Improves training dynamics and scaling efficiency.
Empirical Gains
On a 48 B‑parameter Kimi Linear MoE model (3 B activated, 1.4 T tokens):
model
favorite4