Akihiko Komada a.k.a 駒田明彦 (@aki1770)

How it Works Attention Residuals (AttnRes) replace the fixed accumulation used in standard PreNorm residual connections with a dynamic, attention‑based mechanism.  • – learned pseudo‑query vector for layer l The softmax attention lets each layer selectively aggregate earlier representations based on the current input, rather than treating all prior layers equally. Variants •Full AttnRes – every layer attends to allprevious layer outputs. Memory: O(Ld) ( L = number of layers). •Block AttnRes – layers are grouped into N blocks (e.g., ~8 blocks). •Within a block: standard residual accumulation. •Across blocks: attention is applied only to block‑level summaries plus any partial sum from the current incomplete block. Memory: O(Nd). Both variants are drop‑in replacements that keep the two‑phase transformer computation (attention → MLP) and typically use RMSNorm for stability. Why It Matters Uniform residuals in deep PreNorm transformers cause: •Gradient dilution – earlier layers receive weaker updates. •Uncontrolled hidden‑state growth – magnitudes explode with depth. AttnRes introduces learned, input‑dependent depth selection, which: •Keeps output norms bounded. •Distributes gradients uniformly across layers. •Improves training dynamics and scaling efficiency. Empirical Gains On a 48 B‑parameter Kimi Linear MoE model (3 B activated, 1.4 T tokens):

favorite4

X LinkedIn WhatsApp

Kaynak open_in_new

Kürasyon

Akihiko Komada a.k.a 駒田明彦 (@aki1770)

interestsGenel AI kategorisinden

Haftalık AI Bülteni