Akihiko Komada a.k.a 駒田明彦 (@aki1770)

Akihiko Komada a.k.a 駒田明彦 (@aki1770), Genel AI kategorisinde yer alan bir yapay zeka kaydıdır ve Genel AI başlıklarıyla ilişkilidir.

How it Works Attention Residuals (AttnRes) replace the fixed accumulation used in standard PreNorm residual connections with a dynamic, attention‑based mechanism.
 • – learned pseudo‑query vector for layer l The softmax

arrow_backTüm İçerikler
hubGenel AI
Akihiko Komada a.k.a 駒田明彦
Akihiko Komada a.k.a 駒田明彦@aki1770

Akihiko Komada a.k.a 駒田明彦 (@aki1770)

Akihiko Komada a.k.a 駒田明彦 (@aki1770)
How it Works Attention Residuals (AttnRes) replace the fixed accumulation used in standard PreNorm residual connections with a dynamic, attention‑based mechanism.
 • – learned pseudo‑query vector for layer l The softmax attention lets each layer selectively aggregate earlier representations based on the current input, rather than treating all prior layers equally. Variants •Full AttnRes – every layer attends to allprevious layer outputs.
Memory: O(Ld) ( L = number of layers). •Block AttnRes – layers are grouped into N blocks (e.g., ~8 blocks). •Within a block: standard residual accumulation. •Across blocks: attention is applied only to block‑level summaries plus any partial sum from the current incomplete block.
Memory: O(Nd). Both variants are drop‑in replacements that keep the two‑phase transformer computation (attention → MLP) and typically use RMSNorm for stability. Why It Matters Uniform residuals in deep PreNorm transformers cause: •Gradient dilution – earlier layers receive weaker updates. •Uncontrolled hidden‑state growth – magnitudes explode with depth. AttnRes introduces learned, input‑dependent depth selection, which: •Keeps output norms bounded. •Distributes gradients uniformly across layers. •Improves training dynamics and scaling efficiency. Empirical Gains On a 48 B‑parameter Kimi Linear MoE model (3 B activated, 1.4 T tokens):

interestsGenel AI kategorisinden

Haftalık AI Bülteni

Her hafta en önemli yapay zeka gelişmelerini e-postana gönderelim.