Beyond Residual Connections: How Depth-Wise Attention Is Redefining Large Language Model Architecture
The Architectural Assumption Nobody Questioned
Nearly every modern Large Language Model—whether GPT, Llama, DeepSeek, Claude, Gemini, or Kimi—relies on a core architectural mechanism introduced over a decade ago: the residual connection.
Since the publication of Deep Residual Learning in 2015, residual connections have become one of the foundational building blocks of deep neural networks.
The idea is simple:
Each layer adds its output to the accumulated representation from all previous layers.
This design made it possible to train much deeper networks and became a cornerstone of modern Transformer architectures.
Yet despite the enormous evolution of attention mechanisms, expert routing, context windows, and reasoning systems, the residual connection itself has remained largely unchanged.
A new research paper from Moonshot AI's Kimi Team challenges that assumption and proposes a compelling alternative.
Their framework, called Attention Residuals (AttnRes), introduces attention-based information routing across network depth, potentially redefining how information flows inside large language models.
The Hidden Problem with Residual Connections
Traditional residual connections assign equal importance to every layer.
As models become deeper, representations from earlier layers are repeatedly added into a growing residual stream.
This creates several challenges.
Information Dilution
Early-layer information gradually becomes overwhelmed by later additions.
Important signals may remain present mathematically but become increasingly difficult to access effectively.
No Selective Retrieval
Every layer receives the same accumulated representation.
A layer cannot explicitly choose which earlier representations are most relevant to its current computation.
Growing Hidden-State Magnitudes
As depth increases, hidden-state values naturally grow larger.
Later layers must generate increasingly stronger outputs to remain influential.
This phenomenon, often called PreNorm dilution, has been observed across many modern Transformer architectures.
A Powerful Insight: Treat Depth Like Sequence
The central insight behind AttnRes is elegant.
Researchers observed a similarity between two problems:
Recurrent Neural Networks
RNNs compress all previous tokens into a single state.
This creates information bottlenecks over time.
Residual Networks
Residual connections compress all previous layers into a single accumulated representation.
This creates information bottlenecks over depth.
Transformers solved the first problem through attention.
Instead of compressing sequence history into a single state, each token can selectively attend to previous tokens.
AttnRes applies the same principle to network depth.
Instead of treating all previous layers equally, each layer learns which earlier layers deserve attention.
What Are Attention Residuals?
In AttnRes, every layer can attend to outputs from previous layers.
Rather than simply summing all prior representations, the model computes attention weights that determine how much influence each earlier layer should have.
This creates:
- Selective information retrieval
- Input-dependent information flow
- Dynamic cross-layer communication
- Better preservation of early representations
The result is a more flexible architecture that allows deeper layers to access exactly the information they need.
Why This Matters
Traditional residual connections assume:
- Every previous layer matters equally.
AttnRes assumes:
- Different layers matter differently depending on the task and input.
This subtle shift fundamentally changes how information moves through a model.
Rather than continuously accumulating representations, the network actively retrieves relevant knowledge from its own internal hierarchy.
Making It Scalable: Block Attention Residuals
While full depth-wise attention is theoretically attractive, applying attention across hundreds of layers introduces engineering challenges.
To solve this, the researchers introduced Block Attention Residuals.
The idea is straightforward.
Layers are grouped into blocks.
Instead of attending to every individual layer, the model attends to block-level summaries.
This dramatically reduces:
- Memory requirements
- Communication overhead
- Training complexity
- Inference latency
The researchers report that the approach introduces less than 4% additional training overhead while preserving most of the performance benefits.
This makes the technique practical for large-scale deployments.
Benchmark Results
The paper evaluates AttnRes across multiple model sizes and tasks.
Results show consistent improvements over traditional residual architectures.
GPQA-Diamond
Performance improved by approximately:
36.9 → 44.4
A significant gain on difficult scientific reasoning tasks.
Mathematical Reasoning
Performance increased from:
53.5 → 57.1
suggesting stronger multi-step reasoning capabilities.
HumanEval
Coding benchmark performance improved from:
59.1 → 62.2
indicating benefits for software engineering tasks.
BBH and MMLU
Smaller but consistent improvements were observed across broader reasoning and knowledge benchmarks.
The pattern is noteworthy.
The largest gains appear in tasks requiring:
- Multi-step reasoning
- Complex problem solving
- Code generation
- Knowledge composition
These are precisely the domains where effective information retrieval across layers is most valuable.
What the Ablation Studies Reveal
Several findings stand out.
Input-Dependent Attention Matters
Simply adding fixed weighting mechanisms produced little improvement.
Dynamic attention was essential.
Softmax Beats Sigmoid
Competitive normalization encouraged sharper layer selection and better performance.
Single-Head Depth Attention Is Enough
Surprisingly, allowing different attention heads to retrieve different layers actually reduced performance.
The network appears to benefit from coherent layer-level retrieval.
RMSNorm Is Critical
Without normalization, larger layer outputs dominated attention scores and reduced effectiveness.
These results suggest that depth-wise information routing behaves differently from token-level attention and requires its own design principles.
A New Preference for Depth
One of the most interesting findings is architectural.
Traditional Transformer designs often favor wider networks.
AttnRes appears to favor deeper ones.
Under equivalent compute budgets, the optimal AttnRes configurations consistently shifted toward:
- More layers
- Narrower widths
This suggests that selective depth retrieval makes deeper architectures substantially more useful.
If validated at larger scales, this could influence future frontier model design.
Why Enterprise AI Leaders Should Care
Most enterprise teams focus on:
- Model selection
- Fine-tuning
- Retrieval systems
- Agent orchestration
Few pay attention to the underlying architectural innovations happening inside foundation models.
However, improvements at this level eventually influence:
- Inference cost
- Reasoning quality
- Training efficiency
- Model scalability
AttnRes represents the type of architectural innovation that can quietly reshape the next generation of AI systems.
Organizations evaluating future AI platforms should monitor developments like this closely.
The Bigger Picture
The history of AI contains recurring patterns.
Many breakthrough advances occur when researchers replace fixed mechanisms with attention-based mechanisms.
Examples include:
- RNNs → Transformers
- Fixed retrieval → Retrieval-Augmented Generation
- Static routing → Mixture-of-Experts
Attention Residuals may represent another step in that progression.
Rather than treating depth as a fixed accumulation process, the architecture transforms depth into something that can be dynamically queried and optimized.
Final Thoughts
The residual connection has remained largely unchanged for over a decade.
Attention Residuals challenge the assumption that information should simply accumulate across layers.
Instead, they propose that models should actively retrieve information from their own internal history, much like Transformers retrieve information across tokens.
If the reported results continue to hold at larger scales, AttnRes could become one of the most important architectural developments in the post-Transformer era.
The future of language models may depend not only on how they attend across sequences—but also on how they attend across their own depth.
References
Chen, G., Zhang, Y., Su, J., et al. (2026). Attention Residuals. Moonshot AI / Kimi Team. arXiv:2603.15031.
Available at:
https://arxiv.org/abs/2603.15031
Code Repository:
https://github.com/MoonshotAI/Attention-Residuals
Author Note
This article provides an independent analysis of the Attention Residuals (AttnRes) architecture and its implications for large language model design. All benchmark results, technical descriptions, and experimental findings are derived from the original research paper. Commentary and interpretation reflect the author's perspective.