Blog - AgentsArchitects

Beyond Residual Connections: How Depth-Wise Attention Is Redefining Large Language Model Architecture
The Architectural Assumption Nobody Questioned

Nearly every modern Large Language Model—whether GPT, Llama, DeepSeek, Claude, Gemini, or Kimi—relies on a core architectural mechanism introduced over a decade ago: the residual connection.

Since the publication of Deep Residual Learning in 2015, residual connections have become one of the foundational building blocks of deep neural networks.

The idea is simple:

Each layer adds its output to the accumulated representation from all previous layers.

This design made it possible to train much deeper networks and became a cornerstone of modern Transformer architectures.

Yet despite the enormous evolution of attention mechanisms, expert routing, context windows, and reasoning systems, the residual connection itself has remained largely unchanged.

A new research paper from Moonshot AI's Kimi Team challenges that assumption and proposes a compelling alternative.

Their framework, called Attention Residuals (AttnRes), introduces attention-based information routing across network depth, potentially redefining how information flows inside large language models.

The Hidden Problem with Residual Connections

Traditional residual connections assign equal importance to every layer.

As models become deeper, representations from earlier layers are repeatedly added into a growing residual stream.

This creates several challenges.

Information Dilution

Early-layer information gradually becomes overwhelmed by later additions.

Important signals may remain present mathematically but become increasingly difficult to access effectively.

No Selective Retrieval

Every layer receives the same accumulated representation.

A layer cannot explicitly choose which earlier representations are most relevant to its current computation.

Growing Hidden-State Magnitudes

As depth increases, hidden-state values naturally grow larger.

Later layers must generate increasingly stronger outputs to remain influential.

This phenomenon, often called PreNorm dilution, has been observed across many modern Transformer architectures.

A Powerful Insight: Treat Depth Like Sequence

The central insight behind AttnRes is elegant.

Researchers observed a similarity between two problems:

Recurrent Neural Networks

RNNs compress all previous tokens into a single state.

This creates information bottlenecks over time.

Residual Networks

Residual connections compress all previous layers into a single accumulated representation.

This creates information bottlenecks over depth.

Transformers solved the first problem through attention.

Instead of compressing sequence history into a single state, each token can selectively attend to previous tokens.

AttnRes applies the same principle to network depth.

Instead of treating all previous layers equally, each layer learns which earlier layers deserve attention.

What Are Attention Residuals?

In AttnRes, every layer can attend to outputs from previous layers.

Rather than simply summing all prior representations, the model computes attention weights that determine how much influence each earlier layer should have.

This creates:

Selective information retrieval
Input-dependent information flow
Dynamic cross-layer communication
Better preservation of early representations

The result is a more flexible architecture that allows deeper layers to access exactly the information they need.

Why This Matters

Traditional residual connections assume:

Every previous layer matters equally.

AttnRes assumes:

Different layers matter differently depending on the task and input.

This subtle shift fundamentally changes how information moves through a model.

Rather than continuously accumulating representations, the network actively retrieves relevant knowledge from its own internal hierarchy.

Making It Scalable: Block Attention Residuals

While full depth-wise attention is theoretically attractive, applying attention across hundreds of layers introduces engineering challenges.

To solve this, the researchers introduced Block Attention Residuals.

The idea is straightforward.

Layers are grouped into blocks.

Instead of attending to every individual layer, the model attends to block-level summaries.

This dramatically reduces:

Memory requirements
Communication overhead
Training complexity
Inference latency

The researchers report that the approach introduces less than 4% additional training overhead while preserving most of the performance benefits.

This makes the technique practical for large-scale deployments.

Benchmark Results

The paper evaluates AttnRes across multiple model sizes and tasks.

Results show consistent improvements over traditional residual architectures.

GPQA-Diamond

Performance improved by approximately:

36.9 → 44.4

A significant gain on difficult scientific reasoning tasks.

Mathematical Reasoning

Performance increased from:

53.5 → 57.1

suggesting stronger multi-step reasoning capabilities.

HumanEval

Coding benchmark performance improved from:

59.1 → 62.2

indicating benefits for software engineering tasks.

BBH and MMLU

Smaller but consistent improvements were observed across broader reasoning and knowledge benchmarks.

The pattern is noteworthy.

The largest gains appear in tasks requiring:

Multi-step reasoning
Complex problem solving
Code generation
Knowledge composition

These are precisely the domains where effective information retrieval across layers is most valuable.

What the Ablation Studies Reveal

Several findings stand out.

Input-Dependent Attention Matters

Simply adding fixed weighting mechanisms produced little improvement.

Dynamic attention was essential.

Softmax Beats Sigmoid

Competitive normalization encouraged sharper layer selection and better performance.

Single-Head Depth Attention Is Enough

Surprisingly, allowing different attention heads to retrieve different layers actually reduced performance.

The network appears to benefit from coherent layer-level retrieval.

RMSNorm Is Critical

Without normalization, larger layer outputs dominated attention scores and reduced effectiveness.

These results suggest that depth-wise information routing behaves differently from token-level attention and requires its own design principles.

A New Preference for Depth

One of the most interesting findings is architectural.

Traditional Transformer designs often favor wider networks.

AttnRes appears to favor deeper ones.

Under equivalent compute budgets, the optimal AttnRes configurations consistently shifted toward:

More layers
Narrower widths

This suggests that selective depth retrieval makes deeper architectures substantially more useful.

If validated at larger scales, this could influence future frontier model design.

Why Enterprise AI Leaders Should Care

Most enterprise teams focus on:

Model selection
Fine-tuning
Retrieval systems
Agent orchestration

Few pay attention to the underlying architectural innovations happening inside foundation models.

However, improvements at this level eventually influence:

Inference cost
Reasoning quality
Training efficiency
Model scalability

AttnRes represents the type of architectural innovation that can quietly reshape the next generation of AI systems.

Organizations evaluating future AI platforms should monitor developments like this closely.

The Bigger Picture

The history of AI contains recurring patterns.

Many breakthrough advances occur when researchers replace fixed mechanisms with attention-based mechanisms.

Examples include:

RNNs → Transformers
Fixed retrieval → Retrieval-Augmented Generation
Static routing → Mixture-of-Experts

Attention Residuals may represent another step in that progression.

Rather than treating depth as a fixed accumulation process, the architecture transforms depth into something that can be dynamically queried and optimized.

Final Thoughts

The residual connection has remained largely unchanged for over a decade.

Attention Residuals challenge the assumption that information should simply accumulate across layers.

Instead, they propose that models should actively retrieve information from their own internal history, much like Transformers retrieve information across tokens.

If the reported results continue to hold at larger scales, AttnRes could become one of the most important architectural developments in the post-Transformer era.

The future of language models may depend not only on how they attend across sequences—but also on how they attend across their own depth.

References

Chen, G., Zhang, Y., Su, J., et al. (2026). Attention Residuals. Moonshot AI / Kimi Team. arXiv:2603.15031.

Available at:
https://arxiv.org/abs/2603.15031

Code Repository:
https://github.com/MoonshotAI/Attention-Residuals

Author Note

This article provides an independent analysis of the Attention Residuals (AttnRes) architecture and its implications for large language model design. All benchmark results, technical descriptions, and experimental findings are derived from the original research paper. Commentary and interpretation reflect the author's perspective.