Hyper-Connections: The First Major Improvement to Residual Connections in a Decade
Hyper-Connections: The First Major Imp…
ByteDance's Hyper-Connections expands residual connections to multiple learnable pathways for better training efficiency.
ByteDance's Hyper-Connections paper proposes the first major improvement to residual connections in a decade. By expanding from a single skip connection to multiple learnable connection pathways with parameterized fusion, it achieves significantly lower loss under the same compute budget. While promising, adoption remains limited due to insufficient validation at larger model scales and potential training stability concerns.
Overview
Residual Connections have remained essentially unchanged for a full decade since He Kaiming proposed them in 2016. Despite various attempted variants over the years, experiments have shown that the original version still performs best. In September 2024, ByteDance published a paper called Hyper-Connections, proposing a significant improvement to residual connections that achieves notably better training results under the same computational budget.
ByteDance has published a large number of influential AI papers in recent years, with research capabilities arguably on par with Tencent—and even surpassing them in certain areas. Although this paper's underlying principle appears simple, its potential impact should not be underestimated.
Why Residual Connections Need Improvement
The Small Model Era: One Connection Was Enough
When models were relatively small, a single skip connection was sufficient to solve the vanishing gradient problem and enable deeper model training. This was the core purpose for which residual networks were originally designed, and they accomplished that mission perfectly.
The Large Model Era: Limitations of a Single Connection
However, as models grow increasingly large and deep, traditional single residual connections have revealed their shortcomings. In deep networks, vanishing gradients still occur between deeper layers, meaning those layers effectively learn very little useful information. This is the core problem that Hyper-Connections aims to solve.
The Core Idea Behind Hyper-Connections
Expanding from One Connection to Multiple Connections
The paper's core idea is highly intuitive: since models have grown larger and one connection is no longer sufficient, expand to multiple connections. Taking two connections as an example, the original input H becomes two branches: H1 and H2.
The specific implementation works as follows:
- Input stage: The original input H0 is simply duplicated to generate two copies as initial values for the two connections
- Fused input: H1 and H2 are fused through a weighted sum before being fed into the network layer
- Output distribution: The network layer's output is merged back into the two connections through learnable parameters
- Connection interaction: The two connections are not completely independent—they undergo weighted fusion at every layer
Application in Transformers
In the Transformer architecture, the original flow is layer-by-layer stacking: attention layer → feed-forward layer → attention layer. Hyper-Connections expands the connection channels on top of this. Between every layer, the two connections undergo fusion, with learnable weights controlling the contribution ratio of each for both inputs and outputs.
Key Differences from Original Residual Connections
In original residual connections, the skip connection weights are fixed (e.g., 0.5 each, or direct addition) with no trainable parameters. Hyper-Connections introduces learnable parameters, allowing the model to decide the weight allocation for each connection on its own, providing much greater flexibility.
Experimental Results and Limitations
Impressive Experimental Results
Under the same training condition of 500B tokens, Hyper-Connections achieves lower loss values. Performance across all datasets significantly outperforms traditional residual connections. This means:
- Better model performance under the same compute budget
- Less compute required to achieve the same performance
- Improvements come simply by replacing the connection method—no changes to data or training pipeline needed
Why Mainstream Large Models Haven't Widely Adopted It Yet
Despite the promising results, mainstream large model training has not widely adopted this method. The reasons may include:
Insufficient model scale validation: The paper only validates on models ranging from 1B to 7B parameters, which is far too small compared to today's models with tens or hundreds of billions of parameters. What works on small models doesn't necessarily work on large ones.
Insufficient training data: The paper only uses 500B tokens for training, while current mainstream large models train on over ten trillion or even sixty trillion tokens. LLaMA 1 already used 1.4T, and LLaMA 2 used 2T.
Questionable training stability: From the training curves in the paper, traditional residual connections converge very stably with no obvious loss spikes. However, Hyper-Connections' training curves show significant fluctuations and spikes, which could be a fatal issue in large-scale training.
Unknown long-term training effects: The paper reveals an interesting phenomenon—other residual connection improvements outperform the original version early in training but actually underperform it later on. Whether Hyper-Connections suffers from a similar issue requires validation with much larger-scale data.
Future Outlook
You may not have noticed, but it's understood that DeepSeek V4 actually uses an improved version of Hyper-Connections and has achieved good results. This suggests the direction is promising.
If Hyper-Connections can be validated on models with tens or even hundreds of billions of parameters, it would represent the most significant breakthrough in residual connections in a decade. The original residual connection may gradually be replaced by this multi-connection approach, becoming a new mainstream architectural component.
Summary
The core contribution of Hyper-Connections can be summarized in one sentence: expanding residual networks from a single skip connection to multiple learnable connection pathways, with parameterized fusion mechanisms enabling information exchange between connections. The method itself is not complex, but its potential impact depends on whether it can be validated at larger scales. This 37-page paper contains extensive mathematical proofs and derivations, but the core idea is indeed concise and elegant.
Related articles
OpenAI Codex Deep Dive: The AI Develop…
OpenAI Codex Deep Dive: The AI Development Tool That Makes Programming Feel Like Flying
Deep dive into how OpenAI Codex redefines programming. From real developer feedback to the Time to Fly project, analyzing Codex's strengths in code generation, context understanding, and the AI coding tool competitive landscape.

Claude Code + AssemblyAI in Practice: A Complete Tutorial for Building a Voice Agent in One Afternoon
Learn how to build a Voice Agent with speech recognition, conversation understanding, and calendar booking using Claude Code and AssemblyAI in one afternoon.

Getting Started with Codex from Scratch: Complete Guide from Registration to Setup
Complete guide to getting started with Codex: GPT registration, SMS verification, US Apple ID setup, ChatGPT app installation, and subscription plan selection for beginners.