Hyper-Connections: The First Major Improvement to Residual Connections in a Decade

Overview

Residual Connections have remained essentially unchanged for a full decade since He Kaiming proposed them in 2016. Despite various attempted variants over the years, experiments have shown that the original version still performs best. In September 2024, ByteDance published a paper called Hyper-Connections, proposing a significant improvement to residual connections that achieves notably better training results under the same computational budget.

ByteDance has published a large number of influential AI papers in recent years, with research capabilities arguably on par with Tencent—and even surpassing them in certain areas. Although this paper's underlying principle appears simple, its potential impact should not be underestimated.

Why Residual Connections Need Improvement

The Small Model Era: One Connection Was Enough

When models were relatively small, a single skip connection was sufficient to solve the vanishing gradient problem and enable deeper model training. This was the core purpose for which residual networks were originally designed, and they accomplished that mission perfectly.

The Large Model Era: Limitations of a Single Connection

However, as models grow increasingly large and deep, traditional single residual connections have revealed their shortcomings. In deep networks, vanishing gradients still occur between deeper layers, meaning those layers effectively learn very little useful information. This is the core problem that Hyper-Connections aims to solve.

The Core Idea Behind Hyper-Connections

Expanding from One Connection to Multiple Connections

The paper's core idea is highly intuitive: since models have grown larger and one connection is no longer sufficient, expand to multiple connections. Taking two connections as an example, the original input H becomes two branches: H1 and H2.

The specific implementation works as follows:

Input stage: The original input H0 is simply duplicated to generate two copies as initial values for the two connections
Fused input: H1 and H2 are fused through a weighted sum before being fed into the network layer
Output distribution: The network layer's output is merged back into the two connections through learnable parameters
Connection interaction: The two connections are not completely independent—they undergo weighted fusion at every layer

Application in Transformers

In the Transformer architecture, the original flow is layer-by-layer stacking: attention layer → feed-forward layer → attention layer. Hyper-Connections expands the connection channels on top of this. Between every layer, the two connections undergo fusion, with learnable weights controlling the contribution ratio of each for both inputs and outputs.

Key Differences from Original Residual Connections

In original residual connections, the skip connection weights are fixed (e.g., 0.5 each, or direct addition) with no trainable parameters. Hyper-Connections introduces learnable parameters, allowing the model to decide the weight allocation for each connection on its own, providing much greater flexibility.

Experimental Results and Limitations

Impressive Experimental Results

Under the same training condition of 500B tokens, Hyper-Connections achieves lower loss values. Performance across all datasets significantly outperforms traditional residual connections. This means:

Better model performance under the same compute budget
Less compute required to achieve the same performance
Improvements come simply by replacing the connection method—no changes to data or training pipeline needed

Why Mainstream Large Models Haven't Widely Adopted It Yet

Despite the promising results, mainstream large model training has not widely adopted this method. The reasons may include:

Insufficient model scale validation: The paper only validates on models ranging from 1B to 7B parameters, which is far too small compared to today's models with tens or hundreds of billions of parameters. What works on small models doesn't necessarily work on large ones.

Insufficient training data: The paper only uses 500B tokens for training, while current mainstream large models train on over ten trillion or even sixty trillion tokens. LLaMA 1 already used 1.4T, and LLaMA 2 used 2T.

Questionable training stability: From the training curves in the paper, traditional residual connections converge very stably with no obvious loss spikes. However, Hyper-Connections' training curves show significant fluctuations and spikes, which could be a fatal issue in large-scale training.

Unknown long-term training effects: The paper reveals an interesting phenomenon—other residual connection improvements outperform the original version early in training but actually underperform it later on. Whether Hyper-Connections suffers from a similar issue requires validation with much larger-scale data.

Future Outlook

You may not have noticed, but it's understood that DeepSeek V4 actually uses an improved version of Hyper-Connections and has achieved good results. This suggests the direction is promising.

If Hyper-Connections can be validated on models with tens or even hundreds of billions of parameters, it would represent the most significant breakthrough in residual connections in a decade. The original residual connection may gradually be replaced by this multi-connection approach, becoming a new mainstream architectural component.

Summary

The core contribution of Hyper-Connections can be summarized in one sentence: expanding residual networks from a single skip connection to multiple learnable connection pathways, with parameterized fusion mechanisms enabling information exchange between connections. The method itself is not complex, but its potential impact depends on whether it can be validated at larger scales. This 37-page paper contains extensive mathematical proofs and derivations, but the core idea is indeed concise and elegant.