DiffusionBlocks: A New Method for Block-by-Block Neural Network Training That Reduces Memory by Several Times

Core Idea: Breaking the Memory Bottleneck of End-to-End Training

For over a decade, end-to-end backpropagation (end-to-end backprop) has been considered the only way to train deep networks. However, this approach requires keeping the entire network in memory simultaneously, and the linearly growing memory demands with model depth are becoming a resource bottleneck for AI training.

The core requirement of end-to-end backpropagation is saving intermediate activations at every layer during the forward pass, so that gradients can be computed during backpropagation. For a network with L layers, memory consumption scales linearly with L. Take a GPT-4-scale model as an example — it may have over 120 Transformer blocks, with each layer's activations potentially occupying several GB of memory during large-batch training. While techniques like gradient checkpointing can trade computation for memory, they fundamentally don't change this linear dependency. This bottleneck directly limits the model depth and batch size that researchers can train.

A team from Sakana AI published a paper at ICLR 2026 proposing an entirely new approach — DiffusionBlocks — which splits the network into independent blocks and trains only one block at a time, dramatically reducing the GPU memory required for training while maintaining performance comparable to end-to-end training.

DiffusionBlocks Overview

Technical Principles: Reinterpreting Forward Propagation as a Diffusion Denoising Process

Core Insight: Forward Propagation as Stepwise Denoising

The core innovation of DiffusionBlocks lies in an elegant analogy: reinterpreting the forward propagation process of a neural network as a diffusion model progressively denoising a signal.

The core idea of Diffusion Models originates from non-equilibrium thermodynamics: first, a forward process gradually adds Gaussian noise to data until it becomes pure noise; then a neural network is trained to learn the reverse process — progressively recovering the original data from noise. Each denoising step is a small, learnable transformation that pushes the current state one step closer to the target distribution. DDPM (Denoising Diffusion Probabilistic Models) demonstrated in 2020 that this approach can generate high-quality images. DiffusionBlocks cleverly borrows this mathematical framework: the role of each block in the network is analogous to one denoising step in a diffusion model — each block pushes the intermediate representation one step closer to the final target.

Specifically, the researchers explicitly assign each block a role — advancing the representation one small step toward the target, making it closer to the final goal than the output of the previous block. This is precisely what a diffusion model does at each step.

Block-by-Block Training Mechanism in Detail

In traditional training, all parameters need to be jointly optimized, and gradients must propagate all the way from the end of the network back to the beginning. Under the DiffusionBlocks framework:

The network is divided into multiple independent blocks
Each block only needs to optimize its own local objective function
Training only requires allocating memory for a single block
Blocks can be trained independently without loading the entire network simultaneously

The idea of training networks with local objective functions is not entirely new. As early as 2006, Hinton et al. proposed layer-wise pretraining, which initializes deep networks layer by layer through local unsupervised objectives. Later auxiliary losses methods, such as the intermediate classifiers in GoogLeNet, also attempted to alleviate the vanishing gradient problem. However, these early methods typically underperformed compared to end-to-end training. The key breakthrough of DiffusionBlocks is that it provides a principled theoretical foundation for local objectives through the mathematical framework of the diffusion process, ensuring that each block's local optimization objective remains consistent with the global objective, thereby truly approaching end-to-end training in performance.

This approach fundamentally reduces training memory requirements from scaling linearly with network depth to depending only on the size of a single block.

Experimental Validation: Comprehensive Coverage Across Five Mainstream Architectures

The research team validated DiffusionBlocks' effectiveness across five different architectures:

ViT (Vision Transformer): The mainstream architecture in the vision domain. Proposed by Google in 2020, ViT splits images into fixed-size patch sequences and processes them with a standard Transformer encoder, proving that pure Transformer architectures can match or even surpass CNNs on vision tasks.
DiT (Diffusion Transformer): A cutting-edge model for image generation. DiT applies the Transformer architecture to diffusion-based image generation tasks, proposed by Peebles and Xie in 2023, replacing the traditional U-Net backbone in diffusion models with Transformer blocks, becoming the technical foundation for video generation models like Sora.
Masked Diffusion: A masked diffusion model that applies the diffusion process to discrete token spaces, generating text and other discrete data through progressive masking and unmasking.
Autoregressive Transformer: The core architecture of large language models, generating sequences by predicting the next token one at a time, serving as the foundation of the GPT series.
Looped Transformer: An architecture that iteratively applies the same network, achieving parameter-efficient depth scaling through weight sharing.

Both types of Transformer architectures (ViT and DiT) feature deep stacking characteristics, making them particularly suitable for DiffusionBlocks' block-by-block training strategy. Across all test scenarios, DiffusionBlocks achieved competitive performance with end-to-end training while using only a fraction of the memory.

Special Value for Looped Transformers

DiffusionBlocks' application to Looped Transformers deserves particular attention. These models increase effective depth by iteratively applying the same network, but traditional training requires expensive Backpropagation Through Time (BPTT).

Backpropagation Through Time (BPTT) is the standard method for training recurrent networks, requiring the loop to be unrolled into an equivalent deep feedforward network, then performing backpropagation across the entire unrolled sequence. For Looped Transformers, if the same block is iteratively applied T times, BPTT needs to store all intermediate activations for T steps, with memory consumption of O(T) and computational complexity also growing linearly with T. When T is large (e.g., tens to hundreds of iterations), this makes training extremely expensive.

Through the DiffusionBlocks lens, researchers can replace multi-iteration BPTT with a single forward pass during training, which not only saves memory but also greatly simplifies the training pipeline. DiffusionBlocks treats each iteration as an independent denoising step, requiring only a single forward pass to compute the local loss, completely avoiding the unrolling process of BPTT.

Significance and Outlook: Redefining the Deep Network Training Paradigm

The significance of DiffusionBlocks extends far beyond memory optimization:

Lowering the training barrier: Making it possible to train deeper models on limited hardware resources
Theoretical contribution: Establishing a mathematical connection between deep network training and the diffusion process
Strong practicality: Applicable to multiple current mainstream architectures, including LLMs and vision models
Scalability: Providing a new technical pathway for training even larger-scale models in the future

As model sizes continue to grow, training resource bottlenecks will only become more pronounced. DiffusionBlocks offers a principled solution that has the potential to change the fundamental paradigm of how we train deep neural networks. The work has been open-sourced on GitHub, allowing researchers to directly reproduce and extend this method.

DiffusionBlocks: A New Method for Block-by-Block Neural Network Training That Reduces Memory by Several Times

Core Idea: Breaking the Memory Bottleneck of End-to-End Training

Technical Principles: Reinterpreting Forward Propagation as a Diffusion Denoising Process

Core Insight: Forward Propagation as Stepwise Denoising

Block-by-Block Training Mechanism in Detail

Experimental Validation: Comprehensive Coverage Across Five Mainstream Architectures

Special Value for Looped Transformers

Significance and Outlook: Redefining the Deep Network Training Paradigm

Key Takeaways

Related articles

Sakana AI Launches RSI Lab: The Path to Recursive Self-Improvement Where AI Builds AI

The Clotilda: Underwater Archaeological Discovery of America's Last Slave Ship

Sakana AI in Practice: Reshaping Banking Lending Operations with AI Agents — Technology and Strategy