Making LLMs Faster and Lighter: A Practical Approach to Reshaping Sparsity for GPUs

The Core Problem: Why Does "Doing Less Computation" Actually Run Slower?

The human brain's efficiency lies in activating only the neurons needed for a specific thought. Modern large language models (LLMs) naturally possess this same property—in feed-forward layers, for any given token, over 95% of neurons remain silent. Yet there's a frustrating paradox: making a model do less math often makes it run slower.

The reason is that unstructured sparsity introduces irregular memory access patterns, and GPUs are inherently designed for predictable, dense block operations. GPU compute architecture is based on the SIMT (Single Instruction, Multiple Threads) model, whose core advantage is having thousands of threads execute the same operation simultaneously. When data exists as dense matrices, GPUs can use coalesced memory access to read contiguous data blocks in one shot, fully utilizing high-bandwidth HBM. However, unstructured sparsity means non-zero elements are scattered at arbitrary positions throughout the matrix, forcing GPU threads to jump between non-contiguous memory addresses, resulting in massive cache misses and memory bandwidth waste. Worse still, sparse matrices require additional index data structures to record non-zero element positions, and this metadata itself consumes storage and bandwidth. Therefore, even if theoretical computation is reduced by over 90%, actual execution time may only decrease by 10-20% or less—while sparse matrices theoretically require fewer computations, the irregular data distribution prevents GPUs from fully leveraging their parallel computing capabilities, severely reducing actual throughput.

This is precisely the core problem that Sakana AI and NVIDIA's latest ICML 2026 paper—"Sparser, Faster, Lighter Transformer Language Models"—aims to solve.

Paper Overview

Core Idea: Making Sparsity Fit the GPU, Not the Other Way Around

Traditional sparse acceleration approaches typically try to modify GPU execution logic to handle irregular sparse data, but this often yields diminishing returns. This paper's core philosophy is the exact opposite: reshape the sparse data format so it naturally aligns with GPU-optimized execution paths.

The research team proposes a "hybrid" format that reorganizes sparsity into GPU-friendly structures. Specifically, the approach processes 99% of highly sparse tokens through a fast path while reserving a dense matrix as a safety valve for those rare "heavy" tokens. In actual LLM inference, activation sparsity varies significantly across tokens—most tokens exhibit extremely high sparsity in feed-forward layers with only a few neurons activated, but a small number of "heavy" tokens (such as special beginning-of-sentence markers, high-information-density keywords, etc.) may activate a large number of neurons. If all tokens are forced through the sparse path, these heavy tokens would cause severe storage waste and computational efficiency degradation due to excessive space reserved in the format. The hybrid strategy dynamically routes tokens by setting a sparsity threshold: high-sparsity tokens take the fast path to enjoy sparse acceleration, while low-sparsity tokens take the traditional dense matrix path to ensure correctness and efficiency. This design resembles the "fast path/slow path" pattern in computer architecture, maximizing average performance while maintaining robustness—fully leveraging the computational savings from sparsity while avoiding the GPU performance drag from irregular memory access.

Two Core Technical Contributions

TwELL: A New GPU-Oriented Sparse Packing Format

The paper's first core contribution is TwELL (Tile-wise ELLPACK), a novel sparse data packing format. Unlike traditional sparse formats, TwELL is designed to embed directly into already highly-optimized tiled matrix multiplication (tiled matmul) kernels without disrupting their execution flow.

ELLPACK itself is a classic sparse matrix storage format originating from the 1980s ELLPACK numerical computing software package. Its core idea is: assuming each matrix row has at most K non-zero elements, the entire sparse matrix is stored using an N×K value array and an N×K column index array. Compared to the more general CSR (Compressed Sparse Row) format, ELLPACK's advantage lies in its regular data layout—each row occupies fixed storage space, making parallel access patterns more predictable and well-suited to GPU SIMD execution models. However, traditional ELLPACK causes severe storage waste when the number of non-zero elements varies significantly across rows.

TwELL applies a "tiled" transformation to ELLPACK—it organizes sparse data according to GPU compute unit tile sizes (typically 16×16 or 32×32 sub-matrix blocks), allowing the sparse pattern within each tile to be managed independently. The benefit is that each compute block can be processed at near-dense-matrix efficiency, preserving ELLPACK's regular access advantages while improving storage efficiency. This "hardware-centric" format design is key to achieving true sparse acceleration.

To understand why TwELL can seamlessly embed into existing computation flows, one needs to understand how tiled matrix multiplication works. Efficient matrix operations on modern GPUs partition large matrices into small sub-blocks (tiles), each sized to fit into GPU shared memory or register files. During computation, each GPU thread block handles one output tile's calculation, repeatedly loading input tiles from global memory into shared memory, performing multiply-accumulate operations on-chip, and finally writing results back to global memory. NVIDIA's Tensor Cores further accelerate this process, completing a small matrix multiply-add in a single clock cycle. Libraries like cuBLAS and CUTLASS have optimized tiled matrix multiplication to near hardware theoretical peak performance. TwELL's design elegance lies in aligning sparse data organization with these already highly-optimized tile computation flows, enabling sparse kernels to reuse most optimization strategies from dense kernels.

Custom CUDA Kernels: Fusion and Compression

The second core contribution is a set of custom CUDA kernels with two key capabilities:

Fusing multiple sparse matrix multiplications: By merging multiple operations into a single kernel call, GPU throughput is maximized while reducing kernel launch overhead and intermediate data memory reads/writes.
Compressing TwELL into a hybrid representation: Further compressing the sparse format into a hybrid representation that minimizes activation storage size, significantly reducing memory footprint.

In GPU programming, each kernel launch carries non-negligible overhead: CPU-GPU synchronization, kernel scheduling latency, and the memory round-trip cost of writing intermediate results to global memory for the next kernel to read. Kernel fusion is an optimization technique that merges multiple logically independent operations into a single CUDA kernel execution—intermediate results can remain in registers or shared memory, avoiding expensive global memory reads/writes while reducing kernel launch count and scheduling overhead. In Transformer models, feed-forward network layers typically contain multiple matrix multiplications and activation functions, and fusing these operations can significantly boost throughput. This paper's custom CUDA kernels not only fuse computational operations but also embed sparse format compression and decompression logic, achieving unified computation and data management.

These two techniques work synergistically, delivering substantial improvements in the actual hardware execution efficiency of sparse Transformers.

Experimental Results: Performance Validation at Billion-Parameter Scale

The research team used these kernels to train and benchmark sparse LLMs at billion-parameter scale, with exciting results:

Over 20% improvement in inference and training speed
Significantly reduced peak memory usage
Energy savings even exceeding the speed improvement ratio

In deep learning optimization research, many techniques perform excellently on small-scale models but encounter various issues when scaling to production size: communication overhead, memory fragmentation, numerical stability, etc. This paper's choice to validate at billion-parameter (1B) scale is significant—this scale is large enough to expose real deployment engineering challenges while also being the target scale for many current on-device and edge deployment scenarios (such as LLMs on phones and embedded devices). Over 20% end-to-end acceleration means serving more user requests on the same hardware, or using cheaper hardware under the same latency requirements, which has direct economic value for LLM commercial deployment.

Interestingly, these speedups are not theoretical FLOPS estimates but end-to-end performance improvements measured on actual GPU hardware. This means the theoretical advantages of sparsity are finally beginning to translate into real engineering gains.

Why This Work Deserves Attention

Breaking Sparsity's "Practicality Dilemma"

Model sparsity has long been one of the "holy grails" of deep learning optimization. Researchers have long known that trained neural networks contain massive redundant parameters and activations, but converting this sparsity into actual speed and memory benefits has remained an unsolved challenge. As early as 2018, the "Lottery Ticket Hypothesis" demonstrated that sparse subnetworks within dense networks can achieve equivalent performance, but how to find and efficiently execute these subnetworks without retraining has always lacked practical hardware support solutions. This paper provides a viable path to this problem through "format-kernel" co-design.

A Paradigm Shift in Software-Hardware Co-Design

This work embodies an important trend: future AI acceleration requires not just algorithm-level optimization, but deep co-design between algorithms and hardware. The Sakana AI and NVIDIA collaboration itself illustrates this point—only by understanding GPU's underlying execution model can one design truly efficient sparse formats and kernels. This trend has precedents in the industry: Google's TPU features architecture specifically designed for matrix operations, NVIDIA introduced 2:4 structured sparsity support in the Ampere architecture (forcing 2 out of every 4 elements to zero), and this paper demonstrates how to achieve similar hardware-friendliness at much higher sparsity levels (over 95%).

Open-Source Ecosystem Contribution

The paper is accompanied by open-source GPU kernel and data format implementations (GitHub repository), providing the community with solid infrastructure for further exploring sparse LLMs. Researchers and engineers can build more efficient sparse models directly on this foundation.

Future Outlook

As LLM scale continues to grow, inference costs and energy consumption become increasingly pressing issues. The "making sparsity fit hardware" approach demonstrated in this paper may become an important technical direction for next-generation efficient LLMs. If sparse formats and kernels can further mature, we may be able to reduce LLM operational costs by an order of magnitude without sacrificing model capability.

Notably, this direction does not conflict with other current efficiency optimization techniques (such as quantization, knowledge distillation, speculative decoding, etc.) and can be stacked together. For example, combining TwELL format with INT4 quantization could theoretically achieve dual acceleration from both sparsity and low precision. As next-generation GPU architectures (such as NVIDIA Blackwell and its successors) may introduce more flexible sparse computation support, software-level sparse format innovation will become even more important.

This work will be officially presented at ICML 2026. Interested readers can explore the technical details through the paper and technical blog.