5 related articles
Hyper-Connections: The First Major Imp…
Deep dive into ByteDance's Hyper-Connections: expanding residual connections from one to multiple learnable pathways, significantly improving training under the same compute budget.
Deep DivesUnderstand Transformer through the lens of word continuation. Breaking down language generation into Embedding, Transformer Block, and Probability output modules for intuitive understanding.
TutorialsHow to build a structured paper workflow with Claude Code: three core Skills for material classification, literature evidence matching, and reviewer simulation, plus six reusable AI-assisted research principles.
Deep DivesDeep analysis of DeepSeek V4's core architecture: Hybrid Compressed Attention, Manifold-Constrained Hyperconnection, and MUON optimizer—how they cut inference costs by 10x and enable million-token context processing.
Deep DivesDeep dive into Transformer architecture covering self-attention QKV mechanics, Encoder-Decoder structure, Flash Attention memory optimization, RoPE positional encoding, and GQA inference acceleration.