3 related articles
Deep DivesAnalyzing the "worse is better" philosophy in large model architecture: why DeepSeek V4 dropped N-gram, why Transformer dominates AI, and three iron laws of simple, efficient model design.
Deep DivesDeep analysis of DeepSeek V4's core architecture: Hybrid Compressed Attention, Manifold-Constrained Hyperconnection, and MUON optimizer—how they cut inference costs by 10x and enable million-token context processing.
Deep DivesDeep dive into Transformer architecture covering self-attention QKV mechanics, Encoder-Decoder structure, Flash Attention memory optimization, RoPE positional encoding, and GQA inference acceleration.