Agent Loops in Practice: Transforming Token Output into Productivity from CUDA Kernels to Automated Research

Overview

This week's SGLang Office Hours featured Ligeng Zhu, a Research Scientist at NVIDIA Research, sharing deep practical insights on how Agent Loops transform tokens generated by large models into real engineering productivity. He introduced the Humanize framework—an Agentic Flow framework that enables AI Agents to run autonomously and solve complex engineering and research problems the way human engineers do—and demonstrated its effectiveness through three long-running Agent Loop case studies.

Background: SGLang and Agent Loops SGLang (Structured Generation Language) is a high-performance framework designed for large language model inference, developed by researchers at UC Berkeley and other institutions. Through innovations like RadixAttention, it achieves efficient KV Cache reuse, delivering several times the throughput of traditional inference frameworks when handling complex multi-turn conversations and Agent Loop scenarios. An Agent Loop refers to the working mode where an AI system continuously operates in a closed loop of perception-thinking-action-feedback. Unlike single-shot inference, its core distinction lies in introducing an environmental feedback mechanism that allows the model to dynamically adjust strategies based on execution results.

SGLang Office Hours Event Announcement

The Humanize Framework: Enabling Agents to Iterate Like Human Engineers

Humanize is an Agentic Flow framework whose core design philosophy is to enable AI Agents to run autonomously and approach complex engineering and research problems with the mindset of a human engineer. Unlike traditional single-shot inference, Humanize emphasizes long-running Agent Loops—where the Agent continuously iterates, debugs, and optimizes within a loop until the objective is achieved.

Modern Agentic Flow frameworks typically include the following core components: a Task Planner that decomposes complex goals into executable subtasks; a Tool Use Layer that provides external capability interfaces such as code execution, web search, and file operations; a Memory System divided into short-term working memory and long-term vector database storage; and a Reflection Module for evaluating execution results and triggering retries or strategy adjustments. Humanize's innovation lies in deeply customizing these components for engineering and research scenarios, with targeted optimizations in long-running stability, error recovery mechanisms, and domain knowledge injection, enabling Agents to run continuously for hours without human supervision to complete complex tasks.

This design philosophy reflects an important shift in the current AI engineering landscape: moving from "one-shot generation" to "continuous iteration." Tokens alone do not equal productivity—only when tokens are organized into effective workflows and continuously improved through feedback loops can they produce truly valuable engineering outcomes.

Three Case Studies: Validating Agent Loop Productivity

KDA: Automatically Writing High-Performance CUDA Kernels

The first case study is KDA (Kernel Development Agent), which automatically writes high-performance CUDA kernels. In the MLSys FlashInfer Kernel Contest, kernels generated by KDA ranked 1st through 3rd. This means AI Agents can already compete with top human engineers in the highly specialized domain of GPU programming.

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that allows developers to directly harness thousands of GPU compute cores. Writing high-performance CUDA kernels requires mastery of low-level techniques including Thread Block partitioning, Shared Memory management, Coalesced Memory Access, and Warp-level synchronization—presenting an extremely steep learning curve. FlashInfer is a high-performance CUDA operator library designed specifically for LLM inference, and its Kernel Contest requires participants to achieve extreme optimizations for core operators like attention computation. Participants are typically expert engineers with years of GPU programming experience. KDA's top-three finish in this contest demonstrates that Agent Loops already possess the capability to surpass most human experts in highly specialized domains.

KDA's success shows that the Agent Loop paradigm has a natural advantage in tasks requiring repeated experimentation and performance tuning—Agents can rapidly try hundreds of optimization strategies within the loop, far exceeding the iteration speed of human engineers.

Anthropic Virtual Hardware: Optimizing for Hardware That Doesn't Yet Exist

The second case study is even more forward-looking—optimizing computation on hardware that doesn't physically exist yet. This case demonstrates a unique capability of Agent Loops: they can conduct extensive exploration and optimization in virtual environments without being constrained by physical hardware availability.

Hardware-Software Co-design is an important paradigm in modern chip design, referring to the simultaneous development of supporting software stacks during the chip design phase to fully leverage hardware characteristics. In traditional workflows, software engineers must wait for chip tape-out and completion of packaging and testing before obtaining real hardware—a cycle that typically takes 12-18 months. Virtual Hardware uses cycle-accurate simulators or architecture-level emulators (such as gem5, Verilator) to simulate the behavioral characteristics of target hardware at the software level, enabling software optimization work to proceed in advance. Anthropic likely employed a similar approach in this case, allowing Agents to explore optimal operator implementation strategies in simulated hardware environments—this holds significant engineering value for pre-research on software stacks for next-generation AI accelerator chips.

Agent Loops can complete extensive optimization work in advance within simulated environments, dramatically shortening time-to-market and bringing an entirely new engineering paradigm to chip design and hardware-software co-optimization.

JetAutoResearch: Cutting Research Workflow Costs by Over 50%

The third case study is JetAutoResearch, which uses Ahead-of-Time (AOT) Compilation techniques to reduce AutoResearch workflow costs by over 50%. This directly addresses a core pain point in current AI-automated research: workflows are powerful but computationally expensive.

Ahead-of-Time (AOT) compilation in the context of AI Agent workflows extends to mean: pre-analyzing Agent execution paths, caching intermediate computation results, and pre-planning tool call sequences to avoid redundant inference overhead at runtime. AutoResearch workflows typically involve multiple serial or parallel steps including literature retrieval, experiment design, code generation, and result analysis, each consuming substantial tokens. JetAutoResearch achieves over 50% cost reduction by statically analyzing the workflow's dependency graph, identifying reusable computation subgraphs for caching, and converting dynamic inference into lookup operations. This approach offers important reference value for the engineering deployment of all complex Agent workflows, clearing the economic barrier for scaling Agent Loops.

Future Directions for Agent-Centric Research

Ligeng Zhu also shared the development roadmap for Humanize 2.0 / 3.0, along with his perspective on the future trajectory of Agent-Centric research and engineering. Based on the information disclosed, several key trends are worth noting:

Long-duration autonomous operation: Agents are no longer "one question, one answer" tools, but systems capable of working independently for hours or even days
Cost efficiency optimization: How to reduce token consumption while maintaining effectiveness in Agent Loops is the critical bottleneck for engineering deployment
Deep specialization in professional domains: Shifting from general-purpose assistants to reaching or even surpassing human expert levels in specific professional domains (such as CUDA programming and hardware optimization)

These directions collectively point to a core proposition: the value of an Agent lies not in the intelligence level of a single conversation, but in whether it can continuously deliver high-quality outcomes in real engineering scenarios.

Conclusion: From Token Generation to a Productivity Closed Loop

From KDA's top performance in CUDA kernel competitions, to optimizing computation for hardware that doesn't yet exist, to cutting research costs in half, the Humanize framework demonstrates the enormous potential of the Agent Loop paradigm.

The core insight is this: true productivity doesn't come from generating more tokens—it comes from organizing tokens into effective iterative loops that allow AI to think, experiment, and improve like a human engineer. When an Agent can autonomously complete the full closed loop of "hypothesis-experiment-feedback-optimization" during long-running sessions, tokens are truly transformed into measurable engineering productivity.

Key Takeaways

The Humanize framework enables AI Agents to run autonomously, solving complex engineering and research problems with human-like reasoning
CUDA kernels automatically written by KDA ranked in the top three at the MLSys FlashInfer Kernel Contest
JetAutoResearch uses ahead-of-time compilation to cut AutoResearch workflow costs by over 50%
The core value of Agent Loops lies in organizing tokens into effective iterative cycles, not simple one-shot generation
Agent-Centric research is advancing toward long-duration autonomous operation, cost optimization, and deep domain specialization