LFM2.5 Local Deployment Hands-On: An 8B Parameter Model That Outperforms GPT-o3s in Tool Calling

Introduction: An Agent Model Built for Local Deployment

Liquid AI has open-sourced the LFM2.5 model. This isn't yet another behemoth that can only run on servers — it's a lightweight Agent model specifically designed for local deployment and edge devices. With a total of 8.3B parameters but only 1.5B activated per inference, it can run smoothly on consumer-grade GPUs while delivering surprisingly strong tool-calling capabilities.

This article provides a comprehensive evaluation of LFM2.5 across three dimensions: architecture analysis, local deployment troubleshooting, and hands-on GraphRAG tool-calling benchmarks.

LFM2.5 Architecture Breakdown: Why It's Both Fast and Efficient

The Hybrid Advantage of MOE + Liquid Convolution

LFM2.5 uses a MOE (Mixture of Experts) architecture, activating only 18% (~1.5B) of its 8.3B total parameters per inference. This directly translates to gains in both speed and memory efficiency.

MOE (Mixture of Experts) is a conditional computation architecture first proposed by Jacobs et al. in 1991, recently revitalized by Google's Switch Transformer and Mistral's Mixtral. The core idea is to distribute model parameters across multiple "expert" sub-networks, with a gating network (Router) selectively activating only a few experts per inference. This allows the model to have a large number of parameters (representing greater knowledge capacity) while keeping actual computation costs proportional only to the activated parameters. LFM2.5's ratio of 8.3B total to 1.5B activated parameters (~18%) means it has knowledge capacity approaching an 8B model but inference speed and memory usage closer to a 1.5B model.

But the more critical innovation lies in its layer design: of its 24 layers, 18 use Liquid AI's proprietary LIV (Liquid) convolution, with only 6 layers retaining traditional attention mechanisms.

Liquid convolution is Liquid AI's core technical innovation, inspired by the neural circuit dynamics of the C. elegans roundworm in biological neuroscience. Unlike the standard Transformer's self-attention mechanism, liquid convolution uses dynamically adjustable convolution kernels to process sequential information, with computational complexity that scales linearly with sequence length (O(n)) rather than the quadratic scaling of self-attention (O(n²)). This means that when processing 128K tokens of long text, the computational load of liquid convolution layers is roughly 1/1000th that of attention layers. Liquid AI first publicly introduced this technology in LFM1.0, and LFM2.5 is its evolved version for large-scale commercial use.

Standard Transformer attention mechanisms see computation grow quadratically with context length — a major performance bottleneck for long-sequence inference. Liquid convolution doesn't have this problem. By replacing most layers, the model achieves dramatically faster inference on long sequences with lower memory usage. The remaining 6 GQA attention layers ensure long-range comprehension capability, creating a complementary architecture.

GQA (Grouped Query Attention) is an attention optimization introduced by Meta in LLaMA 2, sitting between standard Multi-Head Attention (MHA) and Multi-Query Attention (MQA). It groups Query heads to share Key and Value heads, preserving MHA's expressiveness while significantly reducing KV Cache memory usage. In LFM2.5, the 6 GQA attention layers handle tasks requiring global dependency modeling (such as long-distance coreference resolution), while the 18 liquid convolution layers handle local feature extraction — forming a hybrid architecture with clear division of labor.

Training Scale and Agent-Specific Optimization

Training data reached 38 trillion tokens — three times that of the previous generation LFM2. More importantly, the model underwent large-scale reinforcement learning (RL) with specific optimization for tool calling and Agent tasks. It supports a 131K token context window, 9 languages (including Chinese), and uses the LFM1.0 open-source license that permits commercial use.

LFM2.5 benchmark scores

Looking at the official benchmarks, tool-calling-related BFCL-v3 and BFCL-v4 scores improved by 19 and 23 points respectively over the previous generation. BFCL (Berkeley Function Calling Leaderboard) is a function-calling evaluation benchmark published by UC Berkeley, specifically measuring LLM performance in real-world tool-calling scenarios. BFCL-v3 tests simple single-tool calling accuracy, while BFCL-v4 escalates to complex scenarios including multi-tool parallel calling, nested calls, and conditional branching. The benchmark covers the complete pipeline from API format parsing and parameter type inference to call timing decisions, making it one of the most authoritative standards for evaluating Agent capabilities. LFM2.5's significant improvement on v4 demonstrates a qualitative leap in complex tool orchestration.

The telecom test simulating real Agent scenarios improved by 74 points — an extraordinary margin. Instruction following (IFEval) reached 91.8, and math reasoning (Math500) reached 88.8.

LFM2.5 Local Deployment: Full Troubleshooting Log on 16GB VRAM

Environment Configuration Essentials

The test environment used an RTX 5060Ti with 16GB VRAM, Ubuntu 22.04, and CUDA 13.0 (Blackwell architecture GPU). Deployment used the vLLM framework, requiring a Python 3.12 environment.

vLLM is a high-performance LLM inference framework developed by UC Berkeley. Its core innovation, PagedAttention, manages KV Cache using a method similar to operating system virtual memory, boosting memory utilization from the typical 50-60% of traditional frameworks to over 90%. For function calling, vLLM uses a guided decoding mechanism to constrain model output to structured data conforming to JSON Schema, ensuring correct tool parameter formatting.

Key steps:

Create a clean conda environment (Python 3.12)
When installing vLLM, specify the Extra Index URL (required for CUDA 13.0)
Model download is approximately 17GB

Critical Startup Parameters Explained

FP8 quantization is essential: The BF16 original is 17GB, which won't fit in 16GB VRAM. FP8 quantization allows normal operation with minimal quality loss. After startup, it occupies approximately 14GB VRAM.

FP8 (8-bit floating point) quantization is an inference acceleration technology first supported in hardware by NVIDIA's Hopper architecture (H100) and further optimized in Ada/Blackwell architectures. Compared to BF16 (16-bit brain floating point), FP8 halves the storage space per parameter while leveraging the GPU's FP8 Tensor Cores for nearly lossless computation. FP8 comes in two formats: E4M3 offers higher precision and is suitable for weight storage, while E5M2 has a larger dynamic range and is better suited for activation values. In practice, FP8 quantization typically impacts LLM output quality by less than 1% in benchmark score differences while delivering nearly 2x memory savings.

Max Model Length needs to be limited: Primarily to reserve VRAM for KV Cache.

Enable Auto Tool Choice + Native Mode: This is the master switch for vLLM tool calling. Without it, the API will outright reject Tools parameters. When Enable Auto Tool Choice is enabled, vLLM automatically detects tool-calling intent in model output and parses the results into OpenAI-compatible tool_calls format for the client.

LFM2.5 deployment considerations

Important Limitations with 16GB VRAM

If you want to integrate with Agent frameworks like Hermes or OpenCore, they require at least 64K context — which 16GB VRAM simply cannot handle. The current configuration is suitable for writing your own code to call the API directly. For running full Agent frameworks, 24GB+ VRAM is recommended.

Additionally, the Tool Code Parser Pythonic parameter in vLLM 0.22.0 did not actually parse successfully in testing, because LFM2.5's tool-calling output includes Tool Code Start wrapper markers that the Pythonic parser doesn't recognize. You'll need to pair it with client-side parsing logic.

GraphRAG Tool-Calling Benchmark: LFM2.5 vs. GPT-o3s

Test Design

The test used the complete text of Romance of the Three Kingdoms, running through the full GraphRAG data processing pipeline (chunking, entity extraction, relationship extraction, community clustering). The four types of data were vectorized and stored in four separate vector databases, with four retrieval tools defined accordingly.

GraphRAG is a retrieval-augmented generation framework proposed by Microsoft Research in 2024, addressing the shortcomings of traditional RAG when handling complex relational queries. Traditional RAG only performs text chunking + vector retrieval, making it difficult to answer questions like "who has the best relationship with whom" that require a global knowledge graph. GraphRAG adds a knowledge graph construction step before vectorization: it first uses an LLM to extract entities and relationships from text, then applies the Leiden algorithm for community clustering on entities, generating multi-level summary indices. This way, retrieval can select the most appropriate index level based on question type — specific facts go through entity/relationship retrieval, while macro-level summaries go through community retrieval.

Four tool definitions

Text chunk retrieval: Best for semantically ambiguous questions
Entity retrieval: Best for questions about specific people or events
Relationship retrieval: Best for questions about relationships between two parties
Community retrieval: Best for macro-level summary questions

The comparison model was OpenAI's open-source GPT-o3s (20B parameters, MXP using 6 GPUs, deployed locally via Ollama).

Four-Round Test Results Comparison

Test 1: Autonomous Tool Selection

Both models correctly chose the text chunk retrieval tool. LFM2.5 consumed 4,141 tokens in 8.8 seconds; GPT-o3s consumed 3,616 tokens in 8.1 seconds. Performance was comparable.

LFM2.5 vs GPT-o3s comparison test results

Test 2: Relationship Retrieval ("Who does Zhang Fei have the best relationship with?")

LFM2.5 consumed 3,203 tokens in 13.5 seconds; GPT-o3s consumed 1,630 tokens in 10.2 seconds.

Test 3: Community Clustering Retrieval

LFM2.5 consumed 2,262 tokens in 5.2 seconds; GPT-o3s consumed 1,943 tokens in 12.1 seconds. Here, LFM2.5 was noticeably faster.

Test 4: Multi-Tool Joint Calling

LFM2.5 correctly made two tool calls, consuming 4,643 tokens in 13.3 seconds. GPT-o3s, however, ran into problems — it redundantly made four calls, consuming 13,826 tokens in 33.1 seconds. This exposed GPT-o3s's weakness in complex tool orchestration.

Third-Party Travel Planning Task Validation

A more compelling set of third-party data: a travel planning task requiring 7 consecutive tool calls (checking weather for 3 cities, 2 currency conversions, sending an email, and setting a reminder).

Metric	LFM2.5 (8B→1.5B)	GPT-o3s (20B→3.6B)
Tool Completion	7/7 all successful	3/7 (missed 4)
Memory Usage	4.8GB	11GB
Inference Speed	266 Token/s	146 Token/s
Total Time	6.9s	15s

GPT-o3s activates more than twice the parameters of LFM2.5, yet its tool-calling capability was actually worse, slower, and more memory-hungry. This further confirms that LFM2.5's strategy of large-scale reinforcement learning specifically targeting Agent tasks is genuinely effective — parameter count isn't the sole determinant of Agent capability; training methodology and architecture design are equally critical.

Conclusion: Who Is LFM2.5 For?

LFM2.5 leverages 38 trillion tokens of training data plus large-scale reinforcement learning to achieve "fewer activated parameters, stronger Agent capabilities." For developers who need to deploy tool-calling Agents locally, this is currently one of the most cost-effective options available.

Ideal scenarios: Custom API calling, lightweight Agent development, edge device deployment

Not ideal for: Agent frameworks requiring 64K+ long context (16GB VRAM isn't enough), scenarios requiring native vLLM parsing support

If you have 24GB+ VRAM, you can unlock longer context windows and pair it with full Agent frameworks. Users with 16GB VRAM should write their own calling logic and interface directly with the API.