MementoGUI: A Multimodal Memory Management Framework for Solving Long-Horizon GUI Agent Amnesia

Introduction: The Core Bottleneck of Long-Horizon GUI Tasks

Multimodal large model-driven GUI agents have advanced rapidly in recent years, achieving remarkably high accuracy on single-step operations. GUI (Graphical User Interface) agents are AI systems that can perceive screen content visually, understand interface elements, and execute actions like clicking, typing, and scrolling—just like humans. With the emergence of multimodal large models such as GPT-4V and Qwen-VL, these agents already perform excellently on single-step operations like identifying and clicking buttons. However, once a task requires navigating multiple interfaces and executing dozens of steps—such as completing a full booking workflow on a travel website involving searching, filtering, comparing prices, and filling out forms—agent performance drops dramatically: they forget previously selected parameters, ignore temporary pop-up notifications, and even repeatedly click the same button in futile loops.

This is the so-called "long-horizon GUI agent amnesia." A research team from the University of Rochester and MIT IBM Watson AI Lab proposed the MementoGUI framework, redefining this problem as a multimodal memory management problem rather than a simple context length issue.

MementoGUI Paper Introduction

Why Extending the Context Window Doesn't Work

Sparse and Unevenly Distributed Information

Previous research approaches focused on aggressively extending input history or storing memory as text only. The context window refers to the maximum number of tokens a large language model can process in a single inference pass. Although it has expanded from 4K to 128K or even longer in recent years, simply increasing window length doesn't solve long-task problems. This is because the Transformer's attention mechanism suffers from "attention dilution" when processing ultra-long sequences—the model struggles to precisely locate key clues among vast amounts of irrelevant information. Moreover, each GUI screenshot frame consumes hundreds to thousands of tokens, meaning even the largest context windows quickly fill up.

The paper highlights a critical fact: useful information in long trajectories is sparse and unevenly distributed. Most steps are routine interface transitions; only a few contain task constraints, completed sub-goals, or visual cues no longer visible on the current screen. Stuffing in excessive redundant information not only wastes the context window but actually degrades the model's decision quality.

From Passive Replay to Active Memory Management

An effective agent shouldn't passively feed all history to the model. Instead, it should actively decide:

When to update memory
What content to retain
How to compress history
When to retrieve past experience

This is the core approach to solving the long-task forgetting problem.

MementoGUI Framework Architecture in Detail

Plugin-Style Design: Zero Fine-Tuning, Plug-and-Play

MementoGUI's greatest design highlight is its plugin-style architecture—no fine-tuning of the original GUI action model is needed. It simply mounts a learned memory controller (Memento Core) on top of the frozen backbone model. This controller is based on a shared frozen QwenVL backbone equipped with four task-specific LoRA adapters, each corresponding to one of four memory control operators.

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique proposed by Microsoft Research in 2021. Its core idea is to inject trainable low-rank decomposition matrices into the Transformer's attention layers while keeping pre-trained model weights frozen. Compared to full fine-tuning which requires updating billions of parameters, LoRA only trains 0.1%-1% of the original parameter count, dramatically reducing computational and storage costs. In MementoGUI, four independent LoRA adapters share the same frozen QwenVL vision-language model backbone, with each adapter focusing on a specific memory control function, achieving a balance between functional decoupling and parameter efficiency.

The four memory control operators are:

Step Processor: Determines whether current step information is worth storing, outputting a saliency score, event summary, ROI box, and episodic retrieval tag
Working Memory Compressor: Merges old entries into compact summaries when working memory capacity is exceeded
Episodic Memory Writer: Converts completed task trajectories into reusable memory entries
Episodic Memory Selector: Filters retrieved candidate memories based on the current task state

Dual Time-Scale Memory System

The framework features two complementary memory tiers, drawing inspiration from classical memory models in cognitive psychology. Working memory corresponds to the human brain's system for temporarily storing and manipulating current task information, with limited capacity (similar to the classic "7±2" rule), requiring constant updating and eviction. Episodic memory corresponds to humans' long-term storage of past experiences, containing rich information about time, place, and context, retrievable when needed. This layered design enables the agent to both efficiently track current task state and draw guidance from historical experience.

Working memory employs an event-gated mechanism—not every frame is recorded; only interface changes that might affect future decisions are stored. Each memory entry contains an event summary, ROI box, ROI crop image, and visual embedding. An ROI (Region of Interest) box is a rectangular bounding box used in computer vision to annotate key image regions. In GUI scenarios, it marks specific screen areas relevant to the current operation or task state—such as a newly appeared dialog box, a filled form field, or a button whose state has changed. By storing ROI crops rather than full screenshots, a full 1920×1080 screenshot that might consume over a thousand visual tokens is reduced to just tens to hundreds of tokens per ROI crop, dramatically saving context space. When capacity is exceeded, old entries are compressed, and at most K ROI references are passed to the backbone model, strictly controlling context growth.

Episodic memory stores trajectory summaries of completed tasks, key actions, representative ROI crops, and retrieval embeddings. It uses an on-demand retrieval strategy—triggered only at the first step of a task or when the step processor flags a need. Retrieval occurs in two stages: first, coarse recall via embeddings (similar to a search engine's initial filtering), then fine-grained multimodal filtering by the selector (combining visual and textual information for precise relevance judgment).

Training Data: Automated Annotation Pipeline

Automatic Generation of High-Quality Supervision Data

The data for training the memory controller comes from automated generation using PSAI computer-use trajectories, requiring no extensive manual annotation. The data processing pipeline includes:

Frame-level annotation: Comparing adjacent video frames to capture fine-grained interface changes (operation events, input types, keystroke sequences, interface change region ROI boxes)
Sub-goal-level annotation: Segmenting interaction logs into chronologically ordered semantic units, recording coarse-grained task progress
Preference data: Generating preference pairs through rule-based corruption and VLM filtering for DPO optimization

DPO (Direct Preference Optimization) is an alignment training method proposed by Stanford University in 2023, serving as a simplified alternative to RLHF (Reinforcement Learning from Human Feedback). Traditional RLHF requires first training a reward model and then performing reinforcement learning—a complex and unstable process. DPO directly optimizes the policy model using preference pair data (one good output and one poor output), transforming the alignment problem into a simple classification loss. In MementoGUI, the research team generates negative samples through rule-based corruption (such as deleting key information or introducing incorrect summaries), then uses VLM filtering to ensure quality, thereby automatically constructing preference data to optimize the memory controller's output quality.

Researchers randomly sampled 200 trajectories for manual verification, and 197 were judged completely correct, demonstrating the extremely high reliability of the automated annotation.

Training Strategy

The step processor and compressor first undergo SFT (Supervised Fine-Tuning)—standard supervised training using input-output pairs—followed by DPO preference optimization to further improve memory quality. The writer and selector only require SFT. Overall training costs remain manageable; thanks to LoRA's parameter efficiency, the total trainable parameters of the entire memory controller are far fewer than full fine-tuning would require.

Experimental Results: Comprehensive and Consistent Performance Gains

Multi-Benchmark Validation

The research team tested on three benchmarks: GUI Odyssey, Multimodal Mind2Web, and the newly introduced MementoGUI Bench. GUI Odyssey is an evaluation benchmark focused on cross-application long-horizon mobile GUI tasks, containing numerous complex tasks requiring switching between multiple apps, with average trajectory lengths far exceeding traditional benchmarks. Multimodal Mind2Web originates from Carnegie Mellon University's Mind2Web project, covering web interaction tasks across over 2,000 real websites, with its multimodal version adding webpage screenshots as visual input. These two benchmarks represent long-horizon GUI challenges on mobile and web platforms respectively, enabling comprehensive evaluation of agents' memory management capabilities across different platforms.

Using UI-TARS 1.5 7B backbone on GUI Odyssey as an example:

No history: action match 54.58, trajectory success rate 1.29%
Full history replay: 66.31, 2.33%
Text-only memory: 62.18, 2.12%
MementoGUI working memory: 67.69, 2.69%
Working memory + episodic memory: 68.32, 3.57%

Consistent improvements were observed across different backbone models, such as MAI-UI 8B rising from 0.36% to 2.12%, and Qwen2.5VL 32B from 0.57% to 2.59%.

Key Findings

Advantages become more pronounced with longer trajectories: The working memory + episodic memory combination far outperforms full history replay on long trajectories, validating the fundamental advantage of active memory management over passive information accumulation
Larger episodic memory banks yield better results: More reusable experiences lead to higher long-task completion rates, demonstrating the continuous enhancement effect of "experience accumulation" on agent capabilities
Equally effective for closed-source models: GPT-4.5's memory consistency score rose from 2.86 to 6.57 (+129.72%), and Gemini 2.5 Pro from 2.75 to 7.22 (+162.55%), showing that MementoGUI's memory management strategy has model-agnostic generality

MementoGUI Bench: A Dedicated Evaluation Benchmark

The paper also introduces a benchmark specifically targeting memory-dependent long-horizon GUI decision-making, containing 200 trajectories with 6,903 steps, averaging 34.8 steps per trajectory. Compared to existing benchmarks that typically have short trajectories of only 5-10 steps, this benchmark more closely approximates real-world complex task scenarios. The evaluation framework features three VLM-based metrics:

VLM Action Match: Semantic equivalence between predicted and reference actions (not simple string matching, but VLM-judged assessment of whether two actions achieve the same effect)
Task Progress Score: Whether the predicted sequence advances the task, with or without loops or regressions (detecting whether the agent is stuck in ineffective repetitive operations)
Memory Consistency Score: Consistency between memory state evolution and task progress (evaluating whether the memory system accurately reflects the actual task state)

Application Prospects and Future Directions

MementoGUI's practical application value spans multiple scenarios:

Office automation: Handling complex workflows across multiple software applications, remembering intermediate states and special user requirements. For example, extracting information from emails, creating orders in an ERP system, then returning to email for confirmation—throughout this workflow, the agent needs to continuously remember order details
Mobile intelligent assistants: Maintaining user constraints throughout cross-app long tasks without repeated reminders. For instance, when a user says "book me a window seat on tomorrow afternoon's high-speed train," the agent needs to remember the "window seat" constraint across multiple subsequent operation steps
Software testing: Autonomously tracking test progress, adjusting operations based on memory goals when interfaces change, and quickly adapting to new layouts after UI redesigns based on historical experience

Future extensible directions include: fine-grained skill-level memory reuse (abstracting operation sequences into transferable skill modules), adaptive write thresholds (dynamically adjusting memory storage strategies based on task complexity), multi-agent shared memory banks (experience sharing in team collaboration scenarios), and personalized memory systems incorporating user preferences (learning specific users' operational habits and preference settings).

Conclusion

MementoGUI's core contribution lies in transforming long-horizon GUI control from "passive history dependence" to "active memory management." It demonstrates an important insight: the bottleneck of current GUI agents has shifted from single-step perception to cross-step state management. Through a plugin-style multimodal memory control framework that significantly enhances long-task capabilities without modifying the original model, it takes a critical step toward truly general-purpose computer-use agents. This work also provides broader inspiration for AI Agent research: beyond pursuing more powerful foundation models, designing efficient external memory mechanisms may be another key path to improving agent practicality.