| --- |
| license: apache-2.0 |
| language: |
| - en |
| - zh |
| pipeline_tag: text-generation |
| tags: |
| - qwen3 |
| - memory |
| - memory-extraction |
| - tool-calling |
| - reasoning |
| - agent |
| base_model: |
| - Qwen/Qwen3-4B |
| --- |
| |
| # MemReader-4B-thinking |
|
|
| ## Introduction |
|
|
| MemReader-4B-thinking is a 4B language model for long-term agent memory management. Instead of treating memory writing as a one-step structured extraction task, it formulates memory construction as a reasoning-and-action process: the model first evaluates whether incoming information is valuable, complete, and unambiguous, and then selects one of four memory operations: |
|
|
| - `add_memory`: write useful and complete information into long-term memory |
| - `search_memory`: retrieve historical memory for disambiguation |
| - `buffer_memory`: temporarily hold incomplete but potentially valuable information |
| - `ignore_memory`: discard low-value or repetitive content |
|
|
| Built on top of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), MemReader-4B-thinking is further optimized for memory management with supervised fine-tuning and GRPO. It is designed for long-horizon dialogue systems, personalized assistants, and agent frameworks that require low-noise, updatable, and retrievable long-term memory. |
|
|
| ## News |
|
|
| - MemReader-4B-thinking is released as an open model for active memory management. |
| - The model is designed for tool-calling workflows and memory-centric agent systems. |
| - It is part of the MemReader family introduced in the paper *MemReader: Active Memory Management for Long-Term Agent Memory*. |
|
|
| ## Usage |
|
|
| - Model ID: `IAAR-Shanghai/MemReader-4B-thinking` |
| - Base model: `Qwen/Qwen3-4B` |
| - Primary use: long-term memory extraction and memory management for agents |
| - Inference modes: `transformers`, OpenAI-compatible serving, `vLLM`, and SGLang |
|
|
| ## Citation |
|
|
| If you use MemReader in your research or product, please cite: |
|
|
| ```bibtex |
| @misc{kang2025memreader, |
| title={MemReader: Active Memory Management for Long-Term Agent Memory}, |
| author={Kang, Jingyi and Li, Chunyu and Chen, Ding and Tang, Bo and Xiong, Feiyu and Li, Zhiyu}, |
| year={2026}, |
| note={Manuscript in preparation} |
| } |
| ``` |
|
|
| ## Highlights |
|
|
| - Active memory management instead of passive memory extraction |
| - Explicit reasoning with thinking traces and tool calls |
| - Strong performance on ambiguity resolution, knowledge update, and temporal reasoning |
| - Native fit for OpenAI-style tool-calling workflows |
| - Efficient local deployment with a 4B parameter footprint |
| - Designed for integration with memory-centric agent systems such as MemOS |
|
|
| ## What Makes MemReader Different |
|
|
| Most memory pipelines directly convert the current dialogue into JSON memories. In realistic settings, that approach is often insufficient: |
|
|
| - low-value chatter can pollute memory |
| - pronouns and missing references may require historical lookup |
| - some information is useful but not yet complete |
| - newer facts may need to update or overwrite older memory |
|
|
| MemReader-4B-thinking reframes memory writing as active memory management. Under a ReAct-style workflow, the model reasons before acting, making memory construction closer to how practical agent systems maintain state over time. |
|
|
| ## Benchmark Performance |
|
|
| MemReader was evaluated on LOCOMO, LongMemEval, and HaluMem. The 4B GRPO version showed especially strong gains on knowledge update, temporal reasoning, and end-to-end memory usability. |
|
|
| ### LOCOMO |
|
|
| | Model | Single Hop | Multi Hop | Temporal | Open Domain | Overall | F1 | Avg. Token | |
| | --- | --- | --- | --- | --- | --- | --- | --- | |
| | MemOS (4o-mini) | 84.06% | 73.16% | 75.90% | 57.29% | 78.70% | 51.90% | 1854 | |
| | MemReader-0.6B | 84.70% | 76.95% | 76.22% | 53.40% | 79.56% | 52.54% | 1976 | |
| | MemReader-4B-SFT | 81.88% | 76.12% | 71.02% | 62.15% | 77.33% | 47.77% | 784 | |
| | MemReader-4B-GRPO | **85.37%** | **81.44%** | 75.80% | **65.62%** | **81.42%** | 49.45% | 1950 | |
|
|
| ### LongMemEval |
|
|
| | Model | Avg. Token | SS-User | SS-Asst | SS-Pref | Multi-Session | Knowledge Update | Temporal Reasoning | Overall | |
| | --- | --- | --- | --- | --- | --- | --- | --- | --- | |
| | MemOS | 1400 | 95.71% | 67.86% | **96.67%** | 70.67% | 74.26% | 77.44% | 77.80% | |
| | EverMemOS | 2800 | **97.14%** | **85.71%** | 93.33% | 73.68% | 89.74% | 77.44% | **83.00%** | |
| | MemReader-0.6B | 1166 | 95.71% | 75.00% | 90.00% | **75.18%** | 82.05% | 75.90% | 80.20% | |
| | MemReader-4B-SFT | 963 | 97.10% | 69.64% | 90.00% | 71.42% | 85.80% | 78.19% | 80.00% | |
| | MemReader-4B-GRPO | **922** | 94.29% | 73.21% | 90.00% | 73.68% | **91.03%** | **84.21%** | **83.00%** | |
|
|
| ### HaluMem |
|
|
| The full HaluMem table in the paper is relatively long. Below we report a compact subset of the memory extraction and memory updating results. |
|
|
| | Model | Extraction Recall | Extraction Weighted Recall | Extraction F1 | Update Correctness | Update Hallucination | Update Omission | |
| | --- | --- | --- | --- | --- | --- | --- | |
| | MemOS | 74.07% | 84.81% | 79.70% | 62.11% | 0.42% | 37.48% | |
| | MemReader-0.6B | 88.40% | 91.38% | 93.76% | 82.69% | 0.77% | 16.51% | |
| | MemReader-4B-SFT | 93.56% | 95.49% | 96.61% | 90.78% | 0.26% | 8.74% | |
| | MemReader-4B-GRPO | **96.57%** | **97.19%** | **98.21%** | **94.55%** | 0.32% | **5.12%** | |
|
|
| These results show that stronger memory writing quality also translates into better memory updating behavior, especially on correctness and omission. |
|
|
| ## Recommended Use Cases |
|
|
| - long-term conversational agents |
| - personalized assistants |
| - agent memory extraction pipelines |
| - memory update and conflict resolution workflows |
| - retrieval-augmented memory systems |
|
|
| ## Intended Use |
|
|
| MemReader-4B-thinking is intended for research and production scenarios where an agent needs to convert conversational context into structured long-term memory. Typical use cases include memory extraction, ambiguity resolution with retrieval, memory update pipelines, and persistent assistant systems. |
|
|
| The model is especially suitable when the application requires explicit control over memory-writing behavior through tool calls such as `search_memory`, `add_memory`, `buffer_memory`, and `ignore_memory`. |
|
|
| ## Model Specs |
|
|
| - Base model: `Qwen/Qwen3-4B` |
| - Parameters: 4B |
| - Tensor type: BF16 |
| - Architecture: `Qwen3ForCausalLM` |
| - Context length: 40,960 tokens |
| - Primary capability: reasoning-driven memory extraction with tool calling |
|
|
| ## Quickstart |
|
|
| ### OpenAI-Compatible API Example |
|
|
| The following example calls the model through an OpenAI-compatible endpoint with required tool calling. |
|
|
| ```python |
| import json |
| import requests |
| |
| url = "https://YOUR_ENDPOINT/v1/chat/completions" |
| |
| payload = { |
| "model": "IAAR-Shanghai/MemReader-4B-thinking", |
| "extra_body": { |
| "chat_template_kwargs": { |
| "enable_thinking": True |
| } |
| }, |
| "messages": [ |
| { |
| "role": "system", |
| "content": ( |
| "You are a memory extraction agent. Your job is to analyze " |
| "conversations and decide what information is worth storing in " |
| "long-term memory.\n\n" |
| "Available actions (call exactly one per turn):\n" |
| "- search_memory: Search existing memories for context\n" |
| "- add_memory: Extract and store valuable facts, preferences, or events\n" |
| "- buffer_memory: Accumulate this turn and wait for more context\n" |
| "- ignore_memory: Nothing worth storing\n\n" |
| "Guidelines:\n" |
| "- Store specific, verifiable facts\n" |
| "- Do not store generic greetings, chitchat, or vague statements\n" |
| "- UserMemory: personal attributes or preferences about the user\n" |
| "- LongTermMemory: facts, events, or shared knowledge from the conversation\n" |
| "- If unsure whether information already exists, call search_memory first" |
| ), |
| }, |
| { |
| "role": "user", |
| "content": ( |
| "Please analyze the following conversation and decide what to store:\n\n" |
| "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n" |
| "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n" |
| "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust." |
| ), |
| }, |
| ], |
| "tools": [ |
| { |
| "type": "function", |
| "function": { |
| "name": "search_memory", |
| "description": "Search historical memories for context.", |
| "parameters": { |
| "type": "object", |
| "properties": { |
| "query": {"type": "string"} |
| }, |
| "required": ["query"], |
| }, |
| }, |
| }, |
| { |
| "type": "function", |
| "function": { |
| "name": "add_memory", |
| "description": "Extract and store memories.", |
| "parameters": { |
| "type": "object", |
| "properties": { |
| "memory_list": { |
| "type": "array", |
| "items": { |
| "type": "object", |
| "properties": { |
| "key": {"type": "string"}, |
| "memory_type": { |
| "type": "string", |
| "enum": ["LongTermMemory", "UserMemory"], |
| }, |
| "value": {"type": "string"}, |
| "tags": { |
| "type": "array", |
| "items": {"type": "string"}, |
| }, |
| }, |
| "required": ["key", "memory_type", "value", "tags"], |
| }, |
| }, |
| "summary": {"type": "string"}, |
| }, |
| "required": ["memory_list", "summary"], |
| }, |
| }, |
| }, |
| { |
| "type": "function", |
| "function": { |
| "name": "buffer_memory", |
| "description": "Buffer for later processing.", |
| "parameters": { |
| "type": "object", |
| "properties": { |
| "reason": {"type": "string"} |
| }, |
| "required": ["reason"], |
| }, |
| }, |
| }, |
| { |
| "type": "function", |
| "function": { |
| "name": "ignore_memory", |
| "description": "Ignore low-value content.", |
| "parameters": { |
| "type": "object", |
| "properties": { |
| "reason": {"type": "string"} |
| }, |
| "required": ["reason"], |
| }, |
| }, |
| }, |
| ], |
| "tool_choice": "required", |
| "temperature": 0.2, |
| "max_tokens": 1024, |
| } |
| |
| headers = { |
| "Authorization": "Bearer YOUR_API_KEY", |
| "Content-Type": "application/json", |
| } |
| |
| response = requests.post(url, headers=headers, json=payload) |
| print(response.text) |
| ``` |
|
|
| ### Hugging Face Transformers Usage |
|
|
| You can also load the model directly from Hugging Face and run memory extraction with tool calling. |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "IAAR-Shanghai/MemReader-4B-thinking" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype="auto", |
| device_map="auto", |
| ) |
| |
| messages = [ |
| { |
| "role": "system", |
| "content": ( |
| "You are a memory extraction agent. Analyze conversations and decide " |
| "what information should be stored in long-term memory." |
| ), |
| }, |
| { |
| "role": "user", |
| "content": ( |
| "Please analyze the following conversation and decide what to store:\n\n" |
| "[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n" |
| "[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n" |
| "[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust." |
| ), |
| }, |
| ] |
| |
| tools = [ |
| { |
| "type": "function", |
| "function": { |
| "name": "search_memory", |
| "description": "Search historical memories for context.", |
| "parameters": { |
| "type": "object", |
| "properties": {"query": {"type": "string"}}, |
| "required": ["query"], |
| }, |
| }, |
| }, |
| { |
| "type": "function", |
| "function": { |
| "name": "add_memory", |
| "description": "Extract and store memories.", |
| "parameters": { |
| "type": "object", |
| "properties": { |
| "memory_list": { |
| "type": "array", |
| "items": { |
| "type": "object", |
| "properties": { |
| "key": {"type": "string"}, |
| "memory_type": { |
| "type": "string", |
| "enum": ["LongTermMemory", "UserMemory"], |
| }, |
| "value": {"type": "string"}, |
| "tags": { |
| "type": "array", |
| "items": {"type": "string"}, |
| }, |
| }, |
| "required": ["key", "memory_type", "value", "tags"], |
| }, |
| }, |
| "summary": {"type": "string"}, |
| }, |
| "required": ["memory_list", "summary"], |
| }, |
| }, |
| }, |
| { |
| "type": "function", |
| "function": { |
| "name": "buffer_memory", |
| "description": "Buffer for later processing.", |
| "parameters": { |
| "type": "object", |
| "properties": {"reason": {"type": "string"}}, |
| "required": ["reason"], |
| }, |
| }, |
| }, |
| { |
| "type": "function", |
| "function": { |
| "name": "ignore_memory", |
| "description": "Ignore low-value content.", |
| "parameters": { |
| "type": "object", |
| "properties": {"reason": {"type": "string"}}, |
| "required": ["reason"], |
| }, |
| }, |
| }, |
| ] |
| |
| text = tokenizer.apply_chat_template( |
| messages, |
| tools=tools, |
| tokenize=False, |
| add_generation_prompt=True, |
| enable_thinking=True, |
| ) |
| |
| model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| generated_ids = model.generate(**model_inputs, max_new_tokens=1024) |
| output_ids = generated_ids[0][len(model_inputs.input_ids[0]):] |
| output = tokenizer.decode(output_ids, skip_special_tokens=True) |
| print(output) |
| ``` |
|
|
| ### vLLM Usage |
|
|
| Start an OpenAI-compatible vLLM server: |
|
|
| ```bash |
| python -m vllm.entrypoints.openai.api_server \ |
| --model IAAR-Shanghai/MemReader-4B-thinking \ |
| --served-model-name MemReader-4B-thinking \ |
| --port 8000 \ |
| --tensor-parallel-size 1 \ |
| --enable-auto-tool-choice \ |
| --tool-call-parser hermes |
| ``` |
|
|
| Then send a standard chat completion request to `http://localhost:8000/v1/chat/completions`. |
|
|
| ### SGLang Usage |
|
|
| MemReader-4B-thinking can also be deployed with SGLang through its OpenAI-compatible serving interface. Please make sure tool calling and thinking mode are enabled in your serving configuration. |
|
|
| ## Output Format |
|
|
| MemReader-4B-thinking is trained to produce thinking traces and tool calls. A typical response looks like this: |
|
|
| ```xml |
| <think> |
| The conversation refers to an already known project and adds a new update: |
| Michael plans to produce a Python vs Rust benchmark report this week. |
| This is valuable project-state information and should be added to memory. |
| </think> |
| |
| <tool_call> |
| {"name": "add_memory", "arguments": {"memory_list": [{"key": "Rust benchmark plan", "memory_type": "LongTermMemory", "value": "Michael said the recommendation system refactoring project is still in evaluation, and he plans to produce a Python-vs-Rust benchmark report this week for the core modules under consideration for Rust rewriting.", "tags": ["project", "Rust", "benchmark", "refactoring"]}], "summary": "Added one memory about the project update and the planned benchmark report."}} |
| </tool_call> |
| ``` |
|
|
| ## Best Practices |
|
|
| - Use `search_memory` first when the conversation contains pronouns, ellipsis, or implicit historical references. |
| - Use `buffer_memory` only when the information is genuinely incomplete and cannot be resolved from history. |
| - Keep tool definitions stable between training and inference. |
| - For production pipelines, execute tool calls externally and feed tool responses back to the model when multi-step reasoning is needed. |
| - If you want shorter outputs, reduce `max_tokens` and control whether thinking traces are exposed in your serving layer. |
|
|
| ## Limitations |
|
|
| - The model is optimized for memory-management scenarios rather than general-purpose chatting. |
| - Quality depends on the external memory schema, retrieval quality, and tool-execution loop. |
| - For highly domain-specific memory schemas, additional instruction tuning may still be beneficial. |
| - As with other LLMs, outputs may still contain mistakes, omissions, or unsupported inferences and should be validated in safety-critical workflows. |
|
|
| ## License Notice |
|
|
| This model is released under the Apache-2.0 license. As it is derived from [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), users should also review and comply with the upstream base model license, usage terms, and any applicable third-party requirements before deployment. |
|
|
| ## Links |
|
|
| - GitHub: [MemTensor/MemOS](https://github.com/MemTensor/MemOS) |
| - API Documentation: [docs.openmem.net](https://docs.openmem.net/) |
| - Model: [IAAR-Shanghai/MemReader-4B-thinking](https://huggingface.co/IAAR-Shanghai/MemReader-4B-thinking) |
| - Base model: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B) |
|
|