init model

3fcb57e verified 5 days ago

18.8 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	pipeline_tag: text-generation
	tags:
	- qwen3
	- memory
	- memory-extraction
	- tool-calling
	- reasoning
	- agent
	base_model:
	- Qwen/Qwen3-4B
	---

	# MemReader-4B-thinking

	## Introduction

	MemReader-4B-thinking is a 4B language model for long-term agent memory management. Instead of treating memory writing as a one-step structured extraction task, it formulates memory construction as a reasoning-and-action process: the model first evaluates whether incoming information is valuable, complete, and unambiguous, and then selects one of four memory operations:

	- `add_memory`: write useful and complete information into long-term memory
	- `search_memory`: retrieve historical memory for disambiguation
	- `buffer_memory`: temporarily hold incomplete but potentially valuable information
	- `ignore_memory`: discard low-value or repetitive content

	Built on top of [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), MemReader-4B-thinking is further optimized for memory management with supervised fine-tuning and GRPO. It is designed for long-horizon dialogue systems, personalized assistants, and agent frameworks that require low-noise, updatable, and retrievable long-term memory.

	## News

	- MemReader-4B-thinking is released as an open model for active memory management.
	- The model is designed for tool-calling workflows and memory-centric agent systems.
	- It is part of the MemReader family introduced in the paper MemReader: Active Memory Management for Long-Term Agent Memory.

	## Usage

	- Model ID: `IAAR-Shanghai/MemReader-4B-thinking`
	- Base model: `Qwen/Qwen3-4B`
	- Primary use: long-term memory extraction and memory management for agents
	- Inference modes: `transformers`, OpenAI-compatible serving, `vLLM`, and SGLang

	## Citation

	If you use MemReader in your research or product, please cite:

	```bibtex
	@misc{kang2025memreader,
	title={MemReader: Active Memory Management for Long-Term Agent Memory},
	author={Kang, Jingyi and Li, Chunyu and Chen, Ding and Tang, Bo and Xiong, Feiyu and Li, Zhiyu},
	year={2026},
	note={Manuscript in preparation}
	}
	```

	## Highlights

	- Active memory management instead of passive memory extraction
	- Explicit reasoning with thinking traces and tool calls
	- Strong performance on ambiguity resolution, knowledge update, and temporal reasoning
	- Native fit for OpenAI-style tool-calling workflows
	- Efficient local deployment with a 4B parameter footprint
	- Designed for integration with memory-centric agent systems such as MemOS

	## What Makes MemReader Different

	Most memory pipelines directly convert the current dialogue into JSON memories. In realistic settings, that approach is often insufficient:

	- low-value chatter can pollute memory
	- pronouns and missing references may require historical lookup
	- some information is useful but not yet complete
	- newer facts may need to update or overwrite older memory

	MemReader-4B-thinking reframes memory writing as active memory management. Under a ReAct-style workflow, the model reasons before acting, making memory construction closer to how practical agent systems maintain state over time.

	## Benchmark Performance

	MemReader was evaluated on LOCOMO, LongMemEval, and HaluMem. The 4B GRPO version showed especially strong gains on knowledge update, temporal reasoning, and end-to-end memory usability.

	### LOCOMO

	\| Model \| Single Hop \| Multi Hop \| Temporal \| Open Domain \| Overall \| F1 \| Avg. Token \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| MemOS (4o-mini) \| 84.06% \| 73.16% \| 75.90% \| 57.29% \| 78.70% \| 51.90% \| 1854 \|
	\| MemReader-0.6B \| 84.70% \| 76.95% \| 76.22% \| 53.40% \| 79.56% \| 52.54% \| 1976 \|
	\| MemReader-4B-SFT \| 81.88% \| 76.12% \| 71.02% \| 62.15% \| 77.33% \| 47.77% \| 784 \|
	\| MemReader-4B-GRPO \| 85.37% \| 81.44% \| 75.80% \| 65.62% \| 81.42% \| 49.45% \| 1950 \|

	### LongMemEval

	\| Model \| Avg. Token \| SS-User \| SS-Asst \| SS-Pref \| Multi-Session \| Knowledge Update \| Temporal Reasoning \| Overall \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| MemOS \| 1400 \| 95.71% \| 67.86% \| 96.67% \| 70.67% \| 74.26% \| 77.44% \| 77.80% \|
	\| EverMemOS \| 2800 \| 97.14% \| 85.71% \| 93.33% \| 73.68% \| 89.74% \| 77.44% \| 83.00% \|
	\| MemReader-0.6B \| 1166 \| 95.71% \| 75.00% \| 90.00% \| 75.18% \| 82.05% \| 75.90% \| 80.20% \|
	\| MemReader-4B-SFT \| 963 \| 97.10% \| 69.64% \| 90.00% \| 71.42% \| 85.80% \| 78.19% \| 80.00% \|
	\| MemReader-4B-GRPO \| 922 \| 94.29% \| 73.21% \| 90.00% \| 73.68% \| 91.03% \| 84.21% \| 83.00% \|

	### HaluMem

	The full HaluMem table in the paper is relatively long. Below we report a compact subset of the memory extraction and memory updating results.

	\| Model \| Extraction Recall \| Extraction Weighted Recall \| Extraction F1 \| Update Correctness \| Update Hallucination \| Update Omission \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| MemOS \| 74.07% \| 84.81% \| 79.70% \| 62.11% \| 0.42% \| 37.48% \|
	\| MemReader-0.6B \| 88.40% \| 91.38% \| 93.76% \| 82.69% \| 0.77% \| 16.51% \|
	\| MemReader-4B-SFT \| 93.56% \| 95.49% \| 96.61% \| 90.78% \| 0.26% \| 8.74% \|
	\| MemReader-4B-GRPO \| 96.57% \| 97.19% \| 98.21% \| 94.55% \| 0.32% \| 5.12% \|

	These results show that stronger memory writing quality also translates into better memory updating behavior, especially on correctness and omission.

	## Recommended Use Cases

	- long-term conversational agents
	- personalized assistants
	- agent memory extraction pipelines
	- memory update and conflict resolution workflows
	- retrieval-augmented memory systems

	## Intended Use

	MemReader-4B-thinking is intended for research and production scenarios where an agent needs to convert conversational context into structured long-term memory. Typical use cases include memory extraction, ambiguity resolution with retrieval, memory update pipelines, and persistent assistant systems.

	The model is especially suitable when the application requires explicit control over memory-writing behavior through tool calls such as `search_memory`, `add_memory`, `buffer_memory`, and `ignore_memory`.

	## Model Specs

	- Base model: `Qwen/Qwen3-4B`
	- Parameters: 4B
	- Tensor type: BF16
	- Architecture: `Qwen3ForCausalLM`
	- Context length: 40,960 tokens
	- Primary capability: reasoning-driven memory extraction with tool calling

	## Quickstart

	### OpenAI-Compatible API Example

	The following example calls the model through an OpenAI-compatible endpoint with required tool calling.

	```python
	import json
	import requests

	url = "https://YOUR_ENDPOINT/v1/chat/completions"

	payload = {
	"model": "IAAR-Shanghai/MemReader-4B-thinking",
	"extra_body": {
	"chat_template_kwargs": {
	"enable_thinking": True
	}
	},
	"messages": [
	{
	"role": "system",
	"content": (
	"You are a memory extraction agent. Your job is to analyze "
	"conversations and decide what information is worth storing in "
	"long-term memory.\n\n"
	"Available actions (call exactly one per turn):\n"
	"- search_memory: Search existing memories for context\n"
	"- add_memory: Extract and store valuable facts, preferences, or events\n"
	"- buffer_memory: Accumulate this turn and wait for more context\n"
	"- ignore_memory: Nothing worth storing\n\n"
	"Guidelines:\n"
	"- Store specific, verifiable facts\n"
	"- Do not store generic greetings, chitchat, or vague statements\n"
	"- UserMemory: personal attributes or preferences about the user\n"
	"- LongTermMemory: facts, events, or shared knowledge from the conversation\n"
	"- If unsure whether information already exists, call search_memory first"
	),
	},
	{
	"role": "user",
	"content": (
	"Please analyze the following conversation and decide what to store:\n\n"
	"[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
	"[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
	"[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
	),
	},
	],
	"tools": [
	{
	"type": "function",
	"function": {
	"name": "search_memory",
	"description": "Search historical memories for context.",
	"parameters": {
	"type": "object",
	"properties": {
	"query": {"type": "string"}
	},
	"required": ["query"],
	},
	},
	},
	{
	"type": "function",
	"function": {
	"name": "add_memory",
	"description": "Extract and store memories.",
	"parameters": {
	"type": "object",
	"properties": {
	"memory_list": {
	"type": "array",
	"items": {
	"type": "object",
	"properties": {
	"key": {"type": "string"},
	"memory_type": {
	"type": "string",
	"enum": ["LongTermMemory", "UserMemory"],
	},
	"value": {"type": "string"},
	"tags": {
	"type": "array",
	"items": {"type": "string"},
	},
	},
	"required": ["key", "memory_type", "value", "tags"],
	},
	},
	"summary": {"type": "string"},
	},
	"required": ["memory_list", "summary"],
	},
	},
	},
	{
	"type": "function",
	"function": {
	"name": "buffer_memory",
	"description": "Buffer for later processing.",
	"parameters": {
	"type": "object",
	"properties": {
	"reason": {"type": "string"}
	},
	"required": ["reason"],
	},
	},
	},
	{
	"type": "function",
	"function": {
	"name": "ignore_memory",
	"description": "Ignore low-value content.",
	"parameters": {
	"type": "object",
	"properties": {
	"reason": {"type": "string"}
	},
	"required": ["reason"],
	},
	},
	},
	],
	"tool_choice": "required",
	"temperature": 0.2,
	"max_tokens": 1024,
	}

	headers = {
	"Authorization": "Bearer YOUR_API_KEY",
	"Content-Type": "application/json",
	}

	response = requests.post(url, headers=headers, json=payload)
	print(response.text)
	```

	### Hugging Face Transformers Usage

	You can also load the model directly from Hugging Face and run memory extraction with tool calling.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "IAAR-Shanghai/MemReader-4B-thinking"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	)

	messages = [
	{
	"role": "system",
	"content": (
	"You are a memory extraction agent. Analyze conversations and decide "
	"what information should be stored in long-term memory."
	),
	},
	{
	"role": "user",
	"content": (
	"Please analyze the following conversation and decide what to store:\n\n"
	"[user]: How is that project at the company going lately? The one he said he wanted to rewrite with a new language.\n"
	"[assistant]: Do you mean the recommendation system refactoring project? Last time we mentioned that Michael planned to rewrite some core modules in Rust, and it was still in the evaluation stage.\n"
	"[user]: Yes, that one. He said he is going to produce a performance comparison report this week, benchmarking Python against Rust."
	),
	},
	]

	tools = [
	{
	"type": "function",
	"function": {
	"name": "search_memory",
	"description": "Search historical memories for context.",
	"parameters": {
	"type": "object",
	"properties": {"query": {"type": "string"}},
	"required": ["query"],
	},
	},
	},
	{
	"type": "function",
	"function": {
	"name": "add_memory",
	"description": "Extract and store memories.",
	"parameters": {
	"type": "object",
	"properties": {
	"memory_list": {
	"type": "array",
	"items": {
	"type": "object",
	"properties": {
	"key": {"type": "string"},
	"memory_type": {
	"type": "string",
	"enum": ["LongTermMemory", "UserMemory"],
	},
	"value": {"type": "string"},
	"tags": {
	"type": "array",
	"items": {"type": "string"},
	},
	},
	"required": ["key", "memory_type", "value", "tags"],
	},
	},
	"summary": {"type": "string"},
	},
	"required": ["memory_list", "summary"],
	},
	},
	},
	{
	"type": "function",
	"function": {
	"name": "buffer_memory",
	"description": "Buffer for later processing.",
	"parameters": {
	"type": "object",
	"properties": {"reason": {"type": "string"}},
	"required": ["reason"],
	},
	},
	},
	{
	"type": "function",
	"function": {
	"name": "ignore_memory",
	"description": "Ignore low-value content.",
	"parameters": {
	"type": "object",
	"properties": {"reason": {"type": "string"}},
	"required": ["reason"],
	},
	},
	},
	]

	text = tokenizer.apply_chat_template(
	messages,
	tools=tools,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True,
	)

	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
	generated_ids = model.generate(**model_inputs, max_new_tokens=1024)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
	output = tokenizer.decode(output_ids, skip_special_tokens=True)
	print(output)
	```

	### vLLM Usage

	Start an OpenAI-compatible vLLM server:

	```bash
	python -m vllm.entrypoints.openai.api_server \
	--model IAAR-Shanghai/MemReader-4B-thinking \
	--served-model-name MemReader-4B-thinking \
	--port 8000 \
	--tensor-parallel-size 1 \
	--enable-auto-tool-choice \
	--tool-call-parser hermes
	```

	Then send a standard chat completion request to `http://localhost:8000/v1/chat/completions`.

	### SGLang Usage

	MemReader-4B-thinking can also be deployed with SGLang through its OpenAI-compatible serving interface. Please make sure tool calling and thinking mode are enabled in your serving configuration.

	## Output Format

	MemReader-4B-thinking is trained to produce thinking traces and tool calls. A typical response looks like this:

	```xml
	<think>
	The conversation refers to an already known project and adds a new update:
	Michael plans to produce a Python vs Rust benchmark report this week.
	This is valuable project-state information and should be added to memory.
	</think>

	<tool_call>
	{"name": "add_memory", "arguments": {"memory_list": [{"key": "Rust benchmark plan", "memory_type": "LongTermMemory", "value": "Michael said the recommendation system refactoring project is still in evaluation, and he plans to produce a Python-vs-Rust benchmark report this week for the core modules under consideration for Rust rewriting.", "tags": ["project", "Rust", "benchmark", "refactoring"]}], "summary": "Added one memory about the project update and the planned benchmark report."}}
	</tool_call>
	```

	## Best Practices

	- Use `search_memory` first when the conversation contains pronouns, ellipsis, or implicit historical references.
	- Use `buffer_memory` only when the information is genuinely incomplete and cannot be resolved from history.
	- Keep tool definitions stable between training and inference.
	- For production pipelines, execute tool calls externally and feed tool responses back to the model when multi-step reasoning is needed.
	- If you want shorter outputs, reduce `max_tokens` and control whether thinking traces are exposed in your serving layer.

	## Limitations

	- The model is optimized for memory-management scenarios rather than general-purpose chatting.
	- Quality depends on the external memory schema, retrieval quality, and tool-execution loop.
	- For highly domain-specific memory schemas, additional instruction tuning may still be beneficial.
	- As with other LLMs, outputs may still contain mistakes, omissions, or unsupported inferences and should be validated in safety-critical workflows.

	## License Notice

	This model is released under the Apache-2.0 license. As it is derived from [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), users should also review and comply with the upstream base model license, usage terms, and any applicable third-party requirements before deployment.

	## Links

	- GitHub: [MemTensor/MemOS](https://github.com/MemTensor/MemOS)
	- API Documentation: [docs.openmem.net](https://docs.openmem.net/)
	- Model: [IAAR-Shanghai/MemReader-4B-thinking](https://huggingface.co/IAAR-Shanghai/MemReader-4B-thinking)
	- Base model: [Qwen/Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B)