Marco-DeepResearch-8B-FP8
FP8 quantized version of AIDC-AI/Marco-DeepResearch-8B for high-throughput GPU inference with vLLM, SGLang, and other FP8-compatible engines.
About the Model
Marco DeepResearch is an efficient 8B-scale deep research agent developed by Alibaba International Digital Commerce (AIDC-AI), based on Qwen3-8B. It autonomously conducts open-ended investigations by integrating complex information retrieval with multi-step reasoning across diverse web sources. The model uses tools (search, visit) for iterative web research with built-in verification.
Under a maximum budget of 600 tool calls, Marco DeepResearch significantly outperforms other 8B-scale agents and surpasses or approaches several 30B-scale agents on challenging benchmarks.
Quantization Details
| Field | Value |
|---|---|
| Method | fp8 (block-wise weight quantization) |
| Format | E4M3 (4 exponent bits, 3 mantissa bits) |
| Weight block size | [128, 128] |
| Activation scheme | Dynamic (per-token scaling at runtime) |
| Modules kept in original precision | lm_head |
| Source precision | BF16 |
Recommended hardware: GPUs with native FP8 tensor cores — NVIDIA Hopper (H100/H200), Ada Lovelace (L40/L40S/RTX 4090), Blackwell, or AMD MI300X — for best throughput and memory savings.
Usage
vLLM
Offline inference:
from vllm import LLM, SamplingParams
llm = LLM(
model="AIDC-AI/Marco-DeepResearch-8B-FP8",
quantization="fp8",
max_model_len=32768,
gpu_memory_utilization=0.9,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=4096,
)
outputs = llm.generate(["<your prompt>"], sampling_params)
print(outputs[0].outputs[0].text)
OpenAI-compatible server:
vllm serve AIDC-AI/Marco-DeepResearch-8B-FP8 \
--quantization fp8 \
--max-model-len 32768 \
--port 8000 \
--gpu-memory-utilization 0.9
SGLang
python -m sglang.launch_server \
--model-path AIDC-AI/Marco-DeepResearch-8B-FP8 \
--quantization fp8 \
--context-length 32768 \
--port 30000
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "AIDC-AI/Marco-DeepResearch-8B-FP8"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
)
messages = [{"role": "user", "content": "<your prompt>"}]
inputs = tokenizer.apply_chat_template(
messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=4096,
temperature=0.7,
top_p=0.95,
do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
TensorRT-LLM
FP8 weights can be loaded directly by TensorRT-LLM's Qwen3 builder. Refer to the TensorRT-LLM Qwen examples and pass --use_fp8 during engine build.
Prompt Format
This model uses a structured prompt format with <think>, <tool_call>, and <answer> tags.
System Prompt Template
You are an expert web researcher. Your task is to find accurate, complete answers through iterative search, extraction, and verification.
## Core Principles
1) Strategic Planning
- Decompose complex questions into targeted sub-tasks
- Choose the right tool for each step
- Refine your approach based on what you learn
2) Precise Execution
- Define clear objectives before using any tool
- Provide sufficient detail for accurate results
- Avoid vague or overly broad requests
3) Rigorous Verification
- Cross-check important facts across multiple sources
- Resolve conflicts by gathering additional evidence
- Only conclude when evidence is sufficient and consistent
## Output Format
In each turn, you can either call a tool or provide the final answer.
**Call a tool:**
<think>your reasoning process</think>
<tool_call>
{"name": "tool_name", "arguments": {"param1": "value1", "param2": "value2"}}
</tool_call>
**Provide final answer (when you have gathered enough information):**
<think>your reasoning and analysis</think>
<answer>the direct answer to the question</answer>
Note: All reasoning should be in <think>, <answer> should contain only the final answer.
Current date: {current_date}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{tools_json}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
Tool Definitions
The model expects tools in OpenAI function calling format:
[
{
"type": "function",
"function": {
"name": "search",
"description": "Search the web via Google to find relevant information and URLs.",
"parameters": {
"type": "object",
"properties": {
"querys": {
"type": "array",
"items": {"type": "string"},
"description": "Search queries for finding relevant information."
}
},
"required": ["querys"]
}
}
},
{
"type": "function",
"function": {
"name": "visit",
"description": "Read webpage content to extract specific information, verify claims, or understand context.",
"parameters": {
"type": "object",
"properties": {
"urls": {
"type": "array",
"items": {"type": "string"},
"description": "URL(s) to visit."
},
"goal": {
"type": "string",
"description": "The specific information to retrieve. Be precise, not vague."
}
},
"required": ["urls", "goal"]
}
}
}
]
Model Output Example
Tool call turn:
<think>
I need to search for information about X to answer the user's question.
</think>
<tool_call>
{"name": "search", "arguments": {"querys": ["search query here"]}}
</tool_call>
Final answer turn:
<think>
Based on the evidence gathered from multiple sources, I can now conclude that...
</think>
<answer>
The direct answer to the question.
</answer>
Benchmark Results
Evaluated on a suite of deep search benchmarks under a maximum budget of 600 tool calls (results from the original unquantized model; FP8 quality is near-identical in our internal evaluation).
Original Model
This is a quantized version of AIDC-AI/Marco-DeepResearch-8B. Please refer to the original model card for full details on training methodology, intended use, and limitations.
- Paper: Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
- Code: GitHub
Citation
@article{zhu2026marco,
title={Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design},
author={Bin Zhu and Qianghuai Jia and Tian Lan and Junyang Ren and Feng Gu and Feihu Jiang and Longyue Wang and Zhao Xu and Weihua Luo},
journal={arXiv preprint arXiv:2603.28376},
year={2026}
}
License
This model is released under the Apache 2.0 License.
- Downloads last month
- -
Model tree for AIDC-AI/Marco-DeepResearch-8B-FP8
Collection including AIDC-AI/Marco-DeepResearch-8B-FP8
Paper for AIDC-AI/Marco-DeepResearch-8B-FP8
Evaluation results
- Accuracy on BrowseCompself-reported31.400
- Accuracy on BrowseComp-ZHself-reported47.100
- Accuracy on GAIAself-reported69.900
- Accuracy on xBench-DeepSearch-2505self-reported82.000
- Accuracy on WebWalkerQAself-reported69.600