Marco-DeepResearch-8B-FP8

FP8 quantized version of AIDC-AI/Marco-DeepResearch-8B for high-throughput GPU inference with vLLM, SGLang, and other FP8-compatible engines.

About the Model

Marco DeepResearch is an efficient 8B-scale deep research agent developed by Alibaba International Digital Commerce (AIDC-AI), based on Qwen3-8B. It autonomously conducts open-ended investigations by integrating complex information retrieval with multi-step reasoning across diverse web sources. The model uses tools (search, visit) for iterative web research with built-in verification.

Under a maximum budget of 600 tool calls, Marco DeepResearch significantly outperforms other 8B-scale agents and surpasses or approaches several 30B-scale agents on challenging benchmarks.

Quantization Details

Field	Value
Method	`fp8` (block-wise weight quantization)
Format	E4M3 (4 exponent bits, 3 mantissa bits)
Weight block size	`[128, 128]`
Activation scheme	Dynamic (per-token scaling at runtime)
Modules kept in original precision	`lm_head`
Source precision	BF16

Recommended hardware: GPUs with native FP8 tensor cores — NVIDIA Hopper (H100/H200), Ada Lovelace (L40/L40S/RTX 4090), Blackwell, or AMD MI300X — for best throughput and memory savings.

Usage

vLLM

Offline inference:

from vllm import LLM, SamplingParams

llm = LLM(
    model="AIDC-AI/Marco-DeepResearch-8B-FP8",
    quantization="fp8",
    max_model_len=32768,
    gpu_memory_utilization=0.9,
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=4096,
)

outputs = llm.generate(["<your prompt>"], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-compatible server:

vllm serve AIDC-AI/Marco-DeepResearch-8B-FP8 \
  --quantization fp8 \
  --max-model-len 32768 \
  --port 8000 \
  --gpu-memory-utilization 0.9

SGLang

python -m sglang.launch_server \
  --model-path AIDC-AI/Marco-DeepResearch-8B-FP8 \
  --quantization fp8 \
  --context-length 32768 \
  --port 30000

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "AIDC-AI/Marco-DeepResearch-8B-FP8"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
)

messages = [{"role": "user", "content": "<your prompt>"}]
inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=4096,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

TensorRT-LLM

FP8 weights can be loaded directly by TensorRT-LLM's Qwen3 builder. Refer to the TensorRT-LLM Qwen examples and pass --use_fp8 during engine build.

Prompt Format

This model uses a structured prompt format with <think>, <tool_call>, and <answer> tags.

System Prompt Template

You are an expert web researcher. Your task is to find accurate, complete answers through iterative search, extraction, and verification.

## Core Principles

1) Strategic Planning
   - Decompose complex questions into targeted sub-tasks
   - Choose the right tool for each step
   - Refine your approach based on what you learn

2) Precise Execution
   - Define clear objectives before using any tool
   - Provide sufficient detail for accurate results
   - Avoid vague or overly broad requests

3) Rigorous Verification
   - Cross-check important facts across multiple sources
   - Resolve conflicts by gathering additional evidence
   - Only conclude when evidence is sufficient and consistent

## Output Format

In each turn, you can either call a tool or provide the final answer.

**Call a tool:**
<think>your reasoning process</think>
<tool_call>
{"name": "tool_name", "arguments": {"param1": "value1", "param2": "value2"}}
</tool_call>

**Provide final answer (when you have gathered enough information):**
<think>your reasoning and analysis</think>
<answer>the direct answer to the question</answer>

Note: All reasoning should be in <think>, <answer> should contain only the final answer.

Current date: {current_date}

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{tools_json}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>

Tool Definitions

The model expects tools in OpenAI function calling format:

[
  {
    "type": "function",
    "function": {
      "name": "search",
      "description": "Search the web via Google to find relevant information and URLs.",
      "parameters": {
        "type": "object",
        "properties": {
          "querys": {
            "type": "array",
            "items": {"type": "string"},
            "description": "Search queries for finding relevant information."
          }
        },
        "required": ["querys"]
      }
    }
  },
  {
    "type": "function",
    "function": {
      "name": "visit",
      "description": "Read webpage content to extract specific information, verify claims, or understand context.",
      "parameters": {
        "type": "object",
        "properties": {
          "urls": {
            "type": "array",
            "items": {"type": "string"},
            "description": "URL(s) to visit."
          },
          "goal": {
            "type": "string",
            "description": "The specific information to retrieve. Be precise, not vague."
          }
        },
        "required": ["urls", "goal"]
      }
    }
  }
]

Model Output Example

Tool call turn:

<think>
I need to search for information about X to answer the user's question.
</think>
<tool_call>
{"name": "search", "arguments": {"querys": ["search query here"]}}
</tool_call>

Final answer turn:

<think>
Based on the evidence gathered from multiple sources, I can now conclude that...
</think>
<answer>
The direct answer to the question.
</answer>

Benchmark Results

Evaluated on a suite of deep search benchmarks under a maximum budget of 600 tool calls (results from the original unquantized model; FP8 quality is near-identical in our internal evaluation).

Marco DeepResearch benchmark performance across BrowseComp, BrowseComp-ZH, xBench-DeepSearch-2510, and GAIA (text-only)

Original Model

This is a quantized version of AIDC-AI/Marco-DeepResearch-8B. Please refer to the original model card for full details on training methodology, intended use, and limitations.

Paper: Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
Code: GitHub

Citation

@article{zhu2026marco,
  title={Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design},
  author={Bin Zhu and Qianghuai Jia and Tian Lan and Junyang Ren and Feng Gu and Feihu Jiang and Longyue Wang and Zhao Xu and Weihua Luo},
  journal={arXiv preprint arXiv:2603.28376},
  year={2026}
}