daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

📄 Paper | 💻 GitHub Repository | 📚 Dataset

Model Card

daVinci-Agency is a long-horizon agentic model finetuned on GLM-4.6. It is designed to master long-horizon tasks by learning from the iterative evolutionary process of real-world software development.

Unlike standard instruction-tuned models, daVinci-Agency is trained on trajectories that explicitly embody key meta-skills: task decomposition, long-term consistency, and iterative refinement. These trajectories, mined from real GitHub Pull Request chains, average 85k tokens and 116 tool invocations, enabling the model to handle complex, multi-step agentic workflows with superior proficiency.

Model Details

Base Model: GLM-4.6 (Zhipu AI)
Context Window: 200K Tokens (Inherited from GLM-4.6, optimized for long-horizon history)
Training Method: Finetuned on daVinci-Agency synthetic data.
Primary Capabilities: Long-horizon planning, complex tool usage, autonomous coding, and self-correction.

Highlights

🚀 Unlocking Long-Horizon Agency: By training on interaction trajectories that mirror the "software evolution" process, daVinci-Agency breaks the teacher bounds of traditional data synthesis, learning to maintain consistency over long operational windows.
🧠 Advanced Meta-Skills: The model demonstrates internalized capabilities for decomposing massive tasks into manageable units (PRs) and refining solutions based on feedback, achieving a 47% relative gain on Toolathlon compared to the base model.
🏗️ Built on GLM-4.6: Leveraging the superior coding and reasoning foundation of GLM-4.6, daVinci-Agency inherits a 200K context window, enabling it to process the extensive context required for complex agentic tasks.

📊 Performance Benchmarks

daVinci-Agency (239 samples) consistently outperforms much larger synthetic datasets.

Training Data	Samples	SWE-bench(SWE-agent)	Toolathlon	$\tau^2$-bench	Overall Avg.
GLM-4.6 (Base)	-	0.608	0.157	0.675	0.441
SWE-Smith	66,000	0.404	0.093	0.586	0.373
CC-bench	260	0.618	0.000	0.697	0.436
daVinci-Agency	239	0.632	0.231	0.707	0.475

Baseline Models Comparison

Model	SWE-bench(SWE-agent)	Toolathlon	AgencyBench	Overall Avg.
DeepSeek-v3.2	0.456	0.250	11.6	0.366
Qwen3-235B	0.504	0.046	4.6	0.309
Kimi-K2-Thinking	0.318	0.213	11.8	0.404
GLM-4.6	0.608	0.157	11.9	0.441
GLM-4.6-daVinci-Agency	0.632	0.231	15.9	0.475

Inference

Since daVinci-Agency is based on GLM-4.6, it utilizes the standard GLM tokenizer and chat template.

Recommended Parameters

Following the GLM-4.6 guidelines, we recommend the following parameters for optimal performance, especially in code-related tasks:

Temperature: 1.0 (General)
Top_p: 0.95
Top_k: 40

Quick Start (Transformers)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
model_path = "GAIR/daVinci-Agency" # Replace with actual path

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

# Example: Complex Long-Horizon Task
query = "Refactor the authentication module in this repository to support OAuth2, ensuring backward compatibility."

messages = [
    {"role": "system", "content": "You are an intelligent software engineering agent capable of long-horizon planning and execution."},
    {"role": "user", "content": query}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
).to(device)

gen_kwargs = {
    "max_new_tokens": 4096,
    "do_sample": True,
    "top_p": 0.95,
    "temperature": 1.0,
    "top_k": 40
}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    response = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(response[0], skip_special_tokens=True))

Citation

If you use daVinci-Agency in your research, please cite our work:

@article{jiang2026davinci,
  title={daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently},
  author={Mohan Jiang and Dayuan Fu and Junhao Shi and Ji Zeng and Weiye Si and Keyu Li and Xuefeng Li and Yang Xiao and Wenjie Li and Dequan Wang and Pengfei Liu},
  journal={arXiv preprint arXiv:2602.02619},
  year={2026}
}

Downloads last month: 10

Safetensors

Model size

353B params

Tensor type

BF16

F32

Paper for GAIR/daVinci-Agency

daVinci-Agency: Unlocking Long-Horizon Agency Data-Efficiently

Paper • 2602.02619 • Published Feb 2 • 53