GitHub Blog Paper

ASTRA-14B-Thinking-v1

Model Description

The ASTRA-14B-Thinking-v1 model is derived from Qwen3-14B and specifically optimized for multi-step, tool-augmented tasks, with enhanced agentic capabilities in complex tool use and structured reasoning.

We also provide a 32B variant ASTRA-32B-Thinking-v1.

Model Performances

ASTRA-Thinking-14B achieves state-of-the-art performance on the BFCL-V3 multi-turn subset at comparable model scales.

Result on BFCL-V3 multi-turn subset: BFCL-V3 multi-turn subset

Data Curation

The training data is built upon two core pillars of automation:

1. Tool-Grounded SFT Data

  • Key Feature: We constructed an extensive tool pool from 1,585 MCP servers, encompassing 19,036 tools across 41 domains. The data pipeline analyzes schema-level dependencies to generate executable tool-chains, ensuring that the synthesized trajectories are realistic and parameter-satisfiable.
  • Sample Data: ASTRA-SFT-1k

2. Automated Verifiable Environments Synthesis

  • Key Feature: To support robust reinforcement learning, we synthesize fully verifiable environments implemented in Python.. These environments are validated via sandboxed execution, providing multi-turn, step-wise verifiable training signals for reinforcement learning.

  • Sample Data: ASTRA-RL-1k

Training Process

The model is trained in two sequential stages to enhance complex agentic decision-making:

  1. Supervised Fine-Tuning (SFT) Before RL, we cold-start the model with high-quality, multi-turn tool-use trajectories. This stage establishes strong behavioral priors for tool-calling formats, long-context understanding, and complex task planning, while ensuring tool diversity and improving coverage of real-world scenarios.

  2. Reinforcement Learning (RL)
    We then conduct multi-turn, tool-integrated Reinforcement Learning with Verifiable Rewards (RLVR). Training uses Adaptive Batch Filling to improve optimization stability and data utilizationand, and adopts batch-level token-loss averaging for more stable and efficient optimization. At each step, actions are executed in a code sandbox and deterministically verified to produce reliable rewards.

Disclaimer

  • Non-endorsement & liability disclaimer: The model is provided for research and educational purposes only. It does not reflect the views, interests, beliefs, or endorsements of any individual or organization, and should not be interpreted as making claims about any group. The project maintainers disclaim responsibility for any direct or indirect harm or damages arising from the use or misuse of the model or related resources.

Citation

@misc{tian2026astraautomatedsynthesisagentic,
      title={ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas}, 
      author={Xiaoyu Tian and Haotian Wang and Shuaiting Chen and Hao Zhou and Kaichi Yu and Yudian Zhang and Jade Ouyang and Junxi Yin and Jiong Chen and Baoyan Guo and Lei Zhang and Junjie Tao and Yuansheng Song and Ming Cui and Chengwei Liu},
      year={2026},
      eprint={2601.21558},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.21558}, 
}

Note: Although the model was trained with bf16 precision, verl saves checkpoints in float32 by default, and we did not change this setting.

Downloads last month
13
Safetensors
Model size
15B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Emperorizzis/ASTRA-14B-Thinking-v1

Finetuned
Qwen/Qwen3-14B
Finetuned
(200)
this model
Quantizations
2 models

Collection including Emperorizzis/ASTRA-14B-Thinking-v1

Paper for Emperorizzis/ASTRA-14B-Thinking-v1