ASTRA-32B-Thinking-v1
Model Description
The ASTRA-32B-Thinking-v1 model is derived from Qwen3-32B and specifically optimized for multi-step, tool-augmented tasks, with enhanced agentic capabilities in complex tool use and structured reasoning.
We also provide a 14B variant ASTRA-14B-Thinking-v1.
Model Performances
ASTRA-Thinking-32B achieves state-of-the-art performance on the BFCL-V3 multi-turn subset at comparable model scales.
Result on BFCL-V3 multi-turn subset:

Data Curation
The training data is built upon two core pillars of automation:
1. Tool-Grounded SFT Data
- Key Feature: We constructed an extensive tool pool from 1,585 MCP servers, encompassing 19,036 tools across 41 domains. The data pipeline analyzes schema-level dependencies to generate executable tool-chains, ensuring that the synthesized trajectories are realistic and parameter-satisfiable.
- Sample Data: ASTRA-SFT-1k
2. Automated Verifiable Environments Synthesis
Key Feature: To support robust reinforcement learning, we synthesize fully verifiable environments implemented in Python. These environments are validated via sandboxed execution, providing multi-turn, step-wise verifiable training signals for reinforcement learning.
Sample Data: ASTRA-RL-1k
Training Process
The model is trained in two sequential stages to enhance complex agentic decision-making:
Supervised Fine-Tuning (SFT) Before RL, we cold-start the model with high-quality, multi-turn tool-use trajectories. This stage establishes strong behavioral priors for tool-calling formats, long-context understanding, and complex task planning, while ensuring tool diversity and improving coverage of real-world scenarios.
Reinforcement Learning (RL)
We then conduct multi-turn, tool-integrated Reinforcement Learning with Verifiable Rewards (RLVR). Training uses Adaptive Batch Filling to improve optimization stability and data utilization, and adopts batch-level token-loss averaging for more stable and efficient optimization. At each step, actions are executed in a code sandbox and deterministically verified to produce reliable rewards.
Disclaimer
- Non-endorsement & liability disclaimer: The model is provided for research and educational purposes only. It does not reflect the views, interests, beliefs, or endorsements of any individual or organization, and should not be interpreted as making claims about any group. The project maintainers disclaim responsibility for any direct or indirect harm or damages arising from the use or misuse of the model or related resources.
Citation
@misc{astra2026,
title={ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas},
author={Beike Language and Intelligence (BLI)},
year={2026},
}
Note: Although the model was trained with bf16 precision, verl saves checkpoints in float32 by default, and we did not change this setting.
- Downloads last month
- 13