GitHub Blog Paper

ASTRA-32B-Thinking-v1

Model Description

The ASTRA-32B-Thinking-v1 model is derived from Qwen3-32B and specifically optimized for multi-step, tool-augmented tasks, with enhanced agentic capabilities in complex tool use and structured reasoning.

We also provide a 14B variant ASTRA-14B-Thinking-v1.

Model Performances

ASTRA-Thinking-32B achieves state-of-the-art performance on the BFCL-V3 multi-turn subset at comparable model scales.

Result on BFCL-V3 multi-turn subset: BFCL-V3 multi-turn subset

Data Curation

The training data is built upon two core pillars of automation:

1. Tool-Grounded SFT Data

  • Key Feature: We constructed an extensive tool pool from 1,585 MCP servers, encompassing 19,036 tools across 41 domains. The data pipeline analyzes schema-level dependencies to generate executable tool-chains, ensuring that the synthesized trajectories are realistic and parameter-satisfiable.
  • Sample Data: ASTRA-SFT-1k

2. Automated Verifiable Environments Synthesis

  • Key Feature: To support robust reinforcement learning, we synthesize fully verifiable environments implemented in Python. These environments are validated via sandboxed execution, providing multi-turn, step-wise verifiable training signals for reinforcement learning.

  • Sample Data: ASTRA-RL-1k

Training Process

The model is trained in two sequential stages to enhance complex agentic decision-making:

  1. Supervised Fine-Tuning (SFT) Before RL, we cold-start the model with high-quality, multi-turn tool-use trajectories. This stage establishes strong behavioral priors for tool-calling formats, long-context understanding, and complex task planning, while ensuring tool diversity and improving coverage of real-world scenarios.

  2. Reinforcement Learning (RL)
    We then conduct multi-turn, tool-integrated Reinforcement Learning with Verifiable Rewards (RLVR). Training uses Adaptive Batch Filling to improve optimization stability and data utilization, and adopts batch-level token-loss averaging for more stable and efficient optimization. At each step, actions are executed in a code sandbox and deterministically verified to produce reliable rewards.

Disclaimer

  • Non-endorsement & liability disclaimer: The model is provided for research and educational purposes only. It does not reflect the views, interests, beliefs, or endorsements of any individual or organization, and should not be interpreted as making claims about any group. The project maintainers disclaim responsibility for any direct or indirect harm or damages arising from the use or misuse of the model or related resources.

Citation

@misc{astra2026,
  title={ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas},
  author={Beike Language and Intelligence (BLI)},
  year={2026},
}

Note: Although the model was trained with bf16 precision, verl saves checkpoints in float32 by default, and we did not change this setting.

Downloads last month
13
Safetensors
Model size
33B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Emperorizzis/ASTRA-32B-Thinking-v1

Base model

Qwen/Qwen3-32B
Finetuned
(172)
this model
Quantizations
2 models

Collection including Emperorizzis/ASTRA-32B-Thinking-v1