| | --- |
| | license: mit |
| | language: |
| | - en |
| | metrics: |
| | - accuracy |
| | base_model: |
| | - Qwen/Qwen2.5-1.5B-Instruct |
| | pipeline_tag: reinforcement-learning |
| | tags: |
| | - agent |
| | - reinforcement-learning |
| | - long-horizon |
| | - embodied-ai |
| | - strategic-exploration |
| | --- |
| | |
| | # SPARK: Strategic Policy-Aware Exploration via Dynamic Branching |
| |
|
| | This model is trained using the **SPARK** framework proposed in the paper: |
| |
|
| | **[SPARK: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning](https://huggingface.co/papers/2601.20209)** |
| |
|
| | π **Paper:** [arXiv:2601.20209](https://arxiv.org/abs/2601.20209) |
| |
|
| | ## Overview |
| |
|
| | SPARK is a novel reinforcement learning framework that enables autonomous strategic exploration for long-horizon agentic tasks. Instead of uniformly exploring all steps, SPARK selectively branches at critical decision points using intrinsic `<explore>` signals, achieving superior performance with significantly fewer training samples. |
| |
|
| | ## Key Features |
| |
|
| | - π― **Autonomous Strategic Exploration**: Dynamically identifies critical states for branching without human priors |
| | - β‘ **Sample Efficient**: Achieves 84.4% success with only 20% training data (vs. GRPO 76.6% at 100%) |
| | - π° **Token Efficient**: Reduces token consumption by up to 47% through prefix sharing |
| | - π **Strong Generalization**: Maintains 80.5% success on unseen tasks (significantly outperforms GRPO) |
| |
|
| | ## Performance Highlights |
| |
|
| | | Benchmark | SPARK-1.5B | GPT-5 | Gemini-2.5-Pro | |
| | |-----------|------------|-------|----------------| |
| | | ALFWorld L2 | **80.5%** | 63.3% | 55.5% | |
| | | ScienceWorld L2 | **49.2%** | 33.6% | 30.5% | |
| | | WebShop | **75.8%** | 29.7% | 32.0% | |
| |
|
| | ## Quickstart |
| | Here we provide a transformers inference style: |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | model_name = "Jinyang23/Spark-1.5B-ScienceWorld" |
| | |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_name, |
| | torch_dtype="auto", |
| | device_map="auto" |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | |
| | prompt = "Calculate the sum of 123 and 456. Provide only the numerical answer." |
| | |
| | messages = [ |
| | {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, |
| | {"role": "user", "content": prompt} |
| | ] |
| | text = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=False, |
| | add_generation_prompt=True |
| | ) |
| | model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| | |
| | generated_ids = model.generate( |
| | **model_inputs, |
| | max_new_tokens=512 |
| | ) |
| | generated_ids = [ |
| | output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
| | ] |
| | |
| | response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
| | print(response) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | If you use this model or the SPARK framework in your research, please cite: |
| | ```bibtex |
| | @article{wu2026spark, |
| | title={SPARK: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning}, |
| | author={Wu, Jinyang and Yang, Shuo and Yang, Changpeng and Shen, Yuhao and Zhang, Shuai and Wen, Zhengqi and Tao, Jianhua}, |
| | journal={arXiv preprint arXiv:2601.20209}, |
| | year={2026} |
| | } |
| | ``` |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model:** Qwen/Qwen2.5-1.5B-Instruct |
| | - **Training Method:** SPARK (Dynamic Branching RL) |
| | - **Training Dataset:** ScienceWorld |
| |
|
| | ## Links |
| |
|
| | - π Paper: https://arxiv.org/abs/2601.20209 |
| | - π€ Paper Page: https://huggingface.co/papers/2601.20209 |