Jinyang23
/

Spark-1.5B-ScienceWorld

Reinforcement Learning

strategic-exploration

Model card Files Files and versions

Spark-1.5B-ScienceWorld / README.md

Jinyang23's picture

Update README.md (#2)

272c342 verified about 1 month ago

|

history blame contribute delete

3.4 kB

	---
	license: mit
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- Qwen/Qwen2.5-1.5B-Instruct
	pipeline_tag: reinforcement-learning
	tags:
	- agent
	- reinforcement-learning
	- long-horizon
	- embodied-ai
	- strategic-exploration
	---

	# SPARK: Strategic Policy-Aware Exploration via Dynamic Branching

	This model is trained using the SPARK framework proposed in the paper:

	[SPARK: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning](https://huggingface.co/papers/2601.20209)

	📄 Paper: [arXiv:2601.20209](https://arxiv.org/abs/2601.20209)

	## Overview

	SPARK is a novel reinforcement learning framework that enables autonomous strategic exploration for long-horizon agentic tasks. Instead of uniformly exploring all steps, SPARK selectively branches at critical decision points using intrinsic `<explore>` signals, achieving superior performance with significantly fewer training samples.

	## Key Features

	- 🎯 Autonomous Strategic Exploration: Dynamically identifies critical states for branching without human priors
	- ⚡ Sample Efficient: Achieves 84.4% success with only 20% training data (vs. GRPO 76.6% at 100%)
	- 💰 Token Efficient: Reduces token consumption by up to 47% through prefix sharing
	- 🌍 Strong Generalization: Maintains 80.5% success on unseen tasks (significantly outperforms GRPO)

	## Performance Highlights

	\| Benchmark \| SPARK-1.5B \| GPT-5 \| Gemini-2.5-Pro \|
	\|-----------\|------------\|-------\|----------------\|
	\| ALFWorld L2 \| 80.5% \| 63.3% \| 55.5% \|
	\| ScienceWorld L2 \| 49.2% \| 33.6% \| 30.5% \|
	\| WebShop \| 75.8% \| 29.7% \| 32.0% \|

	## Quickstart
	Here we provide a transformers inference style:
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "Jinyang23/Spark-1.5B-ScienceWorld"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	prompt = "Calculate the sum of 123 and 456. Provide only the numerical answer."

	messages = [
	{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## Citation

	If you use this model or the SPARK framework in your research, please cite:
	```bibtex
	@article{wu2026spark,
	title={SPARK: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning},
	author={Wu, Jinyang and Yang, Shuo and Yang, Changpeng and Shen, Yuhao and Zhang, Shuai and Wen, Zhengqi and Tao, Jianhua},
	journal={arXiv preprint arXiv:2601.20209},
	year={2026}
	}
	```

	## Model Details

	- Base Model: Qwen/Qwen2.5-1.5B-Instruct
	- Training Method: SPARK (Dynamic Branching RL)
	- Training Dataset: ScienceWorld

	## Links

	- 📄 Paper: https://arxiv.org/abs/2601.20209
	- 🤗 Paper Page: https://huggingface.co/papers/2601.20209