Harryis
/

SCOUT_multitask

Reinforcement Learning

Model card Files Files and versions

Harryis commited on Jan 31

Commit

cb1ba2a

·

verified ·

1 Parent(s): d6bb29c

Create README.md

Files changed (1) hide show

README.md +54 -0

README.md ADDED Viewed

	@@ -0,0 +1,54 @@

+---
+license: apache-2.0
+base_model: Qwen/Qwen2.5-3B-Instruct
+tags:
+- reinforcement-learning
+- multi-task
+- scout
+- ppo
+---
+# SCOUT-Multitask Sequential RL Agent
+This repository contains the final checkpoint of the **SCOUT** (Sequential RL) framework. The model is based on **Qwen2.5-3B-Instruct** and has been trained sequentially across multiple environments, ending with the **Sudoku** task.
+## Model Description
+The SCOUT framework enables large language models to acquire new skills sequentially while maintaining performance on previously learned tasks. This specific model represents the culmination of the training pipeline, achieving state-of-the-art multi-task performance within the SCOUT benchmark.
+- **Framework:** SCOUT (Sequential RL with Exploration & Distillation)
+- **Training Stage:** Final Checkpoint (+PPO on Sudoku)
+- **Base Model:** Qwen2.5-3B-It
+## Experimental Results
+The following results demonstrate the model's performance across all environments after completing the full sequential training curriculum:
+| Task Group | Environment / Setting | Score |
+| :--- | :--- | :--- |
+| **Bandit** | General | **1.0** |
+| **FrozenLake** | Static / Slippery | **0.89 / 0.88** |
+| **Sokoban** | Box1 / Box2 | **0.95 / 0.59** |
+| **Rubiks' Cube** | Rotation 1 / 2 / 3 | **1.0 / 1.0 / 0.89** |
+| **Sudoku** | General | **0.98** |
+| **Average** | **Overall Multi-task** | **0.91** |
+> *Data source: Internal evaluation metrics for the SCOUT sequential RL pipeline.*
+## Capability Highlights
+* **Zero Forgetting:** Maintains a perfect **1.0** score on the initial **Bandit** task even after sequential training on four subsequent complex environments.
+* **Logical Reasoning:** Shows exceptional proficiency in high-dimensional state spaces, particularly **Sudoku (0.98)** and **Rubiks' Cube (Average ~0.96)**.
+* **Robustness:** Demonstrates strong performance in stochastic environments like **FrozenLake Slippery (0.88)**.
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Harryis/SCOUT_multitask"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
+# Example: Prompt the model for a Sudoku move or Sokoban action