Harryis commited on
Commit
cb1ba2a
·
verified ·
1 Parent(s): d6bb29c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-3B-Instruct
4
+ tags:
5
+ - reinforcement-learning
6
+ - multi-task
7
+ - scout
8
+ - ppo
9
+ ---
10
+
11
+ # SCOUT-Multitask Sequential RL Agent
12
+
13
+ This repository contains the final checkpoint of the **SCOUT** (Sequential RL) framework. The model is based on **Qwen2.5-3B-Instruct** and has been trained sequentially across multiple environments, ending with the **Sudoku** task.
14
+
15
+ ## Model Description
16
+
17
+ The SCOUT framework enables large language models to acquire new skills sequentially while maintaining performance on previously learned tasks. This specific model represents the culmination of the training pipeline, achieving state-of-the-art multi-task performance within the SCOUT benchmark.
18
+
19
+ - **Framework:** SCOUT (Sequential RL with Exploration & Distillation)
20
+ - **Training Stage:** Final Checkpoint (+PPO on Sudoku)
21
+ - **Base Model:** Qwen2.5-3B-It
22
+
23
+ ## Experimental Results
24
+
25
+ The following results demonstrate the model's performance across all environments after completing the full sequential training curriculum:
26
+
27
+ | Task Group | Environment / Setting | Score |
28
+ | :--- | :--- | :--- |
29
+ | **Bandit** | General | **1.0** |
30
+ | **FrozenLake** | Static / Slippery | **0.89 / 0.88** |
31
+ | **Sokoban** | Box1 / Box2 | **0.95 / 0.59** |
32
+ | **Rubiks' Cube** | Rotation 1 / 2 / 3 | **1.0 / 1.0 / 0.89** |
33
+ | **Sudoku** | General | **0.98** |
34
+ | **Average** | **Overall Multi-task** | **0.91** |
35
+
36
+ > *Data source: Internal evaluation metrics for the SCOUT sequential RL pipeline.*
37
+
38
+ ## Capability Highlights
39
+
40
+ * **Zero Forgetting:** Maintains a perfect **1.0** score on the initial **Bandit** task even after sequential training on four subsequent complex environments.
41
+ * **Logical Reasoning:** Shows exceptional proficiency in high-dimensional state spaces, particularly **Sudoku (0.98)** and **Rubiks' Cube (Average ~0.96)**.
42
+ * **Robustness:** Demonstrates strong performance in stochastic environments like **FrozenLake Slippery (0.88)**.
43
+
44
+ ## Usage
45
+
46
+ ```python
47
+ from transformers import AutoModelForCausalLM, AutoTokenizer
48
+
49
+ model_name = "Harryis/SCOUT_multitask"
50
+
51
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
52
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
53
+
54
+ # Example: Prompt the model for a Sudoku move or Sokoban action