Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# 20 Questions StarPO - Qwen3-1.7B
|
| 2 |
|
| 3 |
This model is a reinforcement learning fine-tuned version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for the **20 Questions** task, trained with **StarPO** (a variant of GRPO adapted for multi-turn settings). Released as part of the paper *"Intrinsic Credit Assignment for Long Horizon Interaction"*.
|
|
@@ -8,7 +23,7 @@ The model plays the role of a **Questioner** in a game of 20 Questions: it asks
|
|
| 8 |
|
| 9 |
## Training
|
| 10 |
|
| 11 |
-
- **Base model:** [
|
| 12 |
- **Method:** StarPO (multi-turn GRPO / Group Relative Policy Optimization)
|
| 13 |
- **Training data:** 1,000 words from the COCA+ RL training set (no overlap with test set)
|
| 14 |
- **Test set:** 433 held-out words
|
|
@@ -28,7 +43,7 @@ This model is intended for:
|
|
| 28 |
```python
|
| 29 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 30 |
|
| 31 |
-
model_name = "
|
| 32 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 33 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 34 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
tags:
|
| 7 |
+
- 20-questions,
|
| 8 |
+
- rl
|
| 9 |
+
- grpo
|
| 10 |
+
- starpo
|
| 11 |
+
- multi-turn
|
| 12 |
+
- information-seeking
|
| 13 |
+
- reinforcement-learning
|
| 14 |
+
- credit-assignment
|
| 15 |
+
---
|
| 16 |
# 20 Questions StarPO - Qwen3-1.7B
|
| 17 |
|
| 18 |
This model is a reinforcement learning fine-tuned version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for the **20 Questions** task, trained with **StarPO** (a variant of GRPO adapted for multi-turn settings). Released as part of the paper *"Intrinsic Credit Assignment for Long Horizon Interaction"*.
|
|
|
|
| 23 |
|
| 24 |
## Training
|
| 25 |
|
| 26 |
+
- **Base model:** [Qwen3-1.7B-SFT](https://huggingface.co/Klingspor/Qwen3-1.7B-SFT) (SFT on Qwen3-1.7B)
|
| 27 |
- **Method:** StarPO (multi-turn GRPO / Group Relative Policy Optimization)
|
| 28 |
- **Training data:** 1,000 words from the COCA+ RL training set (no overlap with test set)
|
| 29 |
- **Test set:** 433 held-out words
|
|
|
|
| 43 |
```python
|
| 44 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 45 |
|
| 46 |
+
model_name = "Klingspor/StarPO-1.7B"
|
| 47 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 48 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 49 |
|