Klingspor commited on
Commit
be60040
·
verified ·
1 Parent(s): 3fd6902

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -2
README.md CHANGED
@@ -1,3 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # 20 Questions StarPO - Qwen3-1.7B
2
 
3
  This model is a reinforcement learning fine-tuned version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for the **20 Questions** task, trained with **StarPO** (a variant of GRPO adapted for multi-turn settings). Released as part of the paper *"Intrinsic Credit Assignment for Long Horizon Interaction"*.
@@ -8,7 +23,7 @@ The model plays the role of a **Questioner** in a game of 20 Questions: it asks
8
 
9
  ## Training
10
 
11
- - **Base model:** [20q-sft-qwen3-1.7b](https://huggingface.co/bethgelab/20q-sft-qwen3-1.7b) (SFT on Qwen3-1.7B)
12
  - **Method:** StarPO (multi-turn GRPO / Group Relative Policy Optimization)
13
  - **Training data:** 1,000 words from the COCA+ RL training set (no overlap with test set)
14
  - **Test set:** 433 held-out words
@@ -28,7 +43,7 @@ This model is intended for:
28
  ```python
29
  from transformers import AutoModelForCausalLM, AutoTokenizer
30
 
31
- model_name = "bethgelab/20q-starpo-qwen3-1.7b"
32
  tokenizer = AutoTokenizer.from_pretrained(model_name)
33
  model = AutoModelForCausalLM.from_pretrained(model_name)
34
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - 20-questions,
8
+ - rl
9
+ - grpo
10
+ - starpo
11
+ - multi-turn
12
+ - information-seeking
13
+ - reinforcement-learning
14
+ - credit-assignment
15
+ ---
16
  # 20 Questions StarPO - Qwen3-1.7B
17
 
18
  This model is a reinforcement learning fine-tuned version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for the **20 Questions** task, trained with **StarPO** (a variant of GRPO adapted for multi-turn settings). Released as part of the paper *"Intrinsic Credit Assignment for Long Horizon Interaction"*.
 
23
 
24
  ## Training
25
 
26
+ - **Base model:** [Qwen3-1.7B-SFT](https://huggingface.co/Klingspor/Qwen3-1.7B-SFT) (SFT on Qwen3-1.7B)
27
  - **Method:** StarPO (multi-turn GRPO / Group Relative Policy Optimization)
28
  - **Training data:** 1,000 words from the COCA+ RL training set (no overlap with test set)
29
  - **Test set:** 433 held-out words
 
43
  ```python
44
  from transformers import AutoModelForCausalLM, AutoTokenizer
45
 
46
+ model_name = "Klingspor/StarPO-1.7B"
47
  tokenizer = AutoTokenizer.from_pretrained(model_name)
48
  model = AutoModelForCausalLM.from_pretrained(model_name)
49