Klingspor
/

StarPO-1.7B

Text Generation

information-seeking

reinforcement-learning

credit-assignment

Model card Files Files and versions

Klingspor commited on Feb 13

Commit

be60040

·

verified ·

1 Parent(s): 3fd6902

Update README.md

Files changed (1) hide show

README.md +17 -2

README.md CHANGED Viewed

@@ -1,3 +1,18 @@
 # 20 Questions StarPO - Qwen3-1.7B
 This model is a reinforcement learning fine-tuned version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for the **20 Questions** task, trained with **StarPO** (a variant of GRPO adapted for multi-turn settings). Released as part of the paper *"Intrinsic Credit Assignment for Long Horizon Interaction"*.
@@ -8,7 +23,7 @@ The model plays the role of a **Questioner** in a game of 20 Questions: it asks
 ## Training
-- **Base model:** [20q-sft-qwen3-1.7b](https://huggingface.co/bethgelab/20q-sft-qwen3-1.7b) (SFT on Qwen3-1.7B)
 - **Method:** StarPO (multi-turn GRPO / Group Relative Policy Optimization)
 - **Training data:** 1,000 words from the COCA+ RL training set (no overlap with test set)
 - **Test set:** 433 held-out words
@@ -28,7 +43,7 @@ This model is intended for:
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
-model_name = "bethgelab/20q-starpo-qwen3-1.7b"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)

+---
+license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
+tags:
+- 20-questions,
+- rl
+- grpo
+- starpo
+- multi-turn
+- information-seeking
+- reinforcement-learning
+- credit-assignment
+---
 # 20 Questions StarPO - Qwen3-1.7B
 This model is a reinforcement learning fine-tuned version of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) for the **20 Questions** task, trained with **StarPO** (a variant of GRPO adapted for multi-turn settings). Released as part of the paper *"Intrinsic Credit Assignment for Long Horizon Interaction"*.
 ## Training
+- **Base model:** [Qwen3-1.7B-SFT](https://huggingface.co/Klingspor/Qwen3-1.7B-SFT) (SFT on Qwen3-1.7B)
 - **Method:** StarPO (multi-turn GRPO / Group Relative Policy Optimization)
 - **Training data:** 1,000 words from the COCA+ RL training set (no overlap with test set)
 - **Test set:** 433 held-out words
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Klingspor/StarPO-1.7B"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(model_name)