POLARIS-9B / README.md
rishanthrajendhran's picture
Update README.md
a297c82 verified
---
library_name: transformers
license: other
base_model: Qwen/Qwen3.5-9B
language:
- en
tags:
- text-generation
- creative-writing
- long-form-generation
- reinforcement-learning
- grpo
- story-generation
- transformers
pipeline_tag: text-generation
datasets:
- rishanthrajendhran/POLARIS
extra_gated_prompt: "You agree to not share the model with others and also not use it for malicious purposes (eg. attempt to extract copyrighted human-written stories it was trained on)"
extra_gated_fields:
Company: text
Country: country
Specific date: date_picker
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: other
I agree to use this model for non-commercial use ONLY: checkbox
---
# POLARIS-9B
POLARIS-9B is a 9B parameter model for long-form English story generation, trained with a
reinforcement-learning recipe on top of Qwen3.5-9B. Despite training only on stories up to 4k
words, it maintains rubric quality on prompts requesting stories up to 12k words — 3× its
training length. In pairwise evaluation it ranks above all tested open-weight models on
EQ-Bench Creative Writing Elo, and a blinded human study finds it preferred to Qwen3.5-9B
(67.5% winrate) and on par with Qwen3.5-27B (51.2% winrate).
The training recipe — **POLARIS** (Policy Optimization with LLM-as-a-judge rewards and
Anchored-Reference Injection for Storywriting) — uses two core components: a frontier LLM
judge with a structured 16-dimension Story Quality rubric as the online reward, and
human-reference injection (HRI), where a teacher-forced human-written story is inserted into
each GRPO group as a high-reward anchor. The full training run costs approximately $500 in
compute and judge calls (4×A100 80GB, ~48 hours).
## Results
**Pairwise Elo (Gemini 3 Flash judge, dual-position)**
| Rank | Model | EQ-Bench Creative Elo |
|------|-------|----------------------|
| 1 | GPT-5.4 | 1911 |
| 2 | Claude Opus 4.6 | 1783 |
| **3** | **POLARIS-9B** | **1661** |
| 4 | Gemini 3.1 Pro | 1627 |
| 5 | Gemini 3 Flash | 1620 |
| 6 | Gemma 4 31B | 1514 |
| 7 | Qwen3.5-27B | 1503 |
| 9 | Qwen3.5-9B (base) | 1352 |
**Story Quality by requested length** (GPT-5.4 judge, 180 held-out prompts)
| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Length ratio (8–12k) |
|-------|-----------|-----------------|-----------------|----------------------|
| POLARIS-9B | 57.4 | 48.2 | 44.1 | 0.72 |
| Qwen3.5-27B | 51.5 | 38.7 | 24.6 | 0.82 |
| Qwen3.5-9B (base) | 35.1 | 8.7 | −11.8 | 0.88 |
| Gemma 4 31B | 53.9 | 49.7 | 47.1 | 0.36 |
Length ratio is generated / requested word count (1.0 = exact). Gemma 4 31B maintains quality
at long lengths by writing substantially shorter stories than requested; POLARIS-9B is among the few
open-weight models in our comparison that largely avoids quality collapse, length runaway, *and*
severe under-generation at far-transfer lengths.
**Human evaluation** (60 prompt–generation pairs, blinded, two annotators)
| Comparison | POLARIS-9B winrate | 95% CI |
|-----------|-------------------|--------|
| vs. Qwen3.5-9B | 67.5% | [55.0, 80.0] |
| vs. Qwen3.5-27B | 51.2% | [38.8, 58.8] |
Annotator comments most often highlight stronger atmosphere, voice, and scene realization
relative to the base model.
## Intended Use
- Long-form story generation (short stories, flash fiction, narrative scenes etc)
- Creative writing (essays, book reviews, podcast scripts etc)
POLARIS-9B is trained on short-story anthology data and transfers well to related narrative
tasks. Within WritingBench, it performs strongest on categories closest to its training
distribution: character design, fan fiction, novel manuscript, and podcast scripting.
## Out-of-Scope Use
- Factual or knowledge-intensive writing where correctness matters
- Legal, medical, or financial content
- Reproducing or recovering the withheld training stories
## Usage
POLARIS-9B uses extended thinking during generation. Enable thinking and provide adequate token
budget for long stories.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "rishanthrajendhran/POLARIS-9B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
prompt = (
"Write a 2000-word story about an archivist who discovers that missing "
"library books are returning with handwritten notes from the future."
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
# Enable thinking — important for quality
enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=8192,
do_sample=True,
temperature=0.6,
top_p=0.95,
top_k=20,
repetition_penalty=1.10,
)
# Strip the thinking trace; return only the story
generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)
```
## Recommended Generation Settings
These match the settings used in the paper's main evaluation.
| Setting | Value | Notes |
|---------|-------|-------|
| `temperature` | 0.4-1.0 | Lower temperature (0.4-0.6) is recommended for long-form story writing
| `top_p` | 0.95 |
| `top_k` | 20 | |
| `repetition_penalty` | 1.0-1.10 |
| `presence_penalty` | 0.0-1.5 | Do not set repetition_penalty and presence_penalty together
| `max_new_tokens` | 14336 | Minimum recommended for 8–12k target lengths |
| `enable_thinking` | True | Thinking traces are used at generation time |
Thinking token budget counts toward `max_new_tokens` but is stripped before evaluation. If the
model is producing very short stories, increasing `max_new_tokens` is usually the first thing to
try.
## Prompting
It is recommended to include an explicit length request in the prompt. POLARIS-9B was trained with length-stratified
prompts and uses the requested word count to calibrate output length. Example:
```
Write a 3000-word story about [premise].
```
At far-transfer lengths (8–12k words), the model undershoots somewhat (length ratio ≈ 0.72
aggregated across the far-OOD bucket). This is still substantially better than much larger open-weight models that
write 0.36× the requested length while appearing to maintain quality scores.
## Known Limitations
**Stylistic overloading.** The model can push too hard on specificity, jargon, or figurative
density, making prose feel effortful to read even when individual sentences are well-crafted.
Annotators flagged this as a recurring pattern.
**Local coherence failures.** Contradicting details and confusing transitions may appear across
examples, particularly in longer stories. The narrative usually stays on track, but individual
passages may lose logical consistency.
**Length undershooting at far transfer.** On prompts requesting 8–12k words, the model
generates approximately 72% of the requested length on average. Quality is preserved relative
to other open-weight models, but the full length target is not reliably met.
**Story-writing distribution.** The training data is short-story anthology fiction (literary
realism, horror/gothic, sci-fi, regional/folk writing). Performance on non-narrative writing
categories (biography, essays, book reviews) is noticeably weaker.
**Single-seed training.** The reported checkpoint reflects one training run. Seed-to-seed
variance has not been characterized.
## Training
| Parameter | Value |
|-----------|-------|
| Base model | Qwen3.5-9B |
| Training algorithm | GRPO |
| Training data | ~1,388 prompt–story pairs from 100 short-story anthologies |
| Max reference length | 4,000 words |
| GPUs | 4× A100 80GB |
| Training time | ~48 hours |
| Compute cost | ~$400 |
| Judge cost | ~$60 (Gemini 3 Flash, flex tier) |
| Training steps | 160 |
| Batch size | 8 GRPO groups |
| Group size | 6 (5 policy rollouts + 1 injected human reference) |
| Online reward judge | Gemini 3 Flash |
| Evaluation judge | GPT-5.4 |
The human-written stories used in training are derived from commercially purchased anthologies
and are not released. The associated prompt dataset is released separately.
## Citation
```bibtex
@misc{rajendhran2026polarisguidingsmallmodels,
title={POLARIS: Guiding Small Models to Write Long Stories},
author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
year={2026},
eprint={2606.04095},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2606.04095},
}
```