File size: 6,928 Bytes
7f19bde
 
 
 
 
040705e
7f19bde
040705e
 
 
 
 
 
 
7f19bde
040705e
 
7f19bde
 
f68c359
 
7f19bde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6c5508a
 
 
 
 
 
 
 
7f19bde
6c5508a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
library_name: transformers
license: other
base_model: Qwen/Qwen3.5-9B
language:
- en
tags:
- text-generation
- creative-writing
- long-form-generation
- reinforcement-learning
- grpo
- story-generation
- transformers
pipeline_tag: text-generation
datasets:
- rishanthrajendhran/POLARIS
---

# POLARIS-no-HRI-9B

POLARIS-no-HRI-9B is the matched ablation variant of [POLARIS-9B](https://huggingface.co/rishanthrajendhran/POLARIS-9B).
It uses the same GRPO training recipe with the same structured Story Quality reward, identical
hyperparameters, and the same training data β€” but without human-reference injection (HRI).
Instead of 5 policy rollouts + 1 injected human-written story per group, it was trained with 6 policy
rollouts with no reference anchor.

It is a strong creative-writing model in its own right β€” substantially better than the base
Qwen3.5-9B β€” but lags POLARIS-9B most noticeably at far-transfer lengths (8–12k words).

## Comparison with POLARIS-9B

The gap between this model and POLARIS-9B is small at in-distribution lengths and grows at
longer requested lengths, consistent with HRI's role in maintaining gradient pressure toward
stronger writing as generation extends beyond the training range.

**Story Quality by requested length** (GPT-5.4 judge, 180 held-out prompts)

| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate | Slope |
|-------|-----------|-----------------|-----------------|-----------|-------|
| POLARIS-9B | 57.4 | 48.2 | 44.1 | 52.1 | βˆ’3.0 |
| **POLARIS-no-HRI-9B** | **56.5** | **47.0** | **37.7** | **49.7** | **βˆ’3.8** |
| Qwen3.5-9B (base) | 35.1 | 8.7 | βˆ’11.8 | 18.5 | βˆ’10.8 |
| Qwen3.5-27B | 51.5 | 38.7 | 24.6 | 42.8 | βˆ’5.9 |

Slope is the linear fit across the six length buckets (points per step). A steeper negative
slope indicates faster quality degradation as requested length increases.

**EQ-Bench Longform by requested length** (GPT-5.4 judge, uniform aggregation)

| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate |
|-------|-----------|-----------------|-----------------|-----------|
| POLARIS-9B | 63.1 | 57.5 | 54.3 | 59.8 |
| **POLARIS-no-HRI-9B** | **62.1** | **55.7** | **51.6** | **58.2** |
| Qwen3.5-9B (base) | 50.2 | 37.2 | 30.3 | 42.6 |

**Length adherence** (generated / requested word count)

| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | All |
|-------|-----------|-----------------|-----------------|-----|
| POLARIS-9B | 0.99 | 0.87 | 0.72 | 0.90 |
| **POLARIS-no-HRI-9B** | **0.94** | **0.86** | **0.70** | **0.87** |
| Qwen3.5-9B (base) | 1.09 | 0.96 | 0.88 | 1.01 |

**OOD benchmarks**

| Model | WritingBench (D4) | LongBench-Write | EQ-Bench Creative |
|-------|------------------|-----------------|-------------------|
| POLARIS-9B | 7.9 | 81.2 | 70.3 |
| **POLARIS-no-HRI-9B** | **7.8** | **82.1** | **69.7** |
| Qwen3.5-9B (base) | 6.8 | 67.1 | 59.2 |

On OOD benchmarks the two variants are essentially tied; the HRI advantage is concentrated at
long in-distribution lengths where narrative coherence and arc completion are required over many
thousands of tokens.

## Intended Use

- Long-form story generation (short-stories, flash fiction, narrative scenes)
- Creative writing (essays, book reviews, podcast scripts etc)

## Out-of-Scope Use

- Factual or knowledge-intensive writing where correctness matters
- Legal, medical, or financial content
- Reproducing or recovering the withheld training stories

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "rishanthrajendhran/POLARIS-no-HRI-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Write a 2000-word story about an archivist who discovers that missing "
    "library books are returning with handwritten notes from the future."
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.10,
)

generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)
```

## Recommended Generation Settings

Identical to POLARIS-9B.

| Setting | Value | Notes |
|---------|-------|-------|
| `temperature` | 0.4-1.0 | Lower temperatures (0.4-0.6) recommended for long-form story writing |
| `top_p` | 0.95 | |
| `top_k` | 20 | |
| `repetition_penalty` | 1.0-1.10 | |
| `presence_penalty` | 0.0-1.5 | Do no set repetition_penalty and presence_penalty together |
| `max_new_tokens` | 14336 | Minimum recommended for 8–12k target lengths |
| `enable_thinking` | True | |

## Prompting

it is recommended to include an explicit length request in the prompt:

```
Write a 3000-word story about [premise].
```

At far-transfer lengths (8–12k), this model undershoots more than POLARIS-9B (length ratio
β‰ˆ 0.70 vs 0.72). For generation targets above 6k words, POLARIS-9B is the recommended variant.

## Known Limitations

The same qualitative failure modes present in POLARIS-9B apply here β€” stylistic overloading
and local coherence failures β€” since both models share the same base, training data, and reward.
The key additional limitation of this variant relative to POLARIS-9B:

**Steeper quality degradation at long lengths.** Story Quality slope is βˆ’3.8 vs βˆ’3.0 for
POLARIS-9B. At 8–12k words, the gap to POLARIS-9B is 6.4 Story Quality points, compared to
~1–2 points at in-distribution lengths. If your use case involves prompts requesting long
stories, POLARIS-9B is the better choice.

## Training

Identical to POLARIS-9B except for the group composition.

| Parameter | Value |
|-----------|-------|
| Base model | Qwen3.5-9B |
| Training algorithm | GRPO |
| Training data | ~1,388 prompt–story pairs from 100 short-story anthologies |
| Max reference length | 4,000 words |
| GPUs | 4Γ— A100 80GB |
| Training time | ~48 hours |
| Compute cost | ~$400 |
| Judge cost | ~$60 (Gemini 3 Flash, flex tier) |
| Training steps | 160 |
| Batch size | 8 GRPO groups |
| Group size | 6 policy rollouts (no human reference) |
| HRI | **Disabled** |
| Online reward judge | Gemini 3 Flash |
| Evaluation judge | GPT-5.4 |

## Citation

```bibtex
@misc{rajendhran2026polarisguidingsmallmodels,
      title={POLARIS: Guiding Small Models to Write Long Stories}, 
      author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
      year={2026},
      eprint={2606.04095},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.04095}, 
}
```