---
library_name: transformers
license: other
base_model: Qwen/Qwen3.5-9B
language:
- en
tags:
- text-generation
- creative-writing
- long-form-generation
- reinforcement-learning
- grpo
- story-generation
- transformers
pipeline_tag: text-generation
datasets:
- rishanthrajendhran/POLARIS
---

# POLARIS-no-HRI-9B

POLARIS-no-HRI-9B is the matched ablation variant of [POLARIS-9B](https://huggingface.co/rishanthrajendhran/POLARIS-9B).
It uses the same GRPO training recipe with the same structured Story Quality reward, identical
hyperparameters, and the same training data — but without human-reference injection (HRI).
Instead of 5 policy rollouts + 1 injected human-written story per group, it was trained with 6 policy
rollouts with no reference anchor.

It is a strong creative-writing model in its own right — substantially better than the base
Qwen3.5-9B — but lags POLARIS-9B most noticeably at far-transfer lengths (8–12k words).

## Comparison with POLARIS-9B

The gap between this model and POLARIS-9B is small at in-distribution lengths and grows at
longer requested lengths, consistent with HRI's role in maintaining gradient pressure toward
stronger writing as generation extends beyond the training range.

**Story Quality by requested length** (GPT-5.4 judge, 180 held-out prompts)

| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate | Slope |
|-------|-----------|-----------------|-----------------|-----------|-------|
| POLARIS-9B | 57.4 | 48.2 | 44.1 | 52.1 | −3.0 |
| **POLARIS-no-HRI-9B** | **56.5** | **47.0** | **37.7** | **49.7** | **−3.8** |
| Qwen3.5-9B (base) | 35.1 | 8.7 | −11.8 | 18.5 | −10.8 |
| Qwen3.5-27B | 51.5 | 38.7 | 24.6 | 42.8 | −5.9 |

Slope is the linear fit across the six length buckets (points per step). A steeper negative
slope indicates faster quality degradation as requested length increases.

**EQ-Bench Longform by requested length** (GPT-5.4 judge, uniform aggregation)

| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate |
|-------|-----------|-----------------|-----------------|-----------|
| POLARIS-9B | 63.1 | 57.5 | 54.3 | 59.8 |
| **POLARIS-no-HRI-9B** | **62.1** | **55.7** | **51.6** | **58.2** |
| Qwen3.5-9B (base) | 50.2 | 37.2 | 30.3 | 42.6 |

**Length adherence** (generated / requested word count)

| Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | All |
|-------|-----------|-----------------|-----------------|-----|
| POLARIS-9B | 0.99 | 0.87 | 0.72 | 0.90 |
| **POLARIS-no-HRI-9B** | **0.94** | **0.86** | **0.70** | **0.87** |
| Qwen3.5-9B (base) | 1.09 | 0.96 | 0.88 | 1.01 |

**OOD benchmarks**

| Model | WritingBench (D4) | LongBench-Write | EQ-Bench Creative |
|-------|------------------|-----------------|-------------------|
| POLARIS-9B | 7.9 | 81.2 | 70.3 |
| **POLARIS-no-HRI-9B** | **7.8** | **82.1** | **69.7** |
| Qwen3.5-9B (base) | 6.8 | 67.1 | 59.2 |

On OOD benchmarks the two variants are essentially tied; the HRI advantage is concentrated at
long in-distribution lengths where narrative coherence and arc completion are required over many
thousands of tokens.

## Intended Use

- Long-form story generation (short-stories, flash fiction, narrative scenes)
- Creative writing (essays, book reviews, podcast scripts etc)

## Out-of-Scope Use

- Factual or knowledge-intensive writing where correctness matters
- Legal, medical, or financial content
- Reproducing or recovering the withheld training stories

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "rishanthrajendhran/POLARIS-no-HRI-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = (
    "Write a 2000-word story about an archivist who discovers that missing "
    "library books are returning with handwritten notes from the future."
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=8192,
    do_sample=True,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.10,
)

generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(generated)
```

## Recommended Generation Settings

Identical to POLARIS-9B.

| Setting | Value | Notes |
|---------|-------|-------|
| `temperature` | 0.4-1.0 | Lower temperatures (0.4-0.6) recommended for long-form story writing |
| `top_p` | 0.95 | |
| `top_k` | 20 | |
| `repetition_penalty` | 1.0-1.10 | |
| `presence_penalty` | 0.0-1.5 | Do no set repetition_penalty and presence_penalty together |
| `max_new_tokens` | 14336 | Minimum recommended for 8–12k target lengths |
| `enable_thinking` | True | |

## Prompting

it is recommended to include an explicit length request in the prompt:

```
Write a 3000-word story about [premise].
```

At far-transfer lengths (8–12k), this model undershoots more than POLARIS-9B (length ratio
≈ 0.70 vs 0.72). For generation targets above 6k words, POLARIS-9B is the recommended variant.

## Known Limitations

The same qualitative failure modes present in POLARIS-9B apply here — stylistic overloading
and local coherence failures — since both models share the same base, training data, and reward.
The key additional limitation of this variant relative to POLARIS-9B:

**Steeper quality degradation at long lengths.** Story Quality slope is −3.8 vs −3.0 for
POLARIS-9B. At 8–12k words, the gap to POLARIS-9B is 6.4 Story Quality points, compared to
~1–2 points at in-distribution lengths. If your use case involves prompts requesting long
stories, POLARIS-9B is the better choice.

## Training

Identical to POLARIS-9B except for the group composition.

| Parameter | Value |
|-----------|-------|
| Base model | Qwen3.5-9B |
| Training algorithm | GRPO |
| Training data | ~1,388 prompt–story pairs from 100 short-story anthologies |
| Max reference length | 4,000 words |
| GPUs | 4× A100 80GB |
| Training time | ~48 hours |
| Compute cost | ~$400 |
| Judge cost | ~$60 (Gemini 3 Flash, flex tier) |
| Training steps | 160 |
| Batch size | 8 GRPO groups |
| Group size | 6 policy rollouts (no human reference) |
| HRI | **Disabled** |
| Online reward judge | Gemini 3 Flash |
| Evaluation judge | GPT-5.4 |

## Citation

```bibtex
@misc{rajendhran2026polarisguidingsmallmodels,
      title={POLARIS: Guiding Small Models to Write Long Stories}, 
      author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting},
      year={2026},
      eprint={2606.04095},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.04095}, 
}
```