--- library_name: transformers license: other base_model: Qwen/Qwen3.5-9B language: - en tags: - text-generation - creative-writing - long-form-generation - reinforcement-learning - grpo - story-generation - transformers pipeline_tag: text-generation datasets: - rishanthrajendhran/POLARIS --- # POLARIS-no-HRI-9B POLARIS-no-HRI-9B is the matched ablation variant of [POLARIS-9B](https://huggingface.co/rishanthrajendhran/POLARIS-9B). It uses the same GRPO training recipe with the same structured Story Quality reward, identical hyperparameters, and the same training data — but without human-reference injection (HRI). Instead of 5 policy rollouts + 1 injected human-written story per group, it was trained with 6 policy rollouts with no reference anchor. It is a strong creative-writing model in its own right — substantially better than the base Qwen3.5-9B — but lags POLARIS-9B most noticeably at far-transfer lengths (8–12k words). ## Comparison with POLARIS-9B The gap between this model and POLARIS-9B is small at in-distribution lengths and grows at longer requested lengths, consistent with HRI's role in maintaining gradient pressure toward stronger writing as generation extends beyond the training range. **Story Quality by requested length** (GPT-5.4 judge, 180 held-out prompts) | Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate | Slope | |-------|-----------|-----------------|-----------------|-----------|-------| | POLARIS-9B | 57.4 | 48.2 | 44.1 | 52.1 | −3.0 | | **POLARIS-no-HRI-9B** | **56.5** | **47.0** | **37.7** | **49.7** | **−3.8** | | Qwen3.5-9B (base) | 35.1 | 8.7 | −11.8 | 18.5 | −10.8 | | Qwen3.5-27B | 51.5 | 38.7 | 24.6 | 42.8 | −5.9 | Slope is the linear fit across the six length buckets (points per step). A steeper negative slope indicates faster quality degradation as requested length increases. **EQ-Bench Longform by requested length** (GPT-5.4 judge, uniform aggregation) | Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | Aggregate | |-------|-----------|-----------------|-----------------|-----------| | POLARIS-9B | 63.1 | 57.5 | 54.3 | 59.8 | | **POLARIS-no-HRI-9B** | **62.1** | **55.7** | **51.6** | **58.2** | | Qwen3.5-9B (base) | 50.2 | 37.2 | 30.3 | 42.6 | **Length adherence** (generated / requested word count) | Model | ID (1–4k) | Near OOD (4–8k) | Far OOD (8–12k) | All | |-------|-----------|-----------------|-----------------|-----| | POLARIS-9B | 0.99 | 0.87 | 0.72 | 0.90 | | **POLARIS-no-HRI-9B** | **0.94** | **0.86** | **0.70** | **0.87** | | Qwen3.5-9B (base) | 1.09 | 0.96 | 0.88 | 1.01 | **OOD benchmarks** | Model | WritingBench (D4) | LongBench-Write | EQ-Bench Creative | |-------|------------------|-----------------|-------------------| | POLARIS-9B | 7.9 | 81.2 | 70.3 | | **POLARIS-no-HRI-9B** | **7.8** | **82.1** | **69.7** | | Qwen3.5-9B (base) | 6.8 | 67.1 | 59.2 | On OOD benchmarks the two variants are essentially tied; the HRI advantage is concentrated at long in-distribution lengths where narrative coherence and arc completion are required over many thousands of tokens. ## Intended Use - Long-form story generation (short-stories, flash fiction, narrative scenes) - Creative writing (essays, book reviews, podcast scripts etc) ## Out-of-Scope Use - Factual or knowledge-intensive writing where correctness matters - Legal, medical, or financial content - Reproducing or recovering the withheld training stories ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "rishanthrajendhran/POLARIS-no-HRI-9B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto", ) prompt = ( "Write a 2000-word story about an archivist who discovers that missing " "library books are returning with handwritten notes from the future." ) messages = [{"role": "user", "content": prompt}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True, enable_thinking=True, ) inputs = tokenizer(text, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=8192, do_sample=True, temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.10, ) generated = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) print(generated) ``` ## Recommended Generation Settings Identical to POLARIS-9B. | Setting | Value | Notes | |---------|-------|-------| | `temperature` | 0.4-1.0 | Lower temperatures (0.4-0.6) recommended for long-form story writing | | `top_p` | 0.95 | | | `top_k` | 20 | | | `repetition_penalty` | 1.0-1.10 | | | `presence_penalty` | 0.0-1.5 | Do no set repetition_penalty and presence_penalty together | | `max_new_tokens` | 14336 | Minimum recommended for 8–12k target lengths | | `enable_thinking` | True | | ## Prompting it is recommended to include an explicit length request in the prompt: ``` Write a 3000-word story about [premise]. ``` At far-transfer lengths (8–12k), this model undershoots more than POLARIS-9B (length ratio ≈ 0.70 vs 0.72). For generation targets above 6k words, POLARIS-9B is the recommended variant. ## Known Limitations The same qualitative failure modes present in POLARIS-9B apply here — stylistic overloading and local coherence failures — since both models share the same base, training data, and reward. The key additional limitation of this variant relative to POLARIS-9B: **Steeper quality degradation at long lengths.** Story Quality slope is −3.8 vs −3.0 for POLARIS-9B. At 8–12k words, the gap to POLARIS-9B is 6.4 Story Quality points, compared to ~1–2 points at in-distribution lengths. If your use case involves prompts requesting long stories, POLARIS-9B is the better choice. ## Training Identical to POLARIS-9B except for the group composition. | Parameter | Value | |-----------|-------| | Base model | Qwen3.5-9B | | Training algorithm | GRPO | | Training data | ~1,388 prompt–story pairs from 100 short-story anthologies | | Max reference length | 4,000 words | | GPUs | 4× A100 80GB | | Training time | ~48 hours | | Compute cost | ~$400 | | Judge cost | ~$60 (Gemini 3 Flash, flex tier) | | Training steps | 160 | | Batch size | 8 GRPO groups | | Group size | 6 policy rollouts (no human reference) | | HRI | **Disabled** | | Online reward judge | Gemini 3 Flash | | Evaluation judge | GPT-5.4 | ## Citation ```bibtex @misc{rajendhran2026polarisguidingsmallmodels, title={POLARIS: Guiding Small Models to Write Long Stories}, author={Rishanth Rajendhran and Jenna Russell and Mohit Iyyer and John Frederick Wieting}, year={2026}, eprint={2606.04095}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2606.04095}, } ```