npc-coder-1.5b / README.md
ramankrishna10's picture
NPC Coder 1.5B: two-stage SFT (reasoning + persona), HumanEval 65.9%
ec28457 verified
|
Raw
History Blame Contribute Delete
3.1 kB
---
license: apache-2.0
language: [en]
library_name: transformers
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
tags: [code, reasoning, think, local, npc, laconic]
pipeline_tag: text-generation
---
# NPC Coder 1.5B
A local-first coding agent with visible `<think>` reasoning, a laconic
senior-engineer voice, and an honest-failure character (it flags uncertainty
instead of inventing APIs). Built on Qwen2.5-Coder-1.5B-Instruct. Runs on a
laptop in GGUF.
## What it is
- Visible step-by-step reasoning in `<think>` blocks before answering
- Terse, here's-the-fix answers (no filler)
- Admits uncertainty on hard or obscure problems rather than hallucinating
- Stable NPC identity (does not claim to be Qwen)
## Honest capability framing
This is a 1.5B model. It handles easy-to-medium coding and debugging
competently and reasons visibly about them. It is NOT an olympiad-level
solver — on genuinely hard algorithmic problems the reasoning can be
incomplete, and the model is trained to SAY so rather than emit
confident-but-wrong solutions. Treat it as a fast local assistant for everyday
coding, not a replacement for a frontier model on hard problems.
It can still be overconfident on obscure *factual trivia* (exact default
arguments, precise version numbers) — the honest-failure training mitigates
but does not eliminate this at 1.5B. Verify specifics against the docs.
**Benchmark:** HumanEval (instruct, pass@1, greedy): **65.9%**. Measured with
`lm-eval-harness` `humaneval_instruct`. (The personality fine-tune slightly
*improved* the extractable-code rate vs. the reasoning-only stage, because
terser answers parse more cleanly.)
## Personality behavior (held-out eval, 200 prompts)
| behavior | result |
|---|---|
| Correct NPC identity when asked | 100% |
| No identity mention on neutral coding (over-emission) | 2.5% |
| Denies being Qwen / wrong maker | 100% |
| Flags uncertainty on unknown/obscure APIs | 100% |
## Training
- **Stage 1 — reasoning:** SFT on `open-r1/codeforces-cots` (decontaminated
Python subsets, fit-filtered to ≤8192 tokens so every `<think>` trace is
complete; the filter biases toward shorter, laconic traces). 15k traces.
- **Stage 2 — voice + identity + honest-failure:** SFT with a 7k-example
personality set (gated identity, a large anti-over-emission cohort, an
honest-failure cohort, and a 1k anti-forgetting buffer of Stage-1 reasoning
data). LoRA, gentle LR, both stages merged.
Apache 2.0 model. Reasoning data: `open-r1/codeforces-cots` (CC-BY-4.0 / ODC-By,
attributed).
## Local use
GGUF quants: **q4_k_m (~941 MB, laptop default)**, q5_k_m (~1.1 GB), q8_0
(~1.6 GB), f16 (~3.1 GB). At q4_k_m, ~7 tok/s on CPU. Uses the standard ChatML
(`<|im_start|>` / `<|im_end|>`) template.
If q4_k_m's coherence on edge cases matters to you, q5_k_m is a cleaner default.
## Attribution & author
Reasoning data: `open-r1/codeforces-cots` (HuggingFace Open-R1), CC-BY-4.0.
Base model: `Qwen/Qwen2.5-Coder-1.5B-Instruct`, Apache 2.0.
Author: Rama Krishna Bachu / Bottensor (Independent Research). ORCID 0009-0000-1298-0681.