File size: 3,311 Bytes
ff217c9
969a371
 
ff217c9
969a371
 
 
 
 
 
 
 
 
ff217c9
 
969a371
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
license: cc-by-nc-4.0
base_model: sapientinc/HRM-Text-1B
library_name: transformers
pipeline_tag: text-generation
language:
- en
tags:
- code
- code-generation
- hrm
- hierarchical-reasoning
- prefix-lm
---

# HRM-Text-1B-code — a code expert (SFT)

Full-parameter SFT of [`sapientinc/HRM-Text-1B`](https://huggingface.co/sapientinc/HRM-Text-1B) for
**Python code generation**, trained in the model's **`synth,cot` (reasoning) condition** lane. It takes
a base that essentially couldn't code (HumanEval 1.2%) and teaches it to code from just **~25k**
instruction→code SFT examples.

Built as the second expert in a **skill-composition experiment** (can an HRM tool expert + code expert
*merge* into one model?). Full writeup + code: **https://github.com/jasoncarreira/hrm-text-agent**.
Companions: [`hrm-text-agent`](https://huggingface.co/jasoncarreira/hrm-text-agent) (tools),
[`hrm-text-agent-v2`](https://huggingface.co/jasoncarreira/hrm-text-agent-v2) (tools, scaled).

## Scores (pass@1)

| Bench | Base | **This model** |
|---|---|---|
| HumanEval | 1.2% (2/164) | **11.0% (18/164)** |
| MBPP | 2.3% (6/257) | **16.7% (43/257)** |

**Honest positioning:** as a standalone code model this is **entry-level** — roughly StarCoderBase-1B
tier (~15% HE), and well below purpose-built small code models (DeepSeek-Coder-1.3B ~35%,
Qwen2.5-Coder-1.5B ~40%+, Phi-1 ~50%). But those were **pretrained on hundreds of billions of code
tokens**; this learned code from **~25k SFT examples on a non-code reasoning base**, so the result is
about **sample efficiency**, not absolute code SOTA — and plausibly the recurrent reasoning base helps
with code's structured nature. (pass@1 measured with the repo's `eval_code.py` instruct harness, which
can slightly *under*-measure vs a model's native eval.)

## Training
- full-parameter SFT (sapientinc `cfg_sft` recipe: lr 3e-5, cosine to 10%, AdamW(0.9, 0.95) wd 0.1,
  3 epochs, `max_len` 2048, bf16)
- **`synth,cot` condition** (`<|quad_end|><|object_ref_end|>`) — deliberately a *different lane* than
  the tool expert's `direct`, for the composition experiment
- **data:** ~25k instruction→code examples from
  [CodeFeedback-Filtered-Instruction](https://huggingface.co/datasets/m-a-p/CodeFeedback-Filtered-Instruction)
  + [CodeAlpaca-20k](https://huggingface.co/datasets/sahil2801/CodeAlpaca-20k), length-filtered to fit 2048

## Usage
HRM-Text is a PrefixLM with a conditioning scheme — generate in the `synth,cot` lane with
`token_type_ids=1` over the prompt. Use the repo harness rather than a bare `.generate()`:
```bash
python eval_code.py --bench humaneval --model jasoncarreira/hrm-text-code
```

## Note on composition
The merge experiment found this code expert and the tool expert **do not compose** in merged weights —
a hard tool-XOR-code trade at every coefficient (tools work only at full tool-weight, where code dies;
weaken tools at all and they collapse while code recovers). So for a multi-skill HRM agent the path is
**model-routing** between separate experts, not weight-merging. Details in the repo README.

## License & lineage
Base is Apache-2.0; the training data (CodeAlpaca / CodeFeedback lineage) is best treated as
**non-commercial / research**. Verify source licenses for your use case.

🤖 Built with Claude Code.