MoM-python-slm / README.md
srivarenya's picture
Upload README.md with huggingface_hub
a29c68f verified
|
Raw
History Blame Contribute Delete
2.77 kB
---
license: apache-2.0
base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags: [code, python, qwen2.5-coder, dora, mixture-of-models, code-generation]
language: [en]
---
# MoM-Python-SLM (1.5B)
The **Python code-generation node** of a **Mixture-of-Models (MoM)** mesh — a set of small,
specialized Qwen2.5-Coder SLMs (shared tokenizer) coordinated by a lightweight router, aiming to beat
frontier generalists on coding by *specialization depth* rather than parameter count.
This node is a **single-turn code generator** (not an agent): given a Python task (optionally with an
upstream context packet), it returns reasoning followed by code. It shares the Qwen2.5-Coder
tokenizer with the other generative nodes, which is what makes logit-space fusion across the mesh
valid.
- **Base:** [Qwen/Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct)
- **Method:** DoRA r=64 (≈4.6% trainable), SFT (Phase A 1ep + Phase B 2ep), then merged.
- **Data:** 476K instances (decontaminated vs HumanEval/MBPP, 0 overlap) built from the complete
CPython docs + Flask/Requests source, issues/PRs, CVEs, and execution-verified synthetic problems.
## Benchmarks (greedy pass@1)
| Suite | Metric | base | **this model** |
|---|---|---|---|
| HumanEval | pass@1 | 68.9 | **70.7** |
| MBPP | pass@1 | 66.7 | **69.6** |
| Domain (held-out) | `spec_to_code` exec | 0.632 | **0.714** (+8.2) |
| Domain (held-out) | `api_signature` param-recall | 0.217 | **0.299** (+8.2) |
| Domain (held-out) | `problem_solving` exec | 0.700 | 0.713 (parity) |
The largest gains are on **library/API capability** (writing correct code from a spec, recalling API
signatures) — the dimension HumanEval/MBPP are saturated on and can't measure. The repo's
self-contained domain-eval notebook reproduces these.
## Recipe findings (load-bearing)
- **Low DoRA rank wins:** r=64 specializes without forgetting; r=256 catastrophically regressed
(HumanEval 60.4 < base).
- **Moderate reasoning wins:** the ~25%-reasoning recipe (this model) beat a 98%-reasoning sibling,
whose HumanEval *collapsed* to 47 (always-reason prose fights the signature-completion format).
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("srivarenya/MoM-python-slm")
model = AutoModelForCausalLM.from_pretrained(
"srivarenya/MoM-python-slm", dtype="bfloat16", device_map="auto")
```
Prompt with the training system prompt + a Python task; the model returns reasoning then code.
Next step in the pipeline: **GRPO/RLVR** against an execution-grounded reward to push past the
instruct-tuning ceiling. Code, training recipe, and eval harnesses: project repository.