--- license: apache-2.0 base_model: Qwen/Qwen2.5-Coder-1.5B-Instruct pipeline_tag: text-generation library_name: transformers tags: [code, python, qwen2.5-coder, dora, mixture-of-models, code-generation] language: [en] --- # MoM-Python-SLM (1.5B) The **Python code-generation node** of a **Mixture-of-Models (MoM)** mesh — a set of small, specialized Qwen2.5-Coder SLMs (shared tokenizer) coordinated by a lightweight router, aiming to beat frontier generalists on coding by *specialization depth* rather than parameter count. This node is a **single-turn code generator** (not an agent): given a Python task (optionally with an upstream context packet), it returns reasoning followed by code. It shares the Qwen2.5-Coder tokenizer with the other generative nodes, which is what makes logit-space fusion across the mesh valid. - **Base:** [Qwen/Qwen2.5-Coder-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B-Instruct) - **Method:** DoRA r=64 (≈4.6% trainable), SFT (Phase A 1ep + Phase B 2ep), then merged. - **Data:** 476K instances (decontaminated vs HumanEval/MBPP, 0 overlap) built from the complete CPython docs + Flask/Requests source, issues/PRs, CVEs, and execution-verified synthetic problems. ## Benchmarks (greedy pass@1) | Suite | Metric | base | **this model** | |---|---|---|---| | HumanEval | pass@1 | 68.9 | **70.7** | | MBPP | pass@1 | 66.7 | **69.6** | | Domain (held-out) | `spec_to_code` exec | 0.632 | **0.714** (+8.2) | | Domain (held-out) | `api_signature` param-recall | 0.217 | **0.299** (+8.2) | | Domain (held-out) | `problem_solving` exec | 0.700 | 0.713 (parity) | The largest gains are on **library/API capability** (writing correct code from a spec, recalling API signatures) — the dimension HumanEval/MBPP are saturated on and can't measure. The repo's self-contained domain-eval notebook reproduces these. ## Recipe findings (load-bearing) - **Low DoRA rank wins:** r=64 specializes without forgetting; r=256 catastrophically regressed (HumanEval 60.4 < base). - **Moderate reasoning wins:** the ~25%-reasoning recipe (this model) beat a 98%-reasoning sibling, whose HumanEval *collapsed* to 47 (always-reason prose fights the signature-completion format). ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer tok = AutoTokenizer.from_pretrained("srivarenya/MoM-python-slm") model = AutoModelForCausalLM.from_pretrained( "srivarenya/MoM-python-slm", dtype="bfloat16", device_map="auto") ``` Prompt with the training system prompt + a Python task; the model returns reasoning then code. Next step in the pipeline: **GRPO/RLVR** against an execution-grounded reward to push past the instruct-tuning ceiling. Code, training recipe, and eval harnesses: project repository.