--- license: apache-2.0 base_model: Tesslate/OmniCoder-9B tags: - qwen3.5 - lora - sft - reasoning - coding - gguf - omnicoder language: - en library_name: transformers ---
![Slopus](slopus_logo.png) # Slopus 9B **Experimental fine-tuning pipeline on top of `Tesslate/OmniCoder-9B` (Qwen3.5-9B base + 425K Opus agentic traces).** Two chained LoRA stages: a light coding-style adaptation first, then a heavier capacity injection of pure Opus 4.6 reasoning traces.
## Pipeline overview ![pipeline](slopus_pipeline.png) ## Table of contents 1. [Summary](#summary) 2. [Pipeline in 5 steps](#pipeline-in-5-steps) 3. [Phase 1 — LoRA r=8 on OmniCoder-9B](#phase-1) 4. [Phase 2 — LoRA r=128 on phase 1 merge](#phase-2) 5. [Costs](#costs) 6. [Available quants](#available-quants) 7. [Usage with llama-server](#usage-with-llama-server) 8. [Limitations](#limitations) 9. [Datasets & credits](#datasets--credits) ## Summary | | Phase 1 | Phase 2 | |---|--------|--------| | Dataset | `Kukedlc/omnicoder-train` (16K, agentic coding) | `Kukedlc/omnicoder-fase2-reasoning` (24K, Opus 4.6 reasoning, derived from [`Gryphe/Opus-4.6-Reasoning-24k`](https://huggingface.co/datasets/Gryphe/Opus-4.6-Reasoning-24k)) | | Base | `Tesslate/OmniCoder-9B` | `Tesslate/OmniCoder-9B` + phase 1 LoRA merged (fp16) | | LoRA rank / alpha | r=8, alpha=16 | r=128, alpha=256 | | Targets | q,k,v,o,gate,up,down,**out_proj** (GDN) | same + much higher rank | | Epochs | 2 | 2 | | Total steps | 506 | 758 | | Effective batch | 16 (batch=8, GA=8) | 64 (batch=16, GA=4) | | LR | 2e-4 | 1e-4 (lower, base already fine-tuned) | | Max seq | 2048 | 4096 | | Training hardware | RunPod H100 80GB SXM Secure | RunPod H100 80GB SXM Secure | | Approximate time | ~1.5h | ~4.8h | | Loss start → end | 1.14 → 0.88 (**-23.1%**) | 1.03 → 0.88 (**-14.3%**) | ## Pipeline in 5 steps 1. **Base**: [`Tesslate/OmniCoder-9B`](https://huggingface.co/Tesslate/OmniCoder-9B) — Qwen3.5-9B with hybrid GDN (Gated Delta Networks) architecture, fine-tuned by Tesslate on 425K Opus agentic coding traces. 2. **Phase 1 LoRA r=8 alpha=16** on top of the base, dataset `Kukedlc/omnicoder-train`. 2 epochs, 506 steps. Resulting adapter: [`Kukedlc/omnicoder-9b-lora`](https://huggingface.co/Kukedlc/omnicoder-9b-lora) (checkpoints 100/200/300/400/500/506). 3. **Phase 1 merge**: base + phase 1 adapter → ~18 GB HF fp16 model (not published, used only as internal base for phase 2). 4. **Phase 2 LoRA r=128 alpha=256** on top of the merged model, dataset `Kukedlc/omnicoder-fase2-reasoning` (re-rendered from Gryphe Opus reasoning with `` inline in the chat template). 2 epochs, 758 steps. Resulting adapter: [`Kukedlc/omnicoder-9b-fase2-lora`](https://huggingface.co/Kukedlc/omnicoder-9b-fase2-lora) (checkpoints 100/200/.../758). 5. **Phase 2 merge + Quantize** with `llama.cpp` tag `b8292` (commit `b54124110`) — newer master and other versions do NOT fully support Qwen3.5's hybrid GDN architecture and fail with `missing tensor 'blk.32.attn_norm.weight'`. Generated quants: Q2_K, Q3_K_M, Q4_K_M, Q6_K, Q8_0. ## Phase 1 ![phase 1](slopus_fase1_loss.png) A conservative LoRA (r=8 alpha=16) on top of OmniCoder-9B to introduce a first style adjustment in the target domain before the heavier injection in phase 2. The idea: don't break the base with a giant LoRA on the first pass. **Observation**: loss drops MORE in epoch 2 than in epoch 1 (10% in epoch 1 vs 15% in epoch 2), suggesting the base still had headroom to learn — no overfitting. ## Phase 2 ![phase 2](slopus_fase2_loss.png) High-capacity LoRA (r=128 alpha=256, 258M trainable params = 2.67% of the 9B) on top of the phase 1 merge, with a pure reasoning dataset from Opus 4.6 (24K chat examples with separate `reasoning_content`, re-rendered as `Xcontent` inline for the Qwen3.5 chat template). LR lowered to 1e-4 (vs 2e-4 in phase 1) since the model was already fine-tuned. Cosine scheduler with 20 warmup steps. **Recommended sampling parameters** (from Qwen3.6 model card, valid for Qwen3.5): | Mode | temp | top_p | pres_pen | top_k | min_p | |------|------|-------|----------|-------|-------| | Thinking general | 1.0 | 0.95 | 1.5 | 20 | 0 | | Thinking coding | 0.6 | 0.95 | 0.0 | 20 | 0 | | Instruct general | 0.7 | 0.80 | 1.5 | 20 | 0 | ## Benchmarks All benchmarks run **locally** on RTX 3090, against the Q4_K_M GGUF served via `llama-server` at `b8292`. Custom Python harness (`_custom_gsm8k_bench.py` and `_custom_mc_bench.py` in the dataset repo) — we send each question + choices to the model and extract the final letter/number with regex. No `lm-evaluation-harness` because of multiple bugs with the Qwen3.5 hybrid architecture + chat template + logprobs format from `llama-server` (documented in the issues). ### GSM8K — Slopus 9B vs base OmniCoder 9B (100 problems, same Q&A set) | Setting | Slopus 9B Q4 | OmniCoder 9B Q4 | Δ | |---|---|---|---| | Greedy + no-thinking | **80.0%** (5.5 min) | **81.0%** (~5 min) | -1.0 pp | | Thinking ON + Qwen sampling (temp=1.0, top_p=0.95, top_k=20, presence_penalty=1.5) | **92.0%** (5.5 min) | **97.0%** (**65.6 min**) | -5.0 pp | **Speed**: Slopus approximately 12x faster than OmniCoder with thinking on (5.5 min vs 65.6 min, same hardware, same `--parallel 2`). Slopus reasoning is more concise (approximately 300 tokens avg vs approximately 1500 tokens avg for OmniCoder). **Tradeoff**: OmniCoder wins approximately 5pp accuracy on math by re-verifying answers multiple times (`Step → Alternative Plan → Verification`), Slopus wins approximately 12x throughput. For batch workloads (1000s of queries), Slopus processes 12x the volume per hour. ### tinyBenchmarks (100 examples each, custom generative — letter pick, thinking ON + Qwen sampling) | Benchmark | Slopus 9B Q4 | Time | |---|---|---| | tinyMMLU (knowledge) | **76.0%** | 6.3 min | | tinyArc (science Q&A) | **92.0%** | 5.4 min | | tinyHellaswag (common sense) | **65.0%** | 5.1 min | | tinyWinogrande (coreference) | **76.0%** | 4.0 min | | tinyTruthfulQA (truthfulness, mc1) | **61.0%** | 5.2 min | | **Average** | **74.0%** | total ~26 min | **OmniCoder NOT benchmarked on tinyBenchmarks MC** because at OmniCoder's pace (approximately 12x slower) it would have taken 5-6h instead of 26 min. The GSM8K comparison is enough signal of where each model stands. Random baseline for 4-choice MC ≈ 25%. All results are well above baseline. ARC at 92% is excellent for a 9B Q4. ## Costs | Item | Time | Cost | |------|--------|-------| | Phase 1 training, H100 SXM Secure ($3.29/h) | ~1.5h | ~$4.95 | | Phase 2 training, H100 SXM Secure ($3.29/h) | ~4.8h | ~$15.80 | | Intermediate merges + quantize on local 3090 | ~2h | $0 (local) | | HF storage (XET turbo) | - | $0 | | **Total paid compute** | ~6.3h cloud | **~$20.75** | ## Available quants | Quant | Size | Recommended use | |-------|--------|------------------| | `Slopus-9B-Q2_K.gguf` | ~3.6 GB | Only if VRAM <6 GB. Noticeably lower quality. | | `Slopus-9B-Q3_K_M.gguf` | ~4.5 GB | VRAM ~6 GB. Tight compromise. | | **`Slopus-9B-Q4_K_M.gguf`** | **~5.5 GB** | **Sweet spot, default recommended**. VRAM 8 GB+ | | `Slopus-9B-Q6_K.gguf` | ~7.5 GB | Near-perfect quality. VRAM 10 GB+ | | `Slopus-9B-Q8_0.gguf` | ~9.5 GB | Almost indistinguishable from fp16. VRAM 12 GB+ | ## Usage with llama-server **IMPORTANT**: you must use `llama.cpp` at tag `b8292` (commit `b54124110`). New master does NOT load these GGUFs (it fails with a missing tensor error due to a bug in the converter+loader for Qwen3.5's hybrid GDN architecture). ```bash # Install llama.cpp at b8292 git clone https://github.com/ggml-org/llama.cpp cd llama.cpp git checkout b8292 cmake -B build -DGGML_CUDA=ON cmake --build build --target llama-quantize llama-server -j$(nproc) # Serve export LLAMA_CHAT_TEMPLATE_KWARGS='{"enable_thinking":true}' ./build/bin/llama-server \ --model Slopus-9B-Q4_K_M.gguf \ -ngl 999 -fa on --no-mmap \ -c 65536 \ --parallel 4 \ --jinja --reasoning-format deepseek \ --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.0 \ --presence-penalty 1.5 \ --port 12345 ``` ## Phase 2 LoRA adapter (without merge) Both LoRA adapters are included in this same repo: - [`adapter_phase1/`](./adapter_phase1) — final phase 1 adapter (checkpoint-506, r=8 alpha=16), trained on `Kukedlc/omnicoder-train`. Loads on top of `Tesslate/OmniCoder-9B`. - [`adapter_phase2/`](./adapter_phase2) — final phase 2 adapter (checkpoint-758, r=128 alpha=256), trained on `Kukedlc/omnicoder-fase2-reasoning`. Loads on top of `Tesslate/OmniCoder-9B + adapter_phase1` merged. The intermediate merge is NOT published (~18 GB, adds no value vs. the Q4_K_M GGUF directly). To reproduce Slopus 9B from scratch: 1. Apply `adapter_phase1` over `Tesslate/OmniCoder-9B` → merge to fp16 2. Apply `adapter_phase2` over the merged phase 1 → merge to fp16 3. Convert + quantize with `llama.cpp@b8292` ## Limitations - **r=128 alpha=256 on 24K examples** is borderline against the rule-of-thumb "high rank requires a large dataset". The model showed NO visible signs of overfitting (loss still going down at the end of epoch 2), but without an eval set it can't be fully confirmed. For production use ideally build an out-of-distribution benchmark. - **GGUFs require llama.cpp b8292** specifically. Newer versions don't load them. This model is NOT compatible with stock `Ollama` (which embeds a newer llama.cpp) — you'd have to patch Ollama or use `llama-server` directly. - **Basic arithmetic in Q4_K_M** can be slightly off (the model lists the terms correctly but sometimes sums the final result wrong). Use Q6_K or Q8_0 for tasks requiring numerical precision. - **No RLHF/DPO**. SFT only in both phases. ## Datasets & credits - [`Tesslate/OmniCoder-9B`](https://huggingface.co/Tesslate/OmniCoder-9B) — base model. Apache 2.0. - [`Gryphe/Opus-4.6-Reasoning-24k`](https://huggingface.co/datasets/Gryphe/Opus-4.6-Reasoning-24k) — source of the reasoning dataset (re-rendered to Qwen3.5 chat template with inline ``). Apache 2.0. - llama.cpp b8292 — GitHub ggml-org/llama.cpp, commit b54124110. - Unsloth — efficient bf16 LoRA training. - bartowski for publishing OmniCoder-9B GGUFs at b8292, which served as the reference to identify the correct llama.cpp version. ## Reproducing the experiment Scripts in the [`Kukedlc/omnicoder-train`](https://huggingface.co/datasets/Kukedlc/omnicoder-train) dataset repo: - `train_omnicoder.py` — phase 1 - `train_omnicoder_fase2.py` — phase 2 - `setup_train.sh` / `setup_fase2.sh` — RunPod pod launchers - `watcher_upload.sh` / `watcher_upload_fase2.sh` — automatic checkpoint upload ## License Apache 2.0 — inherited from the OmniCoder-9B base and the Gryphe dataset.