--- license: gemma base_model: unsloth/gemma-4-E4B-it tags: - rust - code-generation - leptos - axum - rig - gemma - lora - fine-tuned language: - en datasets: - y0sif/Arcwright-v4-Combined pipeline_tag: text-generation --- # Arcwright **Arcwright** is a Gemma 4 E4B-it model fine-tuned for modern Rust web and AI frameworks: [Leptos](https://leptos.dev), [Axum](https://github.com/tokio-rs/axum), and [Rig](https://github.com/0xPlaygrounds/rig). On `RustWebBench-15`, Arcwright scores **6.87 / 10 overall** — beating not only its base model (Gemma 4 E4B, 5.00) but also: - Gemma 4 26B-A4B (6.59), the **6.5× larger** model from the same family - Claude Haiku (6.73) - Gemini (5.96) - Qwen3-Coder 30B-A3B (5.75) ## Leaderboard | Rank | Model | Leptos | Axum | Rig | Overall | |---|---|---|---|---|---| | **1** | **Arcwright** | **8.40** | **8.28** | **3.92** | **6.87** | | 2 | Claude Haiku | 7.16 | 8.04 | 5.00 | 6.73 | | 3 | Gemma 4 26B-A4B | 7.72 | 8.20 | 3.84 | 6.59 | | 4 | Gemini | 6.96 | 7.40 | 3.52 | 5.96 | | 5 | Qwen3-Coder 30B-A3B | 7.36 | 6.20 | 3.68 | 5.75 | | 6 | Gemma 4 E4B-it (base) | 5.24 | 6.84 | 2.92 | 5.00 | | 7 | Qwen3 8B | 5.52 | 5.28 | 3.08 | 4.63 | | 8 | Qwen2.5-Coder 7B | 4.28 | 4.68 | 1.64 | 3.53 | All models evaluated on the same 15 prompts (5 per crate), judged on 5 dimensions (1-10): correctness, completeness, idiomatic, crate_knowledge, explanation. ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "y0sif/Arcwright", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("y0sif/Arcwright") msgs = [{"role": "user", "content": [ {"type": "text", "text": "Write a Leptos counter component with increment/decrement buttons."} ]}] inputs = tokenizer.apply_chat_template( msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) out = model.generate(inputs, max_new_tokens=1024) print(tokenizer.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True)) ``` A lightweight LoRA-only version is available at [`y0sif/Arcwright-LoRA`](https://huggingface.co/y0sif/Arcwright-LoRA) — apply on top of [`unsloth/gemma-4-E4B-it`](https://huggingface.co/unsloth/gemma-4-E4B-it). ## Training details - **Base**: `unsloth/gemma-4-E4B-it` (4-bit) - **Method**: QLoRA, conservative settings (see below) - **Dataset**: [`y0sif/Arcwright-v4-Combined`](https://huggingface.co/datasets/y0sif/Arcwright-v4-Combined) — 1,007 train / 109 test - Leptos: 334 curated pairs - Axum: 317 curated pairs - Rig: 158 compile-verified pairs (115 from `examples/` + 43 compile-passing supplements) - General Rust: 307 pairs from Strandset-Rust-v1 (replay buffer to prevent catastrophic forgetting) - **Training data pipeline**: 3-gate quality pipeline — sub-agent generation → LLM judge (threshold 7.0) → `cargo check` compile verification. Only entries passing all three gates make it into training. - **Hyperparameters**: r=8, alpha=16, dropout=0, lr=5e-5, 1 epoch, cosine schedule, bf16, effective batch 32. See [the hyperparameter rationale](https://github.com/y0sif/OxideCoder/blob/main/docs/v4-steps/05-train.md). - **Hardware**: Colab Pro, L4 GPU - **Training runtime**: ~15 minutes ## Why it works Three prior training runs (v1-v3) all regressed vs base. v4 fixed the root causes: 1. **Compile-verified data**: every training entry compiles under `cargo check`. No hallucinated APIs. 2. **Conservative hyperparameters**: low rank + low LR + single epoch — just enough drift to inject domain knowledge, not enough to overwrite base capabilities. 3. **General-Rust replay buffer**: 28% of training mix is crate-agnostic Rust, preventing catastrophic forgetting. 4. **Proportional per-crate sizing**: the ratio matches each crate's learnability vs the base model. ## Limitations - **Rig (AI agent framework) scores 3.92** — below Claude Haiku (5.00). Rig has the smallest training share (158 entries) and its API surface is the most niche. Model often uses the correct import paths but invents method signatures. - Evaluated on n=5 per crate. Absolute scores have roughly ±0.3 judge variance. - Knowledge cutoff reflects the crate versions in the training data (Axum 0.8, Leptos 0.7, Rig 0.13 era). - Trained on full sequences (prompt + response), not completion-only — Unsloth/Gemma 4 VLM constraint. ## Benchmark Training data, eval prompts, judge rubric, and scripts: [y0sif/OxideCoder](https://github.com/y0sif/OxideCoder). ## License Inherits Gemma Terms of Use from the base model.