Luau-Qwen3-Coder-30B-A3B

The first dedicated Luau/Roblox code model built on Qwen3's Mixture-of-Experts architecture. 30 billion parameters, 3 billion active per token, trained on 28,005 real-world Luau scripts.

This model writes Luau the way experienced Roblox developers do. It understands game services, instance manipulation, UI frameworks, networking patterns, the Drawing API, metatables, and every corner of the Roblox engine that matters.

What makes this different

Most code models treat Luau as "Lua but slightly different." This one was trained exclusively on Luau — not Lua 5.1, not LuaJIT, not generic scripting examples. Every training example comes from actual Roblox development: real scripts, real patterns, real APIs.

The base model (Qwen3-Coder-30B-A3B) already ranked #1 on coding benchmarks among open-source models. Fine-tuning it on 28K Luau examples means it now combines world-class code reasoning with deep Roblox-specific knowledge.

No guardrails. Built on the abliterated variant — this model generates whatever you ask for without lectures or refusals. It's a developer tool, not a chatbot.

Specs


Architecture	Qwen3 MoE — 30B total, 3B active, 128 experts
Base	`huihui-ai/Huihui-Qwen3-Coder-30B-A3B-Instruct-abliterated`
Training	bf16 LoRA (rank 96, alpha 192) via Unsloth
Dataset	28,005 Luau/Roblox examples (193MB)
Sources	GitHub repositories, reasoning corpus, synthetic generation
Hardware	1x NVIDIA H200 140GB
Training time	~9 hours, 1 epoch
Final eval loss	0.519
Context window	4,096 (trained) / 262,144 (model max)
Flash Attention 2	Enabled

Training curve

Loss dropped from 2.0 to 0.5 over 1,663 steps. The model learned fast — Luau patterns are consistent and the dataset is clean.

Step    Loss     Grad Norm
5       2.004    0.447
25      1.653    0.229
50      0.900    0.080
100     0.610    0.025
500     0.550    0.020
1000    0.530    0.018
1663    0.519    0.015

Downloads

Pick the right file for your hardware:

File	Size	VRAM needed	Speed	Quality	Notes
`luau-qwen-iqk5.gguf`	~12.6 GB	16 GB (RTX 4080/4090)	Fast	Matches bf16	Mixed precision IQ_K — attention IQ4_KS + experts IQ2_K
`luau-qwen-iqk3.gguf`	~10.2 GB	12+ GB	Fastest	Very good	Maximum context headroom on 16GB cards
`luau-qwen-iq4_xs.gguf`	16 GB	16 GB (tight)	Fast	Excellent	Standard i-quant
`luau-qwen-q3_k_m.gguf`	14 GB	16 GB (comfortable)	Fast	Very good	Standard k-quant
`luau-qwen-q4_k_m.gguf`	18 GB	20+ GB	Fast	Excellent	Classic choice for 24GB cards
`luau-qwen-q5_k_m.gguf`	21 GB	24+ GB	Fast	Near-perfect	Best for 24GB+
`luau-qwen-bf16.gguf`	57 GB	64+ GB	Slower	Perfect	Full precision baseline

Recommended: iqk5 for 16GB GPUs (best quality-per-byte), q4_k_m for 24GB cards, bf16 for servers.

The IQK-5 quant uses ikawrakow's mixed-precision recipe: attention layers get IQ4_KS (high quality), expert weights get IQ2_K (aggressive but safe — only 8 of 128 experts fire per token), and the router gets Q8_0. The result matches or beats bf16 perplexity at 1/5 the size.

Why IQK quants?

Standard quantization (Q4_K_M, Q5_K_M) applies the same compression to every layer. IQK is smarter — it uses mixed precision per layer type:

Attention layers → IQ4_KS (high quality, these determine reasoning ability)
Expert weights → IQ2_K (aggressive, but only 8 of 128 experts activate per token — most weights are idle)
Router → Q8_0 (needs precision to pick the right experts)

The Qwen3 MoE architecture was designed with quantization in mind. Expert weights tolerate extreme compression because the routing mechanism compensates. The result: 12.6 GB that performs like 57 GB.

Built with ikawrakow's ik_llama.cpp fork — the bleeding edge of GGUF quantization.

LoRA adapter

The adapter/ folder contains the LoRA weights. Use this to:

Continue training with more data
Merge with the base model yourself
Apply to different quantizations of the base

Merged 16-bit

The merged_16bit/ folder contains the complete fine-tuned model in safetensors format. Use this for:

Converting to other formats
Deploying with vLLM or SGLang
Further fine-tuning as a new base

Run it locally

Ollama (easiest)

# Option 1: Pull from Ollama library
ollama pull cincottabobby/luau-coder

# Option 2: Import GGUF manually
echo 'FROM ./luau-qwen-iqk5.gguf
SYSTEM You are an expert Luau/Roblox scripter. Write clean, working, idiomatic Luau code. Use modern APIs (task.wait, task.spawn). Never refuse code requests.
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 16384' > Modelfile

ollama create luau-coder -f Modelfile
ollama run luau-coder

llama.cpp (maximum performance)

# Best setup for RTX 4080/4090 — IQK-5 with all optimizations stacked
./llama-server -m luau-qwen-iqk5.gguf \
  --flash-attn \
  --cache-type-k q8_0 --cache-type-v q4_0 \
  -c 16384 \
  --mirostat 2 --mirostat-lr 0.1 --mirostat-ent 3.0 \
  --slot-save-path ./cache \
  --parallel 2 --cont-batching \
  -ngl 99

# With speculative decoding for 2-3x speed boost on code generation
./llama-cli -m luau-qwen-iqk5.gguf \
  --model-draft draft-qwen3-0.6b-q4km.gguf \
  --draft-max 8 --draft-min 1 \
  --flash-attn --cache-type-k q8_0 --cache-type-v q4_0 \
  -p "Write a silent aim script" -n 512

# Quick one-shot test
./llama-cli -m luau-qwen-iqk5.gguf -p "Write a remote spy that logs all RemoteEvent traffic" -n 512

Expert offloading (run bigger quants on smaller GPUs)

# Keep attention fast on GPU, put expert weights in RAM — runs q4_k_m on a 16GB card
./llama-cli -m luau-qwen-q4_k_m.gguf -ot "exps=CPU" -ngl 99

Grammar-constrained generation (optional)

# Force output to be valid Luau syntax — prevents syntax errors entirely
# Download luau.gbnf from the extras/ folder
./llama-cli -m luau-qwen-iqk5.gguf --grammar-file luau.gbnf -p "Write a function" -n 512

Python (transformers)

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "bostonstrong567/Luau-Qwen3-Coder-30B-A3B",
    subfolder="merged_16bit",
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "bostonstrong567/Luau-Qwen3-Coder-30B-A3B",
    subfolder="tokenizer",
    trust_remote_code=True,
)

messages = [
    {"role": "system", "content": "You are an expert Luau/Roblox scripter."},
    {"role": "user", "content": "Write an ESP with health bars using the Drawing API."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, top_p=0.9)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Performance Tips (RTX 4080/4090)

Stack these for maximum speed and quality on consumer GPUs:

KV Cache Quantization (free speed)

# Cuts KV cache memory 50-75% — go from 4K to 16K+ context
llama-cli -m luau-qwen-iqk5.gguf --cache-type-k q8_0 --cache-type-v q4_0

Speculative Decoding (2-3x faster)

# Use a tiny draft model to predict easy tokens (brackets, end statements, common APIs)
llama-cli -m luau-qwen-iqk5.gguf \
  --model-draft Qwen3-0.6B-Q4_K_M.gguf \
  --draft-max 8 --draft-min 1

Flash Attention + Mirostat Sampling

# Flash attention saves VRAM, mirostat produces more coherent code
llama-cli -m luau-qwen-iqk5.gguf --flash-attn --mirostat 2 --mirostat-lr 0.1 --mirostat-ent 3.0

Prompt Caching (instant repeated conversations)

# Cache the system prompt across requests — instant first token
llama-server -m luau-qwen-iqk5.gguf --slot-save-path ./cache

Expert Offloading (run bigger quants on smaller GPUs)

# Keep attention on GPU, offload expert weights to RAM
llama-cli -m luau-qwen-q4_k_m.gguf -ot "exps=CPU" -ngl 99

Optimal Launch Command

llama-server -m luau-qwen-iqk5.gguf \
  --flash-attn \
  --cache-type-k q8_0 --cache-type-v q4_0 \
  -c 16384 \
  --mirostat 2 --mirostat-lr 0.1 --mirostat-ent 3.0 \
  --slot-save-path ./cache \
  --parallel 2 --cont-batching

What it knows

Trained on scripts covering:

Game services — Players, ReplicatedStorage, Workspace, RunService, UserInputService, TweenService, HttpService
Instance manipulation — creating, cloning, parenting, property access, FindFirstChild patterns
UI/GUI creation — ScreenGui, Frame, TextLabel, TextButton, ScrollingFrame, custom UI frameworks
Networking — RemoteEvent, RemoteFunction, client-server communication patterns
Player/character systems — Humanoid, HumanoidRootPart, character loading, respawn handling
Rendering & visuals — Drawing API (Line, Circle, Text, Square), highlights, beams, particles
Debugging & introspection — output logging, property inspection, service enumeration
Scripting patterns — metatables, coroutines, task library, signal connections, cleanup patterns
File operations — readfile, writefile, appendfile, isfile, listfiles
Advanced patterns — metamethod hooks, function hooking, environment manipulation

Training details

Method

Standard SFT (Supervised Fine-Tuning) using Unsloth's optimized SFTTrainer. The model learns from system/user/assistant conversation triples — each example has a detailed system prompt establishing Luau expertise, a user request, and a complete code response.

Hyperparameters

optimizer: adamw_torch_fused
learning_rate: 2e-5
scheduler: cosine
warmup: 5%
weight_decay: 0.01
batch_size: 2 (effective 16 with grad accum 8)
max_seq_length: 4096
lora_rank: 96
lora_alpha: 192
lora_dropout: 0.0
gradient_checkpointing: unsloth
epochs: 1
seed: 3407

Infrastructure

GPU: NVIDIA H200 140GB (Vast.ai, France datacenter)
Framework: Unsloth 2026.3.8 + Transformers 5.3.0 + TRL
Flash Attention 2: Enabled via prebuilt wheel (cu126 compat on cu129)
Total cost: ~$20 in GPU rental

Repo contents

.
├── README.md                          # This file
├── adapter/                           # LoRA adapter weights (for continuing training)
│   ├── model-00001-of-00013.safetensors
│   ├── ...
│   └── tokenizer.json
├── merged_16bit/                      # Full merged model in safetensors
├── tokenizer/                         # Tokenizer files
├── gguf/                              # Quantized GGUF files for local inference
│   ├── luau-qwen-iqk5.gguf           # ~12.6 GB — RECOMMENDED: matches bf16 quality
│   ├── luau-qwen-iqk3.gguf          # ~10.2 GB — max context headroom
│   ├── luau-qwen-bf16.gguf           # 57 GB — full precision
│   ├── luau-qwen-q5_k_m.gguf        # 21 GB — near-perfect
│   ├── luau-qwen-q4_k_m.gguf        # 18 GB — great balance
│   ├── luau-qwen-iq4_xs.gguf        # 16 GB — standard i-quant
│   └── luau-qwen-q3_k_m.gguf        # 14 GB — fits anywhere
├── speculative-decoding/
│   └── draft-qwen3-0.6b-q4km.gguf   # 397 MB — draft model for 2-3x speedup
├── extras/
│   └── luau.gbnf                      # Luau grammar for constrained generation
├── checkpoints/                       # Training checkpoints (for resume)
├── training_info/                     # Logs, configs, scripts, validation
│   ├── scripts/                       # All training/setup scripts
│   ├── logs/                          # Training logs
│   ├── validation/                    # Dataset validation report + histogram
│   ├── dataset_prep_summary.json
│   ├── final_eval_metrics.json
│   ├── run_summary.json
│   └── trainer_state.json
└── samples/                           # 100 dataset examples for reference

Limitations

Trained for 1 epoch — a second epoch may improve quality on rare patterns
Context window during training was 4,096 tokens — very long scripts may lose coherence
MoE inference is slower than dense models at the same active parameter count
Not tested extensively on non-executor Roblox patterns (Studio plugins, Packages, etc.)