Qwen3-4B QuickReply LoRA (fused)

LoRA fine-tune of Qwen/Qwen3-4B for generating short, context-aware chat replies. Trained on Apple Silicon with mlx-lm. The LoRA adapter is fused into the base weights here at 50% concentration (scale = 10.0) — the single safetensors set is drop-in usable with mlx-lm or any HF loader that supports Qwen3.

Built for the WID3002 NLP project (University of Malaya, Semester 2 2025/2026) as part of the ChatNow quick-reply suggestion app.

What it's for

Given a short conversation, produce 3 distinct one-liner replies that:

  • Match the language of the most recent message (English / Malay / Chinese).
  • Mirror chat short-forms and abbreviations (e.g. Malay nk mkn p? → reply in the same short-form register, not the spelled-out nak makan apa? form).
  • Preserve particles (lah, lor, leh, ya, eh), code-switching, and the casual rojak mix common in Malaysian chats.
  • Take different conversational moves (direct answer / clarifying question / proposal / opinion / redirect) — three replies, three angles.

What's different from the base

Aspect Base Qwen3-4B This fine-tune
Reply length tends to over-generate (4–5× the reference length) matches reference within 1.3–2×
Malay short-forms often mis-parses (p read as a noun, not apa) decoded and mirrored back
Code-switching inconsistent — drifts to English preserves the thread's language
Tone in casual chat formal / textbook casual, particle-aware
Style mirroring none mirrors the replier's prior register

Performance

100-example held-out chat set, BLEU and ROUGE-L F1, 3 replies per context:

Language n BLEU base → FT ROUGE-L base → FT
Overall 100 0.34 → 8.48 (×25) 0.060 → 0.484 (×8.1)
English 60 0.43 → 6.59 0.083 → 0.363
Malay 15 0.26 → 8.64 0.069 → 0.356
Chinese 25 0.21 → 5.82 0.030 → 0.869

The hyp/ref length ratio also drops sharply on every slice — the fine-tune stops generating long monologues and starts producing actual reply-shaped text.

Training data

Four datasets, sampled and reformatted to chat turns:

  • daily_dialog — English casual conversation
  • bavard/personachat_truecased — English persona-grounded chat
  • bitext/Bitext-customer-support-llm-chatbot-training-dataset — English customer-support style short replies
  • mesolitica/malaysian-sft — Malay / rojak Malaysian text (Bahasa Malaysia + English code-switching)

The Chinese slice in the eval set is reached via the base model's cross-lingual transfer; no zh-only chat data was added during fine-tuning, which is why zh gains are largely about length and particle handling rather than vocabulary.

Training config (mlx-lm LoRA)

model: Qwen/Qwen3-4B
iters: 800
batch_size: 1
lr_schedule: cosine_decay(1e-5 → 1e-6, warmup 100)
lora_rank: 4
lora_alpha: 8
num_layers: 16          # top 16 transformer blocks only
grad_checkpoint: true
max_seq_length: 512

Val loss trajectory: 4.99 → 1.21 → 1.11 → 0.92 → 1.00 → 0.93 → 1.10 → 0.91 (early-stopped near iter 700 due to a Metal compute error; checkpoint at iter 600 was used for the fuse).

Adapter scale was patched from the mlx-lm default 20.0 down to 10.0 before fusing, halving the LoRA's influence on the base weights. This trades a small amount of style adherence for retaining more of the base model's reasoning, instruction-following, and multilingual coverage.

Usage

mlx-lm (Apple Silicon)

from mlx_lm import load, generate

model, tok = load("ZYLIM/qwen3-4b-quickreply-lora")
prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": "Reply in 1 sentence, match the user's language."},
        {"role": "user", "content": "kau nk mkn p?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,  # Qwen3 <think>...</think> still works
)
print(generate(model, tok, prompt=prompt, max_tokens=256))

Through the ChatNow FastAPI server

QUICKREPLY_HF_MODEL=ZYLIM/qwen3-4b-quickreply-lora ./backend/serve.sh

The server exposes an OpenAI-compatible /v1/chat/completions at http://127.0.0.1:8000 (streaming + non-stream). Qwen3 <think> mode is on.

Limitations

  • LoRA targets only the top 16 transformer blocks, so deep semantic reasoning still falls back to the base model — not the fine-tune.
  • Chat short-form coverage is best for Malay and casual English; Mandarin short-forms (e.g. internet slang like xswl, nsdd) are inherited from the base only.
  • The model occasionally still echoes the question; the upstream agent (lib/agent/index.ts in the ChatNow repo) adds an explicit "do not repeat the question verbatim" rule to mitigate.
  • Trained for chat-reply style only, not for tool use, code, or long document tasks. Use the base for those.

Project

WID3002 NLP project, Group 10, University of Malaya, Semester 2 2025/2026. Lecturer: Dr. Mohamed N. M. Lubani.

Authors: Tan Hao Wen, Lim Zi Yang (ZYLIM), Tan Shi Han, Tan Jia Le.

Downloads last month
23
Safetensors
Model size
4B params
Tensor type
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ZYLIM/qwen3-4b-quickreply-lora

Finetuned
Qwen/Qwen3-4B
Adapter
(1022)
this model