Instructions to use ZYLIM/qwen3-4b-quickreply-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use ZYLIM/qwen3-4b-quickreply-lora with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("ZYLIM/qwen3-4b-quickreply-lora") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- Pi new
How to use ZYLIM/qwen3-4b-quickreply-lora with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ZYLIM/qwen3-4b-quickreply-lora" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ZYLIM/qwen3-4b-quickreply-lora with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ZYLIM/qwen3-4b-quickreply-lora
Run Hermes
hermes
- MLX LM
How to use ZYLIM/qwen3-4b-quickreply-lora with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "ZYLIM/qwen3-4b-quickreply-lora"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ZYLIM/qwen3-4b-quickreply-lora", "messages": [ {"role": "user", "content": "Hello"} ] }'
Qwen3-4B QuickReply LoRA (fused)
LoRA fine-tune of Qwen/Qwen3-4B
for generating short, context-aware chat replies. Trained on Apple Silicon
with mlx-lm. The LoRA adapter is fused into the base weights here at
50% concentration (scale = 10.0) — the single safetensors set is
drop-in usable with mlx-lm or any HF loader that supports Qwen3.
Built for the WID3002 NLP project (University of Malaya, Semester 2 2025/2026) as part of the ChatNow quick-reply suggestion app.
What it's for
Given a short conversation, produce 3 distinct one-liner replies that:
- Match the language of the most recent message (English / Malay / Chinese).
- Mirror chat short-forms and abbreviations (e.g. Malay
nk mkn p?→ reply in the same short-form register, not the spelled-outnak makan apa?form). - Preserve particles (
lah,lor,leh,ya,eh), code-switching, and the casual rojak mix common in Malaysian chats. - Take different conversational moves (direct answer / clarifying question / proposal / opinion / redirect) — three replies, three angles.
What's different from the base
| Aspect | Base Qwen3-4B | This fine-tune |
|---|---|---|
| Reply length | tends to over-generate (4–5× the reference length) | matches reference within 1.3–2× |
| Malay short-forms | often mis-parses (p read as a noun, not apa) |
decoded and mirrored back |
| Code-switching | inconsistent — drifts to English | preserves the thread's language |
| Tone in casual chat | formal / textbook | casual, particle-aware |
| Style mirroring | none | mirrors the replier's prior register |
Performance
100-example held-out chat set, BLEU and ROUGE-L F1, 3 replies per context:
| Language | n | BLEU base → FT | ROUGE-L base → FT |
|---|---|---|---|
| Overall | 100 | 0.34 → 8.48 (×25) | 0.060 → 0.484 (×8.1) |
| English | 60 | 0.43 → 6.59 | 0.083 → 0.363 |
| Malay | 15 | 0.26 → 8.64 | 0.069 → 0.356 |
| Chinese | 25 | 0.21 → 5.82 | 0.030 → 0.869 |
The hyp/ref length ratio also drops sharply on every slice — the fine-tune stops generating long monologues and starts producing actual reply-shaped text.
Training data
Four datasets, sampled and reformatted to chat turns:
daily_dialog— English casual conversationbavard/personachat_truecased— English persona-grounded chatbitext/Bitext-customer-support-llm-chatbot-training-dataset— English customer-support style short repliesmesolitica/malaysian-sft— Malay / rojak Malaysian text (Bahasa Malaysia + English code-switching)
The Chinese slice in the eval set is reached via the base model's cross-lingual transfer; no zh-only chat data was added during fine-tuning, which is why zh gains are largely about length and particle handling rather than vocabulary.
Training config (mlx-lm LoRA)
model: Qwen/Qwen3-4B
iters: 800
batch_size: 1
lr_schedule: cosine_decay(1e-5 → 1e-6, warmup 100)
lora_rank: 4
lora_alpha: 8
num_layers: 16 # top 16 transformer blocks only
grad_checkpoint: true
max_seq_length: 512
Val loss trajectory: 4.99 → 1.21 → 1.11 → 0.92 → 1.00 → 0.93 → 1.10 → 0.91
(early-stopped near iter 700 due to a Metal compute error; checkpoint at
iter 600 was used for the fuse).
Adapter scale was patched from the mlx-lm default 20.0 down to 10.0
before fusing, halving the LoRA's influence on the base weights. This
trades a small amount of style adherence for retaining more of the base
model's reasoning, instruction-following, and multilingual coverage.
Usage
mlx-lm (Apple Silicon)
from mlx_lm import load, generate
model, tok = load("ZYLIM/qwen3-4b-quickreply-lora")
prompt = tok.apply_chat_template(
[
{"role": "system", "content": "Reply in 1 sentence, match the user's language."},
{"role": "user", "content": "kau nk mkn p?"},
],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Qwen3 <think>...</think> still works
)
print(generate(model, tok, prompt=prompt, max_tokens=256))
Through the ChatNow FastAPI server
QUICKREPLY_HF_MODEL=ZYLIM/qwen3-4b-quickreply-lora ./backend/serve.sh
The server exposes an OpenAI-compatible /v1/chat/completions at
http://127.0.0.1:8000 (streaming + non-stream). Qwen3 <think> mode is on.
Limitations
- LoRA targets only the top 16 transformer blocks, so deep semantic reasoning still falls back to the base model — not the fine-tune.
- Chat short-form coverage is best for Malay and casual English; Mandarin
short-forms (e.g. internet slang like
xswl,nsdd) are inherited from the base only. - The model occasionally still echoes the question; the upstream agent
(
lib/agent/index.tsin the ChatNow repo) adds an explicit "do not repeat the question verbatim" rule to mitigate. - Trained for chat-reply style only, not for tool use, code, or long document tasks. Use the base for those.
Project
WID3002 NLP project, Group 10, University of Malaya, Semester 2 2025/2026. Lecturer: Dr. Mohamed N. M. Lubani.
Authors: Tan Hao Wen, Lim Zi Yang (ZYLIM), Tan Shi Han, Tan Jia Le.
- Downloads last month
- 23
Quantized