Instructions to use ZYLIM/qwen3-4b-quickreply-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ZYLIM/qwen3-4b-quickreply-lora with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("ZYLIM/qwen3-4b-quickreply-lora")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use ZYLIM/qwen3-4b-quickreply-lora with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ZYLIM/qwen3-4b-quickreply-lora"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ZYLIM/qwen3-4b-quickreply-lora with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ZYLIM/qwen3-4b-quickreply-lora

Run Hermes

hermes

MLX LM

How to use ZYLIM/qwen3-4b-quickreply-lora with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "ZYLIM/qwen3-4b-quickreply-lora"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "ZYLIM/qwen3-4b-quickreply-lora",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

ZYLIM commited on 19 days ago

Commit

29d2e07

verified ·

1 Parent(s): 89d69da

Add model card: usage, eval results, training config

Browse files

Files changed (1) hide show

README.md +156 -0

README.md ADDED Viewed

	@@ -0,0 +1,156 @@

+---
+license: apache-2.0
+license_link: https://huggingface.co/Qwen/Qwen3-4B/blob/main/LICENSE
+language:
+- en
+- ms
+- zh
+library_name: mlx
+tags:
+- mlx
+- lora
+- qwen3
+- chat
+- quick-reply
+- malay
+- code-switching
+base_model: Qwen/Qwen3-4B
+pipeline_tag: text-generation
+---
+# Qwen3-4B QuickReply LoRA (fused)
+LoRA fine-tune of [`Qwen/Qwen3-4B`](https://huggingface.co/Qwen/Qwen3-4B)
+for generating short, context-aware chat replies. Trained on Apple Silicon
+with `mlx-lm`. The LoRA adapter is fused into the base weights here at
+**50% concentration** (`scale = 10.0`) — the single safetensors set is
+drop-in usable with `mlx-lm` or any HF loader that supports Qwen3.
+Built for the WID3002 NLP project (University of Malaya, Semester 2 2025/2026)
+as part of the **ChatNow** quick-reply suggestion app.
+## What it's for
+Given a short conversation, produce 3 distinct one-liner replies that:
+- Match the language of the most recent message (English / Malay / Chinese).
+- Mirror chat **short-forms and abbreviations** (e.g. Malay `nk mkn p?` →
+  reply in the same short-form register, not the spelled-out
+  `nak makan apa?` form).
+- Preserve particles (`lah`, `lor`, `leh`, `ya`, `eh`), code-switching, and
+  the casual rojak mix common in Malaysian chats.
+- Take **different conversational moves** (direct answer / clarifying
+  question / proposal / opinion / redirect) — three replies, three angles.
+## What's different from the base
+| Aspect | Base Qwen3-4B | This fine-tune |
+|---|---|---|
+| Reply length | tends to over-generate (4–5× the reference length) | matches reference within 1.3–2× |
+| Malay short-forms | often mis-parses (`p` read as a noun, not `apa`) | decoded and mirrored back |
+| Code-switching | inconsistent — drifts to English | preserves the thread's language |
+| Tone in casual chat | formal / textbook | casual, particle-aware |
+| Style mirroring | none | mirrors the replier's prior register |
+## Performance
+100-example held-out chat set, BLEU and ROUGE-L F1, 3 replies per context:
+| Language | n | BLEU base → FT | ROUGE-L base → FT |
+|---|---|---|---|
+| **Overall** | 100 | **0.34 → 8.48** (×25) | **0.060 → 0.484** (×8.1) |
+| English | 60 | 0.43 → 6.59 | 0.083 → 0.363 |
+| Malay | 15 | 0.26 → 8.64 | 0.069 → 0.356 |
+| Chinese | 25 | 0.21 → 5.82 | 0.030 → 0.869 |
+The hyp/ref length ratio also drops sharply on every slice — the fine-tune
+stops generating long monologues and starts producing actual reply-shaped
+text.
+## Training data
+Four datasets, sampled and reformatted to chat turns:
+- `daily_dialog` — English casual conversation
+- `bavard/personachat_truecased` — English persona-grounded chat
+- `bitext/Bitext-customer-support-llm-chatbot-training-dataset` — English
+  customer-support style short replies
+- `mesolitica/malaysian-sft` — Malay / rojak Malaysian text (Bahasa
+  Malaysia + English code-switching)
+The Chinese slice in the eval set is reached via the base model's
+cross-lingual transfer; no zh-only chat data was added during fine-tuning,
+which is why zh gains are largely about length and particle handling
+rather than vocabulary.
+## Training config (mlx-lm LoRA)
+```yaml
+model: Qwen/Qwen3-4B
+iters: 800
+batch_size: 1
+lr_schedule: cosine_decay(1e-5 → 1e-6, warmup 100)
+lora_rank: 4
+lora_alpha: 8
+num_layers: 16          # top 16 transformer blocks only
+grad_checkpoint: true
+max_seq_length: 512
+```
+Val loss trajectory: `4.99 → 1.21 → 1.11 → 0.92 → 1.00 → 0.93 → 1.10 → 0.91`
+(early-stopped near iter 700 due to a Metal compute error; checkpoint at
+iter 600 was used for the fuse).
+Adapter scale was patched from the mlx-lm default `20.0` down to `10.0`
+before fusing, halving the LoRA's influence on the base weights. This
+trades a small amount of style adherence for retaining more of the base
+model's reasoning, instruction-following, and multilingual coverage.
+## Usage
+### mlx-lm (Apple Silicon)
+```python
+from mlx_lm import load, generate
+model, tok = load("ZYLIM/qwen3-4b-quickreply-lora")
+prompt = tok.apply_chat_template(
+    [
+        {"role": "system", "content": "Reply in 1 sentence, match the user's language."},
+        {"role": "user", "content": "kau nk mkn p?"},
+    ],
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=True,  # Qwen3 <think>...</think> still works
+)
+print(generate(model, tok, prompt=prompt, max_tokens=256))
+```
+### Through the ChatNow FastAPI server
+```bash
+QUICKREPLY_HF_MODEL=ZYLIM/qwen3-4b-quickreply-lora ./backend/serve.sh
+```
+The server exposes an OpenAI-compatible `/v1/chat/completions` at
+`http://127.0.0.1:8000` (streaming + non-stream). Qwen3 `<think>` mode is on.
+## Limitations
+- LoRA targets only the **top 16 transformer blocks**, so deep semantic
+  reasoning still falls back to the base model — not the fine-tune.
+- Chat short-form coverage is best for Malay and casual English; Mandarin
+  short-forms (e.g. internet slang like `xswl`, `nsdd`) are inherited from
+  the base only.
+- The model occasionally still echoes the question; the upstream agent
+  (`lib/agent/index.ts` in the ChatNow repo) adds an explicit "do not
+  repeat the question verbatim" rule to mitigate.
+- Trained for **chat-reply style only**, not for tool use, code, or long
+  document tasks. Use the base for those.
+## Project
+WID3002 NLP project, Group 10, University of Malaya, Semester 2 2025/2026.
+Lecturer: Dr. Mohamed N. M. Lubani.
+Authors: Tan Hao Wen, Lim Zi Yang (`ZYLIM`), Tan Shi Han, Tan Jia Le.