ZYLIM commited on
Commit
29d2e07
Β·
verified Β·
1 Parent(s): 89d69da

Add model card: usage, eval results, training config

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen3-4B/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ - ms
7
+ - zh
8
+ library_name: mlx
9
+ tags:
10
+ - mlx
11
+ - lora
12
+ - qwen3
13
+ - chat
14
+ - quick-reply
15
+ - malay
16
+ - code-switching
17
+ base_model: Qwen/Qwen3-4B
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # Qwen3-4B QuickReply LoRA (fused)
22
+
23
+ LoRA fine-tune of [`Qwen/Qwen3-4B`](https://huggingface.co/Qwen/Qwen3-4B)
24
+ for generating short, context-aware chat replies. Trained on Apple Silicon
25
+ with `mlx-lm`. The LoRA adapter is fused into the base weights here at
26
+ **50% concentration** (`scale = 10.0`) β€” the single safetensors set is
27
+ drop-in usable with `mlx-lm` or any HF loader that supports Qwen3.
28
+
29
+ Built for the WID3002 NLP project (University of Malaya, Semester 2 2025/2026)
30
+ as part of the **ChatNow** quick-reply suggestion app.
31
+
32
+ ## What it's for
33
+
34
+ Given a short conversation, produce 3 distinct one-liner replies that:
35
+
36
+ - Match the language of the most recent message (English / Malay / Chinese).
37
+ - Mirror chat **short-forms and abbreviations** (e.g. Malay `nk mkn p?` β†’
38
+ reply in the same short-form register, not the spelled-out
39
+ `nak makan apa?` form).
40
+ - Preserve particles (`lah`, `lor`, `leh`, `ya`, `eh`), code-switching, and
41
+ the casual rojak mix common in Malaysian chats.
42
+ - Take **different conversational moves** (direct answer / clarifying
43
+ question / proposal / opinion / redirect) β€” three replies, three angles.
44
+
45
+ ## What's different from the base
46
+
47
+ | Aspect | Base Qwen3-4B | This fine-tune |
48
+ |---|---|---|
49
+ | Reply length | tends to over-generate (4–5Γ— the reference length) | matches reference within 1.3–2Γ— |
50
+ | Malay short-forms | often mis-parses (`p` read as a noun, not `apa`) | decoded and mirrored back |
51
+ | Code-switching | inconsistent β€” drifts to English | preserves the thread's language |
52
+ | Tone in casual chat | formal / textbook | casual, particle-aware |
53
+ | Style mirroring | none | mirrors the replier's prior register |
54
+
55
+ ## Performance
56
+
57
+ 100-example held-out chat set, BLEU and ROUGE-L F1, 3 replies per context:
58
+
59
+ | Language | n | BLEU base β†’ FT | ROUGE-L base β†’ FT |
60
+ |---|---|---|---|
61
+ | **Overall** | 100 | **0.34 β†’ 8.48** (Γ—25) | **0.060 β†’ 0.484** (Γ—8.1) |
62
+ | English | 60 | 0.43 β†’ 6.59 | 0.083 β†’ 0.363 |
63
+ | Malay | 15 | 0.26 β†’ 8.64 | 0.069 β†’ 0.356 |
64
+ | Chinese | 25 | 0.21 β†’ 5.82 | 0.030 β†’ 0.869 |
65
+
66
+ The hyp/ref length ratio also drops sharply on every slice β€” the fine-tune
67
+ stops generating long monologues and starts producing actual reply-shaped
68
+ text.
69
+
70
+ ## Training data
71
+
72
+ Four datasets, sampled and reformatted to chat turns:
73
+
74
+ - `daily_dialog` β€” English casual conversation
75
+ - `bavard/personachat_truecased` β€” English persona-grounded chat
76
+ - `bitext/Bitext-customer-support-llm-chatbot-training-dataset` β€” English
77
+ customer-support style short replies
78
+ - `mesolitica/malaysian-sft` β€” Malay / rojak Malaysian text (Bahasa
79
+ Malaysia + English code-switching)
80
+
81
+ The Chinese slice in the eval set is reached via the base model's
82
+ cross-lingual transfer; no zh-only chat data was added during fine-tuning,
83
+ which is why zh gains are largely about length and particle handling
84
+ rather than vocabulary.
85
+
86
+ ## Training config (mlx-lm LoRA)
87
+
88
+ ```yaml
89
+ model: Qwen/Qwen3-4B
90
+ iters: 800
91
+ batch_size: 1
92
+ lr_schedule: cosine_decay(1e-5 β†’ 1e-6, warmup 100)
93
+ lora_rank: 4
94
+ lora_alpha: 8
95
+ num_layers: 16 # top 16 transformer blocks only
96
+ grad_checkpoint: true
97
+ max_seq_length: 512
98
+ ```
99
+
100
+ Val loss trajectory: `4.99 β†’ 1.21 β†’ 1.11 β†’ 0.92 β†’ 1.00 β†’ 0.93 β†’ 1.10 β†’ 0.91`
101
+ (early-stopped near iter 700 due to a Metal compute error; checkpoint at
102
+ iter 600 was used for the fuse).
103
+
104
+ Adapter scale was patched from the mlx-lm default `20.0` down to `10.0`
105
+ before fusing, halving the LoRA's influence on the base weights. This
106
+ trades a small amount of style adherence for retaining more of the base
107
+ model's reasoning, instruction-following, and multilingual coverage.
108
+
109
+ ## Usage
110
+
111
+ ### mlx-lm (Apple Silicon)
112
+
113
+ ```python
114
+ from mlx_lm import load, generate
115
+
116
+ model, tok = load("ZYLIM/qwen3-4b-quickreply-lora")
117
+ prompt = tok.apply_chat_template(
118
+ [
119
+ {"role": "system", "content": "Reply in 1 sentence, match the user's language."},
120
+ {"role": "user", "content": "kau nk mkn p?"},
121
+ ],
122
+ tokenize=False,
123
+ add_generation_prompt=True,
124
+ enable_thinking=True, # Qwen3 <think>...</think> still works
125
+ )
126
+ print(generate(model, tok, prompt=prompt, max_tokens=256))
127
+ ```
128
+
129
+ ### Through the ChatNow FastAPI server
130
+
131
+ ```bash
132
+ QUICKREPLY_HF_MODEL=ZYLIM/qwen3-4b-quickreply-lora ./backend/serve.sh
133
+ ```
134
+
135
+ The server exposes an OpenAI-compatible `/v1/chat/completions` at
136
+ `http://127.0.0.1:8000` (streaming + non-stream). Qwen3 `<think>` mode is on.
137
+
138
+ ## Limitations
139
+
140
+ - LoRA targets only the **top 16 transformer blocks**, so deep semantic
141
+ reasoning still falls back to the base model β€” not the fine-tune.
142
+ - Chat short-form coverage is best for Malay and casual English; Mandarin
143
+ short-forms (e.g. internet slang like `xswl`, `nsdd`) are inherited from
144
+ the base only.
145
+ - The model occasionally still echoes the question; the upstream agent
146
+ (`lib/agent/index.ts` in the ChatNow repo) adds an explicit "do not
147
+ repeat the question verbatim" rule to mitigate.
148
+ - Trained for **chat-reply style only**, not for tool use, code, or long
149
+ document tasks. Use the base for those.
150
+
151
+ ## Project
152
+
153
+ WID3002 NLP project, Group 10, University of Malaya, Semester 2 2025/2026.
154
+ Lecturer: Dr. Mohamed N. M. Lubani.
155
+
156
+ Authors: Tan Hao Wen, Lim Zi Yang (`ZYLIM`), Tan Shi Han, Tan Jia Le.