Instructions to use ZYLIM/qwen3-4b-quickreply-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ZYLIM/qwen3-4b-quickreply-lora with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("ZYLIM/qwen3-4b-quickreply-lora")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use ZYLIM/qwen3-4b-quickreply-lora with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "ZYLIM/qwen3-4b-quickreply-lora"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use ZYLIM/qwen3-4b-quickreply-lora with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default ZYLIM/qwen3-4b-quickreply-lora

Run Hermes

hermes

OpenClaw new

How to use ZYLIM/qwen3-4b-quickreply-lora with OpenClaw:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "ZYLIM/qwen3-4b-quickreply-lora" \
  --custom-provider-id mlx-lm \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

MLX LM

How to use ZYLIM/qwen3-4b-quickreply-lora with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "ZYLIM/qwen3-4b-quickreply-lora"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "ZYLIM/qwen3-4b-quickreply-lora"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "ZYLIM/qwen3-4b-quickreply-lora",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

Qwen3-4B QuickReply LoRA (fused)

LoRA fine-tune of Qwen/Qwen3-4B for generating short, context-aware chat replies. Trained on Apple Silicon with mlx-lm. The LoRA adapter is fused into the base weights here at 50% concentration (scale = 10.0) — the single safetensors set is drop-in usable with mlx-lm or any HF loader that supports Qwen3.

Built for the WID3002 NLP project (University of Malaya, Semester 2 2025/2026) as part of the ChatNow quick-reply suggestion app.

What it's for

Given a short conversation, produce 3 distinct one-liner replies that:

Match the language of the most recent message (English / Malay / Chinese).
Mirror chat short-forms and abbreviations (e.g. Malay nk mkn p? → reply in the same short-form register, not the spelled-out nak makan apa? form).
Preserve particles (lah, lor, leh, ya, eh), code-switching, and the casual rojak mix common in Malaysian chats.
Take different conversational moves (direct answer / clarifying question / proposal / opinion / redirect) — three replies, three angles.

What's different from the base

Aspect	Base Qwen3-4B	This fine-tune
Reply length	tends to over-generate (4–5× the reference length)	matches reference within 1.3–2×
Malay short-forms	often mis-parses (`p` read as a noun, not `apa`)	decoded and mirrored back
Code-switching	inconsistent — drifts to English	preserves the thread's language
Tone in casual chat	formal / textbook	casual, particle-aware
Style mirroring	none	mirrors the replier's prior register

Performance

100-example held-out chat set, BLEU and ROUGE-L F1, 3 replies per context:

Language	n	BLEU base → FT	ROUGE-L base → FT
Overall	100	0.34 → 8.48 (×25)	0.060 → 0.484 (×8.1)
English	60	0.43 → 6.59	0.083 → 0.363
Malay	15	0.26 → 8.64	0.069 → 0.356
Chinese	25	0.21 → 5.82	0.030 → 0.869

The hyp/ref length ratio also drops sharply on every slice — the fine-tune stops generating long monologues and starts producing actual reply-shaped text.

Training data

Four datasets, sampled and reformatted to chat turns:

daily_dialog — English casual conversation
bavard/personachat_truecased — English persona-grounded chat
bitext/Bitext-customer-support-llm-chatbot-training-dataset — English customer-support style short replies
mesolitica/malaysian-sft — Malay / rojak Malaysian text (Bahasa Malaysia + English code-switching)

The Chinese slice in the eval set is reached via the base model's cross-lingual transfer; no zh-only chat data was added during fine-tuning, which is why zh gains are largely about length and particle handling rather than vocabulary.

Training config (mlx-lm LoRA)

model: Qwen/Qwen3-4B
iters: 800
batch_size: 1
lr_schedule: cosine_decay(1e-5 → 1e-6, warmup 100)
lora_rank: 4
lora_alpha: 8
num_layers: 16          # top 16 transformer blocks only
grad_checkpoint: true
max_seq_length: 512

Val loss trajectory: 4.99 → 1.21 → 1.11 → 0.92 → 1.00 → 0.93 → 1.10 → 0.91 (early-stopped near iter 700 due to a Metal compute error; checkpoint at iter 600 was used for the fuse).

Adapter scale was patched from the mlx-lm default 20.0 down to 10.0 before fusing, halving the LoRA's influence on the base weights. This trades a small amount of style adherence for retaining more of the base model's reasoning, instruction-following, and multilingual coverage.

Usage

mlx-lm (Apple Silicon)

from mlx_lm import load, generate

model, tok = load("ZYLIM/qwen3-4b-quickreply-lora")
prompt = tok.apply_chat_template(
    [
        {"role": "system", "content": "Reply in 1 sentence, match the user's language."},
        {"role": "user", "content": "kau nk mkn p?"},
    ],
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,  # Qwen3 <think>...</think> still works
)
print(generate(model, tok, prompt=prompt, max_tokens=256))

Through the ChatNow FastAPI server

QUICKREPLY_HF_MODEL=ZYLIM/qwen3-4b-quickreply-lora ./backend/serve.sh

The server exposes an OpenAI-compatible /v1/chat/completions at http://127.0.0.1:8000 (streaming + non-stream). Qwen3 <think> mode is on.

Limitations

LoRA targets only the top 16 transformer blocks, so deep semantic reasoning still falls back to the base model — not the fine-tune.
Chat short-form coverage is best for Malay and casual English; Mandarin short-forms (e.g. internet slang like xswl, nsdd) are inherited from the base only.
The model occasionally still echoes the question; the upstream agent (lib/agent/index.ts in the ChatNow repo) adds an explicit "do not repeat the question verbatim" rule to mitigate.
Trained for chat-reply style only, not for tool use, code, or long document tasks. Use the base for those.