How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
# Run inference directly in the terminal:
llama-cli -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
# Run inference directly in the terminal:
llama-cli -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
# Run inference directly in the terminal:
./llama-cli -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
Use Docker
docker model run hf.co/VladHong/Qwen3-4B-Instruct-Lewis:Q5_K_M
Quick Links

Qwen3-4B Instruct Lewis

⚠️ Toy model — not intended for serious or production use. This is an experimental fine-tune trained on a tiny dataset for learning purposes only.

Finetuned from Unsloth/Qwen3-4B-Instruct-2507 using QLoRA + Unsloth on the VladHong/Lewis_Instruct dataset.

Example Conversation

User: What should I do with a talking rabbit?

qwen3-4b-lewis: I don't know, but I think it's time to go.

User: Why?

qwen3-4b-lewis: Because I'm afraid the rabbit will tell the Queen about us!

Training Data

Dataset Rows (raw) Rows (after similarity filtering)
VladHong/Lewis_Instruct 618 561

Similarity filtering used a 0.3 Jaccard threshold. <think> blocks were stripped from all assistant turns before training.

Training Details

Parameter Value
Method QLoRA (4-bit NF4) + Unsloth
LoRA rank 16
LoRA alpha 16
Epochs 1
Steps 71
Batch size 2 per device × 4 gradient accumulation = 8 effective
Learning rate 1e-4 (cosine schedule)
Max seq length 2048
Optimizer AdamW 8-bit
Hardware Tesla T4 (14.56 GB VRAM)
Training time ~39.85 min
Trainable params 33M / 4.05B (0.81%)
Peak VRAM ~4.18 GB

Training used train_on_responses_only — loss computed on assistant completions only.

License Note

Base model is Apache 2.0. Review upstream dataset terms before any use beyond personal experimentation.

Downloads last month
-
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train VladHong/Qwen3-4B-Instruct-Lewis