How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="continker/Qwen3.5-4B-metro-v24",
	filename="Qwen3.5-4B-metro-v24-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Qwen3.5-4B-metro-v24

QLoRA fine-tune of Qwen3.5-4B for the MetroLLM-Bench transit-kiosk task. v24 is the leakage-free retraining used in the MetroLLM-Bench paper (teacher traces drawn only from the 717-case training partition; 238 cases held out). Supersedes continker/Qwen3.5-4B-metro-v23.

Held-out results (n=238, mean of 2 seeds)

Metric 4B base 4B + v24 ฮ”
Tier-1 89.32 91.32 +2.00
Composite 87.25 89.12 +1.87

The efficiency headline of the paper: at 2.6 GB Q4_K_M, the 4B student ties GPT-5.4 full at maximum reasoning effort on held-out tier-1 (91.32 vs 91.37).

Contents

  • adapter/ โ€” LoRA adapter (rank 16, ฮฑ 32; QLoRA 4-bit NF4) + tokenizer + chat template
  • Qwen3.5-4B-metro-v24-Q4_K_M.gguf โ€” merged GGUF (2.6 GB)
  • training_summary.json

GGUF: `llama-server --hf-repo continker/Qwen3.5-4B-metro-v24 --hf-file Qwen3.5-4B-metro-v24-Q4_K_M.gguf`. The LoRA adapter keys use the `.language_model.` prefix; strip it to load onto text-only Qwen3.5-4B. Apache 2.0.

Downloads last month
11
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for continker/Qwen3.5-4B-metro-v24

Finetuned
Qwen/Qwen3.5-4B
Adapter
(258)
this model