How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="continker/Qwen3.5-2B-metro-v24",
	filename="Qwen3.5-2B-metro-v24-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Qwen3.5-2B-metro-v24

QLoRA fine-tune of Qwen3.5-2B for the MetroLLM-Bench transit-kiosk task. v24 is the leakage-free retraining used in the MetroLLM-Bench paper (teacher traces drawn only from the 717-case training partition; 238 cases held out). Supersedes continker/Qwen3.5-2B-metro-v23.

Held-out results (n=238, mean of 2 seeds)

Metric 2B base 2B + v24 ฮ”
Tier-1 74.17 79.43 +5.26
Composite 71.90 77.93 +6.03

The largest relative PEFT gain on the size curve. At 1.2 GB Q4_K_M it runs on a fanless laptop (~39 tok/s sustained on an M2 Air).

Contents

  • adapter/ โ€” LoRA adapter (rank 16, ฮฑ 32; QLoRA 4-bit NF4) + tokenizer + chat template
  • Qwen3.5-2B-metro-v24-Q4_K_M.gguf โ€” merged GGUF (1.2 GB)
  • training_summary.json

GGUF: `llama-server --hf-repo continker/Qwen3.5-2B-metro-v24 --hf-file Qwen3.5-2B-metro-v24-Q4_K_M.gguf`. The LoRA adapter keys use the `.language_model.` prefix; strip it to load onto text-only Qwen3.5-2B. Apache 2.0.

Downloads last month
25
GGUF
Model size
2B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for continker/Qwen3.5-2B-metro-v24

Finetuned
Qwen/Qwen3.5-2B
Adapter
(94)
this model