How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="thwrhrt/Qwen3.5-0.8B-Base-Quant",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwen3.5-0.8B GGUF Quantizations

Converted from: Qwen/Qwen3.5-0.8B-Base

Quantizations

  • Q2_K โ†’ smallest, fastest
  • Q3_K_M โ†’ balanced
  • Q4_K_M โ†’ recommended
  • Q5_K_M โ†’ highest quality

Recommended: Q4_K_M (best balance of speed and quality)

Q2_K โ†’ ~200MB

Q4_K_M โ†’ ~500MB

Q5_K_M โ†’ ~650MB

Tested on

  • LM Studio โœ”
  • llama.cpp โœ”

Notes

  • Converted using llama.cpp
  • No LoRA / base model only

license: apache 2.0

Downloads last month
75
GGUF
Model size
0.8B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support