How to use from the
Use from the
llama-cpp-python library
# Gated model: Login with a HF token with gated access permission
hf auth login
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="oracomputing/Qwen3.5-9B-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

EVALUATION-ONLY ACCESS

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This is a private evaluation version of Qwen3.5-9B-GGUF (OraQuant).

By agreeing, you accept:

  • Internal testing only; no production use
  • No commercial use, redistribution, or reverse-engineering
  • Deletion of all files after evaluation
  • Full terms in LICENSE

Access is granted only to approved licensees.

Log in or Sign Up to review the conditions and access this model content.

Qwen3.5-9B-GGUF (OraQuant)

This repository contains GGUF builds of Qwen3.5-9B, quantized by Ora Computing with OraQuant (OQ) - Ora Computing's proprietary calibrated quantization. These are llama.cpp-compatible quantizations of Qwen/Qwen3.5-9B; the underlying weights are unchanged Qwen3.5-9B weights at reduced precision.

Text only. Qwen/Qwen3.5-9B is a multimodal model; these GGUFs contain only the language model (text input -> text output). The vision/video input encoders are not included.


Model Overview

Model name: Qwen3.5-9B-GGUF (OraQuant) Base model: Qwen/Qwen3.5-9B (Apache-2.0, Alibaba Cloud) - these are GGUF quantizations of it Parameters: ~9 billion (unchanged from the base model) Quantization: OraQuant (OQ) mixed-precision K-quant GGUFs produced by Ora Computing, provided in two footprints - OQ-Q4_K_M (higher quality) and OQ-Q3_K_M (smaller/faster). Not fine-tuned, not parameter-reduced: the model architecture and parameter count are identical to the base model; only the weight precision is reduced. Purpose: Evaluation/test-use only; optimized for local/offline inference and internal benchmarking. License: See LICENSE (Custom Model License Agreement).


Files in this repo

File What it is Size
Qwen3.5-9B-OQ-Q4_K_M.gguf Language model, OraQuant Q4_K_M (higher quality) ~5.7 GB
Qwen3.5-9B-OQ-Q3_K_M.gguf Language model, OraQuant Q3_K_M (smaller/faster) ~4.7 GB
LICENSE Custom Model License Agreement -

Usage

These GGUFs load with stock upstream llama.cpp (no patch required); use a build with Qwen3.5 support.

export MODEL=/path/to/Qwen3.5-9B-OQ-Q4_K_M.gguf   # or the Q3_K_M file

Interactive chat:

./build/bin/llama-cli -m "$MODEL" -ngl 99

Single-shot completion (-st runs one turn then exits):

./build/bin/llama-cli -m "$MODEL" -ngl 99 -st -p "Explain the Chudnovsky algorithm in two sentences."

OpenAI-compatible server (Web UI at http://localhost:8080):

./build/bin/llama-server -m "$MODEL" -ngl 99 \
  --served-model-name qwen3.5-9b --host 0.0.0.0 --port 8080

Qwen3.5 is a reasoning model; the chat template and thinking behaviour are carried in the GGUF.


Intended Use & Restrictions

Permitted use

  • Internal testing, benchmarking, and evaluation of the model by the named Licensee.
  • Exploration of model behaviours, prompt engineering, and non-production prototypes.

Prohibited use

  • Deployment in a production or commercial service, publicly-facing API, resale, or redistribution.
  • Fine-tuning or creating derivative models for production use without a separate agreement.
  • Reverse-engineering the quantization/calibration used to produce these files.
  • Disclosure or sharing of the model (or its weights) to third parties beyond the named Licensee.

Out-of-scope use

  • Use in regulated or safety-critical contexts (unless separately permitted).
  • Any use that violates the Apache License, Version 2.0 under which the upstream model is distributed.

Quantization

  • Method: OraQuant (OQ), Ora Computing's proprietary calibrated quantization. The released files are mixed-precision K-quant GGUFs.
  • No fine-tuning: the weights are the original Qwen/Qwen3.5-9B weights; no additional training was performed.
  • No parameter-count change: the architecture and ~9B parameter count are unchanged; only weight precision is reduced.
  • Footprints: OQ-Q4_K_M for higher quality, OQ-Q3_K_M for a smaller/faster footprint.

Limitations & Risks

  • Quantized models may not replicate the full behaviour of the base model under all prompt categories, particularly domain-specific or rare inputs.
  • The model is provided as-is for testing only and is not certified for production use.
  • Users should validate outputs carefully and monitor for bias or unintended behaviours.

Upstream Attribution

This model is derived from the Qwen3.5-9B model released by Alibaba Cloud under the Apache License, Version 2.0.

"Copyright 2025 Alibaba Cloud. Licensed under the Apache License, Version 2.0."

For full terms, see: https://huggingface.co/Qwen/Qwen3.5-9B/blob/main/LICENSE Apache License, Version 2.0: https://www.apache.org/licenses/LICENSE-2.0


Contact & Support

For licensing inquiries or to request extended evaluation rights, please contact: info@oracomputing.com


Repository and model access are regulated. Do not redistribute or share without explicit written permission from Ora Computing.

Downloads last month
17
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oracomputing/Qwen3.5-9B-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(338)
this model