CARDS-Qwen3.5-9B-FP8
FP8-dynamic quantization of C3DS/CARDS-Qwen3.5-9B — the LoRA-merged Qwen3.5-9B fine-tuned on the CARDS taxonomy from Coan et al. (2025) for climate-contrarian-claim classification.
This is the deployment-friendly variant of the BF16 model: weights compressed to 8-bit floating point (fp8_e4m3) via dynamic per-channel quantization. Loads directly with transformers, vLLM (≥ 0.6 with FP8 Marlin kernels), or any FP8-aware inference engine.
- ~9 GB on disk (vs ~18 GB for the BF16 sibling) — fits on a single 16 GB GPU with comfortable KV-cache headroom.
- No accuracy loss expected. The same recipe applied to the 27B sibling matched the BF16 model within rounding on every test-set metric (see
C3DS/CARDS-Qwen3.5-27B-FP8). FP8-dynamic does not require calibration data.
BF16 sibling results
The BF16 model this checkpoint was quantized from — evaluated on the held-out CARDS test set (1,436 samples, Level 1, min_support ≥ 3):
| Metric | Qwen3.5-9B (base) | CARDS-Qwen3.5-4B FT | CARDS-Qwen3.5-9B (BF16) | CARDS-Qwen3.5-27B FT | Claude Opus 4.6 |
|---|---|---|---|---|---|
| Samples F1 | 0.721 | 0.838 | 0.872 | 0.884 | 0.893 |
| Macro F1 | 0.629 | 0.632 | 0.663 | 0.766 | 0.751 |
| Micro F1 | 0.775 | 0.828 | 0.862 | 0.877 | 0.881 |
| Precision | 0.866 | 0.840 | 0.875 | 0.879 | 0.863 |
| Recall | 0.701 | 0.816 | 0.849 | 0.874 | 0.900 |
| Parse failures | 247 / 1436 | 1 / 1436 | 0 / 1436 | 0 / 1436 | 0 / 1436 |
FP8 numbers are not separately reported — at 27B the FP8 variant matched BF16 to within ±0.005 samples F1 with no parse-failure regression, and the same pattern is expected here.
Usage
With vLLM
vllm serve C3DS/CARDS-Qwen3.5-9B-FP8 \
--port 8000 \
--max-model-len 4096 \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--served-model-name CARDS-Qwen3.5-9B
--kv-cache-dtype fp8 is optional but doubles effective KV cache capacity on Hopper hardware. The FP8 weights are detected automatically — do not pass --dtype bfloat16 (it would override the FP8 weights and undo the savings).
The system prompt and user-message format are identical to the BF16 sibling. We bundle them in this repo as cards_prompts.json for self-contained loading:
import json
from huggingface_hub import hf_hub_download
from openai import OpenAI
prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-9B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger = prompts["cot_trigger"]
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
def classify(text):
resp = client.chat.completions.create(
model="CARDS-Qwen3.5-9B",
messages=[
{"role": "system", "content": slim_system_instruction},
{"role": "user", "content": f"### Text:\n{text}\n\n{cot_trigger}"},
],
temperature=0,
max_tokens=4000,
)
return resp.choices[0].message.content
The model produces a reasoning trace inside <think>…</think> followed by a YAML categories: block listing predicted CARDS codes. To parse: take the content after </think> and read the categories: list.
Multimodal — image + text
The base Qwen3.5/3.6 family supports image inputs via the OpenAI-compatible
image_url content part, and this fine-tune preserves that capability — pass
the system prompt below alongside an image (with or without caption text) and
the model will classify the depicted claim under the CARDS taxonomy.
Serve vLLM with multimodal flags enabled:
vllm serve C3DS/CARDS-Qwen3.5-9B-FP8 \
--port 8000 \
--max-model-len 8192 \
--trust-remote-code \
--limit-mm-per-prompt image=4 \
--enable-prefix-caching \
--served-model-name CARDS-Qwen3.5-9B
import base64, json, mimetypes
from pathlib import Path
from huggingface_hub import hf_hub_download
from openai import OpenAI
prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-9B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger = prompts["cot_trigger"]
def image_part(path):
p = Path(path)
mime = mimetypes.guess_type(p)[0] or "image/png"
b64 = base64.b64encode(p.read_bytes()).decode()
return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}"}}
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
resp = client.chat.completions.create(
model="CARDS-Qwen3.5-9B",
messages=[
{"role": "system", "content": slim_system_instruction},
{"role": "user", "content": [
{"type": "text", "text": "Read the image (and any caption below) and classify the climate claim it makes."},
image_part("screenshot.png"),
{"type": "text", "text": f"### Caption:\n<optional caption>\n\n{cot_trigger}"},
]},
],
temperature=0,
max_tokens=4000,
)
print(resp.choices[0].message.content)
Training & Quantization
Fine-tuning (inherited from the BF16 sibling)
- Base model:
Qwen/Qwen3.5-9B - Method: LoRA (rank 16, α 16, dropout 0) on
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, then merged into base weights - Dataset:
C3DS/cards_sft_dataset(sftconfig — RECoT chat messages) - Framework: Unsloth + TRL
SFTTrainer - Hyperparameters: 3 epochs,
per_device_train_batch_size=1,gradient_accumulation_steps=8,lr=2e-4, cosine schedule, 10 warmup steps,max_seq_length=4096,adamw_8bit,bf16
FP8 quantization
- Scheme:
fp8_e4m3dynamic per-channel quantization (weights only). Activations stay in BF16; per-channel scales are computed at quantization time, no calibration data needed. - Targets: linear layers in transformer blocks (
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj);lm_headleft in BF16. - Tool:
llmcompressorwithQuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC")applied to the merged BF16 checkpoint.
Hardware notes
- Inference: any GPU that supports FP8 weights via Marlin or vLLM's FP8 kernels. Ampere (A100) uses up-cast FP8→BF16 matmul; Hopper (H100/H200) uses native FP8 tensor cores.
- Memory: ~9 GB for weights — fits on a single 16 GB GPU at long context.
Limitations
- Thinking tokens. Training used
enable_thinking=True. Either parse output after</think>, or disable thinking at inference viachat_template_kwargs={"enable_thinking": false}. - Quantization is weight-only. Activations are BF16. For more aggressive compression (W4A16 GPTQ/AWQ) see project follow-up work.
Citation
@article{coan2025cards,
title = {Large language model reveals an increase in climate contrarian speech in the United States Congress},
author = {Coan, Travis G. and Malla, Ranadheer and Nanko, Mirjam O. and Kattrup, William and Roberts, J. Timmons and Cook, John and Boussalis, Constantine},
journal = {Communications Sustainability},
volume = {1},
pages = {37},
year = {2025},
doi = {10.1038/s44458-025-00029-z}
}
License
Apache 2.0, inherited from Qwen3.5-9B.
- Downloads last month
- 32