CARDS-Qwen3.5-9B-FP8

FP8-dynamic quantization of C3DS/CARDS-Qwen3.5-9B — the LoRA-merged Qwen3.5-9B fine-tuned on the CARDS taxonomy from Coan et al. (2025) for climate-contrarian-claim classification.

This is the deployment-friendly variant of the BF16 model: weights compressed to 8-bit floating point (fp8_e4m3) via dynamic per-channel quantization. Loads directly with transformers, vLLM (≥ 0.6 with FP8 Marlin kernels), or any FP8-aware inference engine.

~9 GB on disk (vs ~18 GB for the BF16 sibling) — fits on a single 16 GB GPU with comfortable KV-cache headroom.
No accuracy loss expected. The same recipe applied to the 27B sibling matched the BF16 model within rounding on every test-set metric (see C3DS/CARDS-Qwen3.5-27B-FP8). FP8-dynamic does not require calibration data.

BF16 sibling results

The BF16 model this checkpoint was quantized from — evaluated on the held-out CARDS test set (1,436 samples, Level 1, min_support ≥ 3):

Metric	Qwen3.5-9B (base)	CARDS-Qwen3.5-4B FT	CARDS-Qwen3.5-9B (BF16)	CARDS-Qwen3.5-27B FT	Claude Opus 4.6
Samples F1	0.721	0.838	0.872	0.884	0.893
Macro F1	0.629	0.632	0.663	0.766	0.751
Micro F1	0.775	0.828	0.862	0.877	0.881
Precision	0.866	0.840	0.875	0.879	0.863
Recall	0.701	0.816	0.849	0.874	0.900
Parse failures	247 / 1436	1 / 1436	0 / 1436	0 / 1436	0 / 1436

FP8 numbers are not separately reported — at 27B the FP8 variant matched BF16 to within ±0.005 samples F1 with no parse-failure regression, and the same pattern is expected here.

Usage

With vLLM

vllm serve C3DS/CARDS-Qwen3.5-9B-FP8 \
  --port 8000 \
  --max-model-len 4096 \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --served-model-name CARDS-Qwen3.5-9B

--kv-cache-dtype fp8 is optional but doubles effective KV cache capacity on Hopper hardware. The FP8 weights are detected automatically — do not pass --dtype bfloat16 (it would override the FP8 weights and undo the savings).

The system prompt and user-message format are identical to the BF16 sibling. We bundle them in this repo as cards_prompts.json for self-contained loading:

import json
from huggingface_hub import hf_hub_download
from openai import OpenAI

prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-9B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger             = prompts["cot_trigger"]

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

def classify(text):
    resp = client.chat.completions.create(
        model="CARDS-Qwen3.5-9B",
        messages=[
            {"role": "system", "content": slim_system_instruction},
            {"role": "user",   "content": f"### Text:\n{text}\n\n{cot_trigger}"},
        ],
        temperature=0,
        max_tokens=4000,
    )
    return resp.choices[0].message.content

The model produces a reasoning trace inside <think>…</think> followed by a YAML categories: block listing predicted CARDS codes. To parse: take the content after </think> and read the categories: list.

Multimodal — image + text

The base Qwen3.5/3.6 family supports image inputs via the OpenAI-compatible image_url content part, and this fine-tune preserves that capability — pass the system prompt below alongside an image (with or without caption text) and the model will classify the depicted claim under the CARDS taxonomy.

Serve vLLM with multimodal flags enabled:

vllm serve C3DS/CARDS-Qwen3.5-9B-FP8 \
  --port 8000 \
  --max-model-len 8192 \
  --trust-remote-code \
  --limit-mm-per-prompt image=4 \
  --enable-prefix-caching \
  --served-model-name CARDS-Qwen3.5-9B

import base64, json, mimetypes
from pathlib import Path
from huggingface_hub import hf_hub_download
from openai import OpenAI

prompts = json.load(open(hf_hub_download("C3DS/CARDS-Qwen3.5-9B-FP8", "cards_prompts.json")))
slim_system_instruction = prompts["slim_system_instruction"]
cot_trigger             = prompts["cot_trigger"]

def image_part(path):
    p = Path(path)
    mime = mimetypes.guess_type(p)[0] or "image/png"
    b64 = base64.b64encode(p.read_bytes()).decode()
    return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{b64}"}}

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

resp = client.chat.completions.create(
    model="CARDS-Qwen3.5-9B",
    messages=[
        {"role": "system", "content": slim_system_instruction},
        {"role": "user", "content": [
            {"type": "text", "text": "Read the image (and any caption below) and classify the climate claim it makes."},
            image_part("screenshot.png"),
            {"type": "text", "text": f"### Caption:\n<optional caption>\n\n{cot_trigger}"},
        ]},
    ],
    temperature=0,
    max_tokens=4000,
)
print(resp.choices[0].message.content)

Training & Quantization

Fine-tuning (inherited from the BF16 sibling)

Base model: Qwen/Qwen3.5-9B
Method: LoRA (rank 16, α 16, dropout 0) on q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, then merged into base weights
Dataset: C3DS/cards_sft_dataset (sft config — RECoT chat messages)
Framework: Unsloth + TRL SFTTrainer
Hyperparameters: 3 epochs, per_device_train_batch_size=1, gradient_accumulation_steps=8, lr=2e-4, cosine schedule, 10 warmup steps, max_seq_length=4096, adamw_8bit, bf16

FP8 quantization

Scheme: fp8_e4m3 dynamic per-channel quantization (weights only). Activations stay in BF16; per-channel scales are computed at quantization time, no calibration data needed.
Targets: linear layers in transformer blocks (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj); lm_head left in BF16.
Tool: llmcompressor with QuantizationModifier(targets="Linear", scheme="FP8_DYNAMIC") applied to the merged BF16 checkpoint.

Hardware notes

Inference: any GPU that supports FP8 weights via Marlin or vLLM's FP8 kernels. Ampere (A100) uses up-cast FP8→BF16 matmul; Hopper (H100/H200) uses native FP8 tensor cores.
Memory: ~9 GB for weights — fits on a single 16 GB GPU at long context.

Limitations

Thinking tokens. Training used enable_thinking=True. Either parse output after </think>, or disable thinking at inference via chat_template_kwargs={"enable_thinking": false}.
Quantization is weight-only. Activations are BF16. For more aggressive compression (W4A16 GPTQ/AWQ) see project follow-up work.

Citation

@article{coan2025cards,
  title   = {Large language model reveals an increase in climate contrarian speech in the United States Congress},
  author  = {Coan, Travis G. and Malla, Ranadheer and Nanko, Mirjam O. and Kattrup, William and Roberts, J. Timmons and Cook, John and Boussalis, Constantine},
  journal = {Communications Sustainability},
  volume  = {1},
  pages   = {37},
  year    = {2025},
  doi     = {10.1038/s44458-025-00029-z}
}