CARD Probe Models

Trained probe heads for CARD (Category-Aware Risk Detection for Vision–Language Models, ESORICS 2026). CARD is gradient-free: the VLM backbone is frozen and only these lightweight heads are trained, on cached prefill hidden states.

Code: https://github.com/kevin-Abbring/CARD
Evaluation data: https://huggingface.co/datasets/kbl324/CARD_data

Backbones

One folder per backbone (heads are backbone-specific; the public VLM weights are not redistributed here — load them from their original sources):

qwen3vl_4b/ · qwen3vl_8b/ — Qwen3-VL-Instruct
gemma3_12b/ — Gemma-3-12B-it
llava15_7b/ — LLaVA-1.5-7B

Files (per backbone)

file	contents
`kway_mlp.pt`	K-way category head: MLP `256→256→128→6` (BatchNorm/ReLU/Dropout), PyTorch `state_dict`
`pca_scaler.pkl`	the PCA whitening (256 components) + `StandardScaler` fitted on the curated calibration set
`binary_probes.npz`	per-layer probe directions for the binary `SafetyScore`: `refusal` + 6 category mean-difference vectors
`config.json`	selected read-out layer, depth %, dims, class list, architecture

kway_classes = [crime, hate, misinfo, privacy, sexual, violence].

How CARD uses them at inference

Run one frozen-VLM prefill pass on the (image, text) input; collect per-layer last-token hidden states.
Binary: project hidden states onto the binary_probes directions, aggregate into a SafetyScore, threshold at a target benign FPR → safe / unsafe.
K-way (if unsafe): take the selected_layer hidden state → PCA-whiten + standardise (pca_scaler.pkl) → kway_mlp → harm category.

import torch, pickle, numpy as np, json
cfg = json.load(open("qwen3vl_8b/config.json"))
pp  = pickle.load(open("qwen3vl_8b/pca_scaler.pkl", "rb"))         # {"pca","scaler"}
sd  = torch.load("qwen3vl_8b/kway_mlp.pt", map_location="cpu")     # MLP state_dict
# h = last-token hidden state at cfg["selected_layer"], shape (1, hidden_size)
# z = pp["scaler"].transform(pp["pca"].transform(h))
# logits = MLP(z);  category = cfg["kway_classes"][logits.argmax()]

See the code repo for the full inference and training pipeline. In-domain 6-way accuracy is 88–90 % across all four backbones (paper Table 2).

Citation

@inproceedings{card2026,
  title     = {CARD: Category-Aware Risk Detection for Vision--Language Models},
  booktitle = {European Symposium on Research in Computer Security (ESORICS)},
  year      = {2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support