--- license: mit tags: - safety - vision-language-models - content-moderation - probing - multimodal-safety library_name: pytorch --- # CARD Probe Models Trained **probe heads** for **CARD** (Category-Aware Risk Detection for Vision–Language Models, ESORICS 2026). CARD is gradient-free: the VLM backbone is **frozen** and only these lightweight heads are trained, on cached prefill hidden states. - **Code:** https://github.com/kevin-Abbring/CARD - **Evaluation data:** https://huggingface.co/datasets/kbl324/CARD_data ## Backbones One folder per backbone (heads are backbone-specific; the public VLM weights are **not** redistributed here — load them from their original sources): - `qwen3vl_4b/` · `qwen3vl_8b/` — Qwen3-VL-Instruct - `gemma3_12b/` — Gemma-3-12B-it - `llava15_7b/` — LLaVA-1.5-7B ## Files (per backbone) | file | contents | |---|---| | `kway_mlp.pt` | K-way category head: MLP `256→256→128→6` (BatchNorm/ReLU/Dropout), PyTorch `state_dict` | | `pca_scaler.pkl` | the PCA whitening (256 components) + `StandardScaler` fitted on the curated calibration set | | `binary_probes.npz` | per-layer probe directions for the binary `SafetyScore`: `refusal` + 6 category mean-difference vectors | | `config.json` | selected read-out layer, depth %, dims, class list, architecture | `kway_classes = [crime, hate, misinfo, privacy, sexual, violence]`. ## How CARD uses them at inference 1. Run **one** frozen-VLM prefill pass on the (image, text) input; collect per-layer last-token hidden states. 2. **Binary**: project hidden states onto the `binary_probes` directions, aggregate into a `SafetyScore`, threshold at a target benign FPR → safe / unsafe. 3. **K-way** (if unsafe): take the `selected_layer` hidden state → PCA-whiten + standardise (`pca_scaler.pkl`) → `kway_mlp` → harm category. ```python import torch, pickle, numpy as np, json cfg = json.load(open("qwen3vl_8b/config.json")) pp = pickle.load(open("qwen3vl_8b/pca_scaler.pkl", "rb")) # {"pca","scaler"} sd = torch.load("qwen3vl_8b/kway_mlp.pt", map_location="cpu") # MLP state_dict # h = last-token hidden state at cfg["selected_layer"], shape (1, hidden_size) # z = pp["scaler"].transform(pp["pca"].transform(h)) # logits = MLP(z); category = cfg["kway_classes"][logits.argmax()] ``` See the [code repo](https://github.com/kevin-Abbring/CARD) for the full inference and training pipeline. In-domain 6-way accuracy is 88–90 % across all four backbones (paper Table 2). ## Citation ```bibtex @inproceedings{card2026, title = {CARD: Category-Aware Risk Detection for Vision--Language Models}, booktitle = {European Symposium on Research in Computer Security (ESORICS)}, year = {2026} } ```