| --- |
| license: mit |
| tags: |
| - safety |
| - vision-language-models |
| - content-moderation |
| - probing |
| - multimodal-safety |
| library_name: pytorch |
| --- |
| |
| # CARD Probe Models |
|
|
| Trained **probe heads** for **CARD** (Category-Aware Risk Detection for Vision–Language |
| Models, ESORICS 2026). CARD is gradient-free: the VLM backbone is **frozen** and only |
| these lightweight heads are trained, on cached prefill hidden states. |
|
|
| - **Code:** https://github.com/kevin-Abbring/CARD |
| - **Evaluation data:** https://huggingface.co/datasets/kbl324/CARD_data |
| |
| ## Backbones |
| |
| One folder per backbone (heads are backbone-specific; the public VLM weights are **not** |
| redistributed here — load them from their original sources): |
| |
| - `qwen3vl_4b/` · `qwen3vl_8b/` — Qwen3-VL-Instruct |
| - `gemma3_12b/` — Gemma-3-12B-it |
| - `llava15_7b/` — LLaVA-1.5-7B |
|
|
| ## Files (per backbone) |
|
|
| | file | contents | |
| |---|---| |
| | `kway_mlp.pt` | K-way category head: MLP `256→256→128→6` (BatchNorm/ReLU/Dropout), PyTorch `state_dict` | |
| | `pca_scaler.pkl` | the PCA whitening (256 components) + `StandardScaler` fitted on the curated calibration set | |
| | `binary_probes.npz` | per-layer probe directions for the binary `SafetyScore`: `refusal` + 6 category mean-difference vectors | |
| | `config.json` | selected read-out layer, depth %, dims, class list, architecture | |
|
|
| `kway_classes = [crime, hate, misinfo, privacy, sexual, violence]`. |
|
|
| ## How CARD uses them at inference |
|
|
| 1. Run **one** frozen-VLM prefill pass on the (image, text) input; collect per-layer |
| last-token hidden states. |
| 2. **Binary**: project hidden states onto the `binary_probes` directions, aggregate into a |
| `SafetyScore`, threshold at a target benign FPR → safe / unsafe. |
| 3. **K-way** (if unsafe): take the `selected_layer` hidden state → PCA-whiten + |
| standardise (`pca_scaler.pkl`) → `kway_mlp` → harm category. |
|
|
| ```python |
| import torch, pickle, numpy as np, json |
| cfg = json.load(open("qwen3vl_8b/config.json")) |
| pp = pickle.load(open("qwen3vl_8b/pca_scaler.pkl", "rb")) # {"pca","scaler"} |
| sd = torch.load("qwen3vl_8b/kway_mlp.pt", map_location="cpu") # MLP state_dict |
| # h = last-token hidden state at cfg["selected_layer"], shape (1, hidden_size) |
| # z = pp["scaler"].transform(pp["pca"].transform(h)) |
| # logits = MLP(z); category = cfg["kway_classes"][logits.argmax()] |
| ``` |
|
|
| See the [code repo](https://github.com/kevin-Abbring/CARD) for the full inference and |
| training pipeline. In-domain 6-way accuracy is 88–90 % across all four backbones |
| (paper Table 2). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @inproceedings{card2026, |
| title = {CARD: Category-Aware Risk Detection for Vision--Language Models}, |
| booktitle = {European Symposium on Research in Computer Security (ESORICS)}, |
| year = {2026} |
| } |
| ``` |
|
|