File size: 2,753 Bytes
b8b6ffe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: mit
tags:
  - safety
  - vision-language-models
  - content-moderation
  - probing
  - multimodal-safety
library_name: pytorch
---

# CARD Probe Models

Trained **probe heads** for **CARD** (Category-Aware Risk Detection for Vision–Language
Models, ESORICS 2026). CARD is gradient-free: the VLM backbone is **frozen** and only
these lightweight heads are trained, on cached prefill hidden states.

- **Code:** https://github.com/kevin-Abbring/CARD
- **Evaluation data:** https://huggingface.co/datasets/kbl324/CARD_data

## Backbones

One folder per backbone (heads are backbone-specific; the public VLM weights are **not**
redistributed here — load them from their original sources):

- `qwen3vl_4b/` · `qwen3vl_8b/` — Qwen3-VL-Instruct
- `gemma3_12b/` — Gemma-3-12B-it
- `llava15_7b/` — LLaVA-1.5-7B

## Files (per backbone)

| file | contents |
|---|---|
| `kway_mlp.pt` | K-way category head: MLP `256→256→128→6` (BatchNorm/ReLU/Dropout), PyTorch `state_dict` |
| `pca_scaler.pkl` | the PCA whitening (256 components) + `StandardScaler` fitted on the curated calibration set |
| `binary_probes.npz` | per-layer probe directions for the binary `SafetyScore`: `refusal` + 6 category mean-difference vectors |
| `config.json` | selected read-out layer, depth %, dims, class list, architecture |

`kway_classes = [crime, hate, misinfo, privacy, sexual, violence]`.

## How CARD uses them at inference

1. Run **one** frozen-VLM prefill pass on the (image, text) input; collect per-layer
   last-token hidden states.
2. **Binary**: project hidden states onto the `binary_probes` directions, aggregate into a
   `SafetyScore`, threshold at a target benign FPR → safe / unsafe.
3. **K-way** (if unsafe): take the `selected_layer` hidden state → PCA-whiten +
   standardise (`pca_scaler.pkl`) → `kway_mlp` → harm category.

```python
import torch, pickle, numpy as np, json
cfg = json.load(open("qwen3vl_8b/config.json"))
pp  = pickle.load(open("qwen3vl_8b/pca_scaler.pkl", "rb"))         # {"pca","scaler"}
sd  = torch.load("qwen3vl_8b/kway_mlp.pt", map_location="cpu")     # MLP state_dict
# h = last-token hidden state at cfg["selected_layer"], shape (1, hidden_size)
# z = pp["scaler"].transform(pp["pca"].transform(h))
# logits = MLP(z);  category = cfg["kway_classes"][logits.argmax()]
```

See the [code repo](https://github.com/kevin-Abbring/CARD) for the full inference and
training pipeline. In-domain 6-way accuracy is 88–90 % across all four backbones
(paper Table 2).

## Citation

```bibtex
@inproceedings{card2026,
  title     = {CARD: Category-Aware Risk Detection for Vision--Language Models},
  booktitle = {European Symposium on Research in Computer Security (ESORICS)},
  year      = {2026}
}
```