kbl324
/

CARD_model

vision-language-models

content-moderation

multimodal-safety

Model card Files Files and versions

CARD_model / README.md

kbl324's picture

Add card

b8b6ffe verified 9 days ago

|

History Blame Contribute Delete

2.75 kB

	---
	license: mit
	tags:
	- safety
	- vision-language-models
	- content-moderation
	- probing
	- multimodal-safety
	library_name: pytorch
	---

	# CARD Probe Models

	Trained probe heads for CARD (Category-Aware Risk Detection for Vision–Language
	Models, ESORICS 2026). CARD is gradient-free: the VLM backbone is frozen and only
	these lightweight heads are trained, on cached prefill hidden states.

	- Code: https://github.com/kevin-Abbring/CARD
	- Evaluation data: https://huggingface.co/datasets/kbl324/CARD_data

	## Backbones

	One folder per backbone (heads are backbone-specific; the public VLM weights are not
	redistributed here — load them from their original sources):

	- `qwen3vl_4b/` · `qwen3vl_8b/` — Qwen3-VL-Instruct
	- `gemma3_12b/` — Gemma-3-12B-it
	- `llava15_7b/` — LLaVA-1.5-7B

	## Files (per backbone)

	\| file \| contents \|
	\|---\|---\|
	\| `kway_mlp.pt` \| K-way category head: MLP `256→256→128→6` (BatchNorm/ReLU/Dropout), PyTorch `state_dict` \|
	\| `pca_scaler.pkl` \| the PCA whitening (256 components) + `StandardScaler` fitted on the curated calibration set \|
	\| `binary_probes.npz` \| per-layer probe directions for the binary `SafetyScore`: `refusal` + 6 category mean-difference vectors \|
	\| `config.json` \| selected read-out layer, depth %, dims, class list, architecture \|

	`kway_classes = [crime, hate, misinfo, privacy, sexual, violence]`.

	## How CARD uses them at inference

	1. Run one frozen-VLM prefill pass on the (image, text) input; collect per-layer
	last-token hidden states.
	2. Binary: project hidden states onto the `binary_probes` directions, aggregate into a
	`SafetyScore`, threshold at a target benign FPR → safe / unsafe.
	3. K-way (if unsafe): take the `selected_layer` hidden state → PCA-whiten +
	standardise (`pca_scaler.pkl`) → `kway_mlp` → harm category.

	```python
	import torch, pickle, numpy as np, json
	cfg = json.load(open("qwen3vl_8b/config.json"))
	pp = pickle.load(open("qwen3vl_8b/pca_scaler.pkl", "rb")) # {"pca","scaler"}
	sd = torch.load("qwen3vl_8b/kway_mlp.pt", map_location="cpu") # MLP state_dict
	# h = last-token hidden state at cfg["selected_layer"], shape (1, hidden_size)
	# z = pp["scaler"].transform(pp["pca"].transform(h))
	# logits = MLP(z); category = cfg["kway_classes"][logits.argmax()]
	```

	See the [code repo](https://github.com/kevin-Abbring/CARD) for the full inference and
	training pipeline. In-domain 6-way accuracy is 88–90 % across all four backbones
	(paper Table 2).

	## Citation

	```bibtex
	@inproceedings{card2026,
	title = {CARD: Category-Aware Risk Detection for Vision--Language Models},
	booktitle = {European Symposium on Research in Computer Security (ESORICS)},
	year = {2026}
	}
	```