BBehring
/

pid-runs-v2.2

Text Classification

prompt-injection

Model card Files Files and versions

pid-runs-v2.2 / README.md

BBehring's picture

add Model Card

2115324 verified 18 days ago

|

history blame contribute delete

3.05 kB

	---
	license: mit
	language:
	- en
	tags:
	- text-classification
	- prompt-injection
	- deberta-v3
	- lora
	base_model: microsoft/deberta-v3-base
	datasets:
	- deepset/prompt-injections
	- Lakera/gandalf_ignore_instructions
	- Lakera/gandalf_summarization
	- OpenAssistant/oasst1
	- microsoft/llmail-inject-challenge
	- leolee99/NotInject
	---

	# pid-runs-v2.2 — V2.2 canonical H100 checkpoints

	Trained-model checkpoints from the canonical V2.2 evidence run of
	[`brandon-behring/prompt-injection-sdd`](https://github.com/brandon-behring/prompt-injection-sdd).
	This repo ships 14 fragment checkpoints (LoRA adapters +
	full-FT DeBERTa-v3-base) so a colleague can reproduce the canonical V2.2
	numbers without re-training on an A100.

	## Provenance

	- source run id: `20260511T181707Z-6a180a3a`
	- source repo commit: see the run's `run_metadata.json` for the canonical SHA
	- result schema: `v2.2-evidence-1`
	- hardware: NVIDIA H100 80 GB HBM3 (RunPod)
	- evidence package: [GitHub Release `v2.2-evidence`](https://github.com/brandon-behring/prompt-injection-sdd/releases/tag/v2.2-evidence)
	- claim-gate status: 10/10 claim gates passed; evidence-package gate passed;
	stronger-model claim gate failed (V2.2 is a successful evidence-package
	run, not a promoted stronger-model result).

	## Fragments

	- `full_ft_lr1e-5_seed_42/`
	- `full_ft_v21_seed_42/`
	- `full_ft_v21_seed_43/`
	- `full_ft_v21_seed_44/`
	- `lora_no_notinject_seed_42/`
	- `lora_no_notinject_seed_43/`
	- `lora_no_notinject_seed_44/`
	- `lora_r16_qv_seed_42/`
	- `lora_r16_qv_seed_43/`
	- `lora_r16_qv_seed_44/`
	- `lora_v21_seed_42/`
	- `lora_v21_seed_43/`
	- `lora_v21_seed_44/`
	- `lora_v21_seed_45/`

	Each fragment directory contains the files needed to reload the model for
	inference: tokenizer, config, weights, and the training config that
	produced them. The reference scorers (`frozen_probe`, `lr_tfidf`,
	`protectai_v1`, `protectai_v2`) are inference-only and are not included
	in this repo.

	## Usage

	```python
	from huggingface_hub import snapshot_download

	# Download a single fragment:
	local_dir = snapshot_download(
	repo_id="BBehring/pid-runs-v2.2",
	allow_patterns=["lora_r16_qv_seed_43/*"],
	)
	```

	Or fetch the entire repo for an end-to-end reanalyze workflow as documented
	in [`docs/DIAGNOSTICS.md`](https://github.com/brandon-behring/prompt-injection-sdd/blob/main/docs/DIAGNOSTICS.md)
	Level 4 path 4a.

	## How to read the V2.2 evidence

	Per the
	[comprehensive evidence report](https://github.com/brandon-behring/prompt-injection-sdd/blob/main/docs/v2-2-comprehensive-evidence-report.md):

	- Eval slices answer different claim questions; do not macro-average them.
	- `older_poc_holdout` is the external-shift anchor; treat `ProtectAI v2`'s
	0.938 PR-AUC there as leakage-suspected (see
	`analysis/deep_dive/protectai_leakage_refinement.json` in the source repo).
	- `lakera_within_source_heldout` is saturated (all scorers ≥ 0.989 PR-AUC)
	— use it as a split-hygiene check, not a robustness claim.

	## License

	MIT (matches the source repo).