marklkelly
/

bert-tiny-injection-detector

@@ -300,7 +300,17 @@ The model was trained on **160,239 examples** from three sources. The `allenai/w
 | [`darkknight25/Prompt_Injection_Benign_Prompt_Dataset`](https://huggingface.co/datasets/darkknight25/Prompt_Injection_Benign_Prompt_Dataset) | 393 | 52 | Benign supplement |
 | **Total** | **160,239** | **20,027** | |
-Dataset construction used exact SHA-256 deduplication, text-length filtering (8–4,000 characters), and stratified splitting. Internal dataset identifier: `pi_mix_v1_injection_only`. Training artifact date: 2026-03-17.
 ---

 | [`darkknight25/Prompt_Injection_Benign_Prompt_Dataset`](https://huggingface.co/datasets/darkknight25/Prompt_Injection_Benign_Prompt_Dataset) | 393 | 52 | Benign supplement |
 | **Total** | **160,239** | **20,027** | |
+### Dataset Construction
+Each source dataset uses different label formats and field names. Labels were normalised to a binary scheme (0 = `SAFE`, 1 = `INJECTION`) during ingestion. The build pipeline is recipe-driven: a YAML file specifies each source, the label mapping, and any per-source filters; `ml/data/build.py` executes the recipe and writes the final train/val splits.
+After loading and normalising, the pipeline applies:
+1. **Text-length filtering** — examples shorter than 8 characters or longer than 4,000 characters are dropped.
+2. **SHA-256 deduplication** — exact-duplicate texts are removed on the combined pool before splitting.
+3. **Stratified splitting** — the deduplicated pool is split into train and validation sets with stratification on the label, preserving class balance across both splits.
+Additional sources (`neuralchemy/Prompt-injection-dataset`, `wambosec/prompt-injections-subtle`) were evaluated in later recipe iterations but are not included in the production model, which uses the `pi_mix_v1_injection_only` recipe. Internal dataset identifier: `pi_mix_v1_injection_only`. Training artifact date: 2026-03-17.
 ---