You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

This model detects distributed prompt injection attacks across multi-turn conversations. By requesting access, you agree to:

  1. Use this model only for defensive security, detection, or academic research

  2. Not reverse-engineer detection patterns to develop evasion techniques
    for malicious purposes

  3. Cite the associated paper in any published work
    Please describe your intended use case below.

Log in or Sign Up to review the conditions and access this model content.

Multi-Turn Distributed Prompt Injection Detector

Model Description

A dual-encoder architecture for detecting distributed prompt injection attacks across multi-turn conversations. The system combines a frozen single-turn GRU encoder (2.6M parameters) with a trainable sequence LSTM (27K parameters) that learns temporal attack patterns across conversation turns. The model achieves F1=0.837 on a shared-prefix test set with 4 difficulty tiers and 4 attack strategies, significantly outperforming all voting baselines (p < 0.001 via paired bootstrap).

Model type: lstm-gru-dual-encoder

Architecture

Turn 1 → [Frozen GRU Encoder] → 32-dim ─┐
Turn 2 → [Frozen GRU Encoder] → 32-dim  ─┤
Turn 3 → [Frozen GRU Encoder] → 32-dim  ─┼→ [Sequence LSTM (64-dim)] → Dense(64→32→1)
  ...                                     │
Turn N → [Frozen GRU Encoder] → 32-dim ─┘

Models Included

File Description Trainable Params F1 (v3 test)
v3_gru_retrain.pt Frozen GRU turn encoder 2.6M (frozen) 0.815 (single-turn)
v3_iter5_multiturn.pt Temporal LSTM sequence classifier 27K 0.837
v3_iter6_attention.pt LSTM + additive attention 29K 0.837
v3_distilbert_concat.pt Concatenated DistilBERT baseline 66.4M 0.992
v3_distilbert_hier.pt Hierarchical DistilBERT baseline 5.5M 0.976
vocab.json Vocabulary (20K tokens)

Ablation Models

File Description F1
v3_ablation_shuffled.pt Shuffled turn order 0.760
v3_ablation_reversed.pt Reversed turn order 0.833
v3_ablation_mean_pool.pt Mean pooling (no LSTM) 0.755
v3_ablation_max_pool.pt Max pooling (no LSTM) 0.719
v3_ablation_continuation.pt Post-branch turns only 0.846
v3_ablation_prefix.pt Prefix turns only 0.667
v3_ablation_autoencoder.pt Autoencoder encoder 0.845

Intended Use

Defensive security systems that monitor multi-turn LLM conversations for distributed prompt injection attacks. The model is designed for deployment as a secondary classifier alongside single-turn detectors in LLM guardrail pipelines. Target users include security researchers, AI safety teams, and organizations deploying LLM-based applications that need to detect attacks distributed across multiple conversation turns.

Out-of-Scope Uses

  • Developing or refining adversarial prompt injection attacks
  • Bypassing AI safety filters or content moderation systems
  • Surveillance of private conversations without consent
  • Any application that violates the responsible-use terms

Training Details

Training Data

The model was trained in two phases using distinct datasets:

  • Phase 1 (Single-Turn): 73,390 prompt injection samples from 8 HuggingFace datasets, cleaned and deduplicated
  • Phase 2 (Multi-Turn): 18,754 multi-turn synthetic conversations from the v3 shared-prefix dataset, containing 4 attack strategies (fragment distribution 45%, gradual escalation 25%, context priming 15%, instruction layering 15%) across 4 difficulty tiers (easy, medium, hard, adversarial)

Dataset: rockCO78/multiturn-injection-detection

Training Procedure

Two-phase training pipeline on NVIDIA Jetson Orin AGX (64GB unified memory, Ampere GPU):

  • Phase 1: GRU turn encoder trained with full backpropagation on single-turn data (2.6M parameters, 20 epochs, batch size 64)
  • Phase 2: GRU encoder frozen, sequence LSTM trained on multi-turn conversations (27K trainable parameters, 20 epochs, batch size 32, early stopping with patience 5)

Energy consumption: 0.08 kWh estimated for the full training pipeline.

Hyperparameters

Parameter Value
Learning rate 0.001
Batch size 32 (multi-turn) / 64 (single-turn)
Epochs 20 (with early stopping)
Optimizer Adam
Hidden dimension 64 (LSTM) / 128 (GRU encoder)
Turn embedding dimension 32
Dropout 0.3
Weight decay 0.0001
Scheduler ReduceLROnPlateau
Patience 5
Random seed 42

Evaluation

Metrics

Evaluated on the v3 shared-prefix test set (5,130 sequences across 4 difficulty tiers). All confidence intervals are 95% bootstrap CIs from 1,000 resamples.

Model F1 95% CI Trainable Params
Temporal LSTM (iter5) 0.837 [0.826, 0.847] 27K
+Attention (iter6) 0.837 [0.825, 0.848] 29K
DistilBERT Concatenated 0.992 [0.989, 0.994] 66.4M
DistilBERT Hierarchical 0.976 [0.971, 0.980] 5.5M

Decision threshold: 0.5 (sigmoid output). Threshold-tuned variant at 0.64 achieves F1=0.995 on validation.

Per-Tier Performance (Temporal LSTM)

Tier F1
Easy 0.872
Medium 0.828
Hard 0.828
Adversarial 0.802

Paired bootstrap tests confirm statistical significance for all key comparisons (p < 0.001).

Technical Limitations

Limitations: Trained exclusively on synthetic data from a single LLM (Claude Sonnet 4.6); cross-model generalization is untested. Fixed conversation length of 6-9 user turns. Residual vocabulary confounds in post-branch turns (bag-of-words classifier achieves F1 > 0.93 on post-branch text). The temporal LSTM operates on 32-dimensional turn embeddings and cannot access raw vocabulary, but the training signal partially correlates with lexical features.

Ethical Considerations

This model is designed exclusively for defensive security research and prompt injection detection. The synthetic training data contains adversarial prompt patterns that could theoretically inform attack development if misused. Access is gated to mitigate dual-use risk. The model should not be used to develop adversarial attacks, bypass safety systems, or enable malicious prompt injection. Researchers should follow responsible disclosure practices when reporting vulnerabilities discovered using this model.

Safety and Risk Assessment

Safety: The model classifies conversations as benign or attack and does not generate text. Risks include false negatives (attacks pass undetected) and false positives (benign conversations flagged). Bias: The training data reflects attack patterns from published research (crescendo attacks, foot-in-the-door, context manipulation). Attack strategies not represented in the training data may evade detection.

Model Explainability

The model provides attention weights over conversation turns indicating which turns contributed most to classification decisions. Turn-order sensitivity analysis demonstrates a 55% flip rate when shuffling correctly-classified attacks, confirming reliance on temporal patterns rather than per-turn lexical features. Gate activation visualizations show distinct forget/update gate patterns for attack versus benign sequences.

Data Preprocessing

Single-turn data preprocessing: lowercased, whitespace-normalized, deduplicated using MD5 hashing, and filtered to remove sequences shorter than 5 tokens or longer than 512 tokens. Multi-turn data preprocessing: conversations tokenized per-turn using a 20K-token vocabulary. Each turn encoded independently by the frozen GRU to produce 32-dimensional embeddings. Conversations zero-padded to maximum sequence length within each batch.

Sensitive Personal Information

No sensitive personal information was used. The model was trained exclusively on synthetic data generated by an LLM. No personal data, user conversations, or personally identifiable information (PII) was used during training, validation, or evaluation.

Environmental Impact

Energy consumption: 0.08 kWh estimated for the full training pipeline on NVIDIA Jetson Orin AGX (15W-60W TDP). Carbon footprint estimated at 0.03 kg CO2eq assuming US average grid intensity.

Usage

import torch
import json
from src.models.single_turn import GRUClassifier
from src.models.multi_turn import MultiTurnClassifier

# Load turn encoder
vocab = json.load(open("vocab.json"))
turn_encoder = GRUClassifier(vocab_size=len(vocab), embed_dim=64, hidden_dim=128)
turn_encoder.load_state_dict(torch.load("v3_gru_retrain.pt", map_location="cpu"))
turn_encoder.eval()

# Load multi-turn classifier
mt_model = MultiTurnClassifier(turn_encoder=turn_encoder, hidden_dim=64)
mt_model.load_state_dict(torch.load("v3_iter5_multiturn.pt", map_location="cpu"))
mt_model.eval()

Citation

@misc{lambros2026multiturn,
  title={Temporal Detection of Distributed Prompt Injection Attacks in Multi-Turn Conversations},
  author={Lambros, Rock},
  year={2026},
  note={University of Denver, COMP 4531}
}

Associated Resources

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train rockCO78/multiturn-injection-detector

Evaluation results