| --- |
| base_model: Qwen/Qwen3-4B-Instruct-2507 |
| library_name: peft |
| pipeline_tag: text-generation |
| license: apache-2.0 |
| tags: |
| - base_model:adapter:Qwen/Qwen3-4B-Instruct-2507 |
| - lora |
| - transformers |
| - prompt-injection-detection |
| - security |
| --- |
| |
| # AgentWatcher-Qwen3-4B-Instruct-2507 |
|
|
| AgentWatcher is a detection-based defense against indirect prompt injection in LLM agents. This repository contains the trained **monitor LLM**, which is a LoRA adapter (PEFT) fine-tuned on top of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507). |
|
|
| - **Paper:** [AgentWatcher: A Rule-based Prompt Injection Monitor](https://huggingface.co/papers/2604.01194) |
| - **Repository:** [GitHub - wang-yanting/AgentWatcher](https://github.com/wang-yanting/AgentWatcher) |
|
|
| ## Description |
|
|
| Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. AgentWatcher addresses existing limitations in detection by: |
|
|
| 1. **Causal Context Attribution**: It attributes the LLM's output to a small set of causally influential context segments. By focusing on short, relevant text, it remains scalable even with long contexts. |
| 2. **Rule-based Reasoning**: It utilizes explicit rules to define prompt injection. The monitor LLM reasons over these rules based on the attributed text, making detection decisions more explainable and interpretable than black-box methods. |
|
|
| ## How to Get Started with the Model |
|
|
| You can load this adapter using the `peft` and `transformers` libraries: |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from peft import PeftModel |
| import torch |
| |
| base_model_id = "Qwen/Qwen3-4B-Instruct-2507" |
| adapter_id = "SecureLLMSys/AgentWatcher-Qwen3-4B-Instruct-2507" |
| |
| # Load base model |
| base_model = AutoModelForCausalLM.from_pretrained( |
| base_model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto" |
| ) |
| |
| # Load the AgentWatcher adapter |
| model = PeftModel.from_pretrained(base_model, adapter_id) |
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
| |
| # Example: Prepare a prompt for the monitor LLM to evaluate a context segment |
| prompt = "..." |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| outputs = model.generate(**inputs, max_new_tokens=256) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ## Citation |
|
|
| If you use AgentWatcher in your research, please cite the following paper: |
|
|
| ```bibtex |
| @article{wang2026agentwatcher, |
| title={AgentWatcher: A Rule-based Prompt Injection Monitor}, |
| author={Wang, Yanting and others}, |
| journal={arXiv preprint arXiv:2604.01194}, |
| year={2026} |
| } |
| ``` |