--- base_model: Qwen/Qwen3-4B-Instruct-2507 library_name: peft pipeline_tag: text-generation license: apache-2.0 tags: - base_model:adapter:Qwen/Qwen3-4B-Instruct-2507 - lora - transformers - prompt-injection-detection - security --- # AgentWatcher-Qwen3-4B-Instruct-2507 AgentWatcher is a detection-based defense against indirect prompt injection in LLM agents. This repository contains the trained **monitor LLM**, which is a LoRA adapter (PEFT) fine-tuned on top of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507). - **Paper:** [AgentWatcher: A Rule-based Prompt Injection Monitor](https://huggingface.co/papers/2604.01194) - **Repository:** [GitHub - wang-yanting/AgentWatcher](https://github.com/wang-yanting/AgentWatcher) ## Description Large language models (LLMs) and their applications, such as agents, are highly vulnerable to prompt injection attacks. AgentWatcher addresses existing limitations in detection by: 1. **Causal Context Attribution**: It attributes the LLM's output to a small set of causally influential context segments. By focusing on short, relevant text, it remains scalable even with long contexts. 2. **Rule-based Reasoning**: It utilizes explicit rules to define prompt injection. The monitor LLM reasons over these rules based on the attributed text, making detection decisions more explainable and interpretable than black-box methods. ## How to Get Started with the Model You can load this adapter using the `peft` and `transformers` libraries: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch base_model_id = "Qwen/Qwen3-4B-Instruct-2507" adapter_id = "SecureLLMSys/AgentWatcher-Qwen3-4B-Instruct-2507" # Load base model base_model = AutoModelForCausalLM.from_pretrained( base_model_id, torch_dtype=torch.bfloat16, device_map="auto" ) # Load the AgentWatcher adapter model = PeftModel.from_pretrained(base_model, adapter_id) tokenizer = AutoTokenizer.from_pretrained(base_model_id) # Example: Prepare a prompt for the monitor LLM to evaluate a context segment prompt = "..." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Citation If you use AgentWatcher in your research, please cite the following paper: ```bibtex @article{wang2026agentwatcher, title={AgentWatcher: A Rule-based Prompt Injection Monitor}, author={Wang, Yanting and others}, journal={arXiv preprint arXiv:2604.01194}, year={2026} } ```