ODILE β€” weight-level defenses against prompt injection in tool-using agents

ODILE is a family of LoRA adapters that defend tool-using LLM agents against indirect prompt injection β€” adversarial instructions smuggled into tool results (emails, web pages, documents) that the agent then reads and obeys. ODILE refuses injected instructions while leaving benign task behavior intact, and runs at 1Γ— inference cost β€” no detector, no extra passes.

πŸ’» Code (training + evaluation): https://github.com/memo-ozdincer/ODILE

One adapter per backbone, all rank-16 / alpha-32 LoRA adapters:

Headline result

On AgentDojo with Llama-3.3-70B, ODILE reduces attack-success rate from 14.04% to 0.01% while retaining benign utility (59.8% vs. 59.9% base), at 1Γ— inference cost. The same recipe transfers across six Llama and Qwen backbones and to the out-of-distribution AgentDyn suites, where ODILE is the only zero-ASR defense to retain usable benign throughput.

Load any adapter

from peft import PeftModel
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.3-70B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "memo-ozdincer/ODILE", subfolder="ODILE_Llama-3.3-70B")

Citation

@misc{ozdincer2026odile,
  title  = {Weight-Level Defenses Improve LLM Prompt Injection Robustness},
  author = {Ozdincer, Mehmet and Simko, Samuel and Sch\"olkopf, Bernhard and Jin, Zhijing},
  year   = {2026},
  note   = {Preprint, under review},
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support