Demo idea: threshold calibration and benign security text
#1
by armorerlabs - opened
Nice demo. A practical addition would be a small threshold-calibration panel with examples that intentionally sit near the boundary.
The cases I would include are:
- obvious malicious injection
- indirect injection embedded in a document/tool result
- benign cybersecurity explanation that mentions “ignore previous instructions” as quoted text
- sensitive-data request phrased politely
- tool-use request that is safe in read-only mode but unsafe if write/network tools are enabled
For guardrail demos, the most interesting question is often not “can it catch the obvious attack?” but “does it preserve useful security/dev workflows while still flagging the request before an agent takes a side effect?”