Spaces:

Nomearod
/

agentbench

Sleeping

App Files Files Community

Nomearod Claude Opus 4.6 (1M context) commited on Mar 31

Commit

f7bb777

1 Parent(s): 06bc29e

docs: add security architecture section to README and DECISIONS.md

Browse files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

DECISIONS.md +32 -0
README.md +76 -14

DECISIONS.md CHANGED Viewed

@@ -281,3 +281,35 @@ request on first `complete()` call with tools and checks if the response contain
 `tool_calls`. The result is cached as `self._supports_tool_calling`. Transient failures
 (timeout, 5xx) return `None` and retry on the next call rather than permanently
 downgrading to prompt-based fallback.

 `tool_calls`. The result is cached as `self._supports_tool_calling`. Transient failures
 (timeout, 5xx) return `None` and retry on the next call rather than permanently
 downgrading to prompt-based fallback.
+## Why two-tier injection detection, not three
+The original design included a middle tier (embedding similarity against known injection examples). Dropped because the existing embedding model (all-MiniLM-L6-v2) is a general-purpose sentence encoder, not specialized for adversarial detection. Cosine similarity can't distinguish semantic similarity from intent similarity — "how do I ignore a field in Pydantic?" clusters near "ignore previous instructions" in that embedding space. The threshold between "ambiguous" and "suspicious" is an untunable hyperparameter with no ground truth.
+Two tiers are cleaner: heuristic regex is deterministic (matches or doesn't), DeBERTa classifier is probabilistic (confidence score). No ambiguous handoff between two probabilistic layers. Deployments without GPU get heuristic-only — documented, not hidden.
+## Why regex + optional spaCy for PII, not a cloud API
+Three reasons: cost (cloud PII APIs charge per call), latency (adds network round-trip to every retrieved chunk), and data residency (PII leaves the system boundary). Regex covers the PII types with actual legal/compliance risk: SSNs, credit cards, emails, phone numbers, IP addresses.
+spaCy NER (PERSON, ORG) is optional because false-positive rates on technical text are unacceptable without domain tuning. "FastAPI" triggers ORG, "Jordan" triggers PERSON. The optional import pattern (`try: import spacy`) degrades gracefully with a logged warning — no crash if someone sets `use_ner: true` without installing spaCy.
+## Why append-only JSONL for audit, not SQLite
+One codepath, one format, no config branching. JSONL is append-only by nature — no schema migrations, no transactions, no connection pooling. Log rotation handles size. `jq` provides immediate queryability without building a custom API.
+The original design included an optional SQLite backend and a query endpoint (`GET /admin/audit`). Both were dropped: SQLite adds a second storage codepath with no consumer, and the query endpoint would require API key authentication — an inconsistency when `/ask` itself has no auth.
+JSONL imports trivially into SQLite/DuckDB if structured queries are needed later. No bridges burned.
+## Why HMAC-SHA256 IP hashing in audit logs
+HMAC-SHA256 with a server secret hashes client IPs before logging. Plain SHA-256 was considered but rejected: the IPv4 address space (~4.3 billion) is small enough that unsalted hashes are reversible by offline enumeration. HMAC-SHA256 with a secret key makes precomputation infeasible without the key. The key is sourced from an explicit parameter, `AUDIT_HMAC_KEY` env var, or (with a logged warning) a random per-process fallback.
+## Why three output validators, not four
+The original design included a "length/format sanity check" (reject suspiciously short responses or raw JSON in natural-language context). Dropped because the calculator tool returns short numeric answers and the tech docs domain legitimately contains code blocks and JSON examples. Every false positive erodes trust in the validation layer. The three remaining checks — PII leakage, URL hallucination, blocklist — are deterministic with clear pass/fail semantics.
+## Why buffer-then-validate for streaming output
+The `/ask/stream` endpoint buffers all events from the orchestrator before sending to the client, then validates the assembled answer. This means the client waits for the full answer before receiving any content chunks. The orchestrator emits the final synthesis as a single chunk (tool-use iterations are not streamed), so the buffering adds no perceptible latency. The alternative — streaming chunks immediately and appending a safety marker — leaks unsafe content to any client that stops reading after the `done` event.

README.md CHANGED Viewed

@@ -134,12 +134,74 @@ flowchart LR
     end
 ```
 ## Engineering Scope
 - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
 - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
 - **Infrastructure:** Kubernetes (Helm), Terraform (GCP/GKE), self-hosted LLM serving (vLLM on Modal + Docker Compose)
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 205 deterministic tests with mock providers
 <details><summary>API Reference</summary>
@@ -211,17 +273,17 @@ All tests use MockProvider + MockEmbeddingModel. No API keys. No model downloads
 See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
-### V1 → V2 Evolution
-| Feature | V1 | V2 |
-|---------|----|----|
-| Grounded refusal | 0/5 | Threshold gate |
-| Retrieval P@5 | 0.70 | 0.74 (cross-encoder reranking) |
-| Provider support | OpenAI only | OpenAI + Anthropic + self-hosted vLLM |
-| Provider resilience | None | Retry + backoff |
-| Rate limiting | None | 10 RPM per IP |
-| Streaming | None | SSE (`/ask/stream`) |
-| Conversation memory | Stateless | SQLite sessions |
-| Infrastructure | Local only | Docker, K8s (Helm), Terraform (GKE), Modal |
-| CI/CD | None | GitHub Actions |
-| Tests | 97 | 205 |

     end
 ```
+## Security Architecture
+Defense-in-depth pipeline with four guardrails. Each stage is independently configurable and degrades gracefully.
+```
+User Input
+    │
+    ▼
+┌──────────────────────┐
+│  Injection Detection  │  Tier 1: heuristic regex (local, <1ms)
+│  (pre-retrieval)      │  Tier 2: DeBERTa classifier (Modal GPU)
+└──────────┬───────────┘
+           │ safe
+           ▼
+┌──────────────────────┐
+│  Retrieval            │  FAISS + BM25 + RRF + cross-encoder
+│  (existing pipeline)  │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────┐
+│  PII Redaction        │  regex (always) + spaCy NER (optional)
+│  (post-retrieval)     │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────┐
+│  LLM Generation       │  OpenAI / Anthropic / vLLM (Modal)
+│  (existing pipeline)  │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────┐
+│  Output Validation    │  PII leakage + URL check + blocklist
+│  (post-generation)    │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────┐
+│  Audit Log            │  JSONL, IP-hashed, rotated
+│  (every request)      │
+└──────────┬───────────┘
+           │
+           ▼
+       Response
+```
+**Injection detection** uses a two-tier architecture: heuristic regex rules catch common patterns (<1ms), and an optional DeBERTa classifier on Modal GPU provides high-confidence classification. Without GPU, the system runs heuristic-only — honest degradation, not silent failure.
+**PII redaction** runs regex patterns for high-risk types (SSN, credit card, email, phone, IP address) on every retrieved chunk before it enters the LLM context window. Optional spaCy NER adds PERSON/ORG detection for deployments that need it.
+**Output validation** catches PII leakage (LLM reconstructing redacted data), URL hallucination (URLs not in retrieved chunks), and blocklisted patterns (system prompt fragments, API keys).
+**Audit logging** writes one structured JSON record per request to an append-only JSONL file with HMAC-SHA256 hashed IPs, injection verdicts, PII redaction counts, and output validation results.
+```bash
+# Query the audit log with jq
+jq 'select(.injection_verdict.safe == false)' logs/audit.jsonl
+jq 'select(.session_id == "abc123")' logs/audit.jsonl
+```
 ## Engineering Scope
 - **Agent design & evaluation**: Built two independent orchestration approaches (custom tool-calling loop + LangChain AgentExecutor) and evaluated both on identical metrics to quantify framework tradeoffs
 - **Retrieval engineering**: Hybrid FAISS + BM25 with Reciprocal Rank Fusion, cross-encoder reranking, evaluated across 27 questions with P@5, R@5, citation accuracy
 - **Infrastructure:** Kubernetes (Helm), Terraform (GCP/GKE), self-hosted LLM serving (vLLM on Modal + Docker Compose)
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
+- **Security engineering**: Prompt injection detection (heuristic + ML classifier), PII redaction, output validation, structured audit logging with GDPR-compliant IP hashing
 - **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 205 deterministic tests with mock providers
 <details><summary>API Reference</summary>
 See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF over score normalization, negative evaluation cases, deterministic eval + optional LLM judge, and more.
+### V1 → V2 → V3 Evolution
+| Feature | V1 | V2 | V3 |
+|---------|----|----|-----|
+| Grounded refusal | 0/5 | Threshold gate | Threshold gate |
+| Retrieval P@5 | 0.70 | 0.74 (cross-encoder) | 0.74 |
+| Provider support | OpenAI only | OpenAI + Anthropic + vLLM | Same |
+| Streaming | None | SSE (`/ask/stream`) | SSE |
+| Infrastructure | Local only | Docker, K8s, Terraform, Modal | Same |
+| **Injection detection** | None | None | Two-tier (heuristic + DeBERTa) |
+| **PII redaction** | None | None | Regex + optional NER |
+| **Output validation** | None | None | PII leakage + URL + blocklist |
+| **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
+| Tests | 97 | 205 | 288+ |