Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.7 (1M context) commited on 26 days ago

Commit

0e96cb9

1 Parent(s): fcfd067

docs(readme): correct test count 444 → 443

Reconcile README test-count claim with actual `pytest --collect-only`
output (443 tests). Updates the four occurrences in the badge line,
production-engineering bullet, Testing section, and the comparison
table footer row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -6,7 +6,7 @@
 Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
-`444 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
@@ -240,7 +240,7 @@ security:
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
 - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
-- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 444 deterministic tests with mock providers
 <details><summary>API Reference</summary>
@@ -302,7 +302,7 @@ The golden dataset contains 27 hand-crafted FastAPI questions (19 retrieval · 3
 ## Testing
 ```bash
-make test    # 444 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
@@ -325,4 +325,4 @@ See [DECISIONS.md](DECISIONS.md) for rationale on building from primitives, RRF
 | **PII redaction** | None | None | Regex + optional NER |
 | **Output validation** | None | None | PII leakage + URL + blocklist |
 | **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
-| Tests | 97 | 205 | 288 |

 Agentic knowledge retrieval system with evaluation benchmark. Custom orchestration pipeline + LangChain baseline, evaluated on matched golden datasets across 3 providers (OpenAI, Anthropic, self-hosted vLLM on Modal) and two corpora (FastAPI + Kubernetes). Zero hallucinated citations on all API provider configurations. The separate self-hosted Mistral-7B benchmark is included to show the practical model-size floor where agentic retrieval starts to break down.
+`443 tests` · `3 providers` · `2 corpora` · `LangChain comparison` · `K8s + Terraform` · `CI`
 ## Benchmark Results
 - **MLOps:** Provider comparison benchmark (API vs self-hosted, real measured data)
 - **Security — detection & redaction**: Two-tier prompt injection detection (heuristic regex + DeBERTa classifier), PII redaction on retrieved context, output validation gate (PII leakage, URL hallucination, blocklist)
 - **Security — audit & compliance**: Append-only JSONL audit trail, HMAC-SHA256 IP hashing (GDPR-aligned), log rotation, config-driven security with Literal-constrained enums
+- **Production engineering**: FastAPI, Docker, CI/CD, structured logging, rate limiting, SSE streaming, conversation sessions, 443 deterministic tests with mock providers
 <details><summary>API Reference</summary>
 ## Testing
 ```bash
+make test    # 443 deterministic tests, no API keys needed
 make lint    # ruff + mypy
 ```
 | **PII redaction** | None | None | Regex + optional NER |
 | **Output validation** | None | None | PII leakage + URL + blocklist |
 | **Audit logging** | None | None | JSONL, HMAC-hashed IPs |
+| Tests | 97 | 205 | 443 |