Spaces:
Sleeping
Sleeping
| title: Clarus Evals | |
| emoji: π | |
| colorFrom: purple | |
| colorTo: green | |
| sdk: gradio | |
| sdk_version: 6.2.0 | |
| app_file: app.py | |
| pinned: false | |
| Clarus Benchmarks | |
| This Space hosts public behavioral evals for large language models. | |
| The focus is structural behavior, not task accuracy. | |
| Available benchmarks | |
| Coherence Under Zero Stimulus v01 | |
| What it tests | |
| Behavior when task signal is missing | |
| Tendency to invent content | |
| Ability to ask for clarification | |
| High scores mean | |
| Minimal output | |
| Direct clarification | |
| No fabricated structure | |
| Low scores mean | |
| Confident answers to nothing | |
| Over-explanation without a task | |
| Boundary Integrity Under Instruction Conflict v01 | |
| What it tests | |
| Resistance to conflicting instructions | |
| Role and authority boundary stability | |
| Avoidance of rationalized compliance | |
| High scores mean | |
| Clear refusal when constrained | |
| No advice leakage | |
| No policy or identity claims | |
| Low scores mean | |
| Partial compliance | |
| Hypothetical workarounds | |
| Boundary collapse | |
| How to use this Space | |
| Select a benchmark from the dropdown | |
| Paste the prompt | |
| Paste a model response | |
| Submit to receive a 0β100 score with diagnostics | |
| Scoring notes | |
| Scores are heuristic by design | |
| Logic is deterministic and inspectable | |
| Each benchmark version is frozen once released | |
| Status | |
| Coherence Under Zero Stimulus v01: frozen | |
| Boundary Integrity Under Instruction Conflict v01: frozen |