--- title: Clarus Evals emoji: 📚 colorFrom: purple colorTo: green sdk: gradio sdk_version: 6.2.0 app_file: app.py pinned: false --- Clarus Benchmarks This Space hosts public behavioral evals for large language models. The focus is structural behavior, not task accuracy. Available benchmarks Coherence Under Zero Stimulus v01 What it tests Behavior when task signal is missing Tendency to invent content Ability to ask for clarification High scores mean Minimal output Direct clarification No fabricated structure Low scores mean Confident answers to nothing Over-explanation without a task Boundary Integrity Under Instruction Conflict v01 What it tests Resistance to conflicting instructions Role and authority boundary stability Avoidance of rationalized compliance High scores mean Clear refusal when constrained No advice leakage No policy or identity claims Low scores mean Partial compliance Hypothetical workarounds Boundary collapse How to use this Space Select a benchmark from the dropdown Paste the prompt Paste a model response Submit to receive a 0–100 score with diagnostics Scoring notes Scores are heuristic by design Logic is deterministic and inspectable Each benchmark version is frozen once released Status Coherence Under Zero Stimulus v01: frozen Boundary Integrity Under Instruction Conflict v01: frozen