Running Agents Implicit Memory Conflict Validator π§ Evaluate LLM responses for outdated memory conflicts
Sleeping Agents Sudanese CoT Reasoning Benchmark π§ Run Sudanese Arabic reasoning benchmark with step-by-step analysis
Sleeping Agents COPSD Sudanese Reasoning Demo π Compare Sudanese math reasoning with and without English context
Running Agents PrefixGuard Demo - Agent Failure Detection π‘ Detect potential agent failures from execution traces
Running Agents LoPE Demo - Prompt Perturbation for Reasoning Exploration π§ Compare baseline and perturbed reasoning for tasks
Paused Agents Lost-in-Thought Benchmark π§ Run a benchmark to see how reasoning steps affect retrieval accuracy
Sleeping Agents Master Key Capability Demo π Show expected accuracy boost for a math problem via steering
Sleeping Agents Agentic World Model Explorer π Explore world model levels, laws, and rollouts interactively