| <!DOCTYPE html>
|
| <html lang="en">
|
| <head>
|
| <meta charset="UTF-8">
|
| <meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| <title>RecallTrace OpenEnv</title>
|
| <link rel="preconnect" href="https://fonts.googleapis.com">
|
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet">
|
| <link rel="stylesheet" href="/static/styles.css?v=4">
|
| </head>
|
| <body>
|
| <div class="page-shell">
|
| <header class="hero">
|
| <div class="hero-copy">
|
| <span class="eyebrow">Safety-Critical OpenEnv Benchmark</span>
|
| <h1>RecallTrace OpenEnv</h1>
|
| <p class="hero-text">
|
| A real-world supply-chain recall benchmark where agents must trace contaminated lots,
|
| follow relabeled inventory lineage, inspect evidence, and quarantine only the unsafe stock.
|
| </p>
|
| <div class="badge-row">
|
| <span class="badge">OpenEnv compliant</span>
|
| <span class="badge">Deterministic grading</span>
|
| <span class="badge">3 escalating tasks</span>
|
| <span class="badge">Precision containment</span>
|
| </div>
|
| </div>
|
| <div class="hero-panel">
|
| <div class="metric-card">
|
| <span class="metric-label">Average baseline</span>
|
| <strong id="metric-average">0.9677</strong>
|
| </div>
|
| <div class="metric-card">
|
| <span class="metric-label">Hard task focus</span>
|
| <strong>Mixed safe/unsafe inventory</strong>
|
| </div>
|
| <div class="metric-card">
|
| <span class="metric-label">Judging edge</span>
|
| <strong>Operational realism over toy mechanics</strong>
|
| </div>
|
| </div>
|
| </header>
|
|
|
| <main class="dashboard-grid">
|
| <section class="panel panel-accent">
|
| <div class="panel-header">
|
| <h2>Task Runner</h2>
|
| <p>Choose a task and run the deterministic baseline to inspect the full trajectory.</p>
|
| </div>
|
| <div class="controls">
|
| <label class="field">
|
| <span>Task level</span>
|
| <select id="task-select"></select>
|
| </label>
|
| <div class="button-row">
|
| <button id="reset-button" class="button button-secondary">Reset Task</button>
|
| <button id="run-button" class="button button-primary">Run Episode</button>
|
| <button id="run-all-button" class="button button-ghost">Run All Tasks</button>
|
| </div>
|
| </div>
|
| <div id="task-summary" class="task-summary"></div>
|
| </section>
|
|
|
| <section class="panel">
|
| <div class="panel-header">
|
| <h2>Scoreboard</h2>
|
| <p>Live summary of the current task and the multi-task baseline run.</p>
|
| </div>
|
| <div class="score-grid">
|
| <div class="score-card">
|
| <span>Current score</span>
|
| <strong id="current-score">-</strong>
|
| </div>
|
| <div class="score-card">
|
| <span>Steps taken</span>
|
| <strong id="current-steps">-</strong>
|
| </div>
|
| <div class="score-card">
|
| <span>Status</span>
|
| <strong id="current-status">Ready</strong>
|
| </div>
|
| <div class="score-card">
|
| <span>Average over all tasks</span>
|
| <strong id="all-score">-</strong>
|
| </div>
|
| </div>
|
| <div id="all-results" class="all-results empty-state">Run all tasks to compare easy, medium, and hard performance.</div>
|
| </section>
|
|
|
| <section class="panel panel-wide">
|
| <div class="panel-header">
|
| <h2>Episode Output</h2>
|
| <p>Visual baseline trajectory, readable action summaries, and final grading highlights.</p>
|
| </div>
|
| <div class="episode-layout">
|
| <div class="episode-visuals">
|
| <div class="mini-panel">
|
| <h3>Reward Curve</h3>
|
| <div id="reward-chart" class="reward-chart empty-state">Run a task to render the reward trajectory.</div>
|
| </div>
|
| <div class="mini-panel">
|
| <h3>Final Outcome</h3>
|
| <div id="final-summary" class="final-summary empty-state">Readable scoring highlights will appear here.</div>
|
| </div>
|
| </div>
|
| <div id="episode-log" class="episode-log empty-state">Run a task to populate the episode trajectory.</div>
|
| </div>
|
| </section>
|
|
|
| <section class="panel">
|
| <div class="panel-header">
|
| <h2>Judge Lens</h2>
|
| </div>
|
| <div class="highlight-stack">
|
| <div class="highlight-card">
|
| <span class="highlight-title">Real-world utility</span>
|
| <p>Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.</p>
|
| </div>
|
| <div class="highlight-card">
|
| <span class="highlight-title">Frontier challenge</span>
|
| <p>The hard task forces precision containment of mixed safe and unsafe stock under partial observability.</p>
|
| </div>
|
| <div class="highlight-card">
|
| <span class="highlight-title">Benchmark quality</span>
|
| <p>Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.</p>
|
| </div>
|
| </div>
|
| </section>
|
|
|
| <section class="panel">
|
| <div class="panel-header">
|
| <h2>Project Hub</h2>
|
| </div>
|
| <div class="link-list">
|
| <a href="/health" target="_blank" rel="noreferrer">Health endpoint</a>
|
| <a href="/reset" target="_blank" rel="noreferrer">Reset endpoint</a>
|
| <a href="/tasks" target="_blank" rel="noreferrer">Task catalog JSON</a>
|
| <a href="https://github.com/MS-Shamanth/recalltrace-openenv/tree/sham" target="_blank" rel="noreferrer">GitHub source</a>
|
| <a href="https://huggingface.co/spaces/ms-shamanth/recalltrace-openenv/tree/main" target="_blank" rel="noreferrer">Space files</a>
|
| <a href="https://www.docker.com/" target="_blank" rel="noreferrer">Docker runtime</a>
|
| <a href="https://github.com/openenvai/openenv" target="_blank" rel="noreferrer">OpenEnv ecosystem</a>
|
| </div>
|
| </section>
|
| </main>
|
| </div>
|
| <script src="/static/app.js?v=4"></script>
|
| </body>
|
| </html>
|
|
|