Majen's picture
Initial submission
4d13031 verified
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>RecallTrace OpenEnv</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet">
<link rel="stylesheet" href="/static/styles.css?v=4">
</head>
<body>
<div class="page-shell">
<header class="hero">
<div class="hero-copy">
<span class="eyebrow">Safety-Critical OpenEnv Benchmark</span>
<h1>RecallTrace OpenEnv</h1>
<p class="hero-text">
A real-world supply-chain recall benchmark where agents must trace contaminated lots,
follow relabeled inventory lineage, inspect evidence, and quarantine only the unsafe stock.
</p>
<div class="badge-row">
<span class="badge">OpenEnv compliant</span>
<span class="badge">Deterministic grading</span>
<span class="badge">3 escalating tasks</span>
<span class="badge">Precision containment</span>
</div>
</div>
<div class="hero-panel">
<div class="metric-card">
<span class="metric-label">Average baseline</span>
<strong id="metric-average">0.9677</strong>
</div>
<div class="metric-card">
<span class="metric-label">Hard task focus</span>
<strong>Mixed safe/unsafe inventory</strong>
</div>
<div class="metric-card">
<span class="metric-label">Judging edge</span>
<strong>Operational realism over toy mechanics</strong>
</div>
</div>
</header>
<main class="dashboard-grid">
<section class="panel panel-accent">
<div class="panel-header">
<h2>Task Runner</h2>
<p>Choose a task and run the deterministic baseline to inspect the full trajectory.</p>
</div>
<div class="controls">
<label class="field">
<span>Task level</span>
<select id="task-select"></select>
</label>
<div class="button-row">
<button id="reset-button" class="button button-secondary">Reset Task</button>
<button id="run-button" class="button button-primary">Run Episode</button>
<button id="run-all-button" class="button button-ghost">Run All Tasks</button>
</div>
</div>
<div id="task-summary" class="task-summary"></div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Scoreboard</h2>
<p>Live summary of the current task and the multi-task baseline run.</p>
</div>
<div class="score-grid">
<div class="score-card">
<span>Current score</span>
<strong id="current-score">-</strong>
</div>
<div class="score-card">
<span>Steps taken</span>
<strong id="current-steps">-</strong>
</div>
<div class="score-card">
<span>Status</span>
<strong id="current-status">Ready</strong>
</div>
<div class="score-card">
<span>Average over all tasks</span>
<strong id="all-score">-</strong>
</div>
</div>
<div id="all-results" class="all-results empty-state">Run all tasks to compare easy, medium, and hard performance.</div>
</section>
<section class="panel panel-wide">
<div class="panel-header">
<h2>Episode Output</h2>
<p>Visual baseline trajectory, readable action summaries, and final grading highlights.</p>
</div>
<div class="episode-layout">
<div class="episode-visuals">
<div class="mini-panel">
<h3>Reward Curve</h3>
<div id="reward-chart" class="reward-chart empty-state">Run a task to render the reward trajectory.</div>
</div>
<div class="mini-panel">
<h3>Final Outcome</h3>
<div id="final-summary" class="final-summary empty-state">Readable scoring highlights will appear here.</div>
</div>
</div>
<div id="episode-log" class="episode-log empty-state">Run a task to populate the episode trajectory.</div>
</div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Judge Lens</h2>
</div>
<div class="highlight-stack">
<div class="highlight-card">
<span class="highlight-title">Real-world utility</span>
<p>Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.</p>
</div>
<div class="highlight-card">
<span class="highlight-title">Frontier challenge</span>
<p>The hard task forces precision containment of mixed safe and unsafe stock under partial observability.</p>
</div>
<div class="highlight-card">
<span class="highlight-title">Benchmark quality</span>
<p>Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.</p>
</div>
</div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Project Hub</h2>
</div>
<div class="link-list">
<a href="/health" target="_blank" rel="noreferrer">Health endpoint</a>
<a href="/reset" target="_blank" rel="noreferrer">Reset endpoint</a>
<a href="/tasks" target="_blank" rel="noreferrer">Task catalog JSON</a>
<a href="https://github.com/MS-Shamanth/recalltrace-openenv/tree/sham" target="_blank" rel="noreferrer">GitHub source</a>
<a href="https://huggingface.co/spaces/ms-shamanth/recalltrace-openenv/tree/main" target="_blank" rel="noreferrer">Space files</a>
<a href="https://www.docker.com/" target="_blank" rel="noreferrer">Docker runtime</a>
<a href="https://github.com/openenvai/openenv" target="_blank" rel="noreferrer">OpenEnv ecosystem</a>
</div>
</section>
</main>
</div>
<script src="/static/app.js?v=4"></script>
</body>
</html>