Spaces:
Runtime error
Runtime error
File size: 6,558 Bytes
4d13031 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 | <!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>RecallTrace OpenEnv</title>
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet">
<link rel="stylesheet" href="/static/styles.css?v=4">
</head>
<body>
<div class="page-shell">
<header class="hero">
<div class="hero-copy">
<span class="eyebrow">Safety-Critical OpenEnv Benchmark</span>
<h1>RecallTrace OpenEnv</h1>
<p class="hero-text">
A real-world supply-chain recall benchmark where agents must trace contaminated lots,
follow relabeled inventory lineage, inspect evidence, and quarantine only the unsafe stock.
</p>
<div class="badge-row">
<span class="badge">OpenEnv compliant</span>
<span class="badge">Deterministic grading</span>
<span class="badge">3 escalating tasks</span>
<span class="badge">Precision containment</span>
</div>
</div>
<div class="hero-panel">
<div class="metric-card">
<span class="metric-label">Average baseline</span>
<strong id="metric-average">0.9677</strong>
</div>
<div class="metric-card">
<span class="metric-label">Hard task focus</span>
<strong>Mixed safe/unsafe inventory</strong>
</div>
<div class="metric-card">
<span class="metric-label">Judging edge</span>
<strong>Operational realism over toy mechanics</strong>
</div>
</div>
</header>
<main class="dashboard-grid">
<section class="panel panel-accent">
<div class="panel-header">
<h2>Task Runner</h2>
<p>Choose a task and run the deterministic baseline to inspect the full trajectory.</p>
</div>
<div class="controls">
<label class="field">
<span>Task level</span>
<select id="task-select"></select>
</label>
<div class="button-row">
<button id="reset-button" class="button button-secondary">Reset Task</button>
<button id="run-button" class="button button-primary">Run Episode</button>
<button id="run-all-button" class="button button-ghost">Run All Tasks</button>
</div>
</div>
<div id="task-summary" class="task-summary"></div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Scoreboard</h2>
<p>Live summary of the current task and the multi-task baseline run.</p>
</div>
<div class="score-grid">
<div class="score-card">
<span>Current score</span>
<strong id="current-score">-</strong>
</div>
<div class="score-card">
<span>Steps taken</span>
<strong id="current-steps">-</strong>
</div>
<div class="score-card">
<span>Status</span>
<strong id="current-status">Ready</strong>
</div>
<div class="score-card">
<span>Average over all tasks</span>
<strong id="all-score">-</strong>
</div>
</div>
<div id="all-results" class="all-results empty-state">Run all tasks to compare easy, medium, and hard performance.</div>
</section>
<section class="panel panel-wide">
<div class="panel-header">
<h2>Episode Output</h2>
<p>Visual baseline trajectory, readable action summaries, and final grading highlights.</p>
</div>
<div class="episode-layout">
<div class="episode-visuals">
<div class="mini-panel">
<h3>Reward Curve</h3>
<div id="reward-chart" class="reward-chart empty-state">Run a task to render the reward trajectory.</div>
</div>
<div class="mini-panel">
<h3>Final Outcome</h3>
<div id="final-summary" class="final-summary empty-state">Readable scoring highlights will appear here.</div>
</div>
</div>
<div id="episode-log" class="episode-log empty-state">Run a task to populate the episode trajectory.</div>
</div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Judge Lens</h2>
</div>
<div class="highlight-stack">
<div class="highlight-card">
<span class="highlight-title">Real-world utility</span>
<p>Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.</p>
</div>
<div class="highlight-card">
<span class="highlight-title">Frontier challenge</span>
<p>The hard task forces precision containment of mixed safe and unsafe stock under partial observability.</p>
</div>
<div class="highlight-card">
<span class="highlight-title">Benchmark quality</span>
<p>Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.</p>
</div>
</div>
</section>
<section class="panel">
<div class="panel-header">
<h2>Project Hub</h2>
</div>
<div class="link-list">
<a href="/health" target="_blank" rel="noreferrer">Health endpoint</a>
<a href="/reset" target="_blank" rel="noreferrer">Reset endpoint</a>
<a href="/tasks" target="_blank" rel="noreferrer">Task catalog JSON</a>
<a href="https://github.com/MS-Shamanth/recalltrace-openenv/tree/sham" target="_blank" rel="noreferrer">GitHub source</a>
<a href="https://huggingface.co/spaces/ms-shamanth/recalltrace-openenv/tree/main" target="_blank" rel="noreferrer">Space files</a>
<a href="https://www.docker.com/" target="_blank" rel="noreferrer">Docker runtime</a>
<a href="https://github.com/openenvai/openenv" target="_blank" rel="noreferrer">OpenEnv ecosystem</a>
</div>
</section>
</main>
</div>
<script src="/static/app.js?v=4"></script>
</body>
</html>
|