Spaces:
Runtime error
Runtime error
| | |
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8"> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0"> | |
| <title>RecallTrace OpenEnv</title> | |
| <link rel="preconnect" href="https://fonts.googleapis.com"> | |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin> | |
| <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet"> | |
| <link rel="stylesheet" href="/static/styles.css?v=4"> | |
| </head> | |
| <body> | |
| <div class="page-shell"> | |
| <header class="hero"> | |
| <div class="hero-copy"> | |
| <span class="eyebrow">Safety-Critical OpenEnv Benchmark</span> | |
| <h1>RecallTrace OpenEnv</h1> | |
| <p class="hero-text"> | |
| A real-world supply-chain recall benchmark where agents must trace contaminated lots, | |
| follow relabeled inventory lineage, inspect evidence, and quarantine only the unsafe stock. | |
| </p> | |
| <div class="badge-row"> | |
| <span class="badge">OpenEnv compliant</span> | |
| <span class="badge">Deterministic grading</span> | |
| <span class="badge">3 escalating tasks</span> | |
| <span class="badge">Precision containment</span> | |
| </div> | |
| </div> | |
| <div class="hero-panel"> | |
| <div class="metric-card"> | |
| <span class="metric-label">Average baseline</span> | |
| <strong id="metric-average">0.9677</strong> | |
| </div> | |
| <div class="metric-card"> | |
| <span class="metric-label">Hard task focus</span> | |
| <strong>Mixed safe/unsafe inventory</strong> | |
| </div> | |
| <div class="metric-card"> | |
| <span class="metric-label">Judging edge</span> | |
| <strong>Operational realism over toy mechanics</strong> | |
| </div> | |
| </div> | |
| </header> | |
| <main class="dashboard-grid"> | |
| <section class="panel panel-accent"> | |
| <div class="panel-header"> | |
| <h2>Task Runner</h2> | |
| <p>Choose a task and run the deterministic baseline to inspect the full trajectory.</p> | |
| </div> | |
| <div class="controls"> | |
| <label class="field"> | |
| <span>Task level</span> | |
| <select id="task-select"></select> | |
| </label> | |
| <div class="button-row"> | |
| <button id="reset-button" class="button button-secondary">Reset Task</button> | |
| <button id="run-button" class="button button-primary">Run Episode</button> | |
| <button id="run-all-button" class="button button-ghost">Run All Tasks</button> | |
| </div> | |
| </div> | |
| <div id="task-summary" class="task-summary"></div> | |
| </section> | |
| <section class="panel"> | |
| <div class="panel-header"> | |
| <h2>Scoreboard</h2> | |
| <p>Live summary of the current task and the multi-task baseline run.</p> | |
| </div> | |
| <div class="score-grid"> | |
| <div class="score-card"> | |
| <span>Current score</span> | |
| <strong id="current-score">-</strong> | |
| </div> | |
| <div class="score-card"> | |
| <span>Steps taken</span> | |
| <strong id="current-steps">-</strong> | |
| </div> | |
| <div class="score-card"> | |
| <span>Status</span> | |
| <strong id="current-status">Ready</strong> | |
| </div> | |
| <div class="score-card"> | |
| <span>Average over all tasks</span> | |
| <strong id="all-score">-</strong> | |
| </div> | |
| </div> | |
| <div id="all-results" class="all-results empty-state">Run all tasks to compare easy, medium, and hard performance.</div> | |
| </section> | |
| <section class="panel panel-wide"> | |
| <div class="panel-header"> | |
| <h2>Episode Output</h2> | |
| <p>Visual baseline trajectory, readable action summaries, and final grading highlights.</p> | |
| </div> | |
| <div class="episode-layout"> | |
| <div class="episode-visuals"> | |
| <div class="mini-panel"> | |
| <h3>Reward Curve</h3> | |
| <div id="reward-chart" class="reward-chart empty-state">Run a task to render the reward trajectory.</div> | |
| </div> | |
| <div class="mini-panel"> | |
| <h3>Final Outcome</h3> | |
| <div id="final-summary" class="final-summary empty-state">Readable scoring highlights will appear here.</div> | |
| </div> | |
| </div> | |
| <div id="episode-log" class="episode-log empty-state">Run a task to populate the episode trajectory.</div> | |
| </div> | |
| </section> | |
| <section class="panel"> | |
| <div class="panel-header"> | |
| <h2>Judge Lens</h2> | |
| </div> | |
| <div class="highlight-stack"> | |
| <div class="highlight-card"> | |
| <span class="highlight-title">Real-world utility</span> | |
| <p>Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.</p> | |
| </div> | |
| <div class="highlight-card"> | |
| <span class="highlight-title">Frontier challenge</span> | |
| <p>The hard task forces precision containment of mixed safe and unsafe stock under partial observability.</p> | |
| </div> | |
| <div class="highlight-card"> | |
| <span class="highlight-title">Benchmark quality</span> | |
| <p>Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.</p> | |
| </div> | |
| </div> | |
| </section> | |
| <section class="panel"> | |
| <div class="panel-header"> | |
| <h2>Project Hub</h2> | |
| </div> | |
| <div class="link-list"> | |
| <a href="/health" target="_blank" rel="noreferrer">Health endpoint</a> | |
| <a href="/reset" target="_blank" rel="noreferrer">Reset endpoint</a> | |
| <a href="/tasks" target="_blank" rel="noreferrer">Task catalog JSON</a> | |
| <a href="https://github.com/MS-Shamanth/recalltrace-openenv/tree/sham" target="_blank" rel="noreferrer">GitHub source</a> | |
| <a href="https://huggingface.co/spaces/ms-shamanth/recalltrace-openenv/tree/main" target="_blank" rel="noreferrer">Space files</a> | |
| <a href="https://www.docker.com/" target="_blank" rel="noreferrer">Docker runtime</a> | |
| <a href="https://github.com/openenvai/openenv" target="_blank" rel="noreferrer">OpenEnv ecosystem</a> | |
| </div> | |
| </section> | |
| </main> | |
| </div> | |
| <script src="/static/app.js?v=4"></script> | |
| </body> | |
| </html> | |