File size: 6,558 Bytes
4d13031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>RecallTrace OpenEnv</title>
  <link rel="preconnect" href="https://fonts.googleapis.com">
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
  <link href="https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;700&family=IBM+Plex+Mono:wght@400;500&display=swap" rel="stylesheet">
  <link rel="stylesheet" href="/static/styles.css?v=4">
</head>
<body>
  <div class="page-shell">
    <header class="hero">
      <div class="hero-copy">
        <span class="eyebrow">Safety-Critical OpenEnv Benchmark</span>
        <h1>RecallTrace OpenEnv</h1>
        <p class="hero-text">
          A real-world supply-chain recall benchmark where agents must trace contaminated lots,
          follow relabeled inventory lineage, inspect evidence, and quarantine only the unsafe stock.
        </p>
        <div class="badge-row">
          <span class="badge">OpenEnv compliant</span>
          <span class="badge">Deterministic grading</span>
          <span class="badge">3 escalating tasks</span>
          <span class="badge">Precision containment</span>
        </div>
      </div>
      <div class="hero-panel">
        <div class="metric-card">
          <span class="metric-label">Average baseline</span>
          <strong id="metric-average">0.9677</strong>
        </div>
        <div class="metric-card">
          <span class="metric-label">Hard task focus</span>
          <strong>Mixed safe/unsafe inventory</strong>
        </div>
        <div class="metric-card">
          <span class="metric-label">Judging edge</span>
          <strong>Operational realism over toy mechanics</strong>
        </div>
      </div>
    </header>

    <main class="dashboard-grid">
      <section class="panel panel-accent">
        <div class="panel-header">
          <h2>Task Runner</h2>
          <p>Choose a task and run the deterministic baseline to inspect the full trajectory.</p>
        </div>
        <div class="controls">
          <label class="field">
            <span>Task level</span>
            <select id="task-select"></select>
          </label>
          <div class="button-row">
            <button id="reset-button" class="button button-secondary">Reset Task</button>
            <button id="run-button" class="button button-primary">Run Episode</button>
            <button id="run-all-button" class="button button-ghost">Run All Tasks</button>
          </div>
        </div>
        <div id="task-summary" class="task-summary"></div>
      </section>

      <section class="panel">
        <div class="panel-header">
          <h2>Scoreboard</h2>
          <p>Live summary of the current task and the multi-task baseline run.</p>
        </div>
        <div class="score-grid">
          <div class="score-card">
            <span>Current score</span>
            <strong id="current-score">-</strong>
          </div>
          <div class="score-card">
            <span>Steps taken</span>
            <strong id="current-steps">-</strong>
          </div>
          <div class="score-card">
            <span>Status</span>
            <strong id="current-status">Ready</strong>
          </div>
          <div class="score-card">
            <span>Average over all tasks</span>
            <strong id="all-score">-</strong>
          </div>
        </div>
        <div id="all-results" class="all-results empty-state">Run all tasks to compare easy, medium, and hard performance.</div>
      </section>

      <section class="panel panel-wide">
        <div class="panel-header">
          <h2>Episode Output</h2>
          <p>Visual baseline trajectory, readable action summaries, and final grading highlights.</p>
        </div>
        <div class="episode-layout">
          <div class="episode-visuals">
            <div class="mini-panel">
              <h3>Reward Curve</h3>
              <div id="reward-chart" class="reward-chart empty-state">Run a task to render the reward trajectory.</div>
            </div>
            <div class="mini-panel">
              <h3>Final Outcome</h3>
              <div id="final-summary" class="final-summary empty-state">Readable scoring highlights will appear here.</div>
            </div>
          </div>
          <div id="episode-log" class="episode-log empty-state">Run a task to populate the episode trajectory.</div>
        </div>
      </section>

      <section class="panel">
        <div class="panel-header">
          <h2>Judge Lens</h2>
        </div>
        <div class="highlight-stack">
          <div class="highlight-card">
            <span class="highlight-title">Real-world utility</span>
            <p>Models a safety-critical recall workflow that QA, operations, and supply-chain teams actually perform.</p>
          </div>
          <div class="highlight-card">
            <span class="highlight-title">Frontier challenge</span>
            <p>The hard task forces precision containment of mixed safe and unsafe stock under partial observability.</p>
          </div>
          <div class="highlight-card">
            <span class="highlight-title">Benchmark quality</span>
            <p>Deterministic graders evaluate precision, coverage, investigation depth, and efficiency with reproducible scores.</p>
          </div>
        </div>
      </section>

      <section class="panel">
        <div class="panel-header">
          <h2>Project Hub</h2>
        </div>
        <div class="link-list">
          <a href="/health" target="_blank" rel="noreferrer">Health endpoint</a>
          <a href="/reset" target="_blank" rel="noreferrer">Reset endpoint</a>
          <a href="/tasks" target="_blank" rel="noreferrer">Task catalog JSON</a>
          <a href="https://github.com/MS-Shamanth/recalltrace-openenv/tree/sham" target="_blank" rel="noreferrer">GitHub source</a>
          <a href="https://huggingface.co/spaces/ms-shamanth/recalltrace-openenv/tree/main" target="_blank" rel="noreferrer">Space files</a>
          <a href="https://www.docker.com/" target="_blank" rel="noreferrer">Docker runtime</a>
          <a href="https://github.com/openenvai/openenv" target="_blank" rel="noreferrer">OpenEnv ecosystem</a>
        </div>
      </section>
    </main>
  </div>
  <script src="/static/app.js?v=4"></script>
</body>
</html>