Spaces:
Sleeping
Sleeping
File size: 9,668 Bytes
fa01cfa | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | <!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>AEGIS-Env — Model benchmark</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link
href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap"
rel="stylesheet"
/>
<script src="https://cdn.tailwindcss.com"></script>
<script>
tailwind.config = {
theme: {
extend: {
fontFamily: {
sans: ["Inter", "ui-sans-serif", "system-ui", "sans-serif"],
},
boxShadow: {
glow: "0 20px 60px rgba(99, 102, 241, 0.25)",
},
},
},
};
</script>
<style>
.glass {
background: rgba(255, 255, 255, 0.72);
backdrop-filter: blur(14px);
-webkit-backdrop-filter: blur(14px);
border: 1px solid rgba(255, 255, 255, 0.6);
}
.soft-grid {
background-image: radial-gradient(
rgba(99, 102, 241, 0.12) 1px,
transparent 1px
),
radial-gradient(rgba(236, 72, 153, 0.08) 1px, transparent 1px);
background-position: 0 0, 12px 12px;
background-size: 24px 24px;
}
</style>
</head>
<body class="min-h-screen bg-slate-50 text-slate-900 soft-grid">
<div
class="pointer-events-none fixed inset-x-0 top-0 h-80 bg-gradient-to-b from-indigo-200/60 via-fuchsia-200/30 to-transparent"
></div>
<div class="relative mx-auto max-w-7xl px-4 pb-12 pt-8 sm:px-6 lg:px-8">
<header class="flex flex-col gap-4 sm:flex-row sm:items-end sm:justify-between">
<div>
<p class="text-sm font-medium text-slate-600">
<a href="/web" class="text-indigo-700 hover:underline">← Playground</a>
</p>
<h1 class="mt-2 text-3xl font-semibold tracking-tight sm:text-4xl">
<span
class="text-transparent bg-clip-text bg-gradient-to-r from-indigo-600 via-fuchsia-600 to-sky-600"
>
Model benchmark
</span>
</h1>
<p class="mt-2 max-w-2xl text-sm leading-6 text-slate-600">
List models from an OpenAI-compatible endpoint (e.g.
<span class="font-mono">GET …/v1/models</span>), choose five models and a task
difficulty, then compare runs. Only the chat
<span class="font-semibold">model</span> name changes between episodes; prompts and
environment settings are identical.
</p>
</div>
</header>
<div id="error-banner" class="mt-6 hidden">
<div class="glass rounded-3xl border border-rose-200 bg-rose-50/70 px-4 py-3 text-sm text-rose-800 shadow-sm">
<div class="flex items-start justify-between gap-3">
<pre id="error-text" class="whitespace-pre-wrap text-xs leading-5"></pre>
<button id="error-dismiss" class="rounded-xl px-2 py-1 text-xs font-semibold text-rose-700 hover:bg-rose-100">
Dismiss
</button>
</div>
</div>
</div>
<section class="mt-8 glass rounded-3xl p-5 shadow-sm">
<h2 class="text-sm font-semibold text-slate-800">Configuration</h2>
<p class="mt-1 text-xs leading-5 text-slate-600">
Default API root matches Ollama’s OpenAI-compatible surface (
<a class="text-indigo-700 underline" href="https://ollama.com/v1/models" target="_blank" rel="noreferrer"
>ollama.com/v1/models</a
>). For a local daemon use <span class="font-mono">http://127.0.0.1:11434/v1</span>.
</p>
<div class="mt-4 grid gap-4 lg:grid-cols-2">
<div>
<label class="text-xs font-semibold text-slate-700">API root (list + chat)</label>
<input
id="api-root"
type="text"
value="https://ollama.com/v1"
class="mt-1 w-full rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm font-mono shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
<button
id="btn-refresh-models"
type="button"
class="mt-2 inline-flex items-center gap-2 rounded-2xl border border-slate-200 bg-white/80 px-4 py-2 text-xs font-semibold text-slate-800 shadow-sm hover:bg-white"
>
List models
</button>
<p id="models-status" class="mt-2 text-xs text-slate-500"></p>
</div>
<div>
<label class="text-xs font-semibold text-slate-700">Optional API key</label>
<input
id="api-key"
type="password"
autocomplete="off"
placeholder="Leave empty to use server env or “ollama”"
class="mt-1 w-full rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
</div>
</div>
<div class="mt-6">
<div class="text-xs font-semibold text-slate-700">Select five models</div>
<div id="model-slots" class="mt-2 grid gap-2 sm:grid-cols-2 lg:grid-cols-5"></div>
</div>
<div class="mt-6 flex flex-wrap items-end gap-4">
<div>
<label class="text-xs font-semibold text-slate-700">Task difficulty</label>
<select
id="bench-task"
class="mt-1 block rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
>
<option value="easy">Easy</option>
<option value="medium">Medium</option>
<option value="hard">Hard</option>
</select>
</div>
<div>
<label class="text-xs font-semibold text-slate-700">Max steps</label>
<input
id="bench-max-steps"
type="number"
min="1"
max="200"
value="10"
class="mt-1 w-24 rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
</div>
<div>
<label class="text-xs font-semibold text-slate-700">Seed (optional)</label>
<input
id="bench-seed"
type="number"
min="0"
placeholder="random"
class="mt-1 w-28 rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
</div>
<button
id="btn-run-benchmark"
type="button"
class="inline-flex items-center gap-2 rounded-2xl bg-slate-900 px-5 py-2.5 text-sm font-semibold text-white shadow-sm transition hover:bg-slate-800 disabled:opacity-50"
>
<span class="h-2 w-2 rounded-full bg-emerald-400"></span>
Run benchmark
</button>
</div>
<p id="bench-status" class="mt-3 text-xs font-medium text-indigo-700"></p>
</section>
<section class="mt-8 glass rounded-3xl p-5 shadow-sm">
<h2 class="text-sm font-semibold text-slate-800">Results</h2>
<div class="mt-3 overflow-x-auto">
<table class="w-full min-w-[32rem] text-left text-xs">
<thead>
<tr class="border-b border-slate-200 text-slate-500">
<th class="py-2 pr-3 font-semibold">Model</th>
<th class="py-2 pr-3 font-semibold">Total reward</th>
<th class="py-2 pr-3 font-semibold">Steps</th>
<th class="py-2 font-semibold">Error</th>
</tr>
</thead>
<tbody id="bench-table-body"></tbody>
</table>
</div>
</section>
<section class="mt-8 grid gap-6 lg:grid-cols-2">
<div class="glass rounded-3xl p-5 shadow-sm">
<h3 class="text-sm font-semibold text-slate-800">Total reward by model</h3>
<div class="mt-4 h-72">
<canvas id="chart-total" aria-label="Total reward"></canvas>
</div>
</div>
<div class="glass rounded-3xl p-5 shadow-sm">
<h3 class="text-sm font-semibold text-slate-800">Steps to last transition</h3>
<div class="mt-4 h-72">
<canvas id="chart-steps" aria-label="Step count"></canvas>
</div>
</div>
</section>
<section class="mt-8 glass rounded-3xl p-5 shadow-sm">
<h3 class="text-sm font-semibold text-slate-800">Cumulative reward over steps</h3>
<p class="mt-1 text-xs text-slate-600">Per-episode reward sequence (same task + seed per model).</p>
<div class="mt-4 h-96">
<canvas id="chart-cumulative" aria-label="Cumulative reward"></canvas>
</div>
</section>
<footer class="mt-10 text-center text-xs text-slate-500">
Benchmark uses <span class="font-mono">POST /api/benchmark/run</span> on this server (same prompts as
<span class="font-mono">inference.py</span>).
</footer>
</div>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script>
<script src="/web/assets/benchmark.js"></script>
</body>
</html>
|