aegis-env / server /web /benchmark.html
NishithP2004's picture
Upload folder using huggingface_hub
fa01cfa verified
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>AEGIS-Env — Model benchmark</title>
<link rel="preconnect" href="https://fonts.googleapis.com" />
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
<link
href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap"
rel="stylesheet"
/>
<script src="https://cdn.tailwindcss.com"></script>
<script>
tailwind.config = {
theme: {
extend: {
fontFamily: {
sans: ["Inter", "ui-sans-serif", "system-ui", "sans-serif"],
},
boxShadow: {
glow: "0 20px 60px rgba(99, 102, 241, 0.25)",
},
},
},
};
</script>
<style>
.glass {
background: rgba(255, 255, 255, 0.72);
backdrop-filter: blur(14px);
-webkit-backdrop-filter: blur(14px);
border: 1px solid rgba(255, 255, 255, 0.6);
}
.soft-grid {
background-image: radial-gradient(
rgba(99, 102, 241, 0.12) 1px,
transparent 1px
),
radial-gradient(rgba(236, 72, 153, 0.08) 1px, transparent 1px);
background-position: 0 0, 12px 12px;
background-size: 24px 24px;
}
</style>
</head>
<body class="min-h-screen bg-slate-50 text-slate-900 soft-grid">
<div
class="pointer-events-none fixed inset-x-0 top-0 h-80 bg-gradient-to-b from-indigo-200/60 via-fuchsia-200/30 to-transparent"
></div>
<div class="relative mx-auto max-w-7xl px-4 pb-12 pt-8 sm:px-6 lg:px-8">
<header class="flex flex-col gap-4 sm:flex-row sm:items-end sm:justify-between">
<div>
<p class="text-sm font-medium text-slate-600">
<a href="/web" class="text-indigo-700 hover:underline">← Playground</a>
</p>
<h1 class="mt-2 text-3xl font-semibold tracking-tight sm:text-4xl">
<span
class="text-transparent bg-clip-text bg-gradient-to-r from-indigo-600 via-fuchsia-600 to-sky-600"
>
Model benchmark
</span>
</h1>
<p class="mt-2 max-w-2xl text-sm leading-6 text-slate-600">
List models from an OpenAI-compatible endpoint (e.g.
<span class="font-mono">GET …/v1/models</span>), choose five models and a task
difficulty, then compare runs. Only the chat
<span class="font-semibold">model</span> name changes between episodes; prompts and
environment settings are identical.
</p>
</div>
</header>
<div id="error-banner" class="mt-6 hidden">
<div class="glass rounded-3xl border border-rose-200 bg-rose-50/70 px-4 py-3 text-sm text-rose-800 shadow-sm">
<div class="flex items-start justify-between gap-3">
<pre id="error-text" class="whitespace-pre-wrap text-xs leading-5"></pre>
<button id="error-dismiss" class="rounded-xl px-2 py-1 text-xs font-semibold text-rose-700 hover:bg-rose-100">
Dismiss
</button>
</div>
</div>
</div>
<section class="mt-8 glass rounded-3xl p-5 shadow-sm">
<h2 class="text-sm font-semibold text-slate-800">Configuration</h2>
<p class="mt-1 text-xs leading-5 text-slate-600">
Default API root matches Ollama’s OpenAI-compatible surface (
<a class="text-indigo-700 underline" href="https://ollama.com/v1/models" target="_blank" rel="noreferrer"
>ollama.com/v1/models</a
>). For a local daemon use <span class="font-mono">http://127.0.0.1:11434/v1</span>.
</p>
<div class="mt-4 grid gap-4 lg:grid-cols-2">
<div>
<label class="text-xs font-semibold text-slate-700">API root (list + chat)</label>
<input
id="api-root"
type="text"
value="https://ollama.com/v1"
class="mt-1 w-full rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm font-mono shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
<button
id="btn-refresh-models"
type="button"
class="mt-2 inline-flex items-center gap-2 rounded-2xl border border-slate-200 bg-white/80 px-4 py-2 text-xs font-semibold text-slate-800 shadow-sm hover:bg-white"
>
List models
</button>
<p id="models-status" class="mt-2 text-xs text-slate-500"></p>
</div>
<div>
<label class="text-xs font-semibold text-slate-700">Optional API key</label>
<input
id="api-key"
type="password"
autocomplete="off"
placeholder="Leave empty to use server env or “ollama”"
class="mt-1 w-full rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
</div>
</div>
<div class="mt-6">
<div class="text-xs font-semibold text-slate-700">Select five models</div>
<div id="model-slots" class="mt-2 grid gap-2 sm:grid-cols-2 lg:grid-cols-5"></div>
</div>
<div class="mt-6 flex flex-wrap items-end gap-4">
<div>
<label class="text-xs font-semibold text-slate-700">Task difficulty</label>
<select
id="bench-task"
class="mt-1 block rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
>
<option value="easy">Easy</option>
<option value="medium">Medium</option>
<option value="hard">Hard</option>
</select>
</div>
<div>
<label class="text-xs font-semibold text-slate-700">Max steps</label>
<input
id="bench-max-steps"
type="number"
min="1"
max="200"
value="10"
class="mt-1 w-24 rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
</div>
<div>
<label class="text-xs font-semibold text-slate-700">Seed (optional)</label>
<input
id="bench-seed"
type="number"
min="0"
placeholder="random"
class="mt-1 w-28 rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60"
/>
</div>
<button
id="btn-run-benchmark"
type="button"
class="inline-flex items-center gap-2 rounded-2xl bg-slate-900 px-5 py-2.5 text-sm font-semibold text-white shadow-sm transition hover:bg-slate-800 disabled:opacity-50"
>
<span class="h-2 w-2 rounded-full bg-emerald-400"></span>
Run benchmark
</button>
</div>
<p id="bench-status" class="mt-3 text-xs font-medium text-indigo-700"></p>
</section>
<section class="mt-8 glass rounded-3xl p-5 shadow-sm">
<h2 class="text-sm font-semibold text-slate-800">Results</h2>
<div class="mt-3 overflow-x-auto">
<table class="w-full min-w-[32rem] text-left text-xs">
<thead>
<tr class="border-b border-slate-200 text-slate-500">
<th class="py-2 pr-3 font-semibold">Model</th>
<th class="py-2 pr-3 font-semibold">Total reward</th>
<th class="py-2 pr-3 font-semibold">Steps</th>
<th class="py-2 font-semibold">Error</th>
</tr>
</thead>
<tbody id="bench-table-body"></tbody>
</table>
</div>
</section>
<section class="mt-8 grid gap-6 lg:grid-cols-2">
<div class="glass rounded-3xl p-5 shadow-sm">
<h3 class="text-sm font-semibold text-slate-800">Total reward by model</h3>
<div class="mt-4 h-72">
<canvas id="chart-total" aria-label="Total reward"></canvas>
</div>
</div>
<div class="glass rounded-3xl p-5 shadow-sm">
<h3 class="text-sm font-semibold text-slate-800">Steps to last transition</h3>
<div class="mt-4 h-72">
<canvas id="chart-steps" aria-label="Step count"></canvas>
</div>
</div>
</section>
<section class="mt-8 glass rounded-3xl p-5 shadow-sm">
<h3 class="text-sm font-semibold text-slate-800">Cumulative reward over steps</h3>
<p class="mt-1 text-xs text-slate-600">Per-episode reward sequence (same task + seed per model).</p>
<div class="mt-4 h-96">
<canvas id="chart-cumulative" aria-label="Cumulative reward"></canvas>
</div>
</section>
<footer class="mt-10 text-center text-xs text-slate-500">
Benchmark uses <span class="font-mono">POST /api/benchmark/run</span> on this server (same prompts as
<span class="font-mono">inference.py</span>).
</footer>
</div>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script>
<script src="/web/assets/benchmark.js"></script>
</body>
</html>