Spaces:
Sleeping
Sleeping
| <html lang="en"> | |
| <head> | |
| <meta charset="UTF-8" /> | |
| <meta name="viewport" content="width=device-width, initial-scale=1.0" /> | |
| <title>AEGIS-Env — Model benchmark</title> | |
| <link rel="preconnect" href="https://fonts.googleapis.com" /> | |
| <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin /> | |
| <link | |
| href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" | |
| rel="stylesheet" | |
| /> | |
| <script src="https://cdn.tailwindcss.com"></script> | |
| <script> | |
| tailwind.config = { | |
| theme: { | |
| extend: { | |
| fontFamily: { | |
| sans: ["Inter", "ui-sans-serif", "system-ui", "sans-serif"], | |
| }, | |
| boxShadow: { | |
| glow: "0 20px 60px rgba(99, 102, 241, 0.25)", | |
| }, | |
| }, | |
| }, | |
| }; | |
| </script> | |
| <style> | |
| .glass { | |
| background: rgba(255, 255, 255, 0.72); | |
| backdrop-filter: blur(14px); | |
| -webkit-backdrop-filter: blur(14px); | |
| border: 1px solid rgba(255, 255, 255, 0.6); | |
| } | |
| .soft-grid { | |
| background-image: radial-gradient( | |
| rgba(99, 102, 241, 0.12) 1px, | |
| transparent 1px | |
| ), | |
| radial-gradient(rgba(236, 72, 153, 0.08) 1px, transparent 1px); | |
| background-position: 0 0, 12px 12px; | |
| background-size: 24px 24px; | |
| } | |
| </style> | |
| </head> | |
| <body class="min-h-screen bg-slate-50 text-slate-900 soft-grid"> | |
| <div | |
| class="pointer-events-none fixed inset-x-0 top-0 h-80 bg-gradient-to-b from-indigo-200/60 via-fuchsia-200/30 to-transparent" | |
| ></div> | |
| <div class="relative mx-auto max-w-7xl px-4 pb-12 pt-8 sm:px-6 lg:px-8"> | |
| <header class="flex flex-col gap-4 sm:flex-row sm:items-end sm:justify-between"> | |
| <div> | |
| <p class="text-sm font-medium text-slate-600"> | |
| <a href="/web" class="text-indigo-700 hover:underline">← Playground</a> | |
| </p> | |
| <h1 class="mt-2 text-3xl font-semibold tracking-tight sm:text-4xl"> | |
| <span | |
| class="text-transparent bg-clip-text bg-gradient-to-r from-indigo-600 via-fuchsia-600 to-sky-600" | |
| > | |
| Model benchmark | |
| </span> | |
| </h1> | |
| <p class="mt-2 max-w-2xl text-sm leading-6 text-slate-600"> | |
| List models from an OpenAI-compatible endpoint (e.g. | |
| <span class="font-mono">GET …/v1/models</span>), choose five models and a task | |
| difficulty, then compare runs. Only the chat | |
| <span class="font-semibold">model</span> name changes between episodes; prompts and | |
| environment settings are identical. | |
| </p> | |
| </div> | |
| </header> | |
| <div id="error-banner" class="mt-6 hidden"> | |
| <div class="glass rounded-3xl border border-rose-200 bg-rose-50/70 px-4 py-3 text-sm text-rose-800 shadow-sm"> | |
| <div class="flex items-start justify-between gap-3"> | |
| <pre id="error-text" class="whitespace-pre-wrap text-xs leading-5"></pre> | |
| <button id="error-dismiss" class="rounded-xl px-2 py-1 text-xs font-semibold text-rose-700 hover:bg-rose-100"> | |
| Dismiss | |
| </button> | |
| </div> | |
| </div> | |
| </div> | |
| <section class="mt-8 glass rounded-3xl p-5 shadow-sm"> | |
| <h2 class="text-sm font-semibold text-slate-800">Configuration</h2> | |
| <p class="mt-1 text-xs leading-5 text-slate-600"> | |
| Default API root matches Ollama’s OpenAI-compatible surface ( | |
| <a class="text-indigo-700 underline" href="https://ollama.com/v1/models" target="_blank" rel="noreferrer" | |
| >ollama.com/v1/models</a | |
| >). For a local daemon use <span class="font-mono">http://127.0.0.1:11434/v1</span>. | |
| </p> | |
| <div class="mt-4 grid gap-4 lg:grid-cols-2"> | |
| <div> | |
| <label class="text-xs font-semibold text-slate-700">API root (list + chat)</label> | |
| <input | |
| id="api-root" | |
| type="text" | |
| value="https://ollama.com/v1" | |
| class="mt-1 w-full rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm font-mono shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60" | |
| /> | |
| <button | |
| id="btn-refresh-models" | |
| type="button" | |
| class="mt-2 inline-flex items-center gap-2 rounded-2xl border border-slate-200 bg-white/80 px-4 py-2 text-xs font-semibold text-slate-800 shadow-sm hover:bg-white" | |
| > | |
| List models | |
| </button> | |
| <p id="models-status" class="mt-2 text-xs text-slate-500"></p> | |
| </div> | |
| <div> | |
| <label class="text-xs font-semibold text-slate-700">Optional API key</label> | |
| <input | |
| id="api-key" | |
| type="password" | |
| autocomplete="off" | |
| placeholder="Leave empty to use server env or “ollama”" | |
| class="mt-1 w-full rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60" | |
| /> | |
| </div> | |
| </div> | |
| <div class="mt-6"> | |
| <div class="text-xs font-semibold text-slate-700">Select five models</div> | |
| <div id="model-slots" class="mt-2 grid gap-2 sm:grid-cols-2 lg:grid-cols-5"></div> | |
| </div> | |
| <div class="mt-6 flex flex-wrap items-end gap-4"> | |
| <div> | |
| <label class="text-xs font-semibold text-slate-700">Task difficulty</label> | |
| <select | |
| id="bench-task" | |
| class="mt-1 block rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60" | |
| > | |
| <option value="easy">Easy</option> | |
| <option value="medium">Medium</option> | |
| <option value="hard">Hard</option> | |
| </select> | |
| </div> | |
| <div> | |
| <label class="text-xs font-semibold text-slate-700">Max steps</label> | |
| <input | |
| id="bench-max-steps" | |
| type="number" | |
| min="1" | |
| max="200" | |
| value="10" | |
| class="mt-1 w-24 rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60" | |
| /> | |
| </div> | |
| <div> | |
| <label class="text-xs font-semibold text-slate-700">Seed (optional)</label> | |
| <input | |
| id="bench-seed" | |
| type="number" | |
| min="0" | |
| placeholder="random" | |
| class="mt-1 w-28 rounded-2xl border border-slate-200 bg-white/80 px-3 py-2.5 text-sm shadow-sm outline-none focus:border-indigo-300 focus:ring-4 focus:ring-indigo-200/60" | |
| /> | |
| </div> | |
| <button | |
| id="btn-run-benchmark" | |
| type="button" | |
| class="inline-flex items-center gap-2 rounded-2xl bg-slate-900 px-5 py-2.5 text-sm font-semibold text-white shadow-sm transition hover:bg-slate-800 disabled:opacity-50" | |
| > | |
| <span class="h-2 w-2 rounded-full bg-emerald-400"></span> | |
| Run benchmark | |
| </button> | |
| </div> | |
| <p id="bench-status" class="mt-3 text-xs font-medium text-indigo-700"></p> | |
| </section> | |
| <section class="mt-8 glass rounded-3xl p-5 shadow-sm"> | |
| <h2 class="text-sm font-semibold text-slate-800">Results</h2> | |
| <div class="mt-3 overflow-x-auto"> | |
| <table class="w-full min-w-[32rem] text-left text-xs"> | |
| <thead> | |
| <tr class="border-b border-slate-200 text-slate-500"> | |
| <th class="py-2 pr-3 font-semibold">Model</th> | |
| <th class="py-2 pr-3 font-semibold">Total reward</th> | |
| <th class="py-2 pr-3 font-semibold">Steps</th> | |
| <th class="py-2 font-semibold">Error</th> | |
| </tr> | |
| </thead> | |
| <tbody id="bench-table-body"></tbody> | |
| </table> | |
| </div> | |
| </section> | |
| <section class="mt-8 grid gap-6 lg:grid-cols-2"> | |
| <div class="glass rounded-3xl p-5 shadow-sm"> | |
| <h3 class="text-sm font-semibold text-slate-800">Total reward by model</h3> | |
| <div class="mt-4 h-72"> | |
| <canvas id="chart-total" aria-label="Total reward"></canvas> | |
| </div> | |
| </div> | |
| <div class="glass rounded-3xl p-5 shadow-sm"> | |
| <h3 class="text-sm font-semibold text-slate-800">Steps to last transition</h3> | |
| <div class="mt-4 h-72"> | |
| <canvas id="chart-steps" aria-label="Step count"></canvas> | |
| </div> | |
| </div> | |
| </section> | |
| <section class="mt-8 glass rounded-3xl p-5 shadow-sm"> | |
| <h3 class="text-sm font-semibold text-slate-800">Cumulative reward over steps</h3> | |
| <p class="mt-1 text-xs text-slate-600">Per-episode reward sequence (same task + seed per model).</p> | |
| <div class="mt-4 h-96"> | |
| <canvas id="chart-cumulative" aria-label="Cumulative reward"></canvas> | |
| </div> | |
| </section> | |
| <footer class="mt-10 text-center text-xs text-slate-500"> | |
| Benchmark uses <span class="font-mono">POST /api/benchmark/run</span> on this server (same prompts as | |
| <span class="font-mono">inference.py</span>). | |
| </footer> | |
| </div> | |
| <script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.1/dist/chart.umd.min.js"></script> | |
| <script src="/web/assets/benchmark.js"></script> | |
| </body> | |
| </html> | |