| ---
|
| title: Labrats — Tiny Lab Agents
|
| emoji: 🧪
|
| colorFrom: indigo
|
| colorTo: gray
|
| sdk: gradio
|
| sdk_version: 6.16.0
|
| app_file: app.py
|
| pinned: false
|
| license: apache-2.0
|
| short_description: Small (≤32B) LLM agents doing science in DiscoveryWorld
|
| tags:
|
| - thousand-token-wood
|
| - agent
|
| - best-agent
|
| - best-demo
|
| - small-models
|
| - multi-agent
|
| - discoveryworld
|
| - track:wood
|
| - achievement:llama
|
| ---
|
|
|
| # Labrats — Tiny lab agents in DiscoveryWorld
|
|
|
| A research lab of small (≤32B) LLM agents doing science inside [AllenAI's DiscoveryWorld](https://github.com/allenai/discoveryworld) simulator. Each ReAct avatar has private episodic memory plus a shared lab notebook; N avatars share one world instance and resolve their actions simultaneously.
|
|
|
| ## 🔴 Run it live
|
|
|
| The **"Run live"** tab runs a *real* episode — N ReAct agents on a fresh DiscoveryWorld scenario — entirely on the **free CPU tier**. Both the LLM *and* the memory embeddings run remotely via [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers), so there is no `torch`/`sentence-transformers` install to time out the build. A single in-flight episode is allowed at a time, with step/agent caps (`MAX_STEPS_CAP`, `MAX_AGENTS_CAP`).
|
|
|
| A finished live run is written into `runs/` and appears in the **Episode replay** dropdown automatically.
|
|
|
| ## Pinned traces
|
|
|
| Two recorded episodes ship with the Space:
|
|
|
| - **`stage2-gemma4-12b`** — single agent, memory on, *Instrument Measurement* scenario. Measures the spectrum then pivots `PICKUP → PUT` and **completes the task**.
|
| - **`dbg-verbose-gemma4`** — two agents on an *Archaeology Dating* dig, sharing one lab notebook and talking to each other. Scores 7/10 as they coordinate digging, dating, and flagging the oldest artifact.
|
|
|
| The "Compare two runs" tab puts any two side-by-side so the behavioural delta is visible at a glance.
|
|
|
| ## How it works (Phase C summary)
|
|
|
| - `ChromaMemoryStore` with three collections: `private` (per-author), `notebook` (shared), and `chat` (utterances).
|
| - `HFInferenceEmbedder` (`BAAI/bge-small-en-v1.5`, called via Inference Providers) for retrieval; cosine ANN inside Chroma, re-ranked in Python. No local `torch`.
|
| - Composite score per Generative Agents (UIST 2023): `recency * 0.995^age + importance/10 + relevance`.
|
| - Rule-based writer: every successful `USE` of an instrument → `measurement` record in the notebook; every engine-rejected action → `failure` record in the private tier.
|
| - Retrieval (k=3 per tier) is injected into the prompt under `## Lab notebook` / `## Your notes` headers before each ReAct decision.
|
|
|
| Both the LLM and the embeddings run remotely, so the Space deploys cleanly on the free CPU tier. To run the agent locally, see the repo's main README.
|
|
|