--- title: Labrats โ€” Tiny Lab Agents emoji: ๐Ÿงช colorFrom: indigo colorTo: gray sdk: gradio sdk_version: 6.16.0 app_file: app.py pinned: false license: apache-2.0 short_description: Small (โ‰ค32B) LLM agents doing science in DiscoveryWorld tags: - thousand-token-wood - agent - best-agent - best-demo - small-models - multi-agent - discoveryworld - track:wood - achievement:llama --- # Labrats โ€” Tiny lab agents in DiscoveryWorld A research lab of small (โ‰ค32B) LLM agents doing science inside [AllenAI's DiscoveryWorld](https://github.com/allenai/discoveryworld) simulator. Each ReAct avatar has private episodic memory plus a shared lab notebook; N avatars share one world instance and resolve their actions simultaneously. ## ๐Ÿ”ด Run it live The **"Run live"** tab runs a *real* episode โ€” N ReAct agents on a fresh DiscoveryWorld scenario โ€” entirely on the **free CPU tier**. Both the LLM *and* the memory embeddings run remotely via [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers), so there is no `torch`/`sentence-transformers` install to time out the build. A single in-flight episode is allowed at a time, with step/agent caps (`MAX_STEPS_CAP`, `MAX_AGENTS_CAP`). A finished live run is written into `runs/` and appears in the **Episode replay** dropdown automatically. ## Pinned traces Two recorded episodes ship with the Space: - **`stage2-gemma4-12b`** โ€” single agent, memory on, *Instrument Measurement* scenario. Measures the spectrum then pivots `PICKUP โ†’ PUT` and **completes the task**. - **`dbg-verbose-gemma4`** โ€” two agents on an *Archaeology Dating* dig, sharing one lab notebook and talking to each other. Scores 7/10 as they coordinate digging, dating, and flagging the oldest artifact. The "Compare two runs" tab puts any two side-by-side so the behavioural delta is visible at a glance. ## How it works (Phase C summary) - `ChromaMemoryStore` with three collections: `private` (per-author), `notebook` (shared), and `chat` (utterances). - `HFInferenceEmbedder` (`BAAI/bge-small-en-v1.5`, called via Inference Providers) for retrieval; cosine ANN inside Chroma, re-ranked in Python. No local `torch`. - Composite score per Generative Agents (UIST 2023): `recency * 0.995^age + importance/10 + relevance`. - Rule-based writer: every successful `USE` of an instrument โ†’ `measurement` record in the notebook; every engine-rejected action โ†’ `failure` record in the private tier. - Retrieval (k=3 per tier) is injected into the prompt under `## Lab notebook` / `## Your notes` headers before each ReAct decision. Both the LLM and the embeddings run remotely, so the Space deploys cleanly on the free CPU tier. To run the agent locally, see the repo's main README.