Spaces:

Smestern
/

labrats

Running

App Files Files Community

labrats / README.md

Smestern

Deploy Labrats live Space (free-tier, remote embeddings)

657d2d2 verified 19 days ago

preview code

Raw

History Blame Contribute Delete

2.82 kB

	---
	title: Labrats — Tiny Lab Agents
	emoji: 🧪
	colorFrom: indigo
	colorTo: gray
	sdk: gradio
	sdk_version: 6.16.0
	app_file: app.py
	pinned: false
	license: apache-2.0
	short_description: Small (≤32B) LLM agents doing science in DiscoveryWorld
	tags:
	- thousand-token-wood
	- agent
	- best-agent
	- best-demo
	- small-models
	- multi-agent
	- discoveryworld
	- track:wood
	- achievement:llama
	---

	# Labrats — Tiny lab agents in DiscoveryWorld

	A research lab of small (≤32B) LLM agents doing science inside [AllenAI's DiscoveryWorld](https://github.com/allenai/discoveryworld) simulator. Each ReAct avatar has private episodic memory plus a shared lab notebook; N avatars share one world instance and resolve their actions simultaneously.

	## 🔴 Run it live

	The "Run live" tab runs a real episode — N ReAct agents on a fresh DiscoveryWorld scenario — entirely on the free CPU tier. Both the LLM and the memory embeddings run remotely via [Hugging Face Inference Providers](https://huggingface.co/docs/inference-providers), so there is no `torch`/`sentence-transformers` install to time out the build. A single in-flight episode is allowed at a time, with step/agent caps (`MAX_STEPS_CAP`, `MAX_AGENTS_CAP`).

	A finished live run is written into `runs/` and appears in the Episode replay dropdown automatically.

	## Pinned traces

	Two recorded episodes ship with the Space:

	- `stage2-gemma4-12b` — single agent, memory on, Instrument Measurement scenario. Measures the spectrum then pivots `PICKUP → PUT` and completes the task.
	- `dbg-verbose-gemma4` — two agents on an Archaeology Dating dig, sharing one lab notebook and talking to each other. Scores 7/10 as they coordinate digging, dating, and flagging the oldest artifact.

	The "Compare two runs" tab puts any two side-by-side so the behavioural delta is visible at a glance.

	## How it works (Phase C summary)

	- `ChromaMemoryStore` with three collections: `private` (per-author), `notebook` (shared), and `chat` (utterances).
	- `HFInferenceEmbedder` (`BAAI/bge-small-en-v1.5`, called via Inference Providers) for retrieval; cosine ANN inside Chroma, re-ranked in Python. No local `torch`.
	- Composite score per Generative Agents (UIST 2023): `recency * 0.995^age + importance/10 + relevance`.
	- Rule-based writer: every successful `USE` of an instrument → `measurement` record in the notebook; every engine-rejected action → `failure` record in the private tier.
	- Retrieval (k=3 per tier) is injected into the prompt under `## Lab notebook` / `## Your notes` headers before each ReAct decision.

	Both the LLM and the embeddings run remotely, so the Space deploys cleanly on the free CPU tier. To run the agent locally, see the repo's main README.