---
title: PreferenceLab
emoji: ๐งช
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
- openenv
- rlhf
- preference-learning
license: mit
---
# ๐งช PreferenceLab
### An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline
[](https://www.python.org/)
[](https://fastapi.tiangolo.com/)
[](https://docs.pydantic.dev/)
[](https://gradio.app/)
[](https://www.docker.com/)
[](LICENSE)
[](https://huggingface.co/)
> **Built for the Meta ร Hugging Face OpenEnv Hackathon โ Team Nexis**
| ๐ **Live Space** | [Dev-CrafterX/preference-lab](https://huggingface.co/spaces/Dev-CrafterX/preference-lab) |
|---|---|
---
## Table of Contents
- [Overview](#overview)
- [Why PreferenceLab?](#why-preferencelab)
- [System Architecture](#system-architecture)
- [File Architecture](#file-architecture)
- [Task Design](#task-design)
- [Task 1 โ Pairwise Ranking](#task-1--pairwise-ranking-easy)
- [Task 2 โ Multi-Axis Likert Scoring](#task-2--multi-axis-likert-scoring-medium)
- [Task 3 โ Transitive Consistency Ranking](#task-3--transitive-consistency-ranking-hard)
- [Reward Functions](#reward-functions)
- [Datasets](#datasets)
- [Quick Start](#quick-start)
- [Environment Variables](#environment-variables)
- [API Reference](#api-reference)
- [Integration Guide](#integration-guide)
- [Baseline Scores](#baseline-scores)
- [Testing](#testing)
- [Deployment](#deployment)
- [License](#license)
---
## Overview
**PreferenceLab** is a production-grade [OpenEnv](https://github.com/meta-pytorch/openenv) environment that teaches AI agents to judge LLM response quality โ exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines.
Instead of expensive, slow human annotators, PreferenceLab provides:
| Feature | Details |
|---|---|
| โ
**Deterministic grading** | Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP) |
| โ
**Dense reward signals** | Reward at every annotation step, not just episode-end |
| โ
**Three difficulty levels** | Pairwise โ Likert scoring โ Transitive 4-way ranking |
| โ
**Synthetic fallback** | Zero-dependency offline testing with built-in data |
| โ
**Concurrent sessions** | Up to 64 parallel RL training sessions by default |
| โ
**Reproducible episodes** | Fully seeded random sampling |
| โ
**Web playground** | Gradio UI at `/web` for interactive testing |
---
## Why PreferenceLab?
There are **zero existing OpenEnv environments** that simulate the RLHF data collection pipeline โ the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4.
| Pain Point | PreferenceLab Solution |
|---|---|
| Human annotators are slow & expensive | AI agent replaces the annotator role |
| Binary end-of-episode rewards โ sparse gradients | Every step yields a graded reward signal |
| Single-task environments limit curriculum learning | Three tasks of increasing complexity |
| Hard-to-reproduce evaluations | Seeded episodes are fully deterministic |
| Local dev blocked by API dependencies | Built-in synthetic fallback datasets |
| No visual interface for debugging | Gradio playground at `/web` |
---
## System Architecture
### ๐๏ธ Component Architecture
```mermaid
flowchart TB
subgraph Clients["Clients and Consumers"]
A1["AI Agent
GRPO / TRL Training"]
A2["Baseline Inference
inference.py"]
A3["Gradio Web UI
/web"]
A4["REST / WebSocket
Direct API"]
end
subgraph Platform["Hugging Face Space โ Docker Container"]
subgraph FastAPI["FastAPI Server โ server/app.py"]
EP1["/reset POST"]
EP2["/step POST"]
EP3["/state GET"]
EP4["/health GET"]
EP5["/web Gradio"]
end
subgraph EnvCore["PreferenceLabEnvironment โ server/environment.py"]
RESET["reset()
seed ยท task_type ยท episode_id"]
STEP["step()
grade action โ reward โ sample next"]
STATE["state @property
returns State object"]
end
subgraph Graders["Deterministic Graders"]
G1["Task 1 ยท Pairwise
+1.0 / 0.3 / 0.1 / 0.0"]
G2["Task 2 ยท Likert
1 โ MAE / 4.0"]
G3["Task 3 ยท Consistency
Kendall-tau + Transitivity"]
end
subgraph DataStore["Data Layer โ data/"]
D1["pairwise_data.json
HH-RLHF"]
D2["likert_data.json
UltraFeedback"]
D3["consistency_data.json
Stanford SHP"]
D4["Synthetic Fallback
built-in, always available"]
end
end
subgraph Models["Pydantic Models โ models.py"]
M1["PairwiseAction / Observation"]
M2["LikertAction / Observation"]
M3["ConsistencyAction / Observation"]
end
LLM["HF Inference API
meta-llama / Llama-3.1-8B"]
A1 -- "HTTP / WebSocket" --> FastAPI
A2 -- "Direct import" --> EnvCore
A3 --> EP5
A4 --> FastAPI
EP1 --> RESET
EP2 --> STEP
EP3 --> STATE
EP5 --> RESET
EP5 --> STEP
RESET --> Graders
STEP --> Graders
Graders --> G1
Graders --> G2
Graders --> G3
EnvCore --> DataStore
D1 -.->|fallback| D4
D2 -.->|fallback| D4
D3 -.->|fallback| D4
Models --> Graders
A2 -- "OpenAI client" --> LLM
classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px
classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px
classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
class A1,A2,A3,A4 client
class G1,G2,G3 grader
class D1,D2,D3,D4 data
class M1,M2,M3 model
class LLM external
class EP1,EP2,EP3,EP4,EP5 endpoint
class RESET,STEP,STATE env
```
---
### ๐ Request Lifecycle โ Data Flow
```mermaid
sequenceDiagram
autonumber
actor Agent as AI Agent / TRL Trainer
participant API as FastAPI Server
participant Env as PreferenceLabEnvironment
participant Grader as Deterministic Grader
participant DB as Dataset
Note over Agent,DB: Episode Start
Agent->>API: POST /reset task_type=pairwise seed=42
API->>Env: env.reset(task_type, seed)
Env->>DB: _sample_example(rng)
DB-->>Env: prompt, response_a, response_b, gold_label
Env-->>API: PairwiseObservation reward=0.0 done=false
API-->>Agent: 200 OK Observation JSON
Note over Agent,DB: Step Loop โ max 10 steps per episode
loop For each annotation step
Agent->>Agent: call_llm(system_prompt, observation)
Agent->>API: POST /step action: choice=A
API->>Env: env.step(PairwiseAction)
Env->>Grader: grade_pairwise(action, example)
Grader->>Grader: compare choice vs gold_label
Grader-->>Env: reward=0.99 verdict=correct
Env->>DB: _sample_example next example
DB-->>Env: next example
Env-->>API: Observation reward=0.99 done=false step=N
API-->>Agent: 200 OK StepResult JSON
Agent->>Agent: log_step accumulate reward
end
Note over Agent,DB: Episode End
Env-->>API: Observation done=true step_count=10
API-->>Agent: 200 OK Final Observation
Agent->>Agent: log_end score rewards
Agent->>API: POST /reset start new episode
```
---
### ๐งญ User Flow
```mermaid
flowchart TD
START(["Start"])
subgraph Setup["Setup Phase"]
S1["Clone repository
git clone"]
S2["Install dependencies
pip install -r requirements.txt"]
S3{"Need real
datasets?"}
S4["Download datasets
python scripts/prepare_datasets.py"]
S5["Use synthetic fallback
built-in โ no download needed"]
S6["Set environment vars
HF_TOKEN MODEL_NAME API_BASE_URL"]
end
subgraph Deploy["Choose Deployment"]
D1{"Mode?"}
D2["Local Dev
uvicorn server.app:app --port 8000"]
D3["Docker
docker build and docker run"]
D4["HF Space
git push to HuggingFace"]
end
subgraph Usage["Choose Usage Mode"]
U1{"How to use?"}
U2["Run Baseline
python inference.py"]
U3["Web Playground
localhost:8000/web"]
U4["REST API Integration
HTTP + WebSocket"]
U5["Run Tests
pytest tests/ -v"]
U6["TRL / GRPO Training
parallel sessions via MCPToolClient"]
end
subgraph Episode["Episode Loop"]
E1["POST /reset
choose task_type and seed"]
E2{"Task Type?"}
E3["Pairwise
PairwiseAction: choice A or B
reward 0.01 to 0.99"]
E4["Likert
LikertAction: score 4 axes 1 to 5
reward = 1 minus MAE/4"]
E5["Consistency
ConsistencyAction: rank A B C D
reward = tau + transitivity"]
E6["POST /step
submit action"]
E7["Receive Observation
reward and done flag embedded"]
E8{"done == true?"}
E9["Next step
new example sampled automatically"]
E10["Episode complete
log_end avg reward computed"]
end
START --> S1 --> S2 --> S3
S3 -->|Yes| S4 --> S6
S3 -->|No| S5 --> S6
S6 --> D1
D1 -->|Local| D2
D1 -->|Docker| D3
D1 -->|Cloud| D4
D2 & D3 & D4 --> U1
U1 -->|Baseline| U2
U1 -->|Interactive| U3
U1 -->|Custom| U4
U1 -->|Tests| U5
U1 -->|Training| U6
U2 & U3 & U4 & U6 --> E1
E1 --> E2
E2 -->|pairwise| E3
E2 -->|likert| E4
E2 -->|consistency| E5
E3 & E4 & E5 --> E6 --> E7 --> E8
E8 -->|No| E9 --> E6
E8 -->|Yes| E10
E10 -->|New Episode| E1
E10 -->|Done| FINISH(["Complete"])
classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px
classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px
class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step
class S3,D1,U1,E2,E8 decision
class E3 task1
class E4 task2
class E5 task3
class START,FINISH terminal
```
---
### โ๏ธ Deployment Architecture
```mermaid
flowchart LR
subgraph Dev["Developer Machine"]
CODE["Source Code
preference-lab/"]
GIT["git push"]
CODE --> GIT
end
subgraph Space["Hugging Face Space โ Docker SDK"]
SECRETS["Secrets Injected
HF_TOKEN
API_BASE_URL
MODEL_NAME
MAX_CONCURRENT_ENVS=64"]
CONTAINER["Docker Container
python:3.10-slim"]
UVICORN["uvicorn server.app:app
host 0.0.0.0 port 8000"]
WEB["Gradio UI
/web"]
REST["REST API
/reset /step /state"]
HEALTH["Health Check
/health every 30s"]
CONTAINER --> UVICORN
UVICORN --> WEB
UVICORN --> REST
UVICORN --> HEALTH
SECRETS -.->|env vars injected| CONTAINER
end
PUBURL["Public URL
https://username-preflab.hf.space"]
subgraph LLMApi["HF Inference API"]
MODEL["meta-llama
Llama-3.1-8B-Instruct"]
end
subgraph Consumers["Consumers"]
U1["TRL / GRPO
Training Loop"]
U2["Developer
Browser"]
U3["inference.py
Baseline Script"]
U4["MCPToolClient
PreferenceLabEnv"]
end
GIT --> Space
Space --> PUBURL
U1 -- "WebSocket / OpenEnv" --> REST
U2 -- "HTTPS" --> WEB
U3 -- "Direct import" --> UVICORN
U4 -- "HTTP / MCP" --> REST
REST -- "OpenAI client" --> MODEL
classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px
classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px
classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px
class PUBURL puburl
class CONTAINER,UVICORN docker
class U1,U2,U3,U4 consumer
class MODEL llm
class SECRETS secret
class CODE,GIT dev
class WEB,REST,HEALTH hf
```
---
## File Architecture
```
preference-lab/
โ
โโโ ๐ README.md โ You are here
โโโ ๐ LICENSE
โโโ ๐ .gitignore
โโโ ๐ .dockerignore
โ
โโโ ๐ openenv.yaml โ OpenEnv manifest
โ โ runtime: fastapi
โ โ app: server.app:app
โ โ port: 8000
โ โ type: space
โ โ
โโโ ๐ Dockerfile โ HF Spaces production image
โ โ Base: python:3.10-slim
โ โ CMD: uvicorn server.app:app
โ โ HEALTHCHECK: polls /health every 30s
โ โ
โโโ ๐ requirements.txt โ Flat pip dependency list
โ โ openenv-core, fastapi, uvicorn,
โ โ pydantic, openai, datasets,
โ โ httpx, websockets, gradio
โ โ
โโโ ๐ pyproject.toml โ Build config + project metadata
โ โ (setuptools, same deps as above)
โ โ
โโโ ๐ __init__.py โ Package entry point
โ โ Exports: PreferenceLabEnv,
โ โ PairwiseAction, LikertAction,
โ โ ConsistencyAction + all Observations
โ โ
โโโ ๐ models.py โ Pydantic v2 data models
โ โ Defines the agent โ env contract
โ โ
โ โ ACTIONS OBSERVATIONS
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ PairwiseAction PairwiseObservation
โ โ .choice: A|B|tie|skip .prompt, .response_a, .response_b
โ โ .justification: str? .reward, .done, .step_count
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ LikertAction LikertObservation
โ โ .helpfulness: 1-5 .prompt, .response
โ โ .honesty: 1-5 .rubric, .reward, .done
โ โ .harmlessness: 1-5 โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ .instruction_following: 1-5 ConsistencyObservation
โ โ .prompt
โ โ ConsistencyAction .response_a, .response_b
โ โ .ranking: list[str] (len=4) .response_c, .response_d
โ โ .reward, .done
โ โ
โโโ ๐ client.py โ PreferenceLabEnv client wrapper
โ โ Thin sync/async wrapper around
โ โ openenv.core.MCPToolClient
โ โ
โโโ ๐ inference.py โ Baseline LLM inference script
โ โ Mandatory stdout format:
โ โ [START] task= env= model=
โ โ [STEP] step= action= reward= done=
โ โ [END] success= steps= score=
โ โ
โโโ ๐ test_api.py โ Quick smoke-test (direct import)
โ โ Tests all 3 tasks in sequence
โ โ
โโโ server/ โ Core server package
โ โ
โ โโโ ๐ __init__.py
โ โ
โ โโโ ๐ app.py โ FastAPI application factory
โ โ ENABLE_WEB_INTERFACE=true โ Gradio
โ โ MAX_CONCURRENT_ENVS=64
โ โ Routes: /manifest.json, /.well-known/
โ โ
โ โโโ ๐ environment.py โ Core OpenEnv environment
โ PreferenceLabEnvironment(Environment)
โ SUPPORTS_CONCURRENT_SESSIONS = True
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ reset(seed, task_type, **kwargs)
โ โ Observation
โ step(action)
โ โ Observation [reward & done inline]
โ state @property
โ โ State(episode_id, step_count, ...)
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Graders (internal):
โ grade_pairwise() โ +1.0 / 0.3 / 0.1 / 0.0
โ grade_likert() โ 1 โ MAE/4.0
โ grade_consistency()โ Kendall-ฯ + transitivity
โ
โโโ data/ โ Dataset files (git-ignored)
โ โโโ ๐ pairwise_data.json HH-RLHF gold labels
โ โโโ ๐ likert_data.json UltraFeedback multi-axis scores
โ โโโ ๐ consistency_data.json Stanford SHP ranking pairs
โ (All 3 auto-fallback to synthetic
โ data if files are missing)
โ
โโโ scripts/
โ โโโ ๐ prepare_datasets.py โ Downloads & formats datasets
โ from Hugging Face Hub
โ Usage: python scripts/prepare_datasets.py
โ
โโโ tests/
โโโ ๐ test_environment.py โ pytest test suite
25 test cases covering:
reset / step / state / graders
concurrent sessions / reproducibility
```
---
## Task Design
PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows.
### Task 1 โ Pairwise Ranking (Easy)
The agent is shown a prompt and two LLM responses (A and B), and must pick the better one.
**Observation fields:** `prompt`, `response_a`, `response_b`
**Action:**
```python
PairwiseAction(
choice="A", # "A" | "B" | "tie" | "skip"
justification="..." # optional, not used for grading
)
```
**Grading (vs HH-RLHF gold label):**
| Agent choice | Outcome | Reward |
|---|---|---|
| Correct (matches gold) | โ
| `+1.0` |
| `skip` | โ ๏ธ Abstain | `+0.3` |
| `tie` (when gold is clear) | โ ๏ธ Hedging | `+0.1` |
| Wrong choice | โ | `+0.0` |
---
### Task 2 โ Multi-Axis Likert Scoring (Medium)
The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes.
**Observation fields:** `prompt`, `response`, `rubric`
**Action:**
```python
LikertAction(
helpfulness=4, # 1โ5
honesty=5, # 1โ5
harmlessness=5, # 1โ5
instruction_following=4 # 1โ5
)
```
**Grading (vs UltraFeedback gold scores):**
```
reward = 1.0 โ (MAE / 4.0)
where MAE = mean absolute error across all 4 axes
4.0 = maximum possible error per axis
Perfect match โ reward = 1.0
Off by 1 each โ reward = 0.75
Off by 2 each โ reward = 0.50
Worst case โ reward = 0.0
```
---
### Task 3 โ Transitive Consistency Ranking (Hard)
The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity.
**Observation fields:** `prompt`, `response_a`, `response_b`, `response_c`, `response_d`
**Action:**
```python
ConsistencyAction(
ranking=["B", "A", "D", "C"] # best โ worst, all 4 required
)
```
**Grading (Kendall-ฯ + Transitivity bonus):**
```
reward = ฮฑ ร kendall_tau + ฮฒ ร transitivity_score
kendall_tau: normalized rank correlation vs gold ranking
range [โ1.0, +1.0], clipped to [0, 1]
transitivity_score: fraction of (A>B, B>C โ A>C) triplets satisfied
penalizes logically inconsistent rankings
ฮฑ = 0.7, ฮฒ = 0.3 (weighted combination)
```
---
## Reward Functions
| Task | Formula | Range |
|---|---|---|
| Pairwise | Exact match reward table | `{0.0, 0.1, 0.3, 1.0}` |
| Likert | `1 โ mean(|agent_score โ gold_score|) / 4` | `[0.0, 1.0]` |
| Consistency | `0.7 ร Kendall-ฯ + 0.3 ร Transitivity` | `[0.0, 1.0]` |
All rewards are bounded `[0, 1]` and emitted at **every step** (dense signal).
---
## Datasets
| Dataset | Task | Source | Samples |
|---|---|---|---|
| [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) | Pairwise | Anthropic | ~160K pairs |
| [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) | Likert | OpenBMB | ~64K responses |
| [Stanford SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | Consistency | Stanford | ~385K pairs |
Download all datasets:
```bash
python scripts/prepare_datasets.py
# or with custom sample count:
python scripts/prepare_datasets.py --samples 5000
```
If JSON files are absent, the environment automatically uses **built-in synthetic data** โ no download needed for local development.
---
## Quick Start
### Prerequisites
- Python 3.10+
- `git`
### Local Development
```bash
# 1. Clone
git clone https://github.com/SIBAM890/preferencelab.git
cd preference-lab
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. (Optional) Download real datasets
python scripts/prepare_datasets.py
# 5. Start the server
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
```
Open **http://localhost:8000/web** for the interactive Gradio playground.
### Verify the server is running
```bash
curl http://localhost:8000/health
# โ {"status":"healthy"}
curl http://localhost:8000/schema
# โ full action / observation JSON schema
```
### Run the baseline inference script
```bash
# Set your API credentials (or use any OpenAI-compatible endpoint)
export HF_TOKEN=hf_your_token_here
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py
```
Expected output format:
```
[START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct
[STEP] step=1 action=choice=A reward=1.00 done=false error=null
[STEP] step=2 action=choice=B reward=0.00 done=false error=null
[STEP] step=3 action=choice=A reward=1.00 done=false error=null
[STEP] step=4 action=choice=A reward=1.00 done=false error=null
[STEP] step=5 action=choice=B reward=0.00 done=true error=null
[END] success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00
```
### Run tests
```bash
pytest tests/ -v
# 25 test cases โ reset, step, state, graders, concurrency, reproducibility
```
---
## Environment Variables
| Variable | Default | Description |
|---|---|---|
| `HF_TOKEN` | _(none)_ | Hugging Face API token for LLM inference |
| `API_BASE_URL` | `https://api-inference.huggingface.co/v1` | LLM API endpoint (any OpenAI-compatible URL) |
| `MODEL_NAME` | `meta-llama/Llama-3.1-8B-Instruct` | Model identifier sent to the API |
| `MAX_CONCURRENT_ENVS` | `64` | Maximum parallel WebSocket sessions |
| `ENABLE_WEB_INTERFACE` | `true` | Mount Gradio UI at `/web` |
| `ENV_BASE_URL` | `http://localhost:8000` | PreferenceLab server URL (for remote clients) |
| `ENV_README_PATH` | _(none)_ | Custom path to README for web interface |
---
## API Reference
### REST Endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Server health check |
| `GET` | `/schema` | Action + Observation JSON schemas |
| `GET` | `/state` | Current episode state |
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Submit an action, receive observation |
| `GET` | `/web` | Gradio interactive playground |
| `GET` | `/manifest.json` | PWA web manifest |
### POST /reset
```json
{
"seed": 42,
"task_type": "pairwise"
}
```
`task_type` accepts: `"pairwise"` | `"likert"` | `"consistency"` | omit for random.
### POST /step
```json
{
"action": {
"choice": "A"
}
}
```
### Response (all step/reset endpoints)
```json
{
"observation": {
"task_id": "abc123_step1",
"task_type": "pairwise",
"prompt": "Explain backpropagation.",
"response_a": "...",
"response_b": "...",
"reward": 1.0,
"done": false,
"step_count": 1,
"info": { "verdict": "correct", "gold_label": "A" }
},
"reward": 1.0,
"done": false
}
```
### WebSocket
```
ws://localhost:8000/ws
```
OpenEnv WebSocket protocol โ send `reset`, `step`, `state`, `close` messages. Used by TRL training loops via `MCPToolClient`.
---
## Integration Guide
### Direct Import (Local)
```python
from server.environment import PreferenceLabEnvironment
from models import PairwiseAction, LikertAction, ConsistencyAction
env = PreferenceLabEnvironment()
# Pairwise task
obs = env.reset(seed=42, task_type="pairwise")
print(obs.prompt)
obs = env.step(PairwiseAction(choice="A"))
print(obs.reward, obs.done)
# State (property, not method)
state = env.state
print(state.episode_id, state.step_count)
```
### Using with TRL / GRPO Training
```python
import asyncio
from openenv.core.env_client import EnvClient
from models import PairwiseAction
async def train():
async with EnvClient("http://localhost:8000") as env:
obs = await env.reset(task_type="pairwise")
for step in range(5):
# Your policy predicts the action
action = PairwiseAction(choice=your_policy(obs))
obs = await env.step(action)
reward = obs.reward
done = obs.done
train_on(obs, reward)
if done:
break
asyncio.run(train())
```
### MultiEnv Wrapper (Parallel Sessions)
```python
from openenv.core.env_client import MultiEnvClient
# Spin up 8 parallel sessions on the same server
async with MultiEnvClient("http://localhost:8000", n=8) as envs:
observations = await envs.reset_all(task_type="pairwise")
# envs.step_all(actions) โ list of observations
```
---
## Baseline Scores
Scores produced by `python inference.py` with `meta-llama/Llama-3.1-8B-Instruct`:
| Task | Difficulty | Avg Reward | Notes |
|---|---|---|---|
| Pairwise Ranking | Easy | ~0.60 | Varies by model capability |
| Likert Scoring | Medium | ~0.75 | Continuous signal |
| Consistency Ranking | Hard | ~0.65 | Kendall-tau based |
| **Overall** | โ | **~0.67** | Reproducible with seed=42 |
> Higher scores indicate the model aligns more closely with human preference gold labels.
> Run `python inference.py` to generate fresh scores against your own model.
---
## Testing
### Run tests
```bash
# Full test suite
pytest tests/ -v
# Specific test classes
pytest tests/test_environment.py::TestPreferenceLabGraders -v
pytest tests/test_environment.py::TestEpisodeManagement -v
# Quick smoke test (direct import, no server needed)
python test_api.py
```
**Test coverage โ 22 test cases across 4 classes:**
- `TestPairwiseGrader` โ correct / wrong / skip / tie / range (5 tests)
- `TestLikertGrader` โ perfect / worst / partial / random range (4 tests)
- `TestConsistencyGrader` โ perfect / reversed / invalid IDs / all perms / no-tie (5 tests)
- `TestPreferenceLabEnvironment` โ reset / step / state / seed / episode flow (8 tests)
---
## Deployment
### Docker (Local)
```bash
docker build -t preferencelab .
docker run -p 7860:7860 \
-e HF_TOKEN=hf_your_token \
-e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
-e MAX_CONCURRENT_ENVS=64 \
preferencelab
```
Visit `http://localhost:7860/web`
### Hugging Face Spaces
1. Fork or push this repository to a Hugging Face Space with **Docker SDK**
2. Add the following secrets in Space Settings:
- `HF_TOKEN` โ your Hugging Face token
- `API_BASE_URL` โ inference endpoint (e.g. `https://api-inference.huggingface.co/v1`)
- `MODEL_NAME` โ model to use
The `Dockerfile` handles everything else. Health check polls `/health` every 30 seconds.
---
## License
This project is licensed under the **MIT License** โ see [LICENSE](LICENSE) for details.
---
**Built with โค๏ธ for the Meta ร Hugging Face OpenEnv Hackathon**
*Team Nexis*