Spaces:

Dev-CrafterX
/

preference-lab

Sleeping

App Files Files Community

preference-lab / README.md

Sibam

fix: clamp grader rewards to strictly (0, 1) to pass OpenEnv validation bounds

f3f7bc4 3 months ago

preview code

Raw

History Blame Contribute Delete

30.8 kB

	---
	title: PreferenceLab
	emoji: 🧪
	colorFrom: blue
	colorTo: purple
	sdk: docker
	pinned: false
	tags:
	- openenv
	- rlhf
	- preference-learning
	license: mit
	---

	<div align="center">

	# 🧪 PreferenceLab

	### An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline

	[![Python](https://img.shields.io/badge/Python-3.10%2B-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
	[![FastAPI](https://img.shields.io/badge/FastAPI-0.104%2B-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
	[![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?style=flat-square&logo=pydantic&logoColor=white)](https://docs.pydantic.dev/)
	[![Gradio](https://img.shields.io/badge/Gradio-4.0%2B-FF7C00?style=flat-square&logo=gradio&logoColor=white)](https://gradio.app/)
	[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com/)
	[![License](https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square)](LICENSE)
	[![Hackathon](https://img.shields.io/badge/Meta_%C3%97_HuggingFace-OpenEnv_Hackathon-FF6B00?style=flat-square)](https://huggingface.co/)

	> Built for the Meta × Hugging Face OpenEnv Hackathon — Team Nexis

	\| 🚀 Live Space \| [Dev-CrafterX/preference-lab](https://huggingface.co/spaces/Dev-CrafterX/preference-lab) \|
	\|---\|---\|

	</div>

	---

	## Table of Contents

	- [Overview](#overview)
	- [Why PreferenceLab?](#why-preferencelab)
	- [System Architecture](#system-architecture)
	- [File Architecture](#file-architecture)
	- [Task Design](#task-design)
	- [Task 1 — Pairwise Ranking](#task-1--pairwise-ranking-easy)
	- [Task 2 — Multi-Axis Likert Scoring](#task-2--multi-axis-likert-scoring-medium)
	- [Task 3 — Transitive Consistency Ranking](#task-3--transitive-consistency-ranking-hard)
	- [Reward Functions](#reward-functions)
	- [Datasets](#datasets)
	- [Quick Start](#quick-start)
	- [Environment Variables](#environment-variables)
	- [API Reference](#api-reference)
	- [Integration Guide](#integration-guide)
	- [Baseline Scores](#baseline-scores)
	- [Testing](#testing)
	- [Deployment](#deployment)
	- [License](#license)

	---

	## Overview

	PreferenceLab is a production-grade [OpenEnv](https://github.com/meta-pytorch/openenv) environment that teaches AI agents to judge LLM response quality — exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines.

	Instead of expensive, slow human annotators, PreferenceLab provides:

	\| Feature \| Details \|
	\|---\|---\|
	\| ✅ Deterministic grading \| Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP) \|
	\| ✅ Dense reward signals \| Reward at every annotation step, not just episode-end \|
	\| ✅ Three difficulty levels \| Pairwise → Likert scoring → Transitive 4-way ranking \|
	\| ✅ Synthetic fallback \| Zero-dependency offline testing with built-in data \|
	\| ✅ Concurrent sessions \| Up to 64 parallel RL training sessions by default \|
	\| ✅ Reproducible episodes \| Fully seeded random sampling \|
	\| ✅ Web playground \| Gradio UI at `/web` for interactive testing \|

	---

	## Why PreferenceLab?

	There are zero existing OpenEnv environments that simulate the RLHF data collection pipeline — the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4.

	\| Pain Point \| PreferenceLab Solution \|
	\|---\|---\|
	\| Human annotators are slow & expensive \| AI agent replaces the annotator role \|
	\| Binary end-of-episode rewards → sparse gradients \| Every step yields a graded reward signal \|
	\| Single-task environments limit curriculum learning \| Three tasks of increasing complexity \|
	\| Hard-to-reproduce evaluations \| Seeded episodes are fully deterministic \|
	\| Local dev blocked by API dependencies \| Built-in synthetic fallback datasets \|
	\| No visual interface for debugging \| Gradio playground at `/web` \|

	---

	## System Architecture

	### 🏗️ Component Architecture

	```mermaid
	flowchart TB
	subgraph Clients["Clients and Consumers"]
	A1["AI Agent<br/>GRPO / TRL Training"]
	A2["Baseline Inference<br/>inference.py"]
	A3["Gradio Web UI<br/>/web"]
	A4["REST / WebSocket<br/>Direct API"]
	end

	subgraph Platform["Hugging Face Space — Docker Container"]
	subgraph FastAPI["FastAPI Server — server/app.py"]
	EP1["/reset POST"]
	EP2["/step POST"]
	EP3["/state GET"]
	EP4["/health GET"]
	EP5["/web Gradio"]
	end

	subgraph EnvCore["PreferenceLabEnvironment — server/environment.py"]
	RESET["reset()<br/>seed · task_type · episode_id"]
	STEP["step()<br/>grade action → reward → sample next"]
	STATE["state @property<br/>returns State object"]
	end

	subgraph Graders["Deterministic Graders"]
	G1["Task 1 · Pairwise<br/>+1.0 / 0.3 / 0.1 / 0.0"]
	G2["Task 2 · Likert<br/>1 − MAE / 4.0"]
	G3["Task 3 · Consistency<br/>Kendall-tau + Transitivity"]
	end

	subgraph DataStore["Data Layer — data/"]
	D1["pairwise_data.json<br/>HH-RLHF"]
	D2["likert_data.json<br/>UltraFeedback"]
	D3["consistency_data.json<br/>Stanford SHP"]
	D4["Synthetic Fallback<br/>built-in, always available"]
	end
	end

	subgraph Models["Pydantic Models — models.py"]
	M1["PairwiseAction / Observation"]
	M2["LikertAction / Observation"]
	M3["ConsistencyAction / Observation"]
	end

	LLM["HF Inference API<br/>meta-llama / Llama-3.1-8B"]

	A1 -- "HTTP / WebSocket" --> FastAPI
	A2 -- "Direct import" --> EnvCore
	A3 --> EP5
	A4 --> FastAPI

	EP1 --> RESET
	EP2 --> STEP
	EP3 --> STATE
	EP5 --> RESET
	EP5 --> STEP

	RESET --> Graders
	STEP --> Graders
	Graders --> G1
	Graders --> G2
	Graders --> G3

	EnvCore --> DataStore
	D1 -.->\|fallback\| D4
	D2 -.->\|fallback\| D4
	D3 -.->\|fallback\| D4

	Models --> Graders
	A2 -- "OpenAI client" --> LLM

	classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
	classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
	classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
	classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px
	classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
	classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px
	classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px

	class A1,A2,A3,A4 client
	class G1,G2,G3 grader
	class D1,D2,D3,D4 data
	class M1,M2,M3 model
	class LLM external
	class EP1,EP2,EP3,EP4,EP5 endpoint
	class RESET,STEP,STATE env
	```

	---

	### 🔄 Request Lifecycle — Data Flow

	```mermaid
	sequenceDiagram
	autonumber
	actor Agent as AI Agent / TRL Trainer
	participant API as FastAPI Server
	participant Env as PreferenceLabEnvironment
	participant Grader as Deterministic Grader
	participant DB as Dataset

	Note over Agent,DB: Episode Start

	Agent->>API: POST /reset task_type=pairwise seed=42
	API->>Env: env.reset(task_type, seed)
	Env->>DB: _sample_example(rng)
	DB-->>Env: prompt, response_a, response_b, gold_label
	Env-->>API: PairwiseObservation reward=0.0 done=false
	API-->>Agent: 200 OK Observation JSON

	Note over Agent,DB: Step Loop — max 10 steps per episode

	loop For each annotation step
	Agent->>Agent: call_llm(system_prompt, observation)
	Agent->>API: POST /step action: choice=A
	API->>Env: env.step(PairwiseAction)
	Env->>Grader: grade_pairwise(action, example)
	Grader->>Grader: compare choice vs gold_label
	Grader-->>Env: reward=0.99 verdict=correct
	Env->>DB: _sample_example next example
	DB-->>Env: next example
	Env-->>API: Observation reward=0.99 done=false step=N
	API-->>Agent: 200 OK StepResult JSON
	Agent->>Agent: log_step accumulate reward
	end

	Note over Agent,DB: Episode End

	Env-->>API: Observation done=true step_count=10
	API-->>Agent: 200 OK Final Observation
	Agent->>Agent: log_end score rewards
	Agent->>API: POST /reset start new episode
	```

	---

	### 🧭 User Flow

	```mermaid
	flowchart TD
	START(["Start"])

	subgraph Setup["Setup Phase"]
	S1["Clone repository<br/>git clone"]
	S2["Install dependencies<br/>pip install -r requirements.txt"]
	S3{"Need real<br/>datasets?"}
	S4["Download datasets<br/>python scripts/prepare_datasets.py"]
	S5["Use synthetic fallback<br/>built-in — no download needed"]
	S6["Set environment vars<br/>HF_TOKEN MODEL_NAME API_BASE_URL"]
	end

	subgraph Deploy["Choose Deployment"]
	D1{"Mode?"}
	D2["Local Dev<br/>uvicorn server.app:app --port 8000"]
	D3["Docker<br/>docker build and docker run"]
	D4["HF Space<br/>git push to HuggingFace"]
	end

	subgraph Usage["Choose Usage Mode"]
	U1{"How to use?"}
	U2["Run Baseline<br/>python inference.py"]
	U3["Web Playground<br/>localhost:8000/web"]
	U4["REST API Integration<br/>HTTP + WebSocket"]
	U5["Run Tests<br/>pytest tests/ -v"]
	U6["TRL / GRPO Training<br/>parallel sessions via MCPToolClient"]
	end

	subgraph Episode["Episode Loop"]
	E1["POST /reset<br/>choose task_type and seed"]
	E2{"Task Type?"}
	E3["Pairwise<br/>PairwiseAction: choice A or B<br/>reward 0.01 to 0.99"]
	E4["Likert<br/>LikertAction: score 4 axes 1 to 5<br/>reward = 1 minus MAE/4"]
	E5["Consistency<br/>ConsistencyAction: rank A B C D<br/>reward = tau + transitivity"]
	E6["POST /step<br/>submit action"]
	E7["Receive Observation<br/>reward and done flag embedded"]
	E8{"done == true?"}
	E9["Next step<br/>new example sampled automatically"]
	E10["Episode complete<br/>log_end avg reward computed"]
	end

	START --> S1 --> S2 --> S3
	S3 -->\|Yes\| S4 --> S6
	S3 -->\|No\| S5 --> S6
	S6 --> D1

	D1 -->\|Local\| D2
	D1 -->\|Docker\| D3
	D1 -->\|Cloud\| D4

	D2 & D3 & D4 --> U1

	U1 -->\|Baseline\| U2
	U1 -->\|Interactive\| U3
	U1 -->\|Custom\| U4
	U1 -->\|Tests\| U5
	U1 -->\|Training\| U6

	U2 & U3 & U4 & U6 --> E1

	E1 --> E2
	E2 -->\|pairwise\| E3
	E2 -->\|likert\| E4
	E2 -->\|consistency\| E5
	E3 & E4 & E5 --> E6 --> E7 --> E8
	E8 -->\|No\| E9 --> E6
	E8 -->\|Yes\| E10
	E10 -->\|New Episode\| E1
	E10 -->\|Done\| FINISH(["Complete"])

	classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
	classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px
	classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
	classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
	classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
	classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px

	class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step
	class S3,D1,U1,E2,E8 decision
	class E3 task1
	class E4 task2
	class E5 task3
	class START,FINISH terminal
	```

	---

	### ☁️ Deployment Architecture

	```mermaid
	flowchart LR
	subgraph Dev["Developer Machine"]
	CODE["Source Code<br/>preference-lab/"]
	GIT["git push"]
	CODE --> GIT
	end

	subgraph Space["Hugging Face Space — Docker SDK"]
	SECRETS["Secrets Injected<br/>HF_TOKEN<br/>API_BASE_URL<br/>MODEL_NAME<br/>MAX_CONCURRENT_ENVS=64"]
	CONTAINER["Docker Container<br/>python:3.10-slim"]
	UVICORN["uvicorn server.app:app<br/>host 0.0.0.0 port 8000"]
	WEB["Gradio UI<br/>/web"]
	REST["REST API<br/>/reset /step /state"]
	HEALTH["Health Check<br/>/health every 30s"]

	CONTAINER --> UVICORN
	UVICORN --> WEB
	UVICORN --> REST
	UVICORN --> HEALTH
	SECRETS -.->\|env vars injected\| CONTAINER
	end

	PUBURL["Public URL<br/>https://username-preflab.hf.space"]

	subgraph LLMApi["HF Inference API"]
	MODEL["meta-llama<br/>Llama-3.1-8B-Instruct"]
	end

	subgraph Consumers["Consumers"]
	U1["TRL / GRPO<br/>Training Loop"]
	U2["Developer<br/>Browser"]
	U3["inference.py<br/>Baseline Script"]
	U4["MCPToolClient<br/>PreferenceLabEnv"]
	end

	GIT --> Space
	Space --> PUBURL

	U1 -- "WebSocket / OpenEnv" --> REST
	U2 -- "HTTPS" --> WEB
	U3 -- "Direct import" --> UVICORN
	U4 -- "HTTP / MCP" --> REST

	REST -- "OpenAI client" --> MODEL

	classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px
	classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px
	classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
	classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
	classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
	classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
	classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px

	class PUBURL puburl
	class CONTAINER,UVICORN docker
	class U1,U2,U3,U4 consumer
	class MODEL llm
	class SECRETS secret
	class CODE,GIT dev
	class WEB,REST,HEALTH hf
	```

	---




	## File Architecture

	```
	preference-lab/
	│
	├── 📄 README.md ← You are here
	├── 📄 LICENSE
	├── 📄 .gitignore
	├── 📄 .dockerignore
	│
	├── 📄 openenv.yaml ← OpenEnv manifest
	│ │ runtime: fastapi
	│ │ app: server.app:app
	│ │ port: 8000
	│ │ type: space
	│ │
	├── 📄 Dockerfile ← HF Spaces production image
	│ │ Base: python:3.10-slim
	│ │ CMD: uvicorn server.app:app
	│ │ HEALTHCHECK: polls /health every 30s
	│ │
	├── 📄 requirements.txt ← Flat pip dependency list
	│ │ openenv-core, fastapi, uvicorn,
	│ │ pydantic, openai, datasets,
	│ │ httpx, websockets, gradio
	│ │
	├── 📄 pyproject.toml ← Build config + project metadata
	│ │ (setuptools, same deps as above)
	│ │
	├── 📄 __init__.py ← Package entry point
	│ │ Exports: PreferenceLabEnv,
	│ │ PairwiseAction, LikertAction,
	│ │ ConsistencyAction + all Observations
	│ │
	├── 📄 models.py ← Pydantic v2 data models
	│ │ Defines the agent ↔ env contract
	│ │
	│ │ ACTIONS OBSERVATIONS
	│ │ ───────────────────────────── ─────────────────────────────────
	│ │ PairwiseAction PairwiseObservation
	│ │ .choice: A\|B\|tie\|skip .prompt, .response_a, .response_b
	│ │ .justification: str? .reward, .done, .step_count
	│ │ ─────────────────────────────────
	│ │ LikertAction LikertObservation
	│ │ .helpfulness: 1-5 .prompt, .response
	│ │ .honesty: 1-5 .rubric, .reward, .done
	│ │ .harmlessness: 1-5 ─────────────────────────────────
	│ │ .instruction_following: 1-5 ConsistencyObservation
	│ │ .prompt
	│ │ ConsistencyAction .response_a, .response_b
	│ │ .ranking: list[str] (len=4) .response_c, .response_d
	│ │ .reward, .done
	│ │
	├── 📄 client.py ← PreferenceLabEnv client wrapper
	│ │ Thin sync/async wrapper around
	│ │ openenv.core.MCPToolClient
	│ │
	├── 📄 inference.py ← Baseline LLM inference script
	│ │ Mandatory stdout format:
	│ │ [START] task= env= model=
	│ │ [STEP] step= action= reward= done=
	│ │ [END] success= steps= score=
	│ │
	├── 📄 test_api.py ← Quick smoke-test (direct import)
	│ │ Tests all 3 tasks in sequence
	│ │
	├── server/ ← Core server package
	│ │
	│ ├── 📄 __init__.py
	│ │
	│ ├── 📄 app.py ← FastAPI application factory
	│ │ ENABLE_WEB_INTERFACE=true → Gradio
	│ │ MAX_CONCURRENT_ENVS=64
	│ │ Routes: /manifest.json, /.well-known/
	│ │
	│ └── 📄 environment.py ← Core OpenEnv environment
	│ PreferenceLabEnvironment(Environment)
	│ SUPPORTS_CONCURRENT_SESSIONS = True
	│ ─────────────────────────────────────
	│ reset(seed, task_type, **kwargs)
	│ → Observation
	│ step(action)
	│ → Observation [reward & done inline]
	│ state @property
	│ → State(episode_id, step_count, ...)
	│ ─────────────────────────────────────
	│ Graders (internal):
	│ grade_pairwise() → +1.0 / 0.3 / 0.1 / 0.0
	│ grade_likert() → 1 − MAE/4.0
	│ grade_consistency()→ Kendall-τ + transitivity
	│
	├── data/ ← Dataset files (git-ignored)
	│ ├── 📄 pairwise_data.json HH-RLHF gold labels
	│ ├── 📄 likert_data.json UltraFeedback multi-axis scores
	│ └── 📄 consistency_data.json Stanford SHP ranking pairs
	│ (All 3 auto-fallback to synthetic
	│ data if files are missing)
	│
	├── scripts/
	│ └── 📄 prepare_datasets.py ← Downloads & formats datasets
	│ from Hugging Face Hub
	│ Usage: python scripts/prepare_datasets.py
	│
	└── tests/
	└── 📄 test_environment.py ← pytest test suite
	25 test cases covering:
	reset / step / state / graders
	concurrent sessions / reproducibility
	```

	---

	## Task Design

	PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows.

	### Task 1 — Pairwise Ranking (Easy)

	The agent is shown a prompt and two LLM responses (A and B), and must pick the better one.

	Observation fields: `prompt`, `response_a`, `response_b`

	Action:
	```python
	PairwiseAction(
	choice="A", # "A" \| "B" \| "tie" \| "skip"
	justification="..." # optional, not used for grading
	)
	```

	Grading (vs HH-RLHF gold label):

	\| Agent choice \| Outcome \| Reward \|
	\|---\|---\|---\|
	\| Correct (matches gold) \| ✅ \| `+1.0` \|
	\| `skip` \| ⚠️ Abstain \| `+0.3` \|
	\| `tie` (when gold is clear) \| ⚠️ Hedging \| `+0.1` \|
	\| Wrong choice \| ❌ \| `+0.0` \|

	---

	### Task 2 — Multi-Axis Likert Scoring (Medium)

	The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes.

	Observation fields: `prompt`, `response`, `rubric`

	Action:
	```python
	LikertAction(
	helpfulness=4, # 1–5
	honesty=5, # 1–5
	harmlessness=5, # 1–5
	instruction_following=4 # 1–5
	)
	```

	Grading (vs UltraFeedback gold scores):

	```
	reward = 1.0 − (MAE / 4.0)

	where MAE = mean absolute error across all 4 axes
	4.0 = maximum possible error per axis

	Perfect match → reward = 1.0
	Off by 1 each → reward = 0.75
	Off by 2 each → reward = 0.50
	Worst case → reward = 0.0
	```

	---

	### Task 3 — Transitive Consistency Ranking (Hard)

	The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity.

	Observation fields: `prompt`, `response_a`, `response_b`, `response_c`, `response_d`

	Action:
	```python
	ConsistencyAction(
	ranking=["B", "A", "D", "C"] # best → worst, all 4 required
	)
	```

	Grading (Kendall-τ + Transitivity bonus):

	```
	reward = α × kendall_tau + β × transitivity_score

	kendall_tau: normalized rank correlation vs gold ranking
	range [−1.0, +1.0], clipped to [0, 1]

	transitivity_score: fraction of (A>B, B>C → A>C) triplets satisfied
	penalizes logically inconsistent rankings

	α = 0.7, β = 0.3 (weighted combination)
	```

	---

	## Reward Functions

	\| Task \| Formula \| Range \|
	\|---\|---\|---\|
	\| Pairwise \| Exact match reward table \| `{0.0, 0.1, 0.3, 1.0}` \|
	\| Likert \| `1 − mean(\|agent_score − gold_score\|) / 4` \| `[0.0, 1.0]` \|
	\| Consistency \| `0.7 × Kendall-τ + 0.3 × Transitivity` \| `[0.0, 1.0]` \|

	All rewards are bounded `[0, 1]` and emitted at every step (dense signal).

	---

	## Datasets

	\| Dataset \| Task \| Source \| Samples \|
	\|---\|---\|---\|---\|
	\| [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) \| Pairwise \| Anthropic \| ~160K pairs \|
	\| [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) \| Likert \| OpenBMB \| ~64K responses \|
	\| [Stanford SHP](https://huggingface.co/datasets/stanfordnlp/SHP) \| Consistency \| Stanford \| ~385K pairs \|

	Download all datasets:

	```bash
	python scripts/prepare_datasets.py
	# or with custom sample count:
	python scripts/prepare_datasets.py --samples 5000
	```

	If JSON files are absent, the environment automatically uses built-in synthetic data — no download needed for local development.

	---

	## Quick Start

	### Prerequisites

	- Python 3.10+
	- `git`

	### Local Development

	```bash
	# 1. Clone
	git clone https://github.com/SIBAM890/preferencelab.git
	cd preference-lab

	# 2. Create virtual environment
	python -m venv venv
	source venv/bin/activate # Windows: venv\Scripts\activate

	# 3. Install dependencies
	pip install -r requirements.txt

	# 4. (Optional) Download real datasets
	python scripts/prepare_datasets.py

	# 5. Start the server
	python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
	```

	Open http://localhost:8000/web for the interactive Gradio playground.

	### Verify the server is running

	```bash
	curl http://localhost:8000/health
	# → {"status":"healthy"}

	curl http://localhost:8000/schema
	# → full action / observation JSON schema
	```

	### Run the baseline inference script

	```bash
	# Set your API credentials (or use any OpenAI-compatible endpoint)
	export HF_TOKEN=hf_your_token_here
	export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct

	python inference.py
	```

	Expected output format:

	```
	[START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct
	[STEP] step=1 action=choice=A reward=1.00 done=false error=null
	[STEP] step=2 action=choice=B reward=0.00 done=false error=null
	[STEP] step=3 action=choice=A reward=1.00 done=false error=null
	[STEP] step=4 action=choice=A reward=1.00 done=false error=null
	[STEP] step=5 action=choice=B reward=0.00 done=true error=null
	[END] success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00
	```

	### Run tests

	```bash
	pytest tests/ -v
	# 25 test cases — reset, step, state, graders, concurrency, reproducibility
	```

	---

	## Environment Variables

	\| Variable \| Default \| Description \|
	\|---\|---\|---\|
	\| `HF_TOKEN` \| _(none)_ \| Hugging Face API token for LLM inference \|
	\| `API_BASE_URL` \| `https://api-inference.huggingface.co/v1` \| LLM API endpoint (any OpenAI-compatible URL) \|
	\| `MODEL_NAME` \| `meta-llama/Llama-3.1-8B-Instruct` \| Model identifier sent to the API \|
	\| `MAX_CONCURRENT_ENVS` \| `64` \| Maximum parallel WebSocket sessions \|
	\| `ENABLE_WEB_INTERFACE` \| `true` \| Mount Gradio UI at `/web` \|
	\| `ENV_BASE_URL` \| `http://localhost:8000` \| PreferenceLab server URL (for remote clients) \|
	\| `ENV_README_PATH` \| _(none)_ \| Custom path to README for web interface \|

	---

	## API Reference

	### REST Endpoints

	\| Method \| Path \| Description \|
	\|---\|---\|---\|
	\| `GET` \| `/health` \| Server health check \|
	\| `GET` \| `/schema` \| Action + Observation JSON schemas \|
	\| `GET` \| `/state` \| Current episode state \|
	\| `POST` \| `/reset` \| Start a new episode \|
	\| `POST` \| `/step` \| Submit an action, receive observation \|
	\| `GET` \| `/web` \| Gradio interactive playground \|
	\| `GET` \| `/manifest.json` \| PWA web manifest \|

	### POST /reset

	```json
	{
	"seed": 42,
	"task_type": "pairwise"
	}
	```

	`task_type` accepts: `"pairwise"` \| `"likert"` \| `"consistency"` \| omit for random.

	### POST /step

	```json
	{
	"action": {
	"choice": "A"
	}
	}
	```

	### Response (all step/reset endpoints)

	```json
	{
	"observation": {
	"task_id": "abc123_step1",
	"task_type": "pairwise",
	"prompt": "Explain backpropagation.",
	"response_a": "...",
	"response_b": "...",
	"reward": 1.0,
	"done": false,
	"step_count": 1,
	"info": { "verdict": "correct", "gold_label": "A" }
	},
	"reward": 1.0,
	"done": false
	}
	```

	### WebSocket

	```
	ws://localhost:8000/ws
	```

	OpenEnv WebSocket protocol — send `reset`, `step`, `state`, `close` messages. Used by TRL training loops via `MCPToolClient`.

	---

	## Integration Guide

	### Direct Import (Local)

	```python
	from server.environment import PreferenceLabEnvironment
	from models import PairwiseAction, LikertAction, ConsistencyAction

	env = PreferenceLabEnvironment()

	# Pairwise task
	obs = env.reset(seed=42, task_type="pairwise")
	print(obs.prompt)

	obs = env.step(PairwiseAction(choice="A"))
	print(obs.reward, obs.done)

	# State (property, not method)
	state = env.state
	print(state.episode_id, state.step_count)
	```

	### Using with TRL / GRPO Training

	```python
	import asyncio
	from openenv.core.env_client import EnvClient
	from models import PairwiseAction

	async def train():
	async with EnvClient("http://localhost:8000") as env:
	obs = await env.reset(task_type="pairwise")

	for step in range(5):
	# Your policy predicts the action
	action = PairwiseAction(choice=your_policy(obs))
	obs = await env.step(action)
	reward = obs.reward
	done = obs.done

	train_on(obs, reward)
	if done:
	break

	asyncio.run(train())
	```

	### MultiEnv Wrapper (Parallel Sessions)

	```python
	from openenv.core.env_client import MultiEnvClient

	# Spin up 8 parallel sessions on the same server
	async with MultiEnvClient("http://localhost:8000", n=8) as envs:
	observations = await envs.reset_all(task_type="pairwise")
	# envs.step_all(actions) → list of observations
	```

	---

	## Baseline Scores

	Scores produced by `python inference.py` with `meta-llama/Llama-3.1-8B-Instruct`:

	\| Task \| Difficulty \| Avg Reward \| Notes \|
	\|---\|---\|---\|---\|
	\| Pairwise Ranking \| Easy \| ~0.60 \| Varies by model capability \|
	\| Likert Scoring \| Medium \| ~0.75 \| Continuous signal \|
	\| Consistency Ranking \| Hard \| ~0.65 \| Kendall-tau based \|
	\| Overall \| — \| ~0.67 \| Reproducible with seed=42 \|

	> Higher scores indicate the model aligns more closely with human preference gold labels.
	> Run `python inference.py` to generate fresh scores against your own model.

	---

	## Testing

	### Run tests

	```bash
	# Full test suite
	pytest tests/ -v

	# Specific test classes
	pytest tests/test_environment.py::TestPreferenceLabGraders -v
	pytest tests/test_environment.py::TestEpisodeManagement -v

	# Quick smoke test (direct import, no server needed)
	python test_api.py
	```

	Test coverage — 22 test cases across 4 classes:
	- `TestPairwiseGrader` — correct / wrong / skip / tie / range (5 tests)
	- `TestLikertGrader` — perfect / worst / partial / random range (4 tests)
	- `TestConsistencyGrader` — perfect / reversed / invalid IDs / all perms / no-tie (5 tests)
	- `TestPreferenceLabEnvironment` — reset / step / state / seed / episode flow (8 tests)

	---

	## Deployment

	### Docker (Local)

	```bash
	docker build -t preferencelab .

	docker run -p 7860:7860 \
	-e HF_TOKEN=hf_your_token \
	-e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
	-e MAX_CONCURRENT_ENVS=64 \
	preferencelab
	```

	Visit `http://localhost:7860/web`

	### Hugging Face Spaces

	1. Fork or push this repository to a Hugging Face Space with Docker SDK
	2. Add the following secrets in Space Settings:
	- `HF_TOKEN` — your Hugging Face token
	- `API_BASE_URL` — inference endpoint (e.g. `https://api-inference.huggingface.co/v1`)
	- `MODEL_NAME` — model to use

	The `Dockerfile` handles everything else. Health check polls `/health` every 30 seconds.

	---

	## License

	This project is licensed under the MIT License — see [LICENSE](LICENSE) for details.

	---

	<div align="center">

	Built with ❤️ for the Meta × Hugging Face OpenEnv Hackathon

	Team Nexis

	</div>