Spaces:
Sleeping
Sleeping
| title: PreferenceLab | |
| emoji: π§ͺ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| pinned: false | |
| tags: | |
| - openenv | |
| - rlhf | |
| - preference-learning | |
| license: mit | |
| <div align="center"> | |
| # π§ͺ PreferenceLab | |
| ### An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline | |
| [](https://www.python.org/) | |
| [](https://fastapi.tiangolo.com/) | |
| [](https://docs.pydantic.dev/) | |
| [](https://gradio.app/) | |
| [](https://www.docker.com/) | |
| [](LICENSE) | |
| [](https://huggingface.co/) | |
| > **Built for the Meta Γ Hugging Face OpenEnv Hackathon β Team Nexis** | |
| | π **Live Space** | [Dev-CrafterX/preference-lab](https://huggingface.co/spaces/Dev-CrafterX/preference-lab) | | |
| |---|---| | |
| </div> | |
| --- | |
| ## Table of Contents | |
| - [Overview](#overview) | |
| - [Why PreferenceLab?](#why-preferencelab) | |
| - [System Architecture](#system-architecture) | |
| - [File Architecture](#file-architecture) | |
| - [Task Design](#task-design) | |
| - [Task 1 β Pairwise Ranking](#task-1--pairwise-ranking-easy) | |
| - [Task 2 β Multi-Axis Likert Scoring](#task-2--multi-axis-likert-scoring-medium) | |
| - [Task 3 β Transitive Consistency Ranking](#task-3--transitive-consistency-ranking-hard) | |
| - [Reward Functions](#reward-functions) | |
| - [Datasets](#datasets) | |
| - [Quick Start](#quick-start) | |
| - [Environment Variables](#environment-variables) | |
| - [API Reference](#api-reference) | |
| - [Integration Guide](#integration-guide) | |
| - [Baseline Scores](#baseline-scores) | |
| - [Testing](#testing) | |
| - [Deployment](#deployment) | |
| - [License](#license) | |
| --- | |
| ## Overview | |
| **PreferenceLab** is a production-grade [OpenEnv](https://github.com/meta-pytorch/openenv) environment that teaches AI agents to judge LLM response quality β exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines. | |
| Instead of expensive, slow human annotators, PreferenceLab provides: | |
| | Feature | Details | | |
| |---|---| | |
| | β **Deterministic grading** | Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP) | | |
| | β **Dense reward signals** | Reward at every annotation step, not just episode-end | | |
| | β **Three difficulty levels** | Pairwise β Likert scoring β Transitive 4-way ranking | | |
| | β **Synthetic fallback** | Zero-dependency offline testing with built-in data | | |
| | β **Concurrent sessions** | Up to 64 parallel RL training sessions by default | | |
| | β **Reproducible episodes** | Fully seeded random sampling | | |
| | β **Web playground** | Gradio UI at `/web` for interactive testing | | |
| --- | |
| ## Why PreferenceLab? | |
| There are **zero existing OpenEnv environments** that simulate the RLHF data collection pipeline β the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4. | |
| | Pain Point | PreferenceLab Solution | | |
| |---|---| | |
| | Human annotators are slow & expensive | AI agent replaces the annotator role | | |
| | Binary end-of-episode rewards β sparse gradients | Every step yields a graded reward signal | | |
| | Single-task environments limit curriculum learning | Three tasks of increasing complexity | | |
| | Hard-to-reproduce evaluations | Seeded episodes are fully deterministic | | |
| | Local dev blocked by API dependencies | Built-in synthetic fallback datasets | | |
| | No visual interface for debugging | Gradio playground at `/web` | | |
| --- | |
| ## System Architecture | |
| ### ποΈ Component Architecture | |
| ```mermaid | |
| flowchart TB | |
| subgraph Clients["Clients and Consumers"] | |
| A1["AI Agent<br/>GRPO / TRL Training"] | |
| A2["Baseline Inference<br/>inference.py"] | |
| A3["Gradio Web UI<br/>/web"] | |
| A4["REST / WebSocket<br/>Direct API"] | |
| end | |
| subgraph Platform["Hugging Face Space β Docker Container"] | |
| subgraph FastAPI["FastAPI Server β server/app.py"] | |
| EP1["/reset POST"] | |
| EP2["/step POST"] | |
| EP3["/state GET"] | |
| EP4["/health GET"] | |
| EP5["/web Gradio"] | |
| end | |
| subgraph EnvCore["PreferenceLabEnvironment β server/environment.py"] | |
| RESET["reset()<br/>seed Β· task_type Β· episode_id"] | |
| STEP["step()<br/>grade action β reward β sample next"] | |
| STATE["state @property<br/>returns State object"] | |
| end | |
| subgraph Graders["Deterministic Graders"] | |
| G1["Task 1 Β· Pairwise<br/>+1.0 / 0.3 / 0.1 / 0.0"] | |
| G2["Task 2 Β· Likert<br/>1 β MAE / 4.0"] | |
| G3["Task 3 Β· Consistency<br/>Kendall-tau + Transitivity"] | |
| end | |
| subgraph DataStore["Data Layer β data/"] | |
| D1["pairwise_data.json<br/>HH-RLHF"] | |
| D2["likert_data.json<br/>UltraFeedback"] | |
| D3["consistency_data.json<br/>Stanford SHP"] | |
| D4["Synthetic Fallback<br/>built-in, always available"] | |
| end | |
| end | |
| subgraph Models["Pydantic Models β models.py"] | |
| M1["PairwiseAction / Observation"] | |
| M2["LikertAction / Observation"] | |
| M3["ConsistencyAction / Observation"] | |
| end | |
| LLM["HF Inference API<br/>meta-llama / Llama-3.1-8B"] | |
| A1 -- "HTTP / WebSocket" --> FastAPI | |
| A2 -- "Direct import" --> EnvCore | |
| A3 --> EP5 | |
| A4 --> FastAPI | |
| EP1 --> RESET | |
| EP2 --> STEP | |
| EP3 --> STATE | |
| EP5 --> RESET | |
| EP5 --> STEP | |
| RESET --> Graders | |
| STEP --> Graders | |
| Graders --> G1 | |
| Graders --> G2 | |
| Graders --> G3 | |
| EnvCore --> DataStore | |
| D1 -.->|fallback| D4 | |
| D2 -.->|fallback| D4 | |
| D3 -.->|fallback| D4 | |
| Models --> Graders | |
| A2 -- "OpenAI client" --> LLM | |
| classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px | |
| classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px | |
| classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px | |
| classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px | |
| classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px | |
| classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px | |
| classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px | |
| class A1,A2,A3,A4 client | |
| class G1,G2,G3 grader | |
| class D1,D2,D3,D4 data | |
| class M1,M2,M3 model | |
| class LLM external | |
| class EP1,EP2,EP3,EP4,EP5 endpoint | |
| class RESET,STEP,STATE env | |
| ``` | |
| --- | |
| ### π Request Lifecycle β Data Flow | |
| ```mermaid | |
| sequenceDiagram | |
| autonumber | |
| actor Agent as AI Agent / TRL Trainer | |
| participant API as FastAPI Server | |
| participant Env as PreferenceLabEnvironment | |
| participant Grader as Deterministic Grader | |
| participant DB as Dataset | |
| Note over Agent,DB: Episode Start | |
| Agent->>API: POST /reset task_type=pairwise seed=42 | |
| API->>Env: env.reset(task_type, seed) | |
| Env->>DB: _sample_example(rng) | |
| DB-->>Env: prompt, response_a, response_b, gold_label | |
| Env-->>API: PairwiseObservation reward=0.0 done=false | |
| API-->>Agent: 200 OK Observation JSON | |
| Note over Agent,DB: Step Loop β max 10 steps per episode | |
| loop For each annotation step | |
| Agent->>Agent: call_llm(system_prompt, observation) | |
| Agent->>API: POST /step action: choice=A | |
| API->>Env: env.step(PairwiseAction) | |
| Env->>Grader: grade_pairwise(action, example) | |
| Grader->>Grader: compare choice vs gold_label | |
| Grader-->>Env: reward=0.99 verdict=correct | |
| Env->>DB: _sample_example next example | |
| DB-->>Env: next example | |
| Env-->>API: Observation reward=0.99 done=false step=N | |
| API-->>Agent: 200 OK StepResult JSON | |
| Agent->>Agent: log_step accumulate reward | |
| end | |
| Note over Agent,DB: Episode End | |
| Env-->>API: Observation done=true step_count=10 | |
| API-->>Agent: 200 OK Final Observation | |
| Agent->>Agent: log_end score rewards | |
| Agent->>API: POST /reset start new episode | |
| ``` | |
| --- | |
| ### π§ User Flow | |
| ```mermaid | |
| flowchart TD | |
| START(["Start"]) | |
| subgraph Setup["Setup Phase"] | |
| S1["Clone repository<br/>git clone"] | |
| S2["Install dependencies<br/>pip install -r requirements.txt"] | |
| S3{"Need real<br/>datasets?"} | |
| S4["Download datasets<br/>python scripts/prepare_datasets.py"] | |
| S5["Use synthetic fallback<br/>built-in β no download needed"] | |
| S6["Set environment vars<br/>HF_TOKEN MODEL_NAME API_BASE_URL"] | |
| end | |
| subgraph Deploy["Choose Deployment"] | |
| D1{"Mode?"} | |
| D2["Local Dev<br/>uvicorn server.app:app --port 8000"] | |
| D3["Docker<br/>docker build and docker run"] | |
| D4["HF Space<br/>git push to HuggingFace"] | |
| end | |
| subgraph Usage["Choose Usage Mode"] | |
| U1{"How to use?"} | |
| U2["Run Baseline<br/>python inference.py"] | |
| U3["Web Playground<br/>localhost:8000/web"] | |
| U4["REST API Integration<br/>HTTP + WebSocket"] | |
| U5["Run Tests<br/>pytest tests/ -v"] | |
| U6["TRL / GRPO Training<br/>parallel sessions via MCPToolClient"] | |
| end | |
| subgraph Episode["Episode Loop"] | |
| E1["POST /reset<br/>choose task_type and seed"] | |
| E2{"Task Type?"} | |
| E3["Pairwise<br/>PairwiseAction: choice A or B<br/>reward 0.01 to 0.99"] | |
| E4["Likert<br/>LikertAction: score 4 axes 1 to 5<br/>reward = 1 minus MAE/4"] | |
| E5["Consistency<br/>ConsistencyAction: rank A B C D<br/>reward = tau + transitivity"] | |
| E6["POST /step<br/>submit action"] | |
| E7["Receive Observation<br/>reward and done flag embedded"] | |
| E8{"done == true?"} | |
| E9["Next step<br/>new example sampled automatically"] | |
| E10["Episode complete<br/>log_end avg reward computed"] | |
| end | |
| START --> S1 --> S2 --> S3 | |
| S3 -->|Yes| S4 --> S6 | |
| S3 -->|No| S5 --> S6 | |
| S6 --> D1 | |
| D1 -->|Local| D2 | |
| D1 -->|Docker| D3 | |
| D1 -->|Cloud| D4 | |
| D2 & D3 & D4 --> U1 | |
| U1 -->|Baseline| U2 | |
| U1 -->|Interactive| U3 | |
| U1 -->|Custom| U4 | |
| U1 -->|Tests| U5 | |
| U1 -->|Training| U6 | |
| U2 & U3 & U4 & U6 --> E1 | |
| E1 --> E2 | |
| E2 -->|pairwise| E3 | |
| E2 -->|likert| E4 | |
| E2 -->|consistency| E5 | |
| E3 & E4 & E5 --> E6 --> E7 --> E8 | |
| E8 -->|No| E9 --> E6 | |
| E8 -->|Yes| E10 | |
| E10 -->|New Episode| E1 | |
| E10 -->|Done| FINISH(["Complete"]) | |
| classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px | |
| classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px | |
| classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px | |
| classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px | |
| classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px | |
| classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px | |
| class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step | |
| class S3,D1,U1,E2,E8 decision | |
| class E3 task1 | |
| class E4 task2 | |
| class E5 task3 | |
| class START,FINISH terminal | |
| ``` | |
| --- | |
| ### βοΈ Deployment Architecture | |
| ```mermaid | |
| flowchart LR | |
| subgraph Dev["Developer Machine"] | |
| CODE["Source Code<br/>preference-lab/"] | |
| GIT["git push"] | |
| CODE --> GIT | |
| end | |
| subgraph Space["Hugging Face Space β Docker SDK"] | |
| SECRETS["Secrets Injected<br/>HF_TOKEN<br/>API_BASE_URL<br/>MODEL_NAME<br/>MAX_CONCURRENT_ENVS=64"] | |
| CONTAINER["Docker Container<br/>python:3.10-slim"] | |
| UVICORN["uvicorn server.app:app<br/>host 0.0.0.0 port 8000"] | |
| WEB["Gradio UI<br/>/web"] | |
| REST["REST API<br/>/reset /step /state"] | |
| HEALTH["Health Check<br/>/health every 30s"] | |
| CONTAINER --> UVICORN | |
| UVICORN --> WEB | |
| UVICORN --> REST | |
| UVICORN --> HEALTH | |
| SECRETS -.->|env vars injected| CONTAINER | |
| end | |
| PUBURL["Public URL<br/>https://username-preflab.hf.space"] | |
| subgraph LLMApi["HF Inference API"] | |
| MODEL["meta-llama<br/>Llama-3.1-8B-Instruct"] | |
| end | |
| subgraph Consumers["Consumers"] | |
| U1["TRL / GRPO<br/>Training Loop"] | |
| U2["Developer<br/>Browser"] | |
| U3["inference.py<br/>Baseline Script"] | |
| U4["MCPToolClient<br/>PreferenceLabEnv"] | |
| end | |
| GIT --> Space | |
| Space --> PUBURL | |
| U1 -- "WebSocket / OpenEnv" --> REST | |
| U2 -- "HTTPS" --> WEB | |
| U3 -- "Direct import" --> UVICORN | |
| U4 -- "HTTP / MCP" --> REST | |
| REST -- "OpenAI client" --> MODEL | |
| classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px | |
| classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px | |
| classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px | |
| classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px | |
| classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px | |
| classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px | |
| classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px | |
| class PUBURL puburl | |
| class CONTAINER,UVICORN docker | |
| class U1,U2,U3,U4 consumer | |
| class MODEL llm | |
| class SECRETS secret | |
| class CODE,GIT dev | |
| class WEB,REST,HEALTH hf | |
| ``` | |
| --- | |
| ## File Architecture | |
| ``` | |
| preference-lab/ | |
| β | |
| βββ π README.md β You are here | |
| βββ π LICENSE | |
| βββ π .gitignore | |
| βββ π .dockerignore | |
| β | |
| βββ π openenv.yaml β OpenEnv manifest | |
| β β runtime: fastapi | |
| β β app: server.app:app | |
| β β port: 8000 | |
| β β type: space | |
| β β | |
| βββ π Dockerfile β HF Spaces production image | |
| β β Base: python:3.10-slim | |
| β β CMD: uvicorn server.app:app | |
| β β HEALTHCHECK: polls /health every 30s | |
| β β | |
| βββ π requirements.txt β Flat pip dependency list | |
| β β openenv-core, fastapi, uvicorn, | |
| β β pydantic, openai, datasets, | |
| β β httpx, websockets, gradio | |
| β β | |
| βββ π pyproject.toml β Build config + project metadata | |
| β β (setuptools, same deps as above) | |
| β β | |
| βββ π __init__.py β Package entry point | |
| β β Exports: PreferenceLabEnv, | |
| β β PairwiseAction, LikertAction, | |
| β β ConsistencyAction + all Observations | |
| β β | |
| βββ π models.py β Pydantic v2 data models | |
| β β Defines the agent β env contract | |
| β β | |
| β β ACTIONS OBSERVATIONS | |
| β β βββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ | |
| β β PairwiseAction PairwiseObservation | |
| β β .choice: A|B|tie|skip .prompt, .response_a, .response_b | |
| β β .justification: str? .reward, .done, .step_count | |
| β β βββββββββββββββββββββββββββββββββ | |
| β β LikertAction LikertObservation | |
| β β .helpfulness: 1-5 .prompt, .response | |
| β β .honesty: 1-5 .rubric, .reward, .done | |
| β β .harmlessness: 1-5 βββββββββββββββββββββββββββββββββ | |
| β β .instruction_following: 1-5 ConsistencyObservation | |
| β β .prompt | |
| β β ConsistencyAction .response_a, .response_b | |
| β β .ranking: list[str] (len=4) .response_c, .response_d | |
| β β .reward, .done | |
| β β | |
| βββ π client.py β PreferenceLabEnv client wrapper | |
| β β Thin sync/async wrapper around | |
| β β openenv.core.MCPToolClient | |
| β β | |
| βββ π inference.py β Baseline LLM inference script | |
| β β Mandatory stdout format: | |
| β β [START] task= env= model= | |
| β β [STEP] step= action= reward= done= | |
| β β [END] success= steps= score= | |
| β β | |
| βββ π test_api.py β Quick smoke-test (direct import) | |
| β β Tests all 3 tasks in sequence | |
| β β | |
| βββ server/ β Core server package | |
| β β | |
| β βββ π __init__.py | |
| β β | |
| β βββ π app.py β FastAPI application factory | |
| β β ENABLE_WEB_INTERFACE=true β Gradio | |
| β β MAX_CONCURRENT_ENVS=64 | |
| β β Routes: /manifest.json, /.well-known/ | |
| β β | |
| β βββ π environment.py β Core OpenEnv environment | |
| β PreferenceLabEnvironment(Environment) | |
| β SUPPORTS_CONCURRENT_SESSIONS = True | |
| β βββββββββββββββββββββββββββββββββββββ | |
| β reset(seed, task_type, **kwargs) | |
| β β Observation | |
| β step(action) | |
| β β Observation [reward & done inline] | |
| β state @property | |
| β β State(episode_id, step_count, ...) | |
| β βββββββββββββββββββββββββββββββββββββ | |
| β Graders (internal): | |
| β grade_pairwise() β +1.0 / 0.3 / 0.1 / 0.0 | |
| β grade_likert() β 1 β MAE/4.0 | |
| β grade_consistency()β Kendall-Ο + transitivity | |
| β | |
| βββ data/ β Dataset files (git-ignored) | |
| β βββ π pairwise_data.json HH-RLHF gold labels | |
| β βββ π likert_data.json UltraFeedback multi-axis scores | |
| β βββ π consistency_data.json Stanford SHP ranking pairs | |
| β (All 3 auto-fallback to synthetic | |
| β data if files are missing) | |
| β | |
| βββ scripts/ | |
| β βββ π prepare_datasets.py β Downloads & formats datasets | |
| β from Hugging Face Hub | |
| β Usage: python scripts/prepare_datasets.py | |
| β | |
| βββ tests/ | |
| βββ π test_environment.py β pytest test suite | |
| 25 test cases covering: | |
| reset / step / state / graders | |
| concurrent sessions / reproducibility | |
| ``` | |
| --- | |
| ## Task Design | |
| PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows. | |
| ### Task 1 β Pairwise Ranking (Easy) | |
| The agent is shown a prompt and two LLM responses (A and B), and must pick the better one. | |
| **Observation fields:** `prompt`, `response_a`, `response_b` | |
| **Action:** | |
| ```python | |
| PairwiseAction( | |
| choice="A", # "A" | "B" | "tie" | "skip" | |
| justification="..." # optional, not used for grading | |
| ) | |
| ``` | |
| **Grading (vs HH-RLHF gold label):** | |
| | Agent choice | Outcome | Reward | | |
| |---|---|---| | |
| | Correct (matches gold) | β | `+1.0` | | |
| | `skip` | β οΈ Abstain | `+0.3` | | |
| | `tie` (when gold is clear) | β οΈ Hedging | `+0.1` | | |
| | Wrong choice | β | `+0.0` | | |
| --- | |
| ### Task 2 β Multi-Axis Likert Scoring (Medium) | |
| The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes. | |
| **Observation fields:** `prompt`, `response`, `rubric` | |
| **Action:** | |
| ```python | |
| LikertAction( | |
| helpfulness=4, # 1β5 | |
| honesty=5, # 1β5 | |
| harmlessness=5, # 1β5 | |
| instruction_following=4 # 1β5 | |
| ) | |
| ``` | |
| **Grading (vs UltraFeedback gold scores):** | |
| ``` | |
| reward = 1.0 β (MAE / 4.0) | |
| where MAE = mean absolute error across all 4 axes | |
| 4.0 = maximum possible error per axis | |
| Perfect match β reward = 1.0 | |
| Off by 1 each β reward = 0.75 | |
| Off by 2 each β reward = 0.50 | |
| Worst case β reward = 0.0 | |
| ``` | |
| --- | |
| ### Task 3 β Transitive Consistency Ranking (Hard) | |
| The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity. | |
| **Observation fields:** `prompt`, `response_a`, `response_b`, `response_c`, `response_d` | |
| **Action:** | |
| ```python | |
| ConsistencyAction( | |
| ranking=["B", "A", "D", "C"] # best β worst, all 4 required | |
| ) | |
| ``` | |
| **Grading (Kendall-Ο + Transitivity bonus):** | |
| ``` | |
| reward = Ξ± Γ kendall_tau + Ξ² Γ transitivity_score | |
| kendall_tau: normalized rank correlation vs gold ranking | |
| range [β1.0, +1.0], clipped to [0, 1] | |
| transitivity_score: fraction of (A>B, B>C β A>C) triplets satisfied | |
| penalizes logically inconsistent rankings | |
| Ξ± = 0.7, Ξ² = 0.3 (weighted combination) | |
| ``` | |
| --- | |
| ## Reward Functions | |
| | Task | Formula | Range | | |
| |---|---|---| | |
| | Pairwise | Exact match reward table | `{0.0, 0.1, 0.3, 1.0}` | | |
| | Likert | `1 β mean(|agent_score β gold_score|) / 4` | `[0.0, 1.0]` | | |
| | Consistency | `0.7 Γ Kendall-Ο + 0.3 Γ Transitivity` | `[0.0, 1.0]` | | |
| All rewards are bounded `[0, 1]` and emitted at **every step** (dense signal). | |
| --- | |
| ## Datasets | |
| | Dataset | Task | Source | Samples | | |
| |---|---|---|---| | |
| | [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) | Pairwise | Anthropic | ~160K pairs | | |
| | [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) | Likert | OpenBMB | ~64K responses | | |
| | [Stanford SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | Consistency | Stanford | ~385K pairs | | |
| Download all datasets: | |
| ```bash | |
| python scripts/prepare_datasets.py | |
| # or with custom sample count: | |
| python scripts/prepare_datasets.py --samples 5000 | |
| ``` | |
| If JSON files are absent, the environment automatically uses **built-in synthetic data** β no download needed for local development. | |
| --- | |
| ## Quick Start | |
| ### Prerequisites | |
| - Python 3.10+ | |
| - `git` | |
| ### Local Development | |
| ```bash | |
| # 1. Clone | |
| git clone https://github.com/SIBAM890/preferencelab.git | |
| cd preference-lab | |
| # 2. Create virtual environment | |
| python -m venv venv | |
| source venv/bin/activate # Windows: venv\Scripts\activate | |
| # 3. Install dependencies | |
| pip install -r requirements.txt | |
| # 4. (Optional) Download real datasets | |
| python scripts/prepare_datasets.py | |
| # 5. Start the server | |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| Open **http://localhost:8000/web** for the interactive Gradio playground. | |
| ### Verify the server is running | |
| ```bash | |
| curl http://localhost:8000/health | |
| # β {"status":"healthy"} | |
| curl http://localhost:8000/schema | |
| # β full action / observation JSON schema | |
| ``` | |
| ### Run the baseline inference script | |
| ```bash | |
| # Set your API credentials (or use any OpenAI-compatible endpoint) | |
| export HF_TOKEN=hf_your_token_here | |
| export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct | |
| python inference.py | |
| ``` | |
| Expected output format: | |
| ``` | |
| [START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct | |
| [STEP] step=1 action=choice=A reward=1.00 done=false error=null | |
| [STEP] step=2 action=choice=B reward=0.00 done=false error=null | |
| [STEP] step=3 action=choice=A reward=1.00 done=false error=null | |
| [STEP] step=4 action=choice=A reward=1.00 done=false error=null | |
| [STEP] step=5 action=choice=B reward=0.00 done=true error=null | |
| [END] success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00 | |
| ``` | |
| ### Run tests | |
| ```bash | |
| pytest tests/ -v | |
| # 25 test cases β reset, step, state, graders, concurrency, reproducibility | |
| ``` | |
| --- | |
| ## Environment Variables | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `HF_TOKEN` | _(none)_ | Hugging Face API token for LLM inference | | |
| | `API_BASE_URL` | `https://api-inference.huggingface.co/v1` | LLM API endpoint (any OpenAI-compatible URL) | | |
| | `MODEL_NAME` | `meta-llama/Llama-3.1-8B-Instruct` | Model identifier sent to the API | | |
| | `MAX_CONCURRENT_ENVS` | `64` | Maximum parallel WebSocket sessions | | |
| | `ENABLE_WEB_INTERFACE` | `true` | Mount Gradio UI at `/web` | | |
| | `ENV_BASE_URL` | `http://localhost:8000` | PreferenceLab server URL (for remote clients) | | |
| | `ENV_README_PATH` | _(none)_ | Custom path to README for web interface | | |
| --- | |
| ## API Reference | |
| ### REST Endpoints | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/health` | Server health check | | |
| | `GET` | `/schema` | Action + Observation JSON schemas | | |
| | `GET` | `/state` | Current episode state | | |
| | `POST` | `/reset` | Start a new episode | | |
| | `POST` | `/step` | Submit an action, receive observation | | |
| | `GET` | `/web` | Gradio interactive playground | | |
| | `GET` | `/manifest.json` | PWA web manifest | | |
| ### POST /reset | |
| ```json | |
| { | |
| "seed": 42, | |
| "task_type": "pairwise" | |
| } | |
| ``` | |
| `task_type` accepts: `"pairwise"` | `"likert"` | `"consistency"` | omit for random. | |
| ### POST /step | |
| ```json | |
| { | |
| "action": { | |
| "choice": "A" | |
| } | |
| } | |
| ``` | |
| ### Response (all step/reset endpoints) | |
| ```json | |
| { | |
| "observation": { | |
| "task_id": "abc123_step1", | |
| "task_type": "pairwise", | |
| "prompt": "Explain backpropagation.", | |
| "response_a": "...", | |
| "response_b": "...", | |
| "reward": 1.0, | |
| "done": false, | |
| "step_count": 1, | |
| "info": { "verdict": "correct", "gold_label": "A" } | |
| }, | |
| "reward": 1.0, | |
| "done": false | |
| } | |
| ``` | |
| ### WebSocket | |
| ``` | |
| ws://localhost:8000/ws | |
| ``` | |
| OpenEnv WebSocket protocol β send `reset`, `step`, `state`, `close` messages. Used by TRL training loops via `MCPToolClient`. | |
| --- | |
| ## Integration Guide | |
| ### Direct Import (Local) | |
| ```python | |
| from server.environment import PreferenceLabEnvironment | |
| from models import PairwiseAction, LikertAction, ConsistencyAction | |
| env = PreferenceLabEnvironment() | |
| # Pairwise task | |
| obs = env.reset(seed=42, task_type="pairwise") | |
| print(obs.prompt) | |
| obs = env.step(PairwiseAction(choice="A")) | |
| print(obs.reward, obs.done) | |
| # State (property, not method) | |
| state = env.state | |
| print(state.episode_id, state.step_count) | |
| ``` | |
| ### Using with TRL / GRPO Training | |
| ```python | |
| import asyncio | |
| from openenv.core.env_client import EnvClient | |
| from models import PairwiseAction | |
| async def train(): | |
| async with EnvClient("http://localhost:8000") as env: | |
| obs = await env.reset(task_type="pairwise") | |
| for step in range(5): | |
| # Your policy predicts the action | |
| action = PairwiseAction(choice=your_policy(obs)) | |
| obs = await env.step(action) | |
| reward = obs.reward | |
| done = obs.done | |
| train_on(obs, reward) | |
| if done: | |
| break | |
| asyncio.run(train()) | |
| ``` | |
| ### MultiEnv Wrapper (Parallel Sessions) | |
| ```python | |
| from openenv.core.env_client import MultiEnvClient | |
| # Spin up 8 parallel sessions on the same server | |
| async with MultiEnvClient("http://localhost:8000", n=8) as envs: | |
| observations = await envs.reset_all(task_type="pairwise") | |
| # envs.step_all(actions) β list of observations | |
| ``` | |
| --- | |
| ## Baseline Scores | |
| Scores produced by `python inference.py` with `meta-llama/Llama-3.1-8B-Instruct`: | |
| | Task | Difficulty | Avg Reward | Notes | | |
| |---|---|---|---| | |
| | Pairwise Ranking | Easy | ~0.60 | Varies by model capability | | |
| | Likert Scoring | Medium | ~0.75 | Continuous signal | | |
| | Consistency Ranking | Hard | ~0.65 | Kendall-tau based | | |
| | **Overall** | β | **~0.67** | Reproducible with seed=42 | | |
| > Higher scores indicate the model aligns more closely with human preference gold labels. | |
| > Run `python inference.py` to generate fresh scores against your own model. | |
| --- | |
| ## Testing | |
| ### Run tests | |
| ```bash | |
| # Full test suite | |
| pytest tests/ -v | |
| # Specific test classes | |
| pytest tests/test_environment.py::TestPreferenceLabGraders -v | |
| pytest tests/test_environment.py::TestEpisodeManagement -v | |
| # Quick smoke test (direct import, no server needed) | |
| python test_api.py | |
| ``` | |
| **Test coverage β 22 test cases across 4 classes:** | |
| - `TestPairwiseGrader` β correct / wrong / skip / tie / range (5 tests) | |
| - `TestLikertGrader` β perfect / worst / partial / random range (4 tests) | |
| - `TestConsistencyGrader` β perfect / reversed / invalid IDs / all perms / no-tie (5 tests) | |
| - `TestPreferenceLabEnvironment` β reset / step / state / seed / episode flow (8 tests) | |
| --- | |
| ## Deployment | |
| ### Docker (Local) | |
| ```bash | |
| docker build -t preferencelab . | |
| docker run -p 7860:7860 \ | |
| -e HF_TOKEN=hf_your_token \ | |
| -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ | |
| -e MAX_CONCURRENT_ENVS=64 \ | |
| preferencelab | |
| ``` | |
| Visit `http://localhost:7860/web` | |
| ### Hugging Face Spaces | |
| 1. Fork or push this repository to a Hugging Face Space with **Docker SDK** | |
| 2. Add the following secrets in Space Settings: | |
| - `HF_TOKEN` β your Hugging Face token | |
| - `API_BASE_URL` β inference endpoint (e.g. `https://api-inference.huggingface.co/v1`) | |
| - `MODEL_NAME` β model to use | |
| The `Dockerfile` handles everything else. Health check polls `/health` every 30 seconds. | |
| --- | |
| ## License | |
| This project is licensed under the **MIT License** β see [LICENSE](LICENSE) for details. | |
| --- | |
| <div align="center"> | |
| **Built with β€οΈ for the Meta Γ Hugging Face OpenEnv Hackathon** | |
| *Team Nexis* | |
| </div> | |