--- title: PreferenceLab emoji: ๐Ÿงช colorFrom: blue colorTo: purple sdk: docker pinned: false tags: - openenv - rlhf - preference-learning license: mit ---
# ๐Ÿงช PreferenceLab ### An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline [![Python](https://img.shields.io/badge/Python-3.10%2B-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/) [![FastAPI](https://img.shields.io/badge/FastAPI-0.104%2B-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/) [![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?style=flat-square&logo=pydantic&logoColor=white)](https://docs.pydantic.dev/) [![Gradio](https://img.shields.io/badge/Gradio-4.0%2B-FF7C00?style=flat-square&logo=gradio&logoColor=white)](https://gradio.app/) [![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com/) [![License](https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square)](LICENSE) [![Hackathon](https://img.shields.io/badge/Meta_%C3%97_HuggingFace-OpenEnv_Hackathon-FF6B00?style=flat-square)](https://huggingface.co/) > **Built for the Meta ร— Hugging Face OpenEnv Hackathon โ€” Team Nexis** | ๐Ÿš€ **Live Space** | [Dev-CrafterX/preference-lab](https://huggingface.co/spaces/Dev-CrafterX/preference-lab) | |---|---|
--- ## Table of Contents - [Overview](#overview) - [Why PreferenceLab?](#why-preferencelab) - [System Architecture](#system-architecture) - [File Architecture](#file-architecture) - [Task Design](#task-design) - [Task 1 โ€” Pairwise Ranking](#task-1--pairwise-ranking-easy) - [Task 2 โ€” Multi-Axis Likert Scoring](#task-2--multi-axis-likert-scoring-medium) - [Task 3 โ€” Transitive Consistency Ranking](#task-3--transitive-consistency-ranking-hard) - [Reward Functions](#reward-functions) - [Datasets](#datasets) - [Quick Start](#quick-start) - [Environment Variables](#environment-variables) - [API Reference](#api-reference) - [Integration Guide](#integration-guide) - [Baseline Scores](#baseline-scores) - [Testing](#testing) - [Deployment](#deployment) - [License](#license) --- ## Overview **PreferenceLab** is a production-grade [OpenEnv](https://github.com/meta-pytorch/openenv) environment that teaches AI agents to judge LLM response quality โ€” exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines. Instead of expensive, slow human annotators, PreferenceLab provides: | Feature | Details | |---|---| | โœ… **Deterministic grading** | Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP) | | โœ… **Dense reward signals** | Reward at every annotation step, not just episode-end | | โœ… **Three difficulty levels** | Pairwise โ†’ Likert scoring โ†’ Transitive 4-way ranking | | โœ… **Synthetic fallback** | Zero-dependency offline testing with built-in data | | โœ… **Concurrent sessions** | Up to 64 parallel RL training sessions by default | | โœ… **Reproducible episodes** | Fully seeded random sampling | | โœ… **Web playground** | Gradio UI at `/web` for interactive testing | --- ## Why PreferenceLab? There are **zero existing OpenEnv environments** that simulate the RLHF data collection pipeline โ€” the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4. | Pain Point | PreferenceLab Solution | |---|---| | Human annotators are slow & expensive | AI agent replaces the annotator role | | Binary end-of-episode rewards โ†’ sparse gradients | Every step yields a graded reward signal | | Single-task environments limit curriculum learning | Three tasks of increasing complexity | | Hard-to-reproduce evaluations | Seeded episodes are fully deterministic | | Local dev blocked by API dependencies | Built-in synthetic fallback datasets | | No visual interface for debugging | Gradio playground at `/web` | --- ## System Architecture ### ๐Ÿ—๏ธ Component Architecture ```mermaid flowchart TB subgraph Clients["Clients and Consumers"] A1["AI Agent
GRPO / TRL Training"] A2["Baseline Inference
inference.py"] A3["Gradio Web UI
/web"] A4["REST / WebSocket
Direct API"] end subgraph Platform["Hugging Face Space โ€” Docker Container"] subgraph FastAPI["FastAPI Server โ€” server/app.py"] EP1["/reset POST"] EP2["/step POST"] EP3["/state GET"] EP4["/health GET"] EP5["/web Gradio"] end subgraph EnvCore["PreferenceLabEnvironment โ€” server/environment.py"] RESET["reset()
seed ยท task_type ยท episode_id"] STEP["step()
grade action โ†’ reward โ†’ sample next"] STATE["state @property
returns State object"] end subgraph Graders["Deterministic Graders"] G1["Task 1 ยท Pairwise
+1.0 / 0.3 / 0.1 / 0.0"] G2["Task 2 ยท Likert
1 โˆ’ MAE / 4.0"] G3["Task 3 ยท Consistency
Kendall-tau + Transitivity"] end subgraph DataStore["Data Layer โ€” data/"] D1["pairwise_data.json
HH-RLHF"] D2["likert_data.json
UltraFeedback"] D3["consistency_data.json
Stanford SHP"] D4["Synthetic Fallback
built-in, always available"] end end subgraph Models["Pydantic Models โ€” models.py"] M1["PairwiseAction / Observation"] M2["LikertAction / Observation"] M3["ConsistencyAction / Observation"] end LLM["HF Inference API
meta-llama / Llama-3.1-8B"] A1 -- "HTTP / WebSocket" --> FastAPI A2 -- "Direct import" --> EnvCore A3 --> EP5 A4 --> FastAPI EP1 --> RESET EP2 --> STEP EP3 --> STATE EP5 --> RESET EP5 --> STEP RESET --> Graders STEP --> Graders Graders --> G1 Graders --> G2 Graders --> G3 EnvCore --> DataStore D1 -.->|fallback| D4 D2 -.->|fallback| D4 D3 -.->|fallback| D4 Models --> Graders A2 -- "OpenAI client" --> LLM classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px class A1,A2,A3,A4 client class G1,G2,G3 grader class D1,D2,D3,D4 data class M1,M2,M3 model class LLM external class EP1,EP2,EP3,EP4,EP5 endpoint class RESET,STEP,STATE env ``` --- ### ๐Ÿ”„ Request Lifecycle โ€” Data Flow ```mermaid sequenceDiagram autonumber actor Agent as AI Agent / TRL Trainer participant API as FastAPI Server participant Env as PreferenceLabEnvironment participant Grader as Deterministic Grader participant DB as Dataset Note over Agent,DB: Episode Start Agent->>API: POST /reset task_type=pairwise seed=42 API->>Env: env.reset(task_type, seed) Env->>DB: _sample_example(rng) DB-->>Env: prompt, response_a, response_b, gold_label Env-->>API: PairwiseObservation reward=0.0 done=false API-->>Agent: 200 OK Observation JSON Note over Agent,DB: Step Loop โ€” max 10 steps per episode loop For each annotation step Agent->>Agent: call_llm(system_prompt, observation) Agent->>API: POST /step action: choice=A API->>Env: env.step(PairwiseAction) Env->>Grader: grade_pairwise(action, example) Grader->>Grader: compare choice vs gold_label Grader-->>Env: reward=0.99 verdict=correct Env->>DB: _sample_example next example DB-->>Env: next example Env-->>API: Observation reward=0.99 done=false step=N API-->>Agent: 200 OK StepResult JSON Agent->>Agent: log_step accumulate reward end Note over Agent,DB: Episode End Env-->>API: Observation done=true step_count=10 API-->>Agent: 200 OK Final Observation Agent->>Agent: log_end score rewards Agent->>API: POST /reset start new episode ``` --- ### ๐Ÿงญ User Flow ```mermaid flowchart TD START(["Start"]) subgraph Setup["Setup Phase"] S1["Clone repository
git clone"] S2["Install dependencies
pip install -r requirements.txt"] S3{"Need real
datasets?"} S4["Download datasets
python scripts/prepare_datasets.py"] S5["Use synthetic fallback
built-in โ€” no download needed"] S6["Set environment vars
HF_TOKEN MODEL_NAME API_BASE_URL"] end subgraph Deploy["Choose Deployment"] D1{"Mode?"} D2["Local Dev
uvicorn server.app:app --port 8000"] D3["Docker
docker build and docker run"] D4["HF Space
git push to HuggingFace"] end subgraph Usage["Choose Usage Mode"] U1{"How to use?"} U2["Run Baseline
python inference.py"] U3["Web Playground
localhost:8000/web"] U4["REST API Integration
HTTP + WebSocket"] U5["Run Tests
pytest tests/ -v"] U6["TRL / GRPO Training
parallel sessions via MCPToolClient"] end subgraph Episode["Episode Loop"] E1["POST /reset
choose task_type and seed"] E2{"Task Type?"} E3["Pairwise
PairwiseAction: choice A or B
reward 0.01 to 0.99"] E4["Likert
LikertAction: score 4 axes 1 to 5
reward = 1 minus MAE/4"] E5["Consistency
ConsistencyAction: rank A B C D
reward = tau + transitivity"] E6["POST /step
submit action"] E7["Receive Observation
reward and done flag embedded"] E8{"done == true?"} E9["Next step
new example sampled automatically"] E10["Episode complete
log_end avg reward computed"] end START --> S1 --> S2 --> S3 S3 -->|Yes| S4 --> S6 S3 -->|No| S5 --> S6 S6 --> D1 D1 -->|Local| D2 D1 -->|Docker| D3 D1 -->|Cloud| D4 D2 & D3 & D4 --> U1 U1 -->|Baseline| U2 U1 -->|Interactive| U3 U1 -->|Custom| U4 U1 -->|Tests| U5 U1 -->|Training| U6 U2 & U3 & U4 & U6 --> E1 E1 --> E2 E2 -->|pairwise| E3 E2 -->|likert| E4 E2 -->|consistency| E5 E3 & E4 & E5 --> E6 --> E7 --> E8 E8 -->|No| E9 --> E6 E8 -->|Yes| E10 E10 -->|New Episode| E1 E10 -->|Done| FINISH(["Complete"]) classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step class S3,D1,U1,E2,E8 decision class E3 task1 class E4 task2 class E5 task3 class START,FINISH terminal ``` --- ### โ˜๏ธ Deployment Architecture ```mermaid flowchart LR subgraph Dev["Developer Machine"] CODE["Source Code
preference-lab/"] GIT["git push"] CODE --> GIT end subgraph Space["Hugging Face Space โ€” Docker SDK"] SECRETS["Secrets Injected
HF_TOKEN
API_BASE_URL
MODEL_NAME
MAX_CONCURRENT_ENVS=64"] CONTAINER["Docker Container
python:3.10-slim"] UVICORN["uvicorn server.app:app
host 0.0.0.0 port 8000"] WEB["Gradio UI
/web"] REST["REST API
/reset /step /state"] HEALTH["Health Check
/health every 30s"] CONTAINER --> UVICORN UVICORN --> WEB UVICORN --> REST UVICORN --> HEALTH SECRETS -.->|env vars injected| CONTAINER end PUBURL["Public URL
https://username-preflab.hf.space"] subgraph LLMApi["HF Inference API"] MODEL["meta-llama
Llama-3.1-8B-Instruct"] end subgraph Consumers["Consumers"] U1["TRL / GRPO
Training Loop"] U2["Developer
Browser"] U3["inference.py
Baseline Script"] U4["MCPToolClient
PreferenceLabEnv"] end GIT --> Space Space --> PUBURL U1 -- "WebSocket / OpenEnv" --> REST U2 -- "HTTPS" --> WEB U3 -- "Direct import" --> UVICORN U4 -- "HTTP / MCP" --> REST REST -- "OpenAI client" --> MODEL classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px class PUBURL puburl class CONTAINER,UVICORN docker class U1,U2,U3,U4 consumer class MODEL llm class SECRETS secret class CODE,GIT dev class WEB,REST,HEALTH hf ``` --- ## File Architecture ``` preference-lab/ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ README.md โ† You are here โ”œโ”€โ”€ ๐Ÿ“„ LICENSE โ”œโ”€โ”€ ๐Ÿ“„ .gitignore โ”œโ”€โ”€ ๐Ÿ“„ .dockerignore โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ openenv.yaml โ† OpenEnv manifest โ”‚ โ”‚ runtime: fastapi โ”‚ โ”‚ app: server.app:app โ”‚ โ”‚ port: 8000 โ”‚ โ”‚ type: space โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ Dockerfile โ† HF Spaces production image โ”‚ โ”‚ Base: python:3.10-slim โ”‚ โ”‚ CMD: uvicorn server.app:app โ”‚ โ”‚ HEALTHCHECK: polls /health every 30s โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ requirements.txt โ† Flat pip dependency list โ”‚ โ”‚ openenv-core, fastapi, uvicorn, โ”‚ โ”‚ pydantic, openai, datasets, โ”‚ โ”‚ httpx, websockets, gradio โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ pyproject.toml โ† Build config + project metadata โ”‚ โ”‚ (setuptools, same deps as above) โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ __init__.py โ† Package entry point โ”‚ โ”‚ Exports: PreferenceLabEnv, โ”‚ โ”‚ PairwiseAction, LikertAction, โ”‚ โ”‚ ConsistencyAction + all Observations โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ models.py โ† Pydantic v2 data models โ”‚ โ”‚ Defines the agent โ†” env contract โ”‚ โ”‚ โ”‚ โ”‚ ACTIONS OBSERVATIONS โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ PairwiseAction PairwiseObservation โ”‚ โ”‚ .choice: A|B|tie|skip .prompt, .response_a, .response_b โ”‚ โ”‚ .justification: str? .reward, .done, .step_count โ”‚ โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ LikertAction LikertObservation โ”‚ โ”‚ .helpfulness: 1-5 .prompt, .response โ”‚ โ”‚ .honesty: 1-5 .rubric, .reward, .done โ”‚ โ”‚ .harmlessness: 1-5 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ โ”‚ .instruction_following: 1-5 ConsistencyObservation โ”‚ โ”‚ .prompt โ”‚ โ”‚ ConsistencyAction .response_a, .response_b โ”‚ โ”‚ .ranking: list[str] (len=4) .response_c, .response_d โ”‚ โ”‚ .reward, .done โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ client.py โ† PreferenceLabEnv client wrapper โ”‚ โ”‚ Thin sync/async wrapper around โ”‚ โ”‚ openenv.core.MCPToolClient โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ inference.py โ† Baseline LLM inference script โ”‚ โ”‚ Mandatory stdout format: โ”‚ โ”‚ [START] task= env= model= โ”‚ โ”‚ [STEP] step= action= reward= done= โ”‚ โ”‚ [END] success= steps= score= โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ test_api.py โ† Quick smoke-test (direct import) โ”‚ โ”‚ Tests all 3 tasks in sequence โ”‚ โ”‚ โ”œโ”€โ”€ server/ โ† Core server package โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ __init__.py โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ app.py โ† FastAPI application factory โ”‚ โ”‚ ENABLE_WEB_INTERFACE=true โ†’ Gradio โ”‚ โ”‚ MAX_CONCURRENT_ENVS=64 โ”‚ โ”‚ Routes: /manifest.json, /.well-known/ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ ๐Ÿ“„ environment.py โ† Core OpenEnv environment โ”‚ PreferenceLabEnvironment(Environment) โ”‚ SUPPORTS_CONCURRENT_SESSIONS = True โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ reset(seed, task_type, **kwargs) โ”‚ โ†’ Observation โ”‚ step(action) โ”‚ โ†’ Observation [reward & done inline] โ”‚ state @property โ”‚ โ†’ State(episode_id, step_count, ...) โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”‚ Graders (internal): โ”‚ grade_pairwise() โ†’ +1.0 / 0.3 / 0.1 / 0.0 โ”‚ grade_likert() โ†’ 1 โˆ’ MAE/4.0 โ”‚ grade_consistency()โ†’ Kendall-ฯ„ + transitivity โ”‚ โ”œโ”€โ”€ data/ โ† Dataset files (git-ignored) โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ pairwise_data.json HH-RLHF gold labels โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ likert_data.json UltraFeedback multi-axis scores โ”‚ โ””โ”€โ”€ ๐Ÿ“„ consistency_data.json Stanford SHP ranking pairs โ”‚ (All 3 auto-fallback to synthetic โ”‚ data if files are missing) โ”‚ โ”œโ”€โ”€ scripts/ โ”‚ โ””โ”€โ”€ ๐Ÿ“„ prepare_datasets.py โ† Downloads & formats datasets โ”‚ from Hugging Face Hub โ”‚ Usage: python scripts/prepare_datasets.py โ”‚ โ””โ”€โ”€ tests/ โ””โ”€โ”€ ๐Ÿ“„ test_environment.py โ† pytest test suite 25 test cases covering: reset / step / state / graders concurrent sessions / reproducibility ``` --- ## Task Design PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows. ### Task 1 โ€” Pairwise Ranking (Easy) The agent is shown a prompt and two LLM responses (A and B), and must pick the better one. **Observation fields:** `prompt`, `response_a`, `response_b` **Action:** ```python PairwiseAction( choice="A", # "A" | "B" | "tie" | "skip" justification="..." # optional, not used for grading ) ``` **Grading (vs HH-RLHF gold label):** | Agent choice | Outcome | Reward | |---|---|---| | Correct (matches gold) | โœ… | `+1.0` | | `skip` | โš ๏ธ Abstain | `+0.3` | | `tie` (when gold is clear) | โš ๏ธ Hedging | `+0.1` | | Wrong choice | โŒ | `+0.0` | --- ### Task 2 โ€” Multi-Axis Likert Scoring (Medium) The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes. **Observation fields:** `prompt`, `response`, `rubric` **Action:** ```python LikertAction( helpfulness=4, # 1โ€“5 honesty=5, # 1โ€“5 harmlessness=5, # 1โ€“5 instruction_following=4 # 1โ€“5 ) ``` **Grading (vs UltraFeedback gold scores):** ``` reward = 1.0 โˆ’ (MAE / 4.0) where MAE = mean absolute error across all 4 axes 4.0 = maximum possible error per axis Perfect match โ†’ reward = 1.0 Off by 1 each โ†’ reward = 0.75 Off by 2 each โ†’ reward = 0.50 Worst case โ†’ reward = 0.0 ``` --- ### Task 3 โ€” Transitive Consistency Ranking (Hard) The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity. **Observation fields:** `prompt`, `response_a`, `response_b`, `response_c`, `response_d` **Action:** ```python ConsistencyAction( ranking=["B", "A", "D", "C"] # best โ†’ worst, all 4 required ) ``` **Grading (Kendall-ฯ„ + Transitivity bonus):** ``` reward = ฮฑ ร— kendall_tau + ฮฒ ร— transitivity_score kendall_tau: normalized rank correlation vs gold ranking range [โˆ’1.0, +1.0], clipped to [0, 1] transitivity_score: fraction of (A>B, B>C โ†’ A>C) triplets satisfied penalizes logically inconsistent rankings ฮฑ = 0.7, ฮฒ = 0.3 (weighted combination) ``` --- ## Reward Functions | Task | Formula | Range | |---|---|---| | Pairwise | Exact match reward table | `{0.0, 0.1, 0.3, 1.0}` | | Likert | `1 โˆ’ mean(|agent_score โˆ’ gold_score|) / 4` | `[0.0, 1.0]` | | Consistency | `0.7 ร— Kendall-ฯ„ + 0.3 ร— Transitivity` | `[0.0, 1.0]` | All rewards are bounded `[0, 1]` and emitted at **every step** (dense signal). --- ## Datasets | Dataset | Task | Source | Samples | |---|---|---|---| | [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) | Pairwise | Anthropic | ~160K pairs | | [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) | Likert | OpenBMB | ~64K responses | | [Stanford SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | Consistency | Stanford | ~385K pairs | Download all datasets: ```bash python scripts/prepare_datasets.py # or with custom sample count: python scripts/prepare_datasets.py --samples 5000 ``` If JSON files are absent, the environment automatically uses **built-in synthetic data** โ€” no download needed for local development. --- ## Quick Start ### Prerequisites - Python 3.10+ - `git` ### Local Development ```bash # 1. Clone git clone https://github.com/SIBAM890/preferencelab.git cd preference-lab # 2. Create virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # 3. Install dependencies pip install -r requirements.txt # 4. (Optional) Download real datasets python scripts/prepare_datasets.py # 5. Start the server python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload ``` Open **http://localhost:8000/web** for the interactive Gradio playground. ### Verify the server is running ```bash curl http://localhost:8000/health # โ†’ {"status":"healthy"} curl http://localhost:8000/schema # โ†’ full action / observation JSON schema ``` ### Run the baseline inference script ```bash # Set your API credentials (or use any OpenAI-compatible endpoint) export HF_TOKEN=hf_your_token_here export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct python inference.py ``` Expected output format: ``` [START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct [STEP] step=1 action=choice=A reward=1.00 done=false error=null [STEP] step=2 action=choice=B reward=0.00 done=false error=null [STEP] step=3 action=choice=A reward=1.00 done=false error=null [STEP] step=4 action=choice=A reward=1.00 done=false error=null [STEP] step=5 action=choice=B reward=0.00 done=true error=null [END] success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00 ``` ### Run tests ```bash pytest tests/ -v # 25 test cases โ€” reset, step, state, graders, concurrency, reproducibility ``` --- ## Environment Variables | Variable | Default | Description | |---|---|---| | `HF_TOKEN` | _(none)_ | Hugging Face API token for LLM inference | | `API_BASE_URL` | `https://api-inference.huggingface.co/v1` | LLM API endpoint (any OpenAI-compatible URL) | | `MODEL_NAME` | `meta-llama/Llama-3.1-8B-Instruct` | Model identifier sent to the API | | `MAX_CONCURRENT_ENVS` | `64` | Maximum parallel WebSocket sessions | | `ENABLE_WEB_INTERFACE` | `true` | Mount Gradio UI at `/web` | | `ENV_BASE_URL` | `http://localhost:8000` | PreferenceLab server URL (for remote clients) | | `ENV_README_PATH` | _(none)_ | Custom path to README for web interface | --- ## API Reference ### REST Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/health` | Server health check | | `GET` | `/schema` | Action + Observation JSON schemas | | `GET` | `/state` | Current episode state | | `POST` | `/reset` | Start a new episode | | `POST` | `/step` | Submit an action, receive observation | | `GET` | `/web` | Gradio interactive playground | | `GET` | `/manifest.json` | PWA web manifest | ### POST /reset ```json { "seed": 42, "task_type": "pairwise" } ``` `task_type` accepts: `"pairwise"` | `"likert"` | `"consistency"` | omit for random. ### POST /step ```json { "action": { "choice": "A" } } ``` ### Response (all step/reset endpoints) ```json { "observation": { "task_id": "abc123_step1", "task_type": "pairwise", "prompt": "Explain backpropagation.", "response_a": "...", "response_b": "...", "reward": 1.0, "done": false, "step_count": 1, "info": { "verdict": "correct", "gold_label": "A" } }, "reward": 1.0, "done": false } ``` ### WebSocket ``` ws://localhost:8000/ws ``` OpenEnv WebSocket protocol โ€” send `reset`, `step`, `state`, `close` messages. Used by TRL training loops via `MCPToolClient`. --- ## Integration Guide ### Direct Import (Local) ```python from server.environment import PreferenceLabEnvironment from models import PairwiseAction, LikertAction, ConsistencyAction env = PreferenceLabEnvironment() # Pairwise task obs = env.reset(seed=42, task_type="pairwise") print(obs.prompt) obs = env.step(PairwiseAction(choice="A")) print(obs.reward, obs.done) # State (property, not method) state = env.state print(state.episode_id, state.step_count) ``` ### Using with TRL / GRPO Training ```python import asyncio from openenv.core.env_client import EnvClient from models import PairwiseAction async def train(): async with EnvClient("http://localhost:8000") as env: obs = await env.reset(task_type="pairwise") for step in range(5): # Your policy predicts the action action = PairwiseAction(choice=your_policy(obs)) obs = await env.step(action) reward = obs.reward done = obs.done train_on(obs, reward) if done: break asyncio.run(train()) ``` ### MultiEnv Wrapper (Parallel Sessions) ```python from openenv.core.env_client import MultiEnvClient # Spin up 8 parallel sessions on the same server async with MultiEnvClient("http://localhost:8000", n=8) as envs: observations = await envs.reset_all(task_type="pairwise") # envs.step_all(actions) โ†’ list of observations ``` --- ## Baseline Scores Scores produced by `python inference.py` with `meta-llama/Llama-3.1-8B-Instruct`: | Task | Difficulty | Avg Reward | Notes | |---|---|---|---| | Pairwise Ranking | Easy | ~0.60 | Varies by model capability | | Likert Scoring | Medium | ~0.75 | Continuous signal | | Consistency Ranking | Hard | ~0.65 | Kendall-tau based | | **Overall** | โ€” | **~0.67** | Reproducible with seed=42 | > Higher scores indicate the model aligns more closely with human preference gold labels. > Run `python inference.py` to generate fresh scores against your own model. --- ## Testing ### Run tests ```bash # Full test suite pytest tests/ -v # Specific test classes pytest tests/test_environment.py::TestPreferenceLabGraders -v pytest tests/test_environment.py::TestEpisodeManagement -v # Quick smoke test (direct import, no server needed) python test_api.py ``` **Test coverage โ€” 22 test cases across 4 classes:** - `TestPairwiseGrader` โ€” correct / wrong / skip / tie / range (5 tests) - `TestLikertGrader` โ€” perfect / worst / partial / random range (4 tests) - `TestConsistencyGrader` โ€” perfect / reversed / invalid IDs / all perms / no-tie (5 tests) - `TestPreferenceLabEnvironment` โ€” reset / step / state / seed / episode flow (8 tests) --- ## Deployment ### Docker (Local) ```bash docker build -t preferencelab . docker run -p 7860:7860 \ -e HF_TOKEN=hf_your_token \ -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \ -e MAX_CONCURRENT_ENVS=64 \ preferencelab ``` Visit `http://localhost:7860/web` ### Hugging Face Spaces 1. Fork or push this repository to a Hugging Face Space with **Docker SDK** 2. Add the following secrets in Space Settings: - `HF_TOKEN` โ€” your Hugging Face token - `API_BASE_URL` โ€” inference endpoint (e.g. `https://api-inference.huggingface.co/v1`) - `MODEL_NAME` โ€” model to use The `Dockerfile` handles everything else. Health check polls `/health` every 30 seconds. --- ## License This project is licensed under the **MIT License** โ€” see [LICENSE](LICENSE) for details. ---
**Built with โค๏ธ for the Meta ร— Hugging Face OpenEnv Hackathon** *Team Nexis*