Spaces:
Sleeping
title: PreferenceLab
emoji: π§ͺ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
- openenv
- rlhf
- preference-learning
license: mit
π§ͺ PreferenceLab
An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline
Built for the Meta Γ Hugging Face OpenEnv Hackathon β Team Nexis
| π Live Space | Dev-CrafterX/preference-lab |
|---|
Table of Contents
- Overview
- Why PreferenceLab?
- System Architecture
- File Architecture
- Task Design
- Reward Functions
- Datasets
- Quick Start
- Environment Variables
- API Reference
- Integration Guide
- Baseline Scores
- Testing
- Deployment
- License
Overview
PreferenceLab is a production-grade OpenEnv environment that teaches AI agents to judge LLM response quality β exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines.
Instead of expensive, slow human annotators, PreferenceLab provides:
| Feature | Details |
|---|---|
| β Deterministic grading | Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP) |
| β Dense reward signals | Reward at every annotation step, not just episode-end |
| β Three difficulty levels | Pairwise β Likert scoring β Transitive 4-way ranking |
| β Synthetic fallback | Zero-dependency offline testing with built-in data |
| β Concurrent sessions | Up to 64 parallel RL training sessions by default |
| β Reproducible episodes | Fully seeded random sampling |
| β Web playground | Gradio UI at /web for interactive testing |
Why PreferenceLab?
There are zero existing OpenEnv environments that simulate the RLHF data collection pipeline β the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4.
| Pain Point | PreferenceLab Solution |
|---|---|
| Human annotators are slow & expensive | AI agent replaces the annotator role |
| Binary end-of-episode rewards β sparse gradients | Every step yields a graded reward signal |
| Single-task environments limit curriculum learning | Three tasks of increasing complexity |
| Hard-to-reproduce evaluations | Seeded episodes are fully deterministic |
| Local dev blocked by API dependencies | Built-in synthetic fallback datasets |
| No visual interface for debugging | Gradio playground at /web |
System Architecture
ποΈ Component Architecture
flowchart TB
subgraph Clients["Clients and Consumers"]
A1["AI Agent<br/>GRPO / TRL Training"]
A2["Baseline Inference<br/>inference.py"]
A3["Gradio Web UI<br/>/web"]
A4["REST / WebSocket<br/>Direct API"]
end
subgraph Platform["Hugging Face Space β Docker Container"]
subgraph FastAPI["FastAPI Server β server/app.py"]
EP1["/reset POST"]
EP2["/step POST"]
EP3["/state GET"]
EP4["/health GET"]
EP5["/web Gradio"]
end
subgraph EnvCore["PreferenceLabEnvironment β server/environment.py"]
RESET["reset()<br/>seed Β· task_type Β· episode_id"]
STEP["step()<br/>grade action β reward β sample next"]
STATE["state @property<br/>returns State object"]
end
subgraph Graders["Deterministic Graders"]
G1["Task 1 Β· Pairwise<br/>+1.0 / 0.3 / 0.1 / 0.0"]
G2["Task 2 Β· Likert<br/>1 β MAE / 4.0"]
G3["Task 3 Β· Consistency<br/>Kendall-tau + Transitivity"]
end
subgraph DataStore["Data Layer β data/"]
D1["pairwise_data.json<br/>HH-RLHF"]
D2["likert_data.json<br/>UltraFeedback"]
D3["consistency_data.json<br/>Stanford SHP"]
D4["Synthetic Fallback<br/>built-in, always available"]
end
end
subgraph Models["Pydantic Models β models.py"]
M1["PairwiseAction / Observation"]
M2["LikertAction / Observation"]
M3["ConsistencyAction / Observation"]
end
LLM["HF Inference API<br/>meta-llama / Llama-3.1-8B"]
A1 -- "HTTP / WebSocket" --> FastAPI
A2 -- "Direct import" --> EnvCore
A3 --> EP5
A4 --> FastAPI
EP1 --> RESET
EP2 --> STEP
EP3 --> STATE
EP5 --> RESET
EP5 --> STEP
RESET --> Graders
STEP --> Graders
Graders --> G1
Graders --> G2
Graders --> G3
EnvCore --> DataStore
D1 -.->|fallback| D4
D2 -.->|fallback| D4
D3 -.->|fallback| D4
Models --> Graders
A2 -- "OpenAI client" --> LLM
classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px
classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px
classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
class A1,A2,A3,A4 client
class G1,G2,G3 grader
class D1,D2,D3,D4 data
class M1,M2,M3 model
class LLM external
class EP1,EP2,EP3,EP4,EP5 endpoint
class RESET,STEP,STATE env
π Request Lifecycle β Data Flow
sequenceDiagram
autonumber
actor Agent as AI Agent / TRL Trainer
participant API as FastAPI Server
participant Env as PreferenceLabEnvironment
participant Grader as Deterministic Grader
participant DB as Dataset
Note over Agent,DB: Episode Start
Agent->>API: POST /reset task_type=pairwise seed=42
API->>Env: env.reset(task_type, seed)
Env->>DB: _sample_example(rng)
DB-->>Env: prompt, response_a, response_b, gold_label
Env-->>API: PairwiseObservation reward=0.0 done=false
API-->>Agent: 200 OK Observation JSON
Note over Agent,DB: Step Loop β max 10 steps per episode
loop For each annotation step
Agent->>Agent: call_llm(system_prompt, observation)
Agent->>API: POST /step action: choice=A
API->>Env: env.step(PairwiseAction)
Env->>Grader: grade_pairwise(action, example)
Grader->>Grader: compare choice vs gold_label
Grader-->>Env: reward=0.99 verdict=correct
Env->>DB: _sample_example next example
DB-->>Env: next example
Env-->>API: Observation reward=0.99 done=false step=N
API-->>Agent: 200 OK StepResult JSON
Agent->>Agent: log_step accumulate reward
end
Note over Agent,DB: Episode End
Env-->>API: Observation done=true step_count=10
API-->>Agent: 200 OK Final Observation
Agent->>Agent: log_end score rewards
Agent->>API: POST /reset start new episode
π§ User Flow
flowchart TD
START(["Start"])
subgraph Setup["Setup Phase"]
S1["Clone repository<br/>git clone"]
S2["Install dependencies<br/>pip install -r requirements.txt"]
S3{"Need real<br/>datasets?"}
S4["Download datasets<br/>python scripts/prepare_datasets.py"]
S5["Use synthetic fallback<br/>built-in β no download needed"]
S6["Set environment vars<br/>HF_TOKEN MODEL_NAME API_BASE_URL"]
end
subgraph Deploy["Choose Deployment"]
D1{"Mode?"}
D2["Local Dev<br/>uvicorn server.app:app --port 8000"]
D3["Docker<br/>docker build and docker run"]
D4["HF Space<br/>git push to HuggingFace"]
end
subgraph Usage["Choose Usage Mode"]
U1{"How to use?"}
U2["Run Baseline<br/>python inference.py"]
U3["Web Playground<br/>localhost:8000/web"]
U4["REST API Integration<br/>HTTP + WebSocket"]
U5["Run Tests<br/>pytest tests/ -v"]
U6["TRL / GRPO Training<br/>parallel sessions via MCPToolClient"]
end
subgraph Episode["Episode Loop"]
E1["POST /reset<br/>choose task_type and seed"]
E2{"Task Type?"}
E3["Pairwise<br/>PairwiseAction: choice A or B<br/>reward 0.01 to 0.99"]
E4["Likert<br/>LikertAction: score 4 axes 1 to 5<br/>reward = 1 minus MAE/4"]
E5["Consistency<br/>ConsistencyAction: rank A B C D<br/>reward = tau + transitivity"]
E6["POST /step<br/>submit action"]
E7["Receive Observation<br/>reward and done flag embedded"]
E8{"done == true?"}
E9["Next step<br/>new example sampled automatically"]
E10["Episode complete<br/>log_end avg reward computed"]
end
START --> S1 --> S2 --> S3
S3 -->|Yes| S4 --> S6
S3 -->|No| S5 --> S6
S6 --> D1
D1 -->|Local| D2
D1 -->|Docker| D3
D1 -->|Cloud| D4
D2 & D3 & D4 --> U1
U1 -->|Baseline| U2
U1 -->|Interactive| U3
U1 -->|Custom| U4
U1 -->|Tests| U5
U1 -->|Training| U6
U2 & U3 & U4 & U6 --> E1
E1 --> E2
E2 -->|pairwise| E3
E2 -->|likert| E4
E2 -->|consistency| E5
E3 & E4 & E5 --> E6 --> E7 --> E8
E8 -->|No| E9 --> E6
E8 -->|Yes| E10
E10 -->|New Episode| E1
E10 -->|Done| FINISH(["Complete"])
classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px
classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px
class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step
class S3,D1,U1,E2,E8 decision
class E3 task1
class E4 task2
class E5 task3
class START,FINISH terminal
βοΈ Deployment Architecture
flowchart LR
subgraph Dev["Developer Machine"]
CODE["Source Code<br/>preference-lab/"]
GIT["git push"]
CODE --> GIT
end
subgraph Space["Hugging Face Space β Docker SDK"]
SECRETS["Secrets Injected<br/>HF_TOKEN<br/>API_BASE_URL<br/>MODEL_NAME<br/>MAX_CONCURRENT_ENVS=64"]
CONTAINER["Docker Container<br/>python:3.10-slim"]
UVICORN["uvicorn server.app:app<br/>host 0.0.0.0 port 8000"]
WEB["Gradio UI<br/>/web"]
REST["REST API<br/>/reset /step /state"]
HEALTH["Health Check<br/>/health every 30s"]
CONTAINER --> UVICORN
UVICORN --> WEB
UVICORN --> REST
UVICORN --> HEALTH
SECRETS -.->|env vars injected| CONTAINER
end
PUBURL["Public URL<br/>https://username-preflab.hf.space"]
subgraph LLMApi["HF Inference API"]
MODEL["meta-llama<br/>Llama-3.1-8B-Instruct"]
end
subgraph Consumers["Consumers"]
U1["TRL / GRPO<br/>Training Loop"]
U2["Developer<br/>Browser"]
U3["inference.py<br/>Baseline Script"]
U4["MCPToolClient<br/>PreferenceLabEnv"]
end
GIT --> Space
Space --> PUBURL
U1 -- "WebSocket / OpenEnv" --> REST
U2 -- "HTTPS" --> WEB
U3 -- "Direct import" --> UVICORN
U4 -- "HTTP / MCP" --> REST
REST -- "OpenAI client" --> MODEL
classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px
classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px
classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px
class PUBURL puburl
class CONTAINER,UVICORN docker
class U1,U2,U3,U4 consumer
class MODEL llm
class SECRETS secret
class CODE,GIT dev
class WEB,REST,HEALTH hf
File Architecture
preference-lab/
β
βββ π README.md β You are here
βββ π LICENSE
βββ π .gitignore
βββ π .dockerignore
β
βββ π openenv.yaml β OpenEnv manifest
β β runtime: fastapi
β β app: server.app:app
β β port: 8000
β β type: space
β β
βββ π Dockerfile β HF Spaces production image
β β Base: python:3.10-slim
β β CMD: uvicorn server.app:app
β β HEALTHCHECK: polls /health every 30s
β β
βββ π requirements.txt β Flat pip dependency list
β β openenv-core, fastapi, uvicorn,
β β pydantic, openai, datasets,
β β httpx, websockets, gradio
β β
βββ π pyproject.toml β Build config + project metadata
β β (setuptools, same deps as above)
β β
βββ π __init__.py β Package entry point
β β Exports: PreferenceLabEnv,
β β PairwiseAction, LikertAction,
β β ConsistencyAction + all Observations
β β
βββ π models.py β Pydantic v2 data models
β β Defines the agent β env contract
β β
β β ACTIONS OBSERVATIONS
β β βββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββ
β β PairwiseAction PairwiseObservation
β β .choice: A|B|tie|skip .prompt, .response_a, .response_b
β β .justification: str? .reward, .done, .step_count
β β βββββββββββββββββββββββββββββββββ
β β LikertAction LikertObservation
β β .helpfulness: 1-5 .prompt, .response
β β .honesty: 1-5 .rubric, .reward, .done
β β .harmlessness: 1-5 βββββββββββββββββββββββββββββββββ
β β .instruction_following: 1-5 ConsistencyObservation
β β .prompt
β β ConsistencyAction .response_a, .response_b
β β .ranking: list[str] (len=4) .response_c, .response_d
β β .reward, .done
β β
βββ π client.py β PreferenceLabEnv client wrapper
β β Thin sync/async wrapper around
β β openenv.core.MCPToolClient
β β
βββ π inference.py β Baseline LLM inference script
β β Mandatory stdout format:
β β [START] task= env= model=
β β [STEP] step= action= reward= done=
β β [END] success= steps= score=
β β
βββ π test_api.py β Quick smoke-test (direct import)
β β Tests all 3 tasks in sequence
β β
βββ server/ β Core server package
β β
β βββ π __init__.py
β β
β βββ π app.py β FastAPI application factory
β β ENABLE_WEB_INTERFACE=true β Gradio
β β MAX_CONCURRENT_ENVS=64
β β Routes: /manifest.json, /.well-known/
β β
β βββ π environment.py β Core OpenEnv environment
β PreferenceLabEnvironment(Environment)
β SUPPORTS_CONCURRENT_SESSIONS = True
β βββββββββββββββββββββββββββββββββββββ
β reset(seed, task_type, **kwargs)
β β Observation
β step(action)
β β Observation [reward & done inline]
β state @property
β β State(episode_id, step_count, ...)
β βββββββββββββββββββββββββββββββββββββ
β Graders (internal):
β grade_pairwise() β +1.0 / 0.3 / 0.1 / 0.0
β grade_likert() β 1 β MAE/4.0
β grade_consistency()β Kendall-Ο + transitivity
β
βββ data/ β Dataset files (git-ignored)
β βββ π pairwise_data.json HH-RLHF gold labels
β βββ π likert_data.json UltraFeedback multi-axis scores
β βββ π consistency_data.json Stanford SHP ranking pairs
β (All 3 auto-fallback to synthetic
β data if files are missing)
β
βββ scripts/
β βββ π prepare_datasets.py β Downloads & formats datasets
β from Hugging Face Hub
β Usage: python scripts/prepare_datasets.py
β
βββ tests/
βββ π test_environment.py β pytest test suite
25 test cases covering:
reset / step / state / graders
concurrent sessions / reproducibility
Task Design
PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows.
Task 1 β Pairwise Ranking (Easy)
The agent is shown a prompt and two LLM responses (A and B), and must pick the better one.
Observation fields: prompt, response_a, response_b
Action:
PairwiseAction(
choice="A", # "A" | "B" | "tie" | "skip"
justification="..." # optional, not used for grading
)
Grading (vs HH-RLHF gold label):
| Agent choice | Outcome | Reward |
|---|---|---|
| Correct (matches gold) | β | +1.0 |
skip |
β οΈ Abstain | +0.3 |
tie (when gold is clear) |
β οΈ Hedging | +0.1 |
| Wrong choice | β | +0.0 |
Task 2 β Multi-Axis Likert Scoring (Medium)
The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes.
Observation fields: prompt, response, rubric
Action:
LikertAction(
helpfulness=4, # 1β5
honesty=5, # 1β5
harmlessness=5, # 1β5
instruction_following=4 # 1β5
)
Grading (vs UltraFeedback gold scores):
reward = 1.0 β (MAE / 4.0)
where MAE = mean absolute error across all 4 axes
4.0 = maximum possible error per axis
Perfect match β reward = 1.0
Off by 1 each β reward = 0.75
Off by 2 each β reward = 0.50
Worst case β reward = 0.0
Task 3 β Transitive Consistency Ranking (Hard)
The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity.
Observation fields: prompt, response_a, response_b, response_c, response_d
Action:
ConsistencyAction(
ranking=["B", "A", "D", "C"] # best β worst, all 4 required
)
Grading (Kendall-Ο + Transitivity bonus):
reward = Ξ± Γ kendall_tau + Ξ² Γ transitivity_score
kendall_tau: normalized rank correlation vs gold ranking
range [β1.0, +1.0], clipped to [0, 1]
transitivity_score: fraction of (A>B, B>C β A>C) triplets satisfied
penalizes logically inconsistent rankings
Ξ± = 0.7, Ξ² = 0.3 (weighted combination)
Reward Functions
| Task | Formula | Range |
|---|---|---|
| Pairwise | Exact match reward table | {0.0, 0.1, 0.3, 1.0} |
| Likert | `1 β mean( | agent_score β gold_score |
| Consistency | 0.7 Γ Kendall-Ο + 0.3 Γ Transitivity |
[0.0, 1.0] |
All rewards are bounded [0, 1] and emitted at every step (dense signal).
Datasets
| Dataset | Task | Source | Samples |
|---|---|---|---|
| HH-RLHF | Pairwise | Anthropic | ~160K pairs |
| UltraFeedback | Likert | OpenBMB | ~64K responses |
| Stanford SHP | Consistency | Stanford | ~385K pairs |
Download all datasets:
python scripts/prepare_datasets.py
# or with custom sample count:
python scripts/prepare_datasets.py --samples 5000
If JSON files are absent, the environment automatically uses built-in synthetic data β no download needed for local development.
Quick Start
Prerequisites
- Python 3.10+
git
Local Development
# 1. Clone
git clone https://github.com/SIBAM890/preferencelab.git
cd preference-lab
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txt
# 4. (Optional) Download real datasets
python scripts/prepare_datasets.py
# 5. Start the server
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
Open http://localhost:8000/web for the interactive Gradio playground.
Verify the server is running
curl http://localhost:8000/health
# β {"status":"healthy"}
curl http://localhost:8000/schema
# β full action / observation JSON schema
Run the baseline inference script
# Set your API credentials (or use any OpenAI-compatible endpoint)
export HF_TOKEN=hf_your_token_here
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
python inference.py
Expected output format:
[START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct
[STEP] step=1 action=choice=A reward=1.00 done=false error=null
[STEP] step=2 action=choice=B reward=0.00 done=false error=null
[STEP] step=3 action=choice=A reward=1.00 done=false error=null
[STEP] step=4 action=choice=A reward=1.00 done=false error=null
[STEP] step=5 action=choice=B reward=0.00 done=true error=null
[END] success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00
Run tests
pytest tests/ -v
# 25 test cases β reset, step, state, graders, concurrency, reproducibility
Environment Variables
| Variable | Default | Description |
|---|---|---|
HF_TOKEN |
(none) | Hugging Face API token for LLM inference |
API_BASE_URL |
https://api-inference.huggingface.co/v1 |
LLM API endpoint (any OpenAI-compatible URL) |
MODEL_NAME |
meta-llama/Llama-3.1-8B-Instruct |
Model identifier sent to the API |
MAX_CONCURRENT_ENVS |
64 |
Maximum parallel WebSocket sessions |
ENABLE_WEB_INTERFACE |
true |
Mount Gradio UI at /web |
ENV_BASE_URL |
http://localhost:8000 |
PreferenceLab server URL (for remote clients) |
ENV_README_PATH |
(none) | Custom path to README for web interface |
API Reference
REST Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Server health check |
GET |
/schema |
Action + Observation JSON schemas |
GET |
/state |
Current episode state |
POST |
/reset |
Start a new episode |
POST |
/step |
Submit an action, receive observation |
GET |
/web |
Gradio interactive playground |
GET |
/manifest.json |
PWA web manifest |
POST /reset
{
"seed": 42,
"task_type": "pairwise"
}
task_type accepts: "pairwise" | "likert" | "consistency" | omit for random.
POST /step
{
"action": {
"choice": "A"
}
}
Response (all step/reset endpoints)
{
"observation": {
"task_id": "abc123_step1",
"task_type": "pairwise",
"prompt": "Explain backpropagation.",
"response_a": "...",
"response_b": "...",
"reward": 1.0,
"done": false,
"step_count": 1,
"info": { "verdict": "correct", "gold_label": "A" }
},
"reward": 1.0,
"done": false
}
WebSocket
ws://localhost:8000/ws
OpenEnv WebSocket protocol β send reset, step, state, close messages. Used by TRL training loops via MCPToolClient.
Integration Guide
Direct Import (Local)
from server.environment import PreferenceLabEnvironment
from models import PairwiseAction, LikertAction, ConsistencyAction
env = PreferenceLabEnvironment()
# Pairwise task
obs = env.reset(seed=42, task_type="pairwise")
print(obs.prompt)
obs = env.step(PairwiseAction(choice="A"))
print(obs.reward, obs.done)
# State (property, not method)
state = env.state
print(state.episode_id, state.step_count)
Using with TRL / GRPO Training
import asyncio
from openenv.core.env_client import EnvClient
from models import PairwiseAction
async def train():
async with EnvClient("http://localhost:8000") as env:
obs = await env.reset(task_type="pairwise")
for step in range(5):
# Your policy predicts the action
action = PairwiseAction(choice=your_policy(obs))
obs = await env.step(action)
reward = obs.reward
done = obs.done
train_on(obs, reward)
if done:
break
asyncio.run(train())
MultiEnv Wrapper (Parallel Sessions)
from openenv.core.env_client import MultiEnvClient
# Spin up 8 parallel sessions on the same server
async with MultiEnvClient("http://localhost:8000", n=8) as envs:
observations = await envs.reset_all(task_type="pairwise")
# envs.step_all(actions) β list of observations
Baseline Scores
Scores produced by python inference.py with meta-llama/Llama-3.1-8B-Instruct:
| Task | Difficulty | Avg Reward | Notes |
|---|---|---|---|
| Pairwise Ranking | Easy | ~0.60 | Varies by model capability |
| Likert Scoring | Medium | ~0.75 | Continuous signal |
| Consistency Ranking | Hard | ~0.65 | Kendall-tau based |
| Overall | β | ~0.67 | Reproducible with seed=42 |
Higher scores indicate the model aligns more closely with human preference gold labels. Run
python inference.pyto generate fresh scores against your own model.
Testing
Run tests
# Full test suite
pytest tests/ -v
# Specific test classes
pytest tests/test_environment.py::TestPreferenceLabGraders -v
pytest tests/test_environment.py::TestEpisodeManagement -v
# Quick smoke test (direct import, no server needed)
python test_api.py
Test coverage β 22 test cases across 4 classes:
TestPairwiseGraderβ correct / wrong / skip / tie / range (5 tests)TestLikertGraderβ perfect / worst / partial / random range (4 tests)TestConsistencyGraderβ perfect / reversed / invalid IDs / all perms / no-tie (5 tests)TestPreferenceLabEnvironmentβ reset / step / state / seed / episode flow (8 tests)
Deployment
Docker (Local)
docker build -t preferencelab .
docker run -p 7860:7860 \
-e HF_TOKEN=hf_your_token \
-e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
-e MAX_CONCURRENT_ENVS=64 \
preferencelab
Visit http://localhost:7860/web
Hugging Face Spaces
- Fork or push this repository to a Hugging Face Space with Docker SDK
- Add the following secrets in Space Settings:
HF_TOKENβ your Hugging Face tokenAPI_BASE_URLβ inference endpoint (e.g.https://api-inference.huggingface.co/v1)MODEL_NAMEβ model to use
The Dockerfile handles everything else. Health check polls /health every 30 seconds.
License
This project is licensed under the MIT License β see LICENSE for details.
Built with β€οΈ for the Meta Γ Hugging Face OpenEnv Hackathon
Team Nexis