Spaces:

Dev-CrafterX
/

preference-lab

Sleeping

App Files Files Community

preference-lab / README.md

Sibam

fix: clamp grader rewards to strictly (0, 1) to pass OpenEnv validation bounds

f3f7bc4 3 months ago

preview code

Raw

History Blame Contribute Delete

30.8 kB

metadata

title: PreferenceLab
emoji: 🧪
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - openenv
  - rlhf
  - preference-learning
license: mit

🧪 PreferenceLab

An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline

Built for the Meta × Hugging Face OpenEnv Hackathon — Team Nexis

🚀 Live Space	Dev-CrafterX/preference-lab

Overview
Why PreferenceLab?
System Architecture
File Architecture
Task Design
Reward Functions
Datasets
Quick Start
Environment Variables
API Reference
Integration Guide
Baseline Scores
Testing
Deployment
License

Overview

PreferenceLab is a production-grade OpenEnv environment that teaches AI agents to judge LLM response quality — exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines.

Instead of expensive, slow human annotators, PreferenceLab provides:

Feature	Details
✅ Deterministic grading	Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP)
✅ Dense reward signals	Reward at every annotation step, not just episode-end
✅ Three difficulty levels	Pairwise → Likert scoring → Transitive 4-way ranking
✅ Synthetic fallback	Zero-dependency offline testing with built-in data
✅ Concurrent sessions	Up to 64 parallel RL training sessions by default
✅ Reproducible episodes	Fully seeded random sampling
✅ Web playground	Gradio UI at `/web` for interactive testing

Why PreferenceLab?

There are zero existing OpenEnv environments that simulate the RLHF data collection pipeline — the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4.

Pain Point	PreferenceLab Solution
Human annotators are slow & expensive	AI agent replaces the annotator role
Binary end-of-episode rewards → sparse gradients	Every step yields a graded reward signal
Single-task environments limit curriculum learning	Three tasks of increasing complexity
Hard-to-reproduce evaluations	Seeded episodes are fully deterministic
Local dev blocked by API dependencies	Built-in synthetic fallback datasets
No visual interface for debugging	Gradio playground at `/web`

System Architecture

🏗️ Component Architecture

flowchart TB
    subgraph Clients["Clients and Consumers"]
        A1["AI Agent<br/>GRPO / TRL Training"]
        A2["Baseline Inference<br/>inference.py"]
        A3["Gradio Web UI<br/>/web"]
        A4["REST / WebSocket<br/>Direct API"]
    end

    subgraph Platform["Hugging Face Space — Docker Container"]
        subgraph FastAPI["FastAPI Server — server/app.py"]
            EP1["/reset POST"]
            EP2["/step POST"]
            EP3["/state GET"]
            EP4["/health GET"]
            EP5["/web Gradio"]
        end

        subgraph EnvCore["PreferenceLabEnvironment — server/environment.py"]
            RESET["reset()<br/>seed · task_type · episode_id"]
            STEP["step()<br/>grade action → reward → sample next"]
            STATE["state @property<br/>returns State object"]
        end

        subgraph Graders["Deterministic Graders"]
            G1["Task 1 · Pairwise<br/>+1.0 / 0.3 / 0.1 / 0.0"]
            G2["Task 2 · Likert<br/>1 − MAE / 4.0"]
            G3["Task 3 · Consistency<br/>Kendall-tau + Transitivity"]
        end

        subgraph DataStore["Data Layer — data/"]
            D1["pairwise_data.json<br/>HH-RLHF"]
            D2["likert_data.json<br/>UltraFeedback"]
            D3["consistency_data.json<br/>Stanford SHP"]
            D4["Synthetic Fallback<br/>built-in, always available"]
        end
    end

    subgraph Models["Pydantic Models — models.py"]
        M1["PairwiseAction / Observation"]
        M2["LikertAction / Observation"]
        M3["ConsistencyAction / Observation"]
    end

    LLM["HF Inference API<br/>meta-llama / Llama-3.1-8B"]

    A1 -- "HTTP / WebSocket" --> FastAPI
    A2 -- "Direct import" --> EnvCore
    A3 --> EP5
    A4 --> FastAPI

    EP1 --> RESET
    EP2 --> STEP
    EP3 --> STATE
    EP5 --> RESET
    EP5 --> STEP

    RESET --> Graders
    STEP --> Graders
    Graders --> G1
    Graders --> G2
    Graders --> G3

    EnvCore --> DataStore
    D1 -.->|fallback| D4
    D2 -.->|fallback| D4
    D3 -.->|fallback| D4

    Models --> Graders
    A2 -- "OpenAI client" --> LLM

    classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
    classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px
    classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
    classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px
    classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px

    class A1,A2,A3,A4 client
    class G1,G2,G3 grader
    class D1,D2,D3,D4 data
    class M1,M2,M3 model
    class LLM external
    class EP1,EP2,EP3,EP4,EP5 endpoint
    class RESET,STEP,STATE env

🔄 Request Lifecycle — Data Flow

sequenceDiagram
    autonumber
    actor Agent as AI Agent / TRL Trainer
    participant API as FastAPI Server
    participant Env as PreferenceLabEnvironment
    participant Grader as Deterministic Grader
    participant DB as Dataset

    Note over Agent,DB: Episode Start

    Agent->>API: POST /reset  task_type=pairwise  seed=42
    API->>Env: env.reset(task_type, seed)
    Env->>DB: _sample_example(rng)
    DB-->>Env: prompt, response_a, response_b, gold_label
    Env-->>API: PairwiseObservation  reward=0.0  done=false
    API-->>Agent: 200 OK  Observation JSON

    Note over Agent,DB: Step Loop — max 10 steps per episode

    loop For each annotation step
        Agent->>Agent: call_llm(system_prompt, observation)
        Agent->>API: POST /step  action: choice=A
        API->>Env: env.step(PairwiseAction)
        Env->>Grader: grade_pairwise(action, example)
        Grader->>Grader: compare choice vs gold_label
        Grader-->>Env: reward=0.99  verdict=correct
        Env->>DB: _sample_example  next example
        DB-->>Env: next example
        Env-->>API: Observation  reward=0.99  done=false  step=N
        API-->>Agent: 200 OK  StepResult JSON
        Agent->>Agent: log_step  accumulate reward
    end

    Note over Agent,DB: Episode End

    Env-->>API: Observation  done=true  step_count=10
    API-->>Agent: 200 OK  Final Observation
    Agent->>Agent: log_end  score  rewards
    Agent->>API: POST /reset  start new episode

🧭 User Flow

flowchart TD
    START(["Start"])

    subgraph Setup["Setup Phase"]
        S1["Clone repository<br/>git clone"]
        S2["Install dependencies<br/>pip install -r requirements.txt"]
        S3{"Need real<br/>datasets?"}
        S4["Download datasets<br/>python scripts/prepare_datasets.py"]
        S5["Use synthetic fallback<br/>built-in — no download needed"]
        S6["Set environment vars<br/>HF_TOKEN  MODEL_NAME  API_BASE_URL"]
    end

    subgraph Deploy["Choose Deployment"]
        D1{"Mode?"}
        D2["Local Dev<br/>uvicorn server.app:app --port 8000"]
        D3["Docker<br/>docker build and docker run"]
        D4["HF Space<br/>git push to HuggingFace"]
    end

    subgraph Usage["Choose Usage Mode"]
        U1{"How to use?"}
        U2["Run Baseline<br/>python inference.py"]
        U3["Web Playground<br/>localhost:8000/web"]
        U4["REST API Integration<br/>HTTP + WebSocket"]
        U5["Run Tests<br/>pytest tests/ -v"]
        U6["TRL / GRPO Training<br/>parallel sessions via MCPToolClient"]
    end

    subgraph Episode["Episode Loop"]
        E1["POST /reset<br/>choose task_type and seed"]
        E2{"Task Type?"}
        E3["Pairwise<br/>PairwiseAction: choice A or B<br/>reward 0.01 to 0.99"]
        E4["Likert<br/>LikertAction: score 4 axes 1 to 5<br/>reward = 1 minus MAE/4"]
        E5["Consistency<br/>ConsistencyAction: rank A B C D<br/>reward = tau + transitivity"]
        E6["POST /step<br/>submit action"]
        E7["Receive Observation<br/>reward and done flag embedded"]
        E8{"done == true?"}
        E9["Next step<br/>new example sampled automatically"]
        E10["Episode complete<br/>log_end  avg reward computed"]
    end

    START --> S1 --> S2 --> S3
    S3 -->|Yes| S4 --> S6
    S3 -->|No| S5 --> S6
    S6 --> D1

    D1 -->|Local| D2
    D1 -->|Docker| D3
    D1 -->|Cloud| D4

    D2 & D3 & D4 --> U1

    U1 -->|Baseline| U2
    U1 -->|Interactive| U3
    U1 -->|Custom| U4
    U1 -->|Tests| U5
    U1 -->|Training| U6

    U2 & U3 & U4 & U6 --> E1

    E1 --> E2
    E2 -->|pairwise| E3
    E2 -->|likert| E4
    E2 -->|consistency| E5
    E3 & E4 & E5 --> E6 --> E7 --> E8
    E8 -->|No| E9 --> E6
    E8 -->|Yes| E10
    E10 -->|New Episode| E1
    E10 -->|Done| FINISH(["Complete"])

    classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px
    classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
    classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
    classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px

    class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step
    class S3,D1,U1,E2,E8 decision
    class E3 task1
    class E4 task2
    class E5 task3
    class START,FINISH terminal

☁️ Deployment Architecture

flowchart LR
    subgraph Dev["Developer Machine"]
        CODE["Source Code<br/>preference-lab/"]
        GIT["git push"]
        CODE --> GIT
    end

    subgraph Space["Hugging Face Space — Docker SDK"]
        SECRETS["Secrets Injected<br/>HF_TOKEN<br/>API_BASE_URL<br/>MODEL_NAME<br/>MAX_CONCURRENT_ENVS=64"]
        CONTAINER["Docker Container<br/>python:3.10-slim"]
        UVICORN["uvicorn server.app:app<br/>host 0.0.0.0  port 8000"]
        WEB["Gradio UI<br/>/web"]
        REST["REST API<br/>/reset  /step  /state"]
        HEALTH["Health Check<br/>/health  every 30s"]

        CONTAINER --> UVICORN
        UVICORN --> WEB
        UVICORN --> REST
        UVICORN --> HEALTH
        SECRETS -.->|env vars injected| CONTAINER
    end

    PUBURL["Public URL<br/>https://username-preflab.hf.space"]

    subgraph LLMApi["HF Inference API"]
        MODEL["meta-llama<br/>Llama-3.1-8B-Instruct"]
    end

    subgraph Consumers["Consumers"]
        U1["TRL / GRPO<br/>Training Loop"]
        U2["Developer<br/>Browser"]
        U3["inference.py<br/>Baseline Script"]
        U4["MCPToolClient<br/>PreferenceLabEnv"]
    end

    GIT --> Space
    Space --> PUBURL

    U1 -- "WebSocket / OpenEnv" --> REST
    U2 -- "HTTPS" --> WEB
    U3 -- "Direct import" --> UVICORN
    U4 -- "HTTP / MCP" --> REST

    REST -- "OpenAI client" --> MODEL

    classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px
    classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px
    classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
    classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
    classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px

    class PUBURL puburl
    class CONTAINER,UVICORN docker
    class U1,U2,U3,U4 consumer
    class MODEL llm
    class SECRETS secret
    class CODE,GIT dev
    class WEB,REST,HEALTH hf

File Architecture

preference-lab/
│
├── 📄 README.md                        ← You are here
├── 📄 LICENSE
├── 📄 .gitignore
├── 📄 .dockerignore
│
├── 📄 openenv.yaml                     ← OpenEnv manifest
│   │                                     runtime: fastapi
│   │                                     app: server.app:app
│   │                                     port: 8000
│   │                                     type: space
│   │
├── 📄 Dockerfile                       ← HF Spaces production image
│   │                                     Base: python:3.10-slim
│   │                                     CMD: uvicorn server.app:app
│   │                                     HEALTHCHECK: polls /health every 30s
│   │
├── 📄 requirements.txt                 ← Flat pip dependency list
│   │                                     openenv-core, fastapi, uvicorn,
│   │                                     pydantic, openai, datasets,
│   │                                     httpx, websockets, gradio
│   │
├── 📄 pyproject.toml                   ← Build config + project metadata
│   │                                     (setuptools, same deps as above)
│   │
├── 📄 __init__.py                      ← Package entry point
│   │                                     Exports: PreferenceLabEnv,
│   │                                     PairwiseAction, LikertAction,
│   │                                     ConsistencyAction + all Observations
│   │
├── 📄 models.py                        ← Pydantic v2 data models
│   │                                     Defines the agent ↔ env contract
│   │
│   │   ACTIONS                           OBSERVATIONS
│   │   ─────────────────────────────     ─────────────────────────────────
│   │   PairwiseAction                    PairwiseObservation
│   │     .choice: A|B|tie|skip            .prompt, .response_a, .response_b
│   │     .justification: str?             .reward, .done, .step_count
│   │                                     ─────────────────────────────────
│   │   LikertAction                      LikertObservation
│   │     .helpfulness: 1-5               .prompt, .response
│   │     .honesty: 1-5                   .rubric, .reward, .done
│   │     .harmlessness: 1-5              ─────────────────────────────────
│   │     .instruction_following: 1-5     ConsistencyObservation
│   │                                      .prompt
│   │   ConsistencyAction                  .response_a, .response_b
│   │     .ranking: list[str] (len=4)      .response_c, .response_d
│   │                                      .reward, .done
│   │
├── 📄 client.py                        ← PreferenceLabEnv client wrapper
│   │                                     Thin sync/async wrapper around
│   │                                     openenv.core.MCPToolClient
│   │
├── 📄 inference.py                     ← Baseline LLM inference script
│   │                                     Mandatory stdout format:
│   │                                     [START] task= env= model=
│   │                                     [STEP]  step= action= reward= done=
│   │                                     [END]   success= steps= score=
│   │
├── 📄 test_api.py                      ← Quick smoke-test (direct import)
│   │                                     Tests all 3 tasks in sequence
│   │
├── server/                             ← Core server package
│   │
│   ├── 📄 __init__.py
│   │
│   ├── 📄 app.py                       ← FastAPI application factory
│   │                                     ENABLE_WEB_INTERFACE=true → Gradio
│   │                                     MAX_CONCURRENT_ENVS=64
│   │                                     Routes: /manifest.json, /.well-known/
│   │
│   └── 📄 environment.py               ← Core OpenEnv environment
│                                         PreferenceLabEnvironment(Environment)
│                                         SUPPORTS_CONCURRENT_SESSIONS = True
│                                         ─────────────────────────────────────
│                                         reset(seed, task_type, **kwargs)
│                                           → Observation
│                                         step(action)
│                                           → Observation  [reward & done inline]
│                                         state @property
│                                           → State(episode_id, step_count, ...)
│                                         ─────────────────────────────────────
│                                         Graders (internal):
│                                           grade_pairwise()   → +1.0 / 0.3 / 0.1 / 0.0
│                                           grade_likert()     → 1 − MAE/4.0
│                                           grade_consistency()→ Kendall-τ + transitivity
│
├── data/                               ← Dataset files (git-ignored)
│   ├── 📄 pairwise_data.json            HH-RLHF gold labels
│   ├── 📄 likert_data.json              UltraFeedback multi-axis scores
│   └── 📄 consistency_data.json         Stanford SHP ranking pairs
│                                         (All 3 auto-fallback to synthetic
│                                          data if files are missing)
│
├── scripts/
│   └── 📄 prepare_datasets.py          ← Downloads & formats datasets
│                                         from Hugging Face Hub
│                                         Usage: python scripts/prepare_datasets.py
│
└── tests/
    └── 📄 test_environment.py          ← pytest test suite
                                          25 test cases covering:
                                          reset / step / state / graders
                                          concurrent sessions / reproducibility

Task Design

PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows.

Task 1 — Pairwise Ranking (Easy)

The agent is shown a prompt and two LLM responses (A and B), and must pick the better one.

Observation fields: prompt, response_a, response_b

Action:

PairwiseAction(
    choice="A",           # "A" | "B" | "tie" | "skip"
    justification="..."   # optional, not used for grading
)

Grading (vs HH-RLHF gold label):

Agent choice	Outcome	Reward
Correct (matches gold)	✅	`+1.0`
`skip`	⚠️ Abstain	`+0.3`
`tie` (when gold is clear)	⚠️ Hedging	`+0.1`
Wrong choice	❌	`+0.0`

Task 2 — Multi-Axis Likert Scoring (Medium)

The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes.

Observation fields: prompt, response, rubric

Action:

LikertAction(
    helpfulness=4,           # 1–5
    honesty=5,               # 1–5
    harmlessness=5,          # 1–5
    instruction_following=4  # 1–5
)

Grading (vs UltraFeedback gold scores):

reward = 1.0 − (MAE / 4.0)

where MAE = mean absolute error across all 4 axes
      4.0  = maximum possible error per axis

Perfect match  → reward = 1.0
Off by 1 each  → reward = 0.75
Off by 2 each  → reward = 0.50
Worst case     → reward = 0.0

Task 3 — Transitive Consistency Ranking (Hard)

The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity.

Observation fields: prompt, response_a, response_b, response_c, response_d

Action:

ConsistencyAction(
    ranking=["B", "A", "D", "C"]  # best → worst, all 4 required
)

Grading (Kendall-τ + Transitivity bonus):

reward = α × kendall_tau + β × transitivity_score

kendall_tau:        normalized rank correlation vs gold ranking
                    range [−1.0, +1.0], clipped to [0, 1]

transitivity_score: fraction of (A>B, B>C → A>C) triplets satisfied
                    penalizes logically inconsistent rankings

α = 0.7,  β = 0.3   (weighted combination)

Reward Functions

Task	Formula	Range
Pairwise	Exact match reward table	`{0.0, 0.1, 0.3, 1.0}`
Likert	`1 − mean(	agent_score − gold_score
Consistency	`0.7 × Kendall-τ + 0.3 × Transitivity`	`[0.0, 1.0]`

All rewards are bounded [0, 1] and emitted at every step (dense signal).

Datasets

Dataset	Task	Source	Samples
HH-RLHF	Pairwise	Anthropic	~160K pairs
UltraFeedback	Likert	OpenBMB	~64K responses
Stanford SHP	Consistency	Stanford	~385K pairs

Download all datasets:

python scripts/prepare_datasets.py
# or with custom sample count:
python scripts/prepare_datasets.py --samples 5000

If JSON files are absent, the environment automatically uses built-in synthetic data — no download needed for local development.

Quick Start

Prerequisites

Python 3.10+
git

Local Development

# 1. Clone
git clone https://github.com/SIBAM890/preferencelab.git
cd preference-lab

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Download real datasets
python scripts/prepare_datasets.py

# 5. Start the server
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Open http://localhost:8000/web for the interactive Gradio playground.

Verify the server is running

curl http://localhost:8000/health
# → {"status":"healthy"}

curl http://localhost:8000/schema
# → full action / observation JSON schema

Run the baseline inference script

# Set your API credentials (or use any OpenAI-compatible endpoint)
export HF_TOKEN=hf_your_token_here
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct

python inference.py

Expected output format:

[START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct
[STEP]  step=1 action=choice=A reward=1.00 done=false error=null
[STEP]  step=2 action=choice=B reward=0.00 done=false error=null
[STEP]  step=3 action=choice=A reward=1.00 done=false error=null
[STEP]  step=4 action=choice=A reward=1.00 done=false error=null
[STEP]  step=5 action=choice=B reward=0.00 done=true  error=null
[END]   success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00

Run tests

pytest tests/ -v
# 25 test cases — reset, step, state, graders, concurrency, reproducibility

Environment Variables

Variable	Default	Description
`HF_TOKEN`	(none)	Hugging Face API token for LLM inference
`API_BASE_URL`	`https://api-inference.huggingface.co/v1`	LLM API endpoint (any OpenAI-compatible URL)
`MODEL_NAME`	`meta-llama/Llama-3.1-8B-Instruct`	Model identifier sent to the API
`MAX_CONCURRENT_ENVS`	`64`	Maximum parallel WebSocket sessions
`ENABLE_WEB_INTERFACE`	`true`	Mount Gradio UI at `/web`
`ENV_BASE_URL`	`http://localhost:8000`	PreferenceLab server URL (for remote clients)
`ENV_README_PATH`	(none)	Custom path to README for web interface

API Reference

REST Endpoints

Method	Path	Description
`GET`	`/health`	Server health check
`GET`	`/schema`	Action + Observation JSON schemas
`GET`	`/state`	Current episode state
`POST`	`/reset`	Start a new episode
`POST`	`/step`	Submit an action, receive observation
`GET`	`/web`	Gradio interactive playground
`GET`	`/manifest.json`	PWA web manifest

POST /reset

{
  "seed": 42,
  "task_type": "pairwise"
}

task_type accepts: "pairwise" | "likert" | "consistency" | omit for random.

POST /step

{
  "action": {
    "choice": "A"
  }
}

Response (all step/reset endpoints)

{
  "observation": {
    "task_id": "abc123_step1",
    "task_type": "pairwise",
    "prompt": "Explain backpropagation.",
    "response_a": "...",
    "response_b": "...",
    "reward": 1.0,
    "done": false,
    "step_count": 1,
    "info": { "verdict": "correct", "gold_label": "A" }
  },
  "reward": 1.0,
  "done": false
}

WebSocket

ws://localhost:8000/ws

OpenEnv WebSocket protocol — send reset, step, state, close messages. Used by TRL training loops via MCPToolClient.

Integration Guide

Direct Import (Local)

from server.environment import PreferenceLabEnvironment
from models import PairwiseAction, LikertAction, ConsistencyAction

env = PreferenceLabEnvironment()

# Pairwise task
obs = env.reset(seed=42, task_type="pairwise")
print(obs.prompt)

obs = env.step(PairwiseAction(choice="A"))
print(obs.reward, obs.done)

# State (property, not method)
state = env.state
print(state.episode_id, state.step_count)

Using with TRL / GRPO Training

import asyncio
from openenv.core.env_client import EnvClient
from models import PairwiseAction

async def train():
    async with EnvClient("http://localhost:8000") as env:
        obs = await env.reset(task_type="pairwise")

        for step in range(5):
            # Your policy predicts the action
            action = PairwiseAction(choice=your_policy(obs))
            obs = await env.step(action)
            reward = obs.reward
            done = obs.done

            train_on(obs, reward)
            if done:
                break

asyncio.run(train())

MultiEnv Wrapper (Parallel Sessions)

from openenv.core.env_client import MultiEnvClient

# Spin up 8 parallel sessions on the same server
async with MultiEnvClient("http://localhost:8000", n=8) as envs:
    observations = await envs.reset_all(task_type="pairwise")
    # envs.step_all(actions) → list of observations

Baseline Scores

Scores produced by python inference.py with meta-llama/Llama-3.1-8B-Instruct:

Task	Difficulty	Avg Reward	Notes
Pairwise Ranking	Easy	~0.60	Varies by model capability
Likert Scoring	Medium	~0.75	Continuous signal
Consistency Ranking	Hard	~0.65	Kendall-tau based
Overall	—	~0.67	Reproducible with seed=42

Higher scores indicate the model aligns more closely with human preference gold labels. Run python inference.py to generate fresh scores against your own model.

Testing

Run tests

# Full test suite
pytest tests/ -v

# Specific test classes
pytest tests/test_environment.py::TestPreferenceLabGraders -v
pytest tests/test_environment.py::TestEpisodeManagement -v

# Quick smoke test (direct import, no server needed)
python test_api.py

Test coverage — 22 test cases across 4 classes:

TestPairwiseGrader — correct / wrong / skip / tie / range (5 tests)
TestLikertGrader — perfect / worst / partial / random range (4 tests)
TestConsistencyGrader — perfect / reversed / invalid IDs / all perms / no-tie (5 tests)
TestPreferenceLabEnvironment — reset / step / state / seed / episode flow (8 tests)

Deployment

Docker (Local)

docker build -t preferencelab .

docker run -p 7860:7860 \
  -e HF_TOKEN=hf_your_token \
  -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  -e MAX_CONCURRENT_ENVS=64 \
  preferencelab

Visit http://localhost:7860/web

Hugging Face Spaces

Fork or push this repository to a Hugging Face Space with Docker SDK
Add the following secrets in Space Settings:
- HF_TOKEN — your Hugging Face token
- API_BASE_URL — inference endpoint (e.g. https://api-inference.huggingface.co/v1)
- MODEL_NAME — model to use

The Dockerfile handles everything else. Health check polls /health every 30 seconds.

License

This project is licensed under the MIT License — see LICENSE for details.

Built with ❤️ for the Meta × Hugging Face OpenEnv Hackathon

Team Nexis

🧪 PreferenceLab

An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline

Table of Contents

Overview

Why PreferenceLab?

System Architecture

🏗️ Component Architecture

🔄 Request Lifecycle — Data Flow

🧭 User Flow

☁️ Deployment Architecture

File Architecture

Task Design

Task 1 — Pairwise Ranking (Easy)

Task 2 — Multi-Axis Likert Scoring (Medium)

Task 3 — Transitive Consistency Ranking (Hard)

Reward Functions

Datasets

Quick Start

Prerequisites

Local Development

Verify the server is running

Run the baseline inference script

Run tests

Environment Variables

API Reference

REST Endpoints

POST /reset

POST /step

Response (all step/reset endpoints)

WebSocket

Integration Guide

Direct Import (Local)

Using with TRL / GRPO Training

MultiEnv Wrapper (Parallel Sessions)

Baseline Scores

Testing

Run tests

Deployment

Docker (Local)

Hugging Face Spaces

License