preference-lab / README.md
Sibam
fix: clamp grader rewards to strictly (0, 1) to pass OpenEnv validation bounds
f3f7bc4
|
Raw
History Blame Contribute Delete
30.8 kB
metadata
title: PreferenceLab
emoji: πŸ§ͺ
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - openenv
  - rlhf
  - preference-learning
license: mit

πŸ§ͺ PreferenceLab

An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline

Python FastAPI Pydantic Gradio Docker License Hackathon

Built for the Meta Γ— Hugging Face OpenEnv Hackathon β€” Team Nexis

πŸš€ Live Space Dev-CrafterX/preference-lab

Table of Contents


Overview

PreferenceLab is a production-grade OpenEnv environment that teaches AI agents to judge LLM response quality β€” exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines.

Instead of expensive, slow human annotators, PreferenceLab provides:

Feature Details
βœ… Deterministic grading Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP)
βœ… Dense reward signals Reward at every annotation step, not just episode-end
βœ… Three difficulty levels Pairwise β†’ Likert scoring β†’ Transitive 4-way ranking
βœ… Synthetic fallback Zero-dependency offline testing with built-in data
βœ… Concurrent sessions Up to 64 parallel RL training sessions by default
βœ… Reproducible episodes Fully seeded random sampling
βœ… Web playground Gradio UI at /web for interactive testing

Why PreferenceLab?

There are zero existing OpenEnv environments that simulate the RLHF data collection pipeline β€” the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4.

Pain Point PreferenceLab Solution
Human annotators are slow & expensive AI agent replaces the annotator role
Binary end-of-episode rewards β†’ sparse gradients Every step yields a graded reward signal
Single-task environments limit curriculum learning Three tasks of increasing complexity
Hard-to-reproduce evaluations Seeded episodes are fully deterministic
Local dev blocked by API dependencies Built-in synthetic fallback datasets
No visual interface for debugging Gradio playground at /web

System Architecture

πŸ—οΈ Component Architecture

flowchart TB
    subgraph Clients["Clients and Consumers"]
        A1["AI Agent<br/>GRPO / TRL Training"]
        A2["Baseline Inference<br/>inference.py"]
        A3["Gradio Web UI<br/>/web"]
        A4["REST / WebSocket<br/>Direct API"]
    end

    subgraph Platform["Hugging Face Space β€” Docker Container"]
        subgraph FastAPI["FastAPI Server β€” server/app.py"]
            EP1["/reset POST"]
            EP2["/step POST"]
            EP3["/state GET"]
            EP4["/health GET"]
            EP5["/web Gradio"]
        end

        subgraph EnvCore["PreferenceLabEnvironment β€” server/environment.py"]
            RESET["reset()<br/>seed Β· task_type Β· episode_id"]
            STEP["step()<br/>grade action β†’ reward β†’ sample next"]
            STATE["state @property<br/>returns State object"]
        end

        subgraph Graders["Deterministic Graders"]
            G1["Task 1 Β· Pairwise<br/>+1.0 / 0.3 / 0.1 / 0.0"]
            G2["Task 2 Β· Likert<br/>1 βˆ’ MAE / 4.0"]
            G3["Task 3 Β· Consistency<br/>Kendall-tau + Transitivity"]
        end

        subgraph DataStore["Data Layer β€” data/"]
            D1["pairwise_data.json<br/>HH-RLHF"]
            D2["likert_data.json<br/>UltraFeedback"]
            D3["consistency_data.json<br/>Stanford SHP"]
            D4["Synthetic Fallback<br/>built-in, always available"]
        end
    end

    subgraph Models["Pydantic Models β€” models.py"]
        M1["PairwiseAction / Observation"]
        M2["LikertAction / Observation"]
        M3["ConsistencyAction / Observation"]
    end

    LLM["HF Inference API<br/>meta-llama / Llama-3.1-8B"]

    A1 -- "HTTP / WebSocket" --> FastAPI
    A2 -- "Direct import" --> EnvCore
    A3 --> EP5
    A4 --> FastAPI

    EP1 --> RESET
    EP2 --> STEP
    EP3 --> STATE
    EP5 --> RESET
    EP5 --> STEP

    RESET --> Graders
    STEP --> Graders
    Graders --> G1
    Graders --> G2
    Graders --> G3

    EnvCore --> DataStore
    D1 -.->|fallback| D4
    D2 -.->|fallback| D4
    D3 -.->|fallback| D4

    Models --> Graders
    A2 -- "OpenAI client" --> LLM

    classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
    classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px
    classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
    classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px
    classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px

    class A1,A2,A3,A4 client
    class G1,G2,G3 grader
    class D1,D2,D3,D4 data
    class M1,M2,M3 model
    class LLM external
    class EP1,EP2,EP3,EP4,EP5 endpoint
    class RESET,STEP,STATE env

πŸ”„ Request Lifecycle β€” Data Flow

sequenceDiagram
    autonumber
    actor Agent as AI Agent / TRL Trainer
    participant API as FastAPI Server
    participant Env as PreferenceLabEnvironment
    participant Grader as Deterministic Grader
    participant DB as Dataset

    Note over Agent,DB: Episode Start

    Agent->>API: POST /reset  task_type=pairwise  seed=42
    API->>Env: env.reset(task_type, seed)
    Env->>DB: _sample_example(rng)
    DB-->>Env: prompt, response_a, response_b, gold_label
    Env-->>API: PairwiseObservation  reward=0.0  done=false
    API-->>Agent: 200 OK  Observation JSON

    Note over Agent,DB: Step Loop β€” max 10 steps per episode

    loop For each annotation step
        Agent->>Agent: call_llm(system_prompt, observation)
        Agent->>API: POST /step  action: choice=A
        API->>Env: env.step(PairwiseAction)
        Env->>Grader: grade_pairwise(action, example)
        Grader->>Grader: compare choice vs gold_label
        Grader-->>Env: reward=0.99  verdict=correct
        Env->>DB: _sample_example  next example
        DB-->>Env: next example
        Env-->>API: Observation  reward=0.99  done=false  step=N
        API-->>Agent: 200 OK  StepResult JSON
        Agent->>Agent: log_step  accumulate reward
    end

    Note over Agent,DB: Episode End

    Env-->>API: Observation  done=true  step_count=10
    API-->>Agent: 200 OK  Final Observation
    Agent->>Agent: log_end  score  rewards
    Agent->>API: POST /reset  start new episode

🧭 User Flow

flowchart TD
    START(["Start"])

    subgraph Setup["Setup Phase"]
        S1["Clone repository<br/>git clone"]
        S2["Install dependencies<br/>pip install -r requirements.txt"]
        S3{"Need real<br/>datasets?"}
        S4["Download datasets<br/>python scripts/prepare_datasets.py"]
        S5["Use synthetic fallback<br/>built-in β€” no download needed"]
        S6["Set environment vars<br/>HF_TOKEN  MODEL_NAME  API_BASE_URL"]
    end

    subgraph Deploy["Choose Deployment"]
        D1{"Mode?"}
        D2["Local Dev<br/>uvicorn server.app:app --port 8000"]
        D3["Docker<br/>docker build and docker run"]
        D4["HF Space<br/>git push to HuggingFace"]
    end

    subgraph Usage["Choose Usage Mode"]
        U1{"How to use?"}
        U2["Run Baseline<br/>python inference.py"]
        U3["Web Playground<br/>localhost:8000/web"]
        U4["REST API Integration<br/>HTTP + WebSocket"]
        U5["Run Tests<br/>pytest tests/ -v"]
        U6["TRL / GRPO Training<br/>parallel sessions via MCPToolClient"]
    end

    subgraph Episode["Episode Loop"]
        E1["POST /reset<br/>choose task_type and seed"]
        E2{"Task Type?"}
        E3["Pairwise<br/>PairwiseAction: choice A or B<br/>reward 0.01 to 0.99"]
        E4["Likert<br/>LikertAction: score 4 axes 1 to 5<br/>reward = 1 minus MAE/4"]
        E5["Consistency<br/>ConsistencyAction: rank A B C D<br/>reward = tau + transitivity"]
        E6["POST /step<br/>submit action"]
        E7["Receive Observation<br/>reward and done flag embedded"]
        E8{"done == true?"}
        E9["Next step<br/>new example sampled automatically"]
        E10["Episode complete<br/>log_end  avg reward computed"]
    end

    START --> S1 --> S2 --> S3
    S3 -->|Yes| S4 --> S6
    S3 -->|No| S5 --> S6
    S6 --> D1

    D1 -->|Local| D2
    D1 -->|Docker| D3
    D1 -->|Cloud| D4

    D2 & D3 & D4 --> U1

    U1 -->|Baseline| U2
    U1 -->|Interactive| U3
    U1 -->|Custom| U4
    U1 -->|Tests| U5
    U1 -->|Training| U6

    U2 & U3 & U4 & U6 --> E1

    E1 --> E2
    E2 -->|pairwise| E3
    E2 -->|likert| E4
    E2 -->|consistency| E5
    E3 & E4 & E5 --> E6 --> E7 --> E8
    E8 -->|No| E9 --> E6
    E8 -->|Yes| E10
    E10 -->|New Episode| E1
    E10 -->|Done| FINISH(["Complete"])

    classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px
    classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
    classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
    classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px

    class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step
    class S3,D1,U1,E2,E8 decision
    class E3 task1
    class E4 task2
    class E5 task3
    class START,FINISH terminal

☁️ Deployment Architecture

flowchart LR
    subgraph Dev["Developer Machine"]
        CODE["Source Code<br/>preference-lab/"]
        GIT["git push"]
        CODE --> GIT
    end

    subgraph Space["Hugging Face Space β€” Docker SDK"]
        SECRETS["Secrets Injected<br/>HF_TOKEN<br/>API_BASE_URL<br/>MODEL_NAME<br/>MAX_CONCURRENT_ENVS=64"]
        CONTAINER["Docker Container<br/>python:3.10-slim"]
        UVICORN["uvicorn server.app:app<br/>host 0.0.0.0  port 8000"]
        WEB["Gradio UI<br/>/web"]
        REST["REST API<br/>/reset  /step  /state"]
        HEALTH["Health Check<br/>/health  every 30s"]

        CONTAINER --> UVICORN
        UVICORN --> WEB
        UVICORN --> REST
        UVICORN --> HEALTH
        SECRETS -.->|env vars injected| CONTAINER
    end

    PUBURL["Public URL<br/>https://username-preflab.hf.space"]

    subgraph LLMApi["HF Inference API"]
        MODEL["meta-llama<br/>Llama-3.1-8B-Instruct"]
    end

    subgraph Consumers["Consumers"]
        U1["TRL / GRPO<br/>Training Loop"]
        U2["Developer<br/>Browser"]
        U3["inference.py<br/>Baseline Script"]
        U4["MCPToolClient<br/>PreferenceLabEnv"]
    end

    GIT --> Space
    Space --> PUBURL

    U1 -- "WebSocket / OpenEnv" --> REST
    U2 -- "HTTPS" --> WEB
    U3 -- "Direct import" --> UVICORN
    U4 -- "HTTP / MCP" --> REST

    REST -- "OpenAI client" --> MODEL

    classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px
    classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px
    classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
    classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
    classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px

    class PUBURL puburl
    class CONTAINER,UVICORN docker
    class U1,U2,U3,U4 consumer
    class MODEL llm
    class SECRETS secret
    class CODE,GIT dev
    class WEB,REST,HEALTH hf

File Architecture

preference-lab/
β”‚
β”œβ”€β”€ πŸ“„ README.md                        ← You are here
β”œβ”€β”€ πŸ“„ LICENSE
β”œβ”€β”€ πŸ“„ .gitignore
β”œβ”€β”€ πŸ“„ .dockerignore
β”‚
β”œβ”€β”€ πŸ“„ openenv.yaml                     ← OpenEnv manifest
β”‚   β”‚                                     runtime: fastapi
β”‚   β”‚                                     app: server.app:app
β”‚   β”‚                                     port: 8000
β”‚   β”‚                                     type: space
β”‚   β”‚
β”œβ”€β”€ πŸ“„ Dockerfile                       ← HF Spaces production image
β”‚   β”‚                                     Base: python:3.10-slim
β”‚   β”‚                                     CMD: uvicorn server.app:app
β”‚   β”‚                                     HEALTHCHECK: polls /health every 30s
β”‚   β”‚
β”œβ”€β”€ πŸ“„ requirements.txt                 ← Flat pip dependency list
β”‚   β”‚                                     openenv-core, fastapi, uvicorn,
β”‚   β”‚                                     pydantic, openai, datasets,
β”‚   β”‚                                     httpx, websockets, gradio
β”‚   β”‚
β”œβ”€β”€ πŸ“„ pyproject.toml                   ← Build config + project metadata
β”‚   β”‚                                     (setuptools, same deps as above)
β”‚   β”‚
β”œβ”€β”€ πŸ“„ __init__.py                      ← Package entry point
β”‚   β”‚                                     Exports: PreferenceLabEnv,
β”‚   β”‚                                     PairwiseAction, LikertAction,
β”‚   β”‚                                     ConsistencyAction + all Observations
β”‚   β”‚
β”œβ”€β”€ πŸ“„ models.py                        ← Pydantic v2 data models
β”‚   β”‚                                     Defines the agent ↔ env contract
β”‚   β”‚
β”‚   β”‚   ACTIONS                           OBSERVATIONS
β”‚   β”‚   ─────────────────────────────     ─────────────────────────────────
β”‚   β”‚   PairwiseAction                    PairwiseObservation
β”‚   β”‚     .choice: A|B|tie|skip            .prompt, .response_a, .response_b
β”‚   β”‚     .justification: str?             .reward, .done, .step_count
β”‚   β”‚                                     ─────────────────────────────────
β”‚   β”‚   LikertAction                      LikertObservation
β”‚   β”‚     .helpfulness: 1-5               .prompt, .response
β”‚   β”‚     .honesty: 1-5                   .rubric, .reward, .done
β”‚   β”‚     .harmlessness: 1-5              ─────────────────────────────────
β”‚   β”‚     .instruction_following: 1-5     ConsistencyObservation
β”‚   β”‚                                      .prompt
β”‚   β”‚   ConsistencyAction                  .response_a, .response_b
β”‚   β”‚     .ranking: list[str] (len=4)      .response_c, .response_d
β”‚   β”‚                                      .reward, .done
β”‚   β”‚
β”œβ”€β”€ πŸ“„ client.py                        ← PreferenceLabEnv client wrapper
β”‚   β”‚                                     Thin sync/async wrapper around
β”‚   β”‚                                     openenv.core.MCPToolClient
β”‚   β”‚
β”œβ”€β”€ πŸ“„ inference.py                     ← Baseline LLM inference script
β”‚   β”‚                                     Mandatory stdout format:
β”‚   β”‚                                     [START] task= env= model=
β”‚   β”‚                                     [STEP]  step= action= reward= done=
β”‚   β”‚                                     [END]   success= steps= score=
β”‚   β”‚
β”œβ”€β”€ πŸ“„ test_api.py                      ← Quick smoke-test (direct import)
β”‚   β”‚                                     Tests all 3 tasks in sequence
β”‚   β”‚
β”œβ”€β”€ server/                             ← Core server package
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“„ app.py                       ← FastAPI application factory
β”‚   β”‚                                     ENABLE_WEB_INTERFACE=true β†’ Gradio
β”‚   β”‚                                     MAX_CONCURRENT_ENVS=64
β”‚   β”‚                                     Routes: /manifest.json, /.well-known/
β”‚   β”‚
β”‚   └── πŸ“„ environment.py               ← Core OpenEnv environment
β”‚                                         PreferenceLabEnvironment(Environment)
β”‚                                         SUPPORTS_CONCURRENT_SESSIONS = True
β”‚                                         ─────────────────────────────────────
β”‚                                         reset(seed, task_type, **kwargs)
β”‚                                           β†’ Observation
β”‚                                         step(action)
β”‚                                           β†’ Observation  [reward & done inline]
β”‚                                         state @property
β”‚                                           β†’ State(episode_id, step_count, ...)
β”‚                                         ─────────────────────────────────────
β”‚                                         Graders (internal):
β”‚                                           grade_pairwise()   β†’ +1.0 / 0.3 / 0.1 / 0.0
β”‚                                           grade_likert()     β†’ 1 βˆ’ MAE/4.0
β”‚                                           grade_consistency()β†’ Kendall-Ο„ + transitivity
β”‚
β”œβ”€β”€ data/                               ← Dataset files (git-ignored)
β”‚   β”œβ”€β”€ πŸ“„ pairwise_data.json            HH-RLHF gold labels
β”‚   β”œβ”€β”€ πŸ“„ likert_data.json              UltraFeedback multi-axis scores
β”‚   └── πŸ“„ consistency_data.json         Stanford SHP ranking pairs
β”‚                                         (All 3 auto-fallback to synthetic
β”‚                                          data if files are missing)
β”‚
β”œβ”€β”€ scripts/
β”‚   └── πŸ“„ prepare_datasets.py          ← Downloads & formats datasets
β”‚                                         from Hugging Face Hub
β”‚                                         Usage: python scripts/prepare_datasets.py
β”‚
└── tests/
    └── πŸ“„ test_environment.py          ← pytest test suite
                                          25 test cases covering:
                                          reset / step / state / graders
                                          concurrent sessions / reproducibility

Task Design

PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows.

Task 1 β€” Pairwise Ranking (Easy)

The agent is shown a prompt and two LLM responses (A and B), and must pick the better one.

Observation fields: prompt, response_a, response_b

Action:

PairwiseAction(
    choice="A",           # "A" | "B" | "tie" | "skip"
    justification="..."   # optional, not used for grading
)

Grading (vs HH-RLHF gold label):

Agent choice Outcome Reward
Correct (matches gold) βœ… +1.0
skip ⚠️ Abstain +0.3
tie (when gold is clear) ⚠️ Hedging +0.1
Wrong choice ❌ +0.0

Task 2 β€” Multi-Axis Likert Scoring (Medium)

The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes.

Observation fields: prompt, response, rubric

Action:

LikertAction(
    helpfulness=4,           # 1–5
    honesty=5,               # 1–5
    harmlessness=5,          # 1–5
    instruction_following=4  # 1–5
)

Grading (vs UltraFeedback gold scores):

reward = 1.0 βˆ’ (MAE / 4.0)

where MAE = mean absolute error across all 4 axes
      4.0  = maximum possible error per axis

Perfect match  β†’ reward = 1.0
Off by 1 each  β†’ reward = 0.75
Off by 2 each  β†’ reward = 0.50
Worst case     β†’ reward = 0.0

Task 3 β€” Transitive Consistency Ranking (Hard)

The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity.

Observation fields: prompt, response_a, response_b, response_c, response_d

Action:

ConsistencyAction(
    ranking=["B", "A", "D", "C"]  # best β†’ worst, all 4 required
)

Grading (Kendall-Ο„ + Transitivity bonus):

reward = Ξ± Γ— kendall_tau + Ξ² Γ— transitivity_score

kendall_tau:        normalized rank correlation vs gold ranking
                    range [βˆ’1.0, +1.0], clipped to [0, 1]

transitivity_score: fraction of (A>B, B>C β†’ A>C) triplets satisfied
                    penalizes logically inconsistent rankings

Ξ± = 0.7,  Ξ² = 0.3   (weighted combination)

Reward Functions

Task Formula Range
Pairwise Exact match reward table {0.0, 0.1, 0.3, 1.0}
Likert `1 βˆ’ mean( agent_score βˆ’ gold_score
Consistency 0.7 Γ— Kendall-Ο„ + 0.3 Γ— Transitivity [0.0, 1.0]

All rewards are bounded [0, 1] and emitted at every step (dense signal).


Datasets

Dataset Task Source Samples
HH-RLHF Pairwise Anthropic ~160K pairs
UltraFeedback Likert OpenBMB ~64K responses
Stanford SHP Consistency Stanford ~385K pairs

Download all datasets:

python scripts/prepare_datasets.py
# or with custom sample count:
python scripts/prepare_datasets.py --samples 5000

If JSON files are absent, the environment automatically uses built-in synthetic data β€” no download needed for local development.


Quick Start

Prerequisites

  • Python 3.10+
  • git

Local Development

# 1. Clone
git clone https://github.com/SIBAM890/preferencelab.git
cd preference-lab

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Download real datasets
python scripts/prepare_datasets.py

# 5. Start the server
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Open http://localhost:8000/web for the interactive Gradio playground.

Verify the server is running

curl http://localhost:8000/health
# β†’ {"status":"healthy"}

curl http://localhost:8000/schema
# β†’ full action / observation JSON schema

Run the baseline inference script

# Set your API credentials (or use any OpenAI-compatible endpoint)
export HF_TOKEN=hf_your_token_here
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct

python inference.py

Expected output format:

[START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct
[STEP]  step=1 action=choice=A reward=1.00 done=false error=null
[STEP]  step=2 action=choice=B reward=0.00 done=false error=null
[STEP]  step=3 action=choice=A reward=1.00 done=false error=null
[STEP]  step=4 action=choice=A reward=1.00 done=false error=null
[STEP]  step=5 action=choice=B reward=0.00 done=true  error=null
[END]   success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00

Run tests

pytest tests/ -v
# 25 test cases β€” reset, step, state, graders, concurrency, reproducibility

Environment Variables

Variable Default Description
HF_TOKEN (none) Hugging Face API token for LLM inference
API_BASE_URL https://api-inference.huggingface.co/v1 LLM API endpoint (any OpenAI-compatible URL)
MODEL_NAME meta-llama/Llama-3.1-8B-Instruct Model identifier sent to the API
MAX_CONCURRENT_ENVS 64 Maximum parallel WebSocket sessions
ENABLE_WEB_INTERFACE true Mount Gradio UI at /web
ENV_BASE_URL http://localhost:8000 PreferenceLab server URL (for remote clients)
ENV_README_PATH (none) Custom path to README for web interface

API Reference

REST Endpoints

Method Path Description
GET /health Server health check
GET /schema Action + Observation JSON schemas
GET /state Current episode state
POST /reset Start a new episode
POST /step Submit an action, receive observation
GET /web Gradio interactive playground
GET /manifest.json PWA web manifest

POST /reset

{
  "seed": 42,
  "task_type": "pairwise"
}

task_type accepts: "pairwise" | "likert" | "consistency" | omit for random.

POST /step

{
  "action": {
    "choice": "A"
  }
}

Response (all step/reset endpoints)

{
  "observation": {
    "task_id": "abc123_step1",
    "task_type": "pairwise",
    "prompt": "Explain backpropagation.",
    "response_a": "...",
    "response_b": "...",
    "reward": 1.0,
    "done": false,
    "step_count": 1,
    "info": { "verdict": "correct", "gold_label": "A" }
  },
  "reward": 1.0,
  "done": false
}

WebSocket

ws://localhost:8000/ws

OpenEnv WebSocket protocol β€” send reset, step, state, close messages. Used by TRL training loops via MCPToolClient.


Integration Guide

Direct Import (Local)

from server.environment import PreferenceLabEnvironment
from models import PairwiseAction, LikertAction, ConsistencyAction

env = PreferenceLabEnvironment()

# Pairwise task
obs = env.reset(seed=42, task_type="pairwise")
print(obs.prompt)

obs = env.step(PairwiseAction(choice="A"))
print(obs.reward, obs.done)

# State (property, not method)
state = env.state
print(state.episode_id, state.step_count)

Using with TRL / GRPO Training

import asyncio
from openenv.core.env_client import EnvClient
from models import PairwiseAction

async def train():
    async with EnvClient("http://localhost:8000") as env:
        obs = await env.reset(task_type="pairwise")

        for step in range(5):
            # Your policy predicts the action
            action = PairwiseAction(choice=your_policy(obs))
            obs = await env.step(action)
            reward = obs.reward
            done = obs.done

            train_on(obs, reward)
            if done:
                break

asyncio.run(train())

MultiEnv Wrapper (Parallel Sessions)

from openenv.core.env_client import MultiEnvClient

# Spin up 8 parallel sessions on the same server
async with MultiEnvClient("http://localhost:8000", n=8) as envs:
    observations = await envs.reset_all(task_type="pairwise")
    # envs.step_all(actions) β†’ list of observations

Baseline Scores

Scores produced by python inference.py with meta-llama/Llama-3.1-8B-Instruct:

Task Difficulty Avg Reward Notes
Pairwise Ranking Easy ~0.60 Varies by model capability
Likert Scoring Medium ~0.75 Continuous signal
Consistency Ranking Hard ~0.65 Kendall-tau based
Overall β€” ~0.67 Reproducible with seed=42

Higher scores indicate the model aligns more closely with human preference gold labels. Run python inference.py to generate fresh scores against your own model.


Testing

Run tests

# Full test suite
pytest tests/ -v

# Specific test classes
pytest tests/test_environment.py::TestPreferenceLabGraders -v
pytest tests/test_environment.py::TestEpisodeManagement -v

# Quick smoke test (direct import, no server needed)
python test_api.py

Test coverage β€” 22 test cases across 4 classes:

  • TestPairwiseGrader β€” correct / wrong / skip / tie / range (5 tests)
  • TestLikertGrader β€” perfect / worst / partial / random range (4 tests)
  • TestConsistencyGrader β€” perfect / reversed / invalid IDs / all perms / no-tie (5 tests)
  • TestPreferenceLabEnvironment β€” reset / step / state / seed / episode flow (8 tests)

Deployment

Docker (Local)

docker build -t preferencelab .

docker run -p 7860:7860 \
  -e HF_TOKEN=hf_your_token \
  -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  -e MAX_CONCURRENT_ENVS=64 \
  preferencelab

Visit http://localhost:7860/web

Hugging Face Spaces

  1. Fork or push this repository to a Hugging Face Space with Docker SDK
  2. Add the following secrets in Space Settings:
    • HF_TOKEN β€” your Hugging Face token
    • API_BASE_URL β€” inference endpoint (e.g. https://api-inference.huggingface.co/v1)
    • MODEL_NAME β€” model to use

The Dockerfile handles everything else. Health check polls /health every 30 seconds.


License

This project is licensed under the MIT License β€” see LICENSE for details.


Built with ❀️ for the Meta Γ— Hugging Face OpenEnv Hackathon

Team Nexis