---
title: PreferenceLab
emoji: 🧪
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
tags:
  - openenv
  - rlhf
  - preference-learning
license: mit
---

<div align="center">

# 🧪 PreferenceLab

### An OpenEnv Environment Simulating the RLHF Human Preference Data Collection Pipeline

[![Python](https://img.shields.io/badge/Python-3.10%2B-3776AB?style=flat-square&logo=python&logoColor=white)](https://www.python.org/)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.104%2B-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
[![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?style=flat-square&logo=pydantic&logoColor=white)](https://docs.pydantic.dev/)
[![Gradio](https://img.shields.io/badge/Gradio-4.0%2B-FF7C00?style=flat-square&logo=gradio&logoColor=white)](https://gradio.app/)
[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=flat-square&logo=docker&logoColor=white)](https://www.docker.com/)
[![License](https://img.shields.io/badge/License-MIT-blue.svg?style=flat-square)](LICENSE)
[![Hackathon](https://img.shields.io/badge/Meta_%C3%97_HuggingFace-OpenEnv_Hackathon-FF6B00?style=flat-square)](https://huggingface.co/)

> **Built for the Meta × Hugging Face OpenEnv Hackathon — Team Nexis**

| 🚀 **Live Space** | [Dev-CrafterX/preference-lab](https://huggingface.co/spaces/Dev-CrafterX/preference-lab) |
|---|---|

</div>

---

## Table of Contents

- [Overview](#overview)
- [Why PreferenceLab?](#why-preferencelab)
- [System Architecture](#system-architecture)
- [File Architecture](#file-architecture)
- [Task Design](#task-design)
  - [Task 1 — Pairwise Ranking](#task-1--pairwise-ranking-easy)
  - [Task 2 — Multi-Axis Likert Scoring](#task-2--multi-axis-likert-scoring-medium)
  - [Task 3 — Transitive Consistency Ranking](#task-3--transitive-consistency-ranking-hard)
- [Reward Functions](#reward-functions)
- [Datasets](#datasets)
- [Quick Start](#quick-start)
- [Environment Variables](#environment-variables)
- [API Reference](#api-reference)
- [Integration Guide](#integration-guide)
- [Baseline Scores](#baseline-scores)
- [Testing](#testing)
- [Deployment](#deployment)
- [License](#license)

---

## Overview

**PreferenceLab** is a production-grade [OpenEnv](https://github.com/meta-pytorch/openenv) environment that teaches AI agents to judge LLM response quality — exactly as human annotators do during RLHF (Reinforcement Learning from Human Feedback) pipelines.

Instead of expensive, slow human annotators, PreferenceLab provides:

| Feature | Details |
|---|---|
| ✅ **Deterministic grading** | Gold labels from real preference datasets (HH-RLHF, UltraFeedback, SHP) |
| ✅ **Dense reward signals** | Reward at every annotation step, not just episode-end |
| ✅ **Three difficulty levels** | Pairwise → Likert scoring → Transitive 4-way ranking |
| ✅ **Synthetic fallback** | Zero-dependency offline testing with built-in data |
| ✅ **Concurrent sessions** | Up to 64 parallel RL training sessions by default |
| ✅ **Reproducible episodes** | Fully seeded random sampling |
| ✅ **Web playground** | Gradio UI at `/web` for interactive testing |

---

## Why PreferenceLab?

There are **zero existing OpenEnv environments** that simulate the RLHF data collection pipeline — the same pipeline that produces the alignment data used to fine-tune models like Llama 3, Claude, and GPT-4.

| Pain Point | PreferenceLab Solution |
|---|---|
| Human annotators are slow & expensive | AI agent replaces the annotator role |
| Binary end-of-episode rewards → sparse gradients | Every step yields a graded reward signal |
| Single-task environments limit curriculum learning | Three tasks of increasing complexity |
| Hard-to-reproduce evaluations | Seeded episodes are fully deterministic |
| Local dev blocked by API dependencies | Built-in synthetic fallback datasets |
| No visual interface for debugging | Gradio playground at `/web` |

---

## System Architecture

### 🏗️ Component Architecture

```mermaid
flowchart TB
    subgraph Clients["Clients and Consumers"]
        A1["AI Agent<br/>GRPO / TRL Training"]
        A2["Baseline Inference<br/>inference.py"]
        A3["Gradio Web UI<br/>/web"]
        A4["REST / WebSocket<br/>Direct API"]
    end

    subgraph Platform["Hugging Face Space — Docker Container"]
        subgraph FastAPI["FastAPI Server — server/app.py"]
            EP1["/reset POST"]
            EP2["/step POST"]
            EP3["/state GET"]
            EP4["/health GET"]
            EP5["/web Gradio"]
        end

        subgraph EnvCore["PreferenceLabEnvironment — server/environment.py"]
            RESET["reset()<br/>seed · task_type · episode_id"]
            STEP["step()<br/>grade action → reward → sample next"]
            STATE["state @property<br/>returns State object"]
        end

        subgraph Graders["Deterministic Graders"]
            G1["Task 1 · Pairwise<br/>+1.0 / 0.3 / 0.1 / 0.0"]
            G2["Task 2 · Likert<br/>1 − MAE / 4.0"]
            G3["Task 3 · Consistency<br/>Kendall-tau + Transitivity"]
        end

        subgraph DataStore["Data Layer — data/"]
            D1["pairwise_data.json<br/>HH-RLHF"]
            D2["likert_data.json<br/>UltraFeedback"]
            D3["consistency_data.json<br/>Stanford SHP"]
            D4["Synthetic Fallback<br/>built-in, always available"]
        end
    end

    subgraph Models["Pydantic Models — models.py"]
        M1["PairwiseAction / Observation"]
        M2["LikertAction / Observation"]
        M3["ConsistencyAction / Observation"]
    end

    LLM["HF Inference API<br/>meta-llama / Llama-3.1-8B"]

    A1 -- "HTTP / WebSocket" --> FastAPI
    A2 -- "Direct import" --> EnvCore
    A3 --> EP5
    A4 --> FastAPI

    EP1 --> RESET
    EP2 --> STEP
    EP3 --> STATE
    EP5 --> RESET
    EP5 --> STEP

    RESET --> Graders
    STEP --> Graders
    Graders --> G1
    Graders --> G2
    Graders --> G3

    EnvCore --> DataStore
    D1 -.->|fallback| D4
    D2 -.->|fallback| D4
    D3 -.->|fallback| D4

    Models --> Graders
    A2 -- "OpenAI client" --> LLM

    classDef client fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef grader fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
    classDef data fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef model fill:#1a2a3a,stroke:#2196f3,color:#e3f2fd,stroke-width:2px
    classDef external fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
    classDef endpoint fill:#263238,stroke:#607d8b,color:#eceff1,stroke-width:1px
    classDef env fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px

    class A1,A2,A3,A4 client
    class G1,G2,G3 grader
    class D1,D2,D3,D4 data
    class M1,M2,M3 model
    class LLM external
    class EP1,EP2,EP3,EP4,EP5 endpoint
    class RESET,STEP,STATE env
```

---

### 🔄 Request Lifecycle — Data Flow

```mermaid
sequenceDiagram
    autonumber
    actor Agent as AI Agent / TRL Trainer
    participant API as FastAPI Server
    participant Env as PreferenceLabEnvironment
    participant Grader as Deterministic Grader
    participant DB as Dataset

    Note over Agent,DB: Episode Start

    Agent->>API: POST /reset  task_type=pairwise  seed=42
    API->>Env: env.reset(task_type, seed)
    Env->>DB: _sample_example(rng)
    DB-->>Env: prompt, response_a, response_b, gold_label
    Env-->>API: PairwiseObservation  reward=0.0  done=false
    API-->>Agent: 200 OK  Observation JSON

    Note over Agent,DB: Step Loop — max 10 steps per episode

    loop For each annotation step
        Agent->>Agent: call_llm(system_prompt, observation)
        Agent->>API: POST /step  action: choice=A
        API->>Env: env.step(PairwiseAction)
        Env->>Grader: grade_pairwise(action, example)
        Grader->>Grader: compare choice vs gold_label
        Grader-->>Env: reward=0.99  verdict=correct
        Env->>DB: _sample_example  next example
        DB-->>Env: next example
        Env-->>API: Observation  reward=0.99  done=false  step=N
        API-->>Agent: 200 OK  StepResult JSON
        Agent->>Agent: log_step  accumulate reward
    end

    Note over Agent,DB: Episode End

    Env-->>API: Observation  done=true  step_count=10
    API-->>Agent: 200 OK  Final Observation
    Agent->>Agent: log_end  score  rewards
    Agent->>API: POST /reset  start new episode
```

---

### 🧭 User Flow

```mermaid
flowchart TD
    START(["Start"])

    subgraph Setup["Setup Phase"]
        S1["Clone repository<br/>git clone"]
        S2["Install dependencies<br/>pip install -r requirements.txt"]
        S3{"Need real<br/>datasets?"}
        S4["Download datasets<br/>python scripts/prepare_datasets.py"]
        S5["Use synthetic fallback<br/>built-in — no download needed"]
        S6["Set environment vars<br/>HF_TOKEN  MODEL_NAME  API_BASE_URL"]
    end

    subgraph Deploy["Choose Deployment"]
        D1{"Mode?"}
        D2["Local Dev<br/>uvicorn server.app:app --port 8000"]
        D3["Docker<br/>docker build and docker run"]
        D4["HF Space<br/>git push to HuggingFace"]
    end

    subgraph Usage["Choose Usage Mode"]
        U1{"How to use?"}
        U2["Run Baseline<br/>python inference.py"]
        U3["Web Playground<br/>localhost:8000/web"]
        U4["REST API Integration<br/>HTTP + WebSocket"]
        U5["Run Tests<br/>pytest tests/ -v"]
        U6["TRL / GRPO Training<br/>parallel sessions via MCPToolClient"]
    end

    subgraph Episode["Episode Loop"]
        E1["POST /reset<br/>choose task_type and seed"]
        E2{"Task Type?"}
        E3["Pairwise<br/>PairwiseAction: choice A or B<br/>reward 0.01 to 0.99"]
        E4["Likert<br/>LikertAction: score 4 axes 1 to 5<br/>reward = 1 minus MAE/4"]
        E5["Consistency<br/>ConsistencyAction: rank A B C D<br/>reward = tau + transitivity"]
        E6["POST /step<br/>submit action"]
        E7["Receive Observation<br/>reward and done flag embedded"]
        E8{"done == true?"}
        E9["Next step<br/>new example sampled automatically"]
        E10["Episode complete<br/>log_end  avg reward computed"]
    end

    START --> S1 --> S2 --> S3
    S3 -->|Yes| S4 --> S6
    S3 -->|No| S5 --> S6
    S6 --> D1

    D1 -->|Local| D2
    D1 -->|Docker| D3
    D1 -->|Cloud| D4

    D2 & D3 & D4 --> U1

    U1 -->|Baseline| U2
    U1 -->|Interactive| U3
    U1 -->|Custom| U4
    U1 -->|Tests| U5
    U1 -->|Training| U6

    U2 & U3 & U4 & U6 --> E1

    E1 --> E2
    E2 -->|pairwise| E3
    E2 -->|likert| E4
    E2 -->|consistency| E5
    E3 & E4 & E5 --> E6 --> E7 --> E8
    E8 -->|No| E9 --> E6
    E8 -->|Yes| E10
    E10 -->|New Episode| E1
    E10 -->|Done| FINISH(["Complete"])

    classDef step fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef decision fill:#1a1a2e,stroke:#e94560,color:#f5f5f5,stroke-width:2px
    classDef task1 fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
    classDef task2 fill:#3a2a1a,stroke:#ff9800,color:#fff3e0,stroke-width:2px
    classDef task3 fill:#2a1a3a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef terminal fill:#263238,stroke:#66c0f4,color:#c6d4df,stroke-width:3px

    class S1,S2,S4,S5,S6,D2,D3,D4,U2,U3,U4,U5,U6,E1,E6,E7,E9,E10 step
    class S3,D1,U1,E2,E8 decision
    class E3 task1
    class E4 task2
    class E5 task3
    class START,FINISH terminal
```

---

### ☁️ Deployment Architecture

```mermaid
flowchart LR
    subgraph Dev["Developer Machine"]
        CODE["Source Code<br/>preference-lab/"]
        GIT["git push"]
        CODE --> GIT
    end

    subgraph Space["Hugging Face Space — Docker SDK"]
        SECRETS["Secrets Injected<br/>HF_TOKEN<br/>API_BASE_URL<br/>MODEL_NAME<br/>MAX_CONCURRENT_ENVS=64"]
        CONTAINER["Docker Container<br/>python:3.10-slim"]
        UVICORN["uvicorn server.app:app<br/>host 0.0.0.0  port 8000"]
        WEB["Gradio UI<br/>/web"]
        REST["REST API<br/>/reset  /step  /state"]
        HEALTH["Health Check<br/>/health  every 30s"]

        CONTAINER --> UVICORN
        UVICORN --> WEB
        UVICORN --> REST
        UVICORN --> HEALTH
        SECRETS -.->|env vars injected| CONTAINER
    end

    PUBURL["Public URL<br/>https://username-preflab.hf.space"]

    subgraph LLMApi["HF Inference API"]
        MODEL["meta-llama<br/>Llama-3.1-8B-Instruct"]
    end

    subgraph Consumers["Consumers"]
        U1["TRL / GRPO<br/>Training Loop"]
        U2["Developer<br/>Browser"]
        U3["inference.py<br/>Baseline Script"]
        U4["MCPToolClient<br/>PreferenceLabEnv"]
    end

    GIT --> Space
    Space --> PUBURL

    U1 -- "WebSocket / OpenEnv" --> REST
    U2 -- "HTTPS" --> WEB
    U3 -- "Direct import" --> UVICORN
    U4 -- "HTTP / MCP" --> REST

    REST -- "OpenAI client" --> MODEL

    classDef hf fill:#ff6b00,stroke:#ff9800,color:#fff,stroke-width:2px
    classDef docker fill:#0db7ed,stroke:#066da5,color:#fff,stroke-width:2px
    classDef consumer fill:#1e3a5f,stroke:#4a90d9,color:#e8f4fd,stroke-width:2px
    classDef llm fill:#4a235a,stroke:#9c27b0,color:#f3e5f5,stroke-width:2px
    classDef secret fill:#3a1a1a,stroke:#f44336,color:#ffebee,stroke-width:2px
    classDef dev fill:#1a3a2a,stroke:#4caf50,color:#e8f5e9,stroke-width:2px
    classDef puburl fill:#0d2137,stroke:#29b6f6,color:#e1f5fe,stroke-width:2px

    class PUBURL puburl
    class CONTAINER,UVICORN docker
    class U1,U2,U3,U4 consumer
    class MODEL llm
    class SECRETS secret
    class CODE,GIT dev
    class WEB,REST,HEALTH hf
```

---


## File Architecture

```
preference-lab/
│
├── 📄 README.md                        ← You are here
├── 📄 LICENSE
├── 📄 .gitignore
├── 📄 .dockerignore
│
├── 📄 openenv.yaml                     ← OpenEnv manifest
│   │                                     runtime: fastapi
│   │                                     app: server.app:app
│   │                                     port: 8000
│   │                                     type: space
│   │
├── 📄 Dockerfile                       ← HF Spaces production image
│   │                                     Base: python:3.10-slim
│   │                                     CMD: uvicorn server.app:app
│   │                                     HEALTHCHECK: polls /health every 30s
│   │
├── 📄 requirements.txt                 ← Flat pip dependency list
│   │                                     openenv-core, fastapi, uvicorn,
│   │                                     pydantic, openai, datasets,
│   │                                     httpx, websockets, gradio
│   │
├── 📄 pyproject.toml                   ← Build config + project metadata
│   │                                     (setuptools, same deps as above)
│   │
├── 📄 __init__.py                      ← Package entry point
│   │                                     Exports: PreferenceLabEnv,
│   │                                     PairwiseAction, LikertAction,
│   │                                     ConsistencyAction + all Observations
│   │
├── 📄 models.py                        ← Pydantic v2 data models
│   │                                     Defines the agent ↔ env contract
│   │
│   │   ACTIONS                           OBSERVATIONS
│   │   ─────────────────────────────     ─────────────────────────────────
│   │   PairwiseAction                    PairwiseObservation
│   │     .choice: A|B|tie|skip            .prompt, .response_a, .response_b
│   │     .justification: str?             .reward, .done, .step_count
│   │                                     ─────────────────────────────────
│   │   LikertAction                      LikertObservation
│   │     .helpfulness: 1-5               .prompt, .response
│   │     .honesty: 1-5                   .rubric, .reward, .done
│   │     .harmlessness: 1-5              ─────────────────────────────────
│   │     .instruction_following: 1-5     ConsistencyObservation
│   │                                      .prompt
│   │   ConsistencyAction                  .response_a, .response_b
│   │     .ranking: list[str] (len=4)      .response_c, .response_d
│   │                                      .reward, .done
│   │
├── 📄 client.py                        ← PreferenceLabEnv client wrapper
│   │                                     Thin sync/async wrapper around
│   │                                     openenv.core.MCPToolClient
│   │
├── 📄 inference.py                     ← Baseline LLM inference script
│   │                                     Mandatory stdout format:
│   │                                     [START] task= env= model=
│   │                                     [STEP]  step= action= reward= done=
│   │                                     [END]   success= steps= score=
│   │
├── 📄 test_api.py                      ← Quick smoke-test (direct import)
│   │                                     Tests all 3 tasks in sequence
│   │
├── server/                             ← Core server package
│   │
│   ├── 📄 __init__.py
│   │
│   ├── 📄 app.py                       ← FastAPI application factory
│   │                                     ENABLE_WEB_INTERFACE=true → Gradio
│   │                                     MAX_CONCURRENT_ENVS=64
│   │                                     Routes: /manifest.json, /.well-known/
│   │
│   └── 📄 environment.py               ← Core OpenEnv environment
│                                         PreferenceLabEnvironment(Environment)
│                                         SUPPORTS_CONCURRENT_SESSIONS = True
│                                         ─────────────────────────────────────
│                                         reset(seed, task_type, **kwargs)
│                                           → Observation
│                                         step(action)
│                                           → Observation  [reward & done inline]
│                                         state @property
│                                           → State(episode_id, step_count, ...)
│                                         ─────────────────────────────────────
│                                         Graders (internal):
│                                           grade_pairwise()   → +1.0 / 0.3 / 0.1 / 0.0
│                                           grade_likert()     → 1 − MAE/4.0
│                                           grade_consistency()→ Kendall-τ + transitivity
│
├── data/                               ← Dataset files (git-ignored)
│   ├── 📄 pairwise_data.json            HH-RLHF gold labels
│   ├── 📄 likert_data.json              UltraFeedback multi-axis scores
│   └── 📄 consistency_data.json         Stanford SHP ranking pairs
│                                         (All 3 auto-fallback to synthetic
│                                          data if files are missing)
│
├── scripts/
│   └── 📄 prepare_datasets.py          ← Downloads & formats datasets
│                                         from Hugging Face Hub
│                                         Usage: python scripts/prepare_datasets.py
│
└── tests/
    └── 📄 test_environment.py          ← pytest test suite
                                          25 test cases covering:
                                          reset / step / state / graders
                                          concurrent sessions / reproducibility
```

---

## Task Design

PreferenceLab presents agents with three progressively harder annotation tasks, matching real RLHF data collection workflows.

### Task 1 — Pairwise Ranking (Easy)

The agent is shown a prompt and two LLM responses (A and B), and must pick the better one.

**Observation fields:** `prompt`, `response_a`, `response_b`

**Action:**
```python
PairwiseAction(
    choice="A",           # "A" | "B" | "tie" | "skip"
    justification="..."   # optional, not used for grading
)
```

**Grading (vs HH-RLHF gold label):**

| Agent choice | Outcome | Reward |
|---|---|---|
| Correct (matches gold) | ✅ | `+1.0` |
| `skip` | ⚠️ Abstain | `+0.3` |
| `tie` (when gold is clear) | ⚠️ Hedging | `+0.1` |
| Wrong choice | ❌ | `+0.0` |

---

### Task 2 — Multi-Axis Likert Scoring (Medium)

The agent is shown a prompt and a single LLM response, and must score it on four independent quality axes.

**Observation fields:** `prompt`, `response`, `rubric`

**Action:**
```python
LikertAction(
    helpfulness=4,           # 1–5
    honesty=5,               # 1–5
    harmlessness=5,          # 1–5
    instruction_following=4  # 1–5
)
```

**Grading (vs UltraFeedback gold scores):**

```
reward = 1.0 − (MAE / 4.0)

where MAE = mean absolute error across all 4 axes
      4.0  = maximum possible error per axis

Perfect match  → reward = 1.0
Off by 1 each  → reward = 0.75
Off by 2 each  → reward = 0.50
Worst case     → reward = 0.0
```

---

### Task 3 — Transitive Consistency Ranking (Hard)

The agent is shown a prompt and four LLM responses (A, B, C, D), and must rank all four from best to worst. Grading checks both ranking quality and logical transitivity.

**Observation fields:** `prompt`, `response_a`, `response_b`, `response_c`, `response_d`

**Action:**
```python
ConsistencyAction(
    ranking=["B", "A", "D", "C"]  # best → worst, all 4 required
)
```

**Grading (Kendall-τ + Transitivity bonus):**

```
reward = α × kendall_tau + β × transitivity_score

kendall_tau:        normalized rank correlation vs gold ranking
                    range [−1.0, +1.0], clipped to [0, 1]

transitivity_score: fraction of (A>B, B>C → A>C) triplets satisfied
                    penalizes logically inconsistent rankings

α = 0.7,  β = 0.3   (weighted combination)
```

---

## Reward Functions

| Task | Formula | Range |
|---|---|---|
| Pairwise | Exact match reward table | `{0.0, 0.1, 0.3, 1.0}` |
| Likert | `1 − mean(|agent_score − gold_score|) / 4` | `[0.0, 1.0]` |
| Consistency | `0.7 × Kendall-τ + 0.3 × Transitivity` | `[0.0, 1.0]` |

All rewards are bounded `[0, 1]` and emitted at **every step** (dense signal).

---

## Datasets

| Dataset | Task | Source | Samples |
|---|---|---|---|
| [HH-RLHF](https://huggingface.co/datasets/Anthropic/hh-rlhf) | Pairwise | Anthropic | ~160K pairs |
| [UltraFeedback](https://huggingface.co/datasets/openbmb/UltraFeedback) | Likert | OpenBMB | ~64K responses |
| [Stanford SHP](https://huggingface.co/datasets/stanfordnlp/SHP) | Consistency | Stanford | ~385K pairs |

Download all datasets:

```bash
python scripts/prepare_datasets.py
# or with custom sample count:
python scripts/prepare_datasets.py --samples 5000
```

If JSON files are absent, the environment automatically uses **built-in synthetic data** — no download needed for local development.

---

## Quick Start

### Prerequisites

- Python 3.10+
- `git`

### Local Development

```bash
# 1. Clone
git clone https://github.com/SIBAM890/preferencelab.git
cd preference-lab

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Download real datasets
python scripts/prepare_datasets.py

# 5. Start the server
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
```

Open **http://localhost:8000/web** for the interactive Gradio playground.

### Verify the server is running

```bash
curl http://localhost:8000/health
# → {"status":"healthy"}

curl http://localhost:8000/schema
# → full action / observation JSON schema
```

### Run the baseline inference script

```bash
# Set your API credentials (or use any OpenAI-compatible endpoint)
export HF_TOKEN=hf_your_token_here
export MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct

python inference.py
```

Expected output format:

```
[START] task=pairwise-ranking env=preference_lab model=meta-llama/Llama-3.1-8B-Instruct
[STEP]  step=1 action=choice=A reward=1.00 done=false error=null
[STEP]  step=2 action=choice=B reward=0.00 done=false error=null
[STEP]  step=3 action=choice=A reward=1.00 done=false error=null
[STEP]  step=4 action=choice=A reward=1.00 done=false error=null
[STEP]  step=5 action=choice=B reward=0.00 done=true  error=null
[END]   success=true steps=5 score=0.60 rewards=1.00,0.00,1.00,1.00,0.00
```

### Run tests

```bash
pytest tests/ -v
# 25 test cases — reset, step, state, graders, concurrency, reproducibility
```

---

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `HF_TOKEN` | _(none)_ | Hugging Face API token for LLM inference |
| `API_BASE_URL` | `https://api-inference.huggingface.co/v1` | LLM API endpoint (any OpenAI-compatible URL) |
| `MODEL_NAME` | `meta-llama/Llama-3.1-8B-Instruct` | Model identifier sent to the API |
| `MAX_CONCURRENT_ENVS` | `64` | Maximum parallel WebSocket sessions |
| `ENABLE_WEB_INTERFACE` | `true` | Mount Gradio UI at `/web` |
| `ENV_BASE_URL` | `http://localhost:8000` | PreferenceLab server URL (for remote clients) |
| `ENV_README_PATH` | _(none)_ | Custom path to README for web interface |

---

## API Reference

### REST Endpoints

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Server health check |
| `GET` | `/schema` | Action + Observation JSON schemas |
| `GET` | `/state` | Current episode state |
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Submit an action, receive observation |
| `GET` | `/web` | Gradio interactive playground |
| `GET` | `/manifest.json` | PWA web manifest |

### POST /reset

```json
{
  "seed": 42,
  "task_type": "pairwise"
}
```

`task_type` accepts: `"pairwise"` | `"likert"` | `"consistency"` | omit for random.

### POST /step

```json
{
  "action": {
    "choice": "A"
  }
}
```

### Response (all step/reset endpoints)

```json
{
  "observation": {
    "task_id": "abc123_step1",
    "task_type": "pairwise",
    "prompt": "Explain backpropagation.",
    "response_a": "...",
    "response_b": "...",
    "reward": 1.0,
    "done": false,
    "step_count": 1,
    "info": { "verdict": "correct", "gold_label": "A" }
  },
  "reward": 1.0,
  "done": false
}
```

### WebSocket

```
ws://localhost:8000/ws
```

OpenEnv WebSocket protocol — send `reset`, `step`, `state`, `close` messages. Used by TRL training loops via `MCPToolClient`.

---

## Integration Guide

### Direct Import (Local)

```python
from server.environment import PreferenceLabEnvironment
from models import PairwiseAction, LikertAction, ConsistencyAction

env = PreferenceLabEnvironment()

# Pairwise task
obs = env.reset(seed=42, task_type="pairwise")
print(obs.prompt)

obs = env.step(PairwiseAction(choice="A"))
print(obs.reward, obs.done)

# State (property, not method)
state = env.state
print(state.episode_id, state.step_count)
```

### Using with TRL / GRPO Training

```python
import asyncio
from openenv.core.env_client import EnvClient
from models import PairwiseAction

async def train():
    async with EnvClient("http://localhost:8000") as env:
        obs = await env.reset(task_type="pairwise")

        for step in range(5):
            # Your policy predicts the action
            action = PairwiseAction(choice=your_policy(obs))
            obs = await env.step(action)
            reward = obs.reward
            done = obs.done

            train_on(obs, reward)
            if done:
                break

asyncio.run(train())
```

### MultiEnv Wrapper (Parallel Sessions)

```python
from openenv.core.env_client import MultiEnvClient

# Spin up 8 parallel sessions on the same server
async with MultiEnvClient("http://localhost:8000", n=8) as envs:
    observations = await envs.reset_all(task_type="pairwise")
    # envs.step_all(actions) → list of observations
```

---

## Baseline Scores

Scores produced by `python inference.py` with `meta-llama/Llama-3.1-8B-Instruct`:

| Task | Difficulty | Avg Reward | Notes |
|---|---|---|---|
| Pairwise Ranking | Easy | ~0.60 | Varies by model capability |
| Likert Scoring | Medium | ~0.75 | Continuous signal |
| Consistency Ranking | Hard | ~0.65 | Kendall-tau based |
| **Overall** | — | **~0.67** | Reproducible with seed=42 |

> Higher scores indicate the model aligns more closely with human preference gold labels.
> Run `python inference.py` to generate fresh scores against your own model.

---

## Testing

### Run tests

```bash
# Full test suite
pytest tests/ -v

# Specific test classes
pytest tests/test_environment.py::TestPreferenceLabGraders -v
pytest tests/test_environment.py::TestEpisodeManagement -v

# Quick smoke test (direct import, no server needed)
python test_api.py
```

**Test coverage — 22 test cases across 4 classes:**
- `TestPairwiseGrader` — correct / wrong / skip / tie / range (5 tests)
- `TestLikertGrader` — perfect / worst / partial / random range (4 tests)
- `TestConsistencyGrader` — perfect / reversed / invalid IDs / all perms / no-tie (5 tests)
- `TestPreferenceLabEnvironment` — reset / step / state / seed / episode flow (8 tests)

---

## Deployment

### Docker (Local)

```bash
docker build -t preferencelab .

docker run -p 7860:7860 \
  -e HF_TOKEN=hf_your_token \
  -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  -e MAX_CONCURRENT_ENVS=64 \
  preferencelab
```

Visit `http://localhost:7860/web`

### Hugging Face Spaces

1. Fork or push this repository to a Hugging Face Space with **Docker SDK**
2. Add the following secrets in Space Settings:
   - `HF_TOKEN` — your Hugging Face token
   - `API_BASE_URL` — inference endpoint (e.g. `https://api-inference.huggingface.co/v1`)
   - `MODEL_NAME` — model to use

The `Dockerfile` handles everything else. Health check polls `/health` every 30 seconds.

---

## License

This project is licensed under the **MIT License** — see [LICENSE](LICENSE) for details.

---

<div align="center">

**Built with ❤️ for the Meta × Hugging Face OpenEnv Hackathon**

*Team Nexis*

</div>