Multi-Turn Constitutional Tournament Environment

Tournament-style reward system for Constitutional AI training with multi-turn conversation support.

Concept

This environment extends the Constitutional Tournament with multi-turn conversation handling:

Loads ShareGPT format datasets (e.g., anthracite-org/kalo-opus-instruct-22k-no-refusal)
Extracts all conversation turns (excluding system prompts) with configurable max_turns
Pairs off rollouts (e.g., 256 rollouts per example)
Judges pairs using constitutional principles with full conversation context
Winners advance to face other winners
Every win = reward - responses satisfying more principles accumulate more wins

Multi-Turn Configuration

Control how many conversation turns to include:

load_environment(
    max_turns=-1,  # All turns (default)
    max_turns=1,   # Single turn (first human message only)
    max_turns=3,   # Up to 3 human turns with assistant responses between
)

The max_turns parameter counts human turns. If set to 2, the prompt will include:

First human message
First assistant response (if present)
Second human message

The model generates the next response in the conversation.

Multi-Turn Judge Prompt Format

The judge sees the full conversation context with XML-separated turns:

<conversation-context>
<turn-1 role="user">
What is the capital of France?
</turn-1>

<turn-2 role="assistant">
Paris is the capital of France.
</turn-2>

<turn-3 role="user">
Tell me more about it.
</turn-3>
</conversation-context>

<response-a>
[Response A]
</response-a>

<response-b>
[Response B]
</response-b>

Dataset Format

Expects ShareGPT format with conversations field:

{
  "conversations": [
    {"from": "system", "value": "..."},  // Skipped (not included)
    {"from": "human", "value": "..."},   // Included as user turn
    {"from": "gpt", "value": "..."},     // Included as assistant turn
    {"from": "human", "value": "..."},   // Included as user turn
    ...
  ]
}

System prompts are always skipped. The last message in the prompt is always a user message (trailing assistant messages are removed so the model generates the response).

Configuration

load_environment(
    # Dataset - ShareGPT format from HuggingFace
    dataset_name="anthracite-org/kalo-opus-instruct-22k-no-refusal",

    # Constitution
    constitution_path="/tank/mango/mango-verifiers/const.txt",

    # Judge model (required)
    judge_model="openai/gpt-4.1-mini",
    judge_base_url="https://app.firmware.ai/api/v1",
    judge_api_key="your-api-key",
    judge_temperature=0.3,
    judge_timeout=120.0,

    # Concurrency
    max_concurrent_judges=64,
    max_concurrent_tournaments=4,

    # Dataset size
    num_train_examples=10000,
    num_eval_examples=500,

    # Multi-turn configuration
    max_turns=-1,  # -1 for all turns, or specific number
)

Usage

# Install
vf-install multiturn_constitutional_tournament

# Run evaluation
vf-eval multiturn_constitutional_tournament \
    -n 5 \
    -m your-model \
    --rollouts-per-example 16

# Training
vf-train multiturn_constitutional_tournament \
    --model your-model \
    --rollouts-per-example 256

Tournament Structure

Same as Constitutional Tournament - for 256 rollouts per example:

Round 1: 256 -> 128 winners (128 get 1 point)
Round 2: 128 -> 64 winners  (64 get 2 points)
Round 3: 64 -> 32 winners   (32 get 3 points)
Round 4: 32 -> 16 winners   (16 get 4 points)
Round 5: 16 -> 8 winners    (8 get 5 points)
Round 6: 8 -> 4 winners     (4 get 6 points)
Round 7: 4 -> 2 winners     (2 get 7 points)
Round 8: 2 -> 1 winner      (1 gets 8 points)

Final reward = wins / total_rounds (normalized to 0-1)