Multi-Turn Constitutional Tournament Environment
Tournament-style reward system for Constitutional AI training with multi-turn conversation support.
Concept
This environment extends the Constitutional Tournament with multi-turn conversation handling:
- Loads ShareGPT format datasets (e.g.,
anthracite-org/kalo-opus-instruct-22k-no-refusal) - Extracts all conversation turns (excluding system prompts) with configurable
max_turns - Pairs off rollouts (e.g., 256 rollouts per example)
- Judges pairs using constitutional principles with full conversation context
- Winners advance to face other winners
- Every win = reward - responses satisfying more principles accumulate more wins
Multi-Turn Configuration
Control how many conversation turns to include:
load_environment(
max_turns=-1, # All turns (default)
max_turns=1, # Single turn (first human message only)
max_turns=3, # Up to 3 human turns with assistant responses between
)
The max_turns parameter counts human turns. If set to 2, the prompt will include:
- First human message
- First assistant response (if present)
- Second human message
The model generates the next response in the conversation.
Multi-Turn Judge Prompt Format
The judge sees the full conversation context with XML-separated turns:
<conversation-context>
<turn-1 role="user">
What is the capital of France?
</turn-1>
<turn-2 role="assistant">
Paris is the capital of France.
</turn-2>
<turn-3 role="user">
Tell me more about it.
</turn-3>
</conversation-context>
<response-a>
[Response A]
</response-a>
<response-b>
[Response B]
</response-b>
Dataset Format
Expects ShareGPT format with conversations field:
{
"conversations": [
{"from": "system", "value": "..."}, // Skipped (not included)
{"from": "human", "value": "..."}, // Included as user turn
{"from": "gpt", "value": "..."}, // Included as assistant turn
{"from": "human", "value": "..."}, // Included as user turn
...
]
}
System prompts are always skipped. The last message in the prompt is always a user message (trailing assistant messages are removed so the model generates the response).
Configuration
load_environment(
# Dataset - ShareGPT format from HuggingFace
dataset_name="anthracite-org/kalo-opus-instruct-22k-no-refusal",
# Constitution
constitution_path="/tank/mango/mango-verifiers/const.txt",
# Judge model (required)
judge_model="openai/gpt-4.1-mini",
judge_base_url="https://app.firmware.ai/api/v1",
judge_api_key="your-api-key",
judge_temperature=0.3,
judge_timeout=120.0,
# Concurrency
max_concurrent_judges=64,
max_concurrent_tournaments=4,
# Dataset size
num_train_examples=10000,
num_eval_examples=500,
# Multi-turn configuration
max_turns=-1, # -1 for all turns, or specific number
)
Usage
# Install
vf-install multiturn_constitutional_tournament
# Run evaluation
vf-eval multiturn_constitutional_tournament \
-n 5 \
-m your-model \
--rollouts-per-example 16
# Training
vf-train multiturn_constitutional_tournament \
--model your-model \
--rollouts-per-example 256
Tournament Structure
Same as Constitutional Tournament - for 256 rollouts per example:
Round 1: 256 -> 128 winners (128 get 1 point)
Round 2: 128 -> 64 winners (64 get 2 points)
Round 3: 64 -> 32 winners (32 get 3 points)
Round 4: 32 -> 16 winners (16 get 4 points)
Round 5: 16 -> 8 winners (8 get 5 points)
Round 6: 8 -> 4 winners (4 get 6 points)
Round 7: 4 -> 2 winners (2 get 7 points)
Round 8: 2 -> 1 winner (1 gets 8 points)
Final reward = wins / total_rounds (normalized to 0-1)