Spaces:

AlazarM
/

trenches

Sleeping

App Files Files Community

trenches / backend /POST_TRAINING_PLAN.md

Codex

sync main snapshot for HF Space

1794757 about 2 months ago

preview code

raw

history blame contribute delete

5.11 kB

Post-Training Plan: 6 Entities × 1 Hour Parallel

Overview

6 HF A100 Spaces running in parallel. Total wall time: 1 hour. Total cost: $15. Base model: Qwen/Qwen3-8B (no quantization).

GRPO post-training on OpenEnv. Qwen3-8B already knows how to reason — we're aligning it to each entity's policy behavior through the environment reward signal.

Cost

Item	Rate	Quantity	Cost
A100 80GB	$2.50/hr	6 Spaces × 1h	$15

Optimal Hyperparameters

Researched from TRL docs, DeepSeek-R1 paper, Open-R1 recipe, and TRL OpenEnv examples.

# Model
model_id: Qwen/Qwen3-8B
# No quantization — full precision on A100 80GB.
# Quantization noise actually aids exploration (QeRL paper).

# GRPO Core (from DeepSeek-R1 + Open-R1 recipes)
algorithm: GRPO
loss_type: grpo
beta: 0.001 # KL coefficient (DeepSeek-R1 uses 0.001)
num_generations:
  16 # DeepSeek-R1: "sample 16 outputs per prompt"
  # More generations = better group-relative advantage signal
max_steps: 100 # 1 hour on A100 with these settings
warmup_steps: 10 # Stabilize early training

# Learning Rate
learning_rate:
  5e-6 # Open-R1 + OpenEnv Sudoku example both use 5e-6
  # Higher than our earlier 5e-7; research shows
  # post-training converges faster with this range

# Batching
per_device_train_batch_size: 1 # Memory-safe for 9B 4-bit
gradient_accumulation_steps: 8 # Effective batch = 8 (from TRL Sudoku OpenEnv example)

# Context
max_prompt_length: 1536
max_completion_length: 256

# Generation Sampling (from TRL OpenEnv Sudoku)
temperature: 0.8 # Balanced exploration vs exploitation
top_k: 10 # Focused sampling

# Saving
save_strategy: steps
save_steps: 25 # Checkpoint every 25 steps (4 saves per run)

# Inference
generation_backend: transformers # vllm if CUDA available
# If vllm: use_vllm=True, vllm_mode="colocate", vllm_gpu_memory_utilization=0.3

# Preview
preview_samples: 3
training_stage: stage_1_dense

Why These Settings

Setting	Value	Source/Reasoning
`num_generations: 16`	DeepSeek-R1	More rollouts = better advantage estimation. 16 is the standard for GRPO
`beta: 0.001`	DeepSeek-R1	Low KL penalty allows the model to explore further from base policy
`learning_rate: 5e-6`	Open-R1 + TRL examples	10x higher than our earlier setting; post-training on instruct models converges with higher LR
`gradient_accumulation: 8`	TRL OpenEnv Sudoku	Effective batch of 8 stabilizes updates without excessive VRAM
`temperature: 0.8`	TRL OpenEnv Sudoku	Encourages diverse completions during rollout
`No quantization`	A100 80GB has enough VRAM for 8B full precision	Full precision avoids quantization noise and simplifies checkpointing

Per-Space Command

Replace ENTITY with: us, israel, iran, hezbollah, gulf, oversight

python -m trenches_env.training_cli \
  --model-id Qwen/Qwen3-8B \
  --training-agent ENTITY \
  --replay-id ENTITY_synthetic_seed_2025_2026 \
  --output-dir checkpoints/ENTITY-qwen3-8b \
  --generation-backend transformers \
  --training-stage stage_1_dense \
  --max-steps 100 \
  --train-size 256 \
  --num-generations 16 \
  --per-device-train-batch-size 1 \
  --gradient-accumulation-steps 8 \
  --learning-rate 5e-6 \
  --max-prompt-length 1536 \
  --max-completion-length 256 \
  --preview-samples 3

HuggingFace Hub Output

shlawgathon/trenches-us-qwen3-8b
shlawgathon/trenches-israel-qwen3-8b
shlawgathon/trenches-iran-qwen3-8b
shlawgathon/trenches-hezbollah-qwen3-8b
shlawgathon/trenches-gulf-qwen3-8b
shlawgathon/trenches-oversight-qwen3-8b

Each checkpoint contains: config.json, model.safetensors, tokenizer.json, generation_config.json, training_args.bin

Build Steps

~~Create 5 replay datasets (israel, iran, hezbollah, gulf, oversight)~~ ✅ done (synthetic seed data in synthetic_historical_replays/)
~~Add --quantize-4bit to training_cli.py (NF4 via bitsandbytes)~~ ✅ done
~~Add beta, warmup_steps, temperature, top_k, save_strategy CLI args~~ ✅ done
~~Add bitsandbytes>=0.43.0 to pyproject.toml~~ ✅ done
~~Smoke test locally with tiny-gpt2~~ ✅ done (US + Israel pass)
~~Smoke test on HF T4 GPU~~ ✅ done (trenches-training-smoke)
Spin up 6 HF A100 Spaces → 1 hour → done