openenv_hack / pres /training_results.html
thomasm6m6's picture
Initial Freeciv OpenEnv Space
8dc7642 verified
<!doctype html>
<html><head><meta charset='utf-8'><title>Training results</title>
<style>
body { font-family: -apple-system, BlinkMacSystemFont, sans-serif; max-width: 1000px; margin: 40px auto; line-height: 1.45; padding: 0 20px; }
.card { background: #f6f8fa; border-radius: 10px; padding: 18px; margin: 16px 0; }
img { max-width: 100%; border: 1px solid #ddd; border-radius: 8px; }
table { border-collapse: collapse; width: 100%; }
td, th { border-bottom: 1px solid #ddd; padding: 8px; text-align: left; }
</style></head><body>
<h1>Training results</h1>
<div class='card'>
<b>Live backend:</b> real Freeciv Web on H100<br>
<b>Model:</b> Qwen/Qwen3.5-0.8B + Unsloth LoRA + TRL GRPO<br>
<b>Run:</b> 10 steps, 32 live states, batch size 8<br>
<b>Train runtime:</b> None
</div>
<div class='card'>
<b>Observed reward improvement:</b> 0.125 → 1.000<br>
<b>Best visible point:</b> step 10 reward 1.000
</div>
<h2>Reward curve</h2>
<p><img src='reward_curve.png' alt='reward curve'></p>
<h2>Start vs end</h2>
<p><img src='before_after_reward.png' alt='before after reward'></p>
<h2>Per-step reward</h2>
<table><tr><th>step</th><th>reward</th><th>reward std</th></tr><tr><td>1</td><td>0.125</td><td>0.250</td></tr><tr><td>2</td><td>0.375</td><td>0.539</td></tr><tr><td>3</td><td>0.250</td><td>0.500</td></tr><tr><td>4</td><td>0.500</td><td>0.577</td></tr><tr><td>5</td><td>0.625</td><td>0.539</td></tr><tr><td>6</td><td>0.875</td><td>0.250</td></tr><tr><td>7</td><td>0.750</td><td>0.500</td></tr><tr><td>8</td><td>0.875</td><td>0.250</td></tr><tr><td>9</td><td>0.750</td><td>0.500</td></tr><tr><td>10</td><td>1.000</td><td>0.000</td></tr></table>
</body></html>