alex17cmbs commited on
Commit
0a2af01
·
verified ·
1 Parent(s): 76a37cc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -16
README.md CHANGED
@@ -5,12 +5,9 @@ colorFrom: gray
5
  colorTo: yellow
6
  sdk: static
7
  pinned: false
8
- thumbnail: >-
9
- https://cdn-uploads.huggingface.co/production/uploads/65b9463ac9aa94b1ca85414b/1lQ_nGrGT4hurdTRk_Yv_.jpeg
10
  ---
11
 
12
  <div style="max-width:980px;margin:0 auto;padding:24px 16px;font:16px/1.55 system-ui,-apple-system,Segoe UI,Roboto;">
13
-
14
  <header style="display:flex;gap:16px;align-items:center;margin-bottom:12px;">
15
  <img src="./assets/logo.png" alt="Foaster.ai" width="42" height="42" style="border-radius:8px">
16
  <div>
@@ -19,12 +16,14 @@ thumbnail: >-
19
  </div>
20
  </header>
21
 
22
- <p><strong>Foaster.ai</strong> is a French start-up building agentic systems and running research to understand how LLMs behave in social settings. At <em>Foaster Labs</em>, our first project—<strong>Werewolf Benchmark</strong>—pits leading models against each other in live social-deduction games to measure leadership, bluffing, and resistance to manipulation.</p>
23
 
24
- <p>We ran <strong>210 full games</strong> with <strong>7 top models</strong> to produce a <em>role-conditioned</em> Elo: wolves ≈ manipulation power, villagers ≈ manipulation resistance. GPT-5 currently sits alone at the top—contenders welcome. 🐺</p>
 
 
25
 
26
- <div style="margin:22px 0 8px;font-weight:700;">Role-conditioned Elo (quick view)</div>
27
- <p style="margin:0 0 12px;color:#64748b">Overall Elo plus per-role splits (ELO-W = wolf, ELO-V = villager). For the full interactive, <a href="./index.html">open the Space</a>.</p>
28
 
29
  <div style="border:1px solid #e5e7eb;border-radius:12px;overflow:hidden">
30
  <table style="width:100%;border-collapse:collapse;font-size:14px">
@@ -40,15 +39,10 @@ thumbnail: >-
40
  </tr>
41
  </thead>
42
  <tbody>
43
- <tr><td style="padding:10px">🥇</td><td style="padding:10px">GPT-5 (OpenAI)</td><td style="padding:10px;text-align:center">1492</td><td style="padding:10px;text-align:center">1508</td><td style="padding:10px;text-align:center">1476</td><td style="padding:10px;text-align:center">96.7%</td><td style="padding:10px;text-align:center">60</td></tr>
44
- <tr><td style="padding:10px">🥈</td><td style="padding:10px">Gemini 2.5 Pro (Google)</td><td style="padding:10px;text-align:center">1261</td><td style="padding:10px;text-align:center">1163</td><td style="padding:10px;text-align:center">1360</td><td style="padding:10px;text-align:center">63.3%</td><td style="padding:10px;text-align:center">60</td></tr>
45
- <tr><td style="padding:10px">🥉</td><td style="padding:10px">Gemini 2.5 Flash (Google)</td><td style="padding:10px;text-align:center">1188</td><td style="padding:10px;text-align:center">1103</td><td style="padding:10px;text-align:center">1273</td><td style="padding:10px;text-align:center">51.7%</td><td style="padding:10px;text-align:center">60</td></tr>
46
- <tr><td style="padding:10px">#4</td><td style="padding:10px">Qwen3-235B-Instruct (Alibaba)</td><td style="padding:10px;text-align:center">1176</td><td style="padding:10px;text-align:center">1077</td><td style="padding:10px;text-align:center">1274</td><td style="padding:10px;text-align:center">45.0%</td><td style="padding:10px;text-align:center">60</td></tr>
47
- <tr><td style="padding:10px">#5</td><td style="padding:10px">GPT-5-mini (OpenAI)</td><td style="padding:10px;text-align:center">1173</td><td style="padding:10px;text-align:center">1107</td><td style="padding:10px;text-align:center">1239</td><td style="padding:10px;text-align:center">41.7%</td><td style="padding:10px;text-align:center">60</td></tr>
48
- <tr><td style="padding:10px">#6</td><td style="padding:10px">Kimi-K2-Instruct (Moonshot AI)</td><td style="padding:10px;text-align:center">1130</td><td style="padding:10px;text-align:center">1168</td><td style="padding:10px;text-align:center">1091</td><td style="padding:10px;text-align:center">36.7%</td><td style="padding:10px;text-align:center">60</td></tr>
49
- <tr><td style="padding:10px">#7</td><td style="padding:10px">GPT-OSS-120B (OpenAI)</td><td style="padding:10px;text-align:center">980</td><td style="padding:10px;text-align:center">931</td><td style="padding:10px;text-align:center">1030</td><td style="padding:10px;text-align:center">15.0%</td><td style="padding:10px;text-align:center">60</td></tr>
50
  </tbody>
51
  </table>
52
  </div>
53
-
54
- </div>
 
5
  colorTo: yellow
6
  sdk: static
7
  pinned: false
 
 
8
  ---
9
 
10
  <div style="max-width:980px;margin:0 auto;padding:24px 16px;font:16px/1.55 system-ui,-apple-system,Segoe UI,Roboto;">
 
11
  <header style="display:flex;gap:16px;align-items:center;margin-bottom:12px;">
12
  <img src="./assets/logo.png" alt="Foaster.ai" width="42" height="42" style="border-radius:8px">
13
  <div>
 
16
  </div>
17
  </header>
18
 
19
+ <p><strong>Foaster.ai</strong> is a French start-up focused on the agentic era. At <em>Foaster Labs</em>, our Werewolf Benchmark studies how LLMs behave under social pressure—leadership, bluffing, and resistance to manipulation.</p>
20
 
21
+ <div style="display:flex;align-items:center;gap:10px;margin:14px 0 22px;">
22
+ <a href="https://huggingface.co/spaces/Foaster-ai/werewolf-leaderboard" style="padding:10px 14px;border:1px solid #e5e7eb;border-radius:10px;text-decoration:none;">🔗 Full leaderboard →</a>
23
+ </div>
24
 
25
+ <h3 style="margin:0 0 8px;">Results — Podium (role-conditioned Elo)</h3>
26
+ <p style="margin:0 0 10px;color:#64748b">ELO-W = wolf (manipulation power) · ELO-V = villager (manipulation resistance)</p>
27
 
28
  <div style="border:1px solid #e5e7eb;border-radius:12px;overflow:hidden">
29
  <table style="width:100%;border-collapse:collapse;font-size:14px">
 
39
  </tr>
40
  </thead>
41
  <tbody>
42
+ <tr><td style="padding:10px">🥇 #1</td><td style="padding:10px">GPT-5 (OpenAI)</td><td style="padding:10px;text-align:center">1492</td><td style="padding:10px;text-align:center">1508</td><td style="padding:10px;text-align:center">1476</td><td style="padding:10px;text-align:center">96.7%</td><td style="padding:10px;text-align:center">60</td></tr>
43
+ <tr><td style="padding:10px">🥈 #2</td><td style="padding:10px">Gemini 2.5 Pro (Google)</td><td style="padding:10px;text-align:center">1261</td><td style="padding:10px;text-align:center">1163</td><td style="padding:10px;text-align:center">1360</td><td style="padding:10px;text-align:center">63.3%</td><td style="padding:10px;text-align:center">60</td></tr>
44
+ <tr><td style="padding:10px">🥉 #3</td><td style="padding:10px">Gemini 2.5 Flash (Google)</td><td style="padding:10px;text-align:center">1188</td><td style="padding:10px;text-align:center">1103</td><td style="padding:10px;text-align:center">1273</td><td style="padding:10px;text-align:center">51.7%</td><td style="padding:10px;text-align:center">60</td></tr>
 
 
 
 
45
  </tbody>
46
  </table>
47
  </div>
48
+ </div>