kabalan Claude Opus 4.6 commited on
Commit
20b7748
·
1 Parent(s): 64d69d7

Add OpenEnv server implementation and Python packaging

Browse files

Refactors the classical optimization loop into an OpenEnv-compatible environment with FastAPI server, Docker support, and standardized action/observation spaces. Adds skill_files_baseline/ as committed minimal starting point. Updates README with server setup, Docker instructions, and API documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

.env.example ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Slide Skill OpenEnv — Environment Variables
2
+ #
3
+ # Copy this file to .env and fill in the values.
4
+ # Never commit .env to version control.
5
+
6
+ # ---------------------------------------------------------------------------
7
+ # Required
8
+ # ---------------------------------------------------------------------------
9
+
10
+ # Google Gemini API key — used by all three LLM roles:
11
+ # Generator: Gemini 3 Flash (writes pptxgenjs JavaScript)
12
+ # Evaluator: Gemini 3.1 Pro (scores the slide with vision)
13
+ # Optimizer: Gemini 3.1 Pro (rewrites DESIGN_RULES.md)
14
+ # Get your key at: https://aistudio.google.com/app/apikey
15
+ GEMINI_API_KEY=your_gemini_api_key_here
16
+
17
+ # ---------------------------------------------------------------------------
18
+ # Optional — override defaults
19
+ # ---------------------------------------------------------------------------
20
+
21
+ # Maximum number of optimization steps per episode (default: 7).
22
+ # Each step takes ~60-120s. At 7 steps, a full episode runs ~7-14 minutes.
23
+ # Reduce for faster iteration during development; increase for deeper optimization.
24
+ # SLIDE_SKILL_MAX_STEPS=7
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # HuggingFace Spaces (set these as Space secrets, not in .env)
28
+ # ---------------------------------------------------------------------------
29
+ # When deploying to HF Spaces, add GEMINI_API_KEY as a repository secret
30
+ # via the Space settings UI. Do not hardcode it in the Dockerfile or source.
.gitignore CHANGED
@@ -1,2 +1,9 @@
1
  node_modules/
2
  .DS_Store
 
 
 
 
 
 
 
 
1
  node_modules/
2
  .DS_Store
3
+ .env
4
+ __pycache__/
5
+ *.pyc
6
+ .mypy_cache/
7
+ .ruff_cache/
8
+ dist/
9
+ *.egg-info/
README.MD CHANGED
@@ -26,29 +26,30 @@ A fixed task is used across all rounds so improvements are solely from skill opt
26
 
27
  > Generate a 1-slide timeline PowerPoint about Dutch Hydrogen Strategy (2020-2035) in McKinsey & Company consulting style.
28
 
29
- ## Skill Folder (What Gets Optimized)
30
 
31
- ```
32
- skill_vN/
33
- ├── DESIGN_RULES.md # Colors, fonts, spacing rules
34
- └── EXAMPLES.md # Good/bad patterns (grows over rounds)
35
- ```
36
 
37
- Starts minimal. Grows smarter each round as evaluation feedback accumulates.
 
 
 
38
 
39
- ## Results
 
 
40
 
41
  Ran 5 rounds (v0 through v4). Final score: **89/100**.
42
 
43
- | Dimension | Score (/15) |
44
- |-----------|-------------|
45
- | Background & Layout | 14 |
46
- | Color Palette | 14 |
47
- | Typography | 13 |
48
- | Title Quality | 15 |
49
- | Data Presentation | 12 |
50
- | Structural Elements | 13 |
51
- | Overall Impression | 8 (/10) |
52
 
53
  **Verdict:** A highly professional slide that closely mirrors McKinsey's visual language with an insight-driven title, restrained color palette, and proper structural elements.
54
 
@@ -57,39 +58,151 @@ Ran 5 rounds (v0 through v4). Final score: **89/100**.
57
  ```
58
  Skill-Forge/
59
  ├── README.MD
60
- ├── package.json
61
- ├── pptx/ # PPTX skill (executor instructions)
 
 
 
62
  │ ├── SKILL.md
63
  │ ├── pptxgenjs.md
64
  │ ├── editing.md
65
- │ └── scripts/ # Office utilities (unpack, validate, thumbnail, etc.)
66
- ├── output/
67
- ├── TASK_PROMPT.md # Fixed task used every round
68
- │ ├── reference/ # Gold-standard reference slide images
69
- ── skill_v0/ .. skill_v5/ # Skill versions (evolving instructions)
70
- ├── generate_v0.js .. v4.js # Generated pptxgenjs scripts
71
- ├── slide_v0.pptx .. v4.pptx # Generated slides
72
- │ ├── slide_v0.pdf .. v4.pdf # PDF conversions
73
- │ ├── slide_v0-1.jpg .. v4-1.jpg # Rendered slide images
74
- │ ├── evaluator.py # Evaluation script
75
- ── evaluation_results.json # Score progression
76
- ── estudio_base_...pdf # Reference document for visual style
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  ```
78
 
79
  ## Prerequisites
80
 
 
81
  - Node.js
82
  - Python 3
83
  - LibreOffice (`soffice`) for PDF conversion
84
  - Poppler (`pdftoppm`) for PDF-to-image conversion
85
 
 
 
 
86
  ## Setup
87
 
88
  ```bash
 
89
  npm install
90
- pip install "markitdown[pptx]" Pillow
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
93
  ## License
94
 
95
  ISC
 
26
 
27
  > Generate a 1-slide timeline PowerPoint about Dutch Hydrogen Strategy (2020-2035) in McKinsey & Company consulting style.
28
 
29
+ ## What Gets Optimized
30
 
31
+ There are two distinct layers of "skill files":
 
 
 
 
32
 
33
+ | Layer | Location | Purpose | Optimized? |
34
+ |-------|----------|---------|------------|
35
+ | Generic pptx tooling | `pptx/` | Teaches Claude how to use pptxgenjs (API reference, shapes, coordinates) | **No** — stable Anthropic skill |
36
+ | Brand style guidelines | `skill_vN/` or `skill_files_baseline/` | McKinsey-specific colors, typography, structural elements | **Yes** — evolves each round |
37
 
38
+ The optimizer rewrites `DESIGN_RULES.md` and `EXAMPLES.md` each round. The `pptx/` skill files are never touched.
39
+
40
+ ## Results (Classical Loop)
41
 
42
  Ran 5 rounds (v0 through v4). Final score: **89/100**.
43
 
44
+ | Dimension | Score |
45
+ |-----------|-------|
46
+ | Background & Layout | 14/15 |
47
+ | Color Palette | 14/15 |
48
+ | Typography | 13/15 |
49
+ | Title Quality | 15/15 |
50
+ | Data Presentation | 12/15 |
51
+ | Structural Elements | 13/15 |
52
+ | Overall Impression | 8/10 |
53
 
54
  **Verdict:** A highly professional slide that closely mirrors McKinsey's visual language with an insight-driven title, restrained color palette, and proper structural elements.
55
 
 
58
  ```
59
  Skill-Forge/
60
  ├── README.MD
61
+ ├── package.json # pptxgenjs ^4.0.1
62
+ ├── pyproject.toml # Python package (OpenEnv server)
63
+ ├── .env.example # Environment variable reference
64
+
65
+ ├── pptx/ # Generic pptx skill (DO NOT MODIFY)
66
  │ ├── SKILL.md
67
  │ ├── pptxgenjs.md
68
  │ ├── editing.md
69
+ │ └── scripts/ # Office utilities (unpack, validate, thumbnail)
70
+
71
+ ├── skill_files_baseline/ # Committed minimal baseline (skill_v0 content)
72
+ │ ├── DESIGN_RULES.md # Starting style rules (teal palette, basic typography)
73
+ ── EXAMPLES.md # Empty no prior rounds
74
+
75
+ ├── openenv/ # OpenEnv environment (new)
76
+ │ ├── app.py # FastAPI server (POST /reset, /step, DELETE /sessions)
77
+ │ ├── client.py # Reference client + LLM optimizer loop
78
+ │ ├── models.py # Pydantic models: actions, observation, state
79
+ ── slide_skill_environment.py # Core environment logic (reset, step, close)
80
+ │ ├── skill_manager.py # Applies EditSection / ReplaceFile actions
81
+ │ ├── slide_generator.py # LLM → JS → Node → LibreOffice → JPG pipeline
82
+ │ ├── evaluator_adapter.py # Gemini 3.1 Pro vision evaluator (reusable class)
83
+ │ ├── openenv.yaml # OpenEnv manifest
84
+ │ └── Dockerfile # Node.js + LibreOffice + poppler + Python
85
+
86
+ └── output/
87
+ ├── TASK_PROMPT.md # Fixed task used every round
88
+ ├── reference/ # Gold-standard McKinsey reference images (JPGs)
89
+ ├── skill_v0/ .. skill_v5/ # Historical skill versions
90
+ ├── generate_v0.js .. v5.js # Historical generated JS scripts
91
+ ├── slide_v0.pptx .. v5.pptx # Historical generated slides
92
+ ├── evaluator.py # Original standalone evaluator script
93
+ └── evaluation_results.json # Score progression
94
  ```
95
 
96
  ## Prerequisites
97
 
98
+ ### Classical loop (manual)
99
  - Node.js
100
  - Python 3
101
  - LibreOffice (`soffice`) for PDF conversion
102
  - Poppler (`pdftoppm`) for PDF-to-image conversion
103
 
104
+ ### OpenEnv server
105
+ All of the above, plus Python 3.12+ and the packages in `pyproject.toml`.
106
+
107
  ## Setup
108
 
109
  ```bash
110
+ # Node dependencies (pptxgenjs)
111
  npm install
112
+
113
+ # Python dependencies
114
+ pip install -e ".[server]"
115
+
116
+ # Environment variables
117
+ cp .env.example .env
118
+ # Edit .env and set GEMINI_API_KEY
119
+ ```
120
+
121
+ ## Running the OpenEnv Server
122
+
123
+ ```bash
124
+ cd openenv
125
+ uvicorn app:app --host 0.0.0.0 --port 8000 --workers 1
126
+ ```
127
+
128
+ Then run the reference client (full optimization loop):
129
+
130
+ ```bash
131
+ python openenv/client.py --server http://localhost:8000 --max-steps 7
132
  ```
133
 
134
+ Or a smoke test (single step):
135
+
136
+ ```bash
137
+ python openenv/client.py --server http://localhost:8000 --smoke-test
138
+ ```
139
+
140
+ ## Docker
141
+
142
+ ```bash
143
+ # Build
144
+ docker build -f openenv/Dockerfile -t slide-skill-openenv .
145
+
146
+ # Run
147
+ docker run -p 8000:8000 -e GEMINI_API_KEY=$GEMINI_API_KEY slide-skill-openenv
148
+ ```
149
+
150
+ > **Note:** The Docker image is ~600-700 MB due to LibreOffice (~500 MB). LibreOffice is required for `.pptx → .pdf` conversion and has no lighter alternative that faithfully renders pptxgenjs output.
151
+
152
+ ## OpenEnv Action Space
153
+
154
+ The agent can submit two types of actions each step:
155
+
156
+ **`replace_file`** — Rewrite an entire skill file (matches how the historical optimizer works):
157
+ ```json
158
+ {
159
+ "action_type": "replace_file",
160
+ "file": "DESIGN_RULES.md",
161
+ "new_content": "# Design Rules\n\n## Color Palette\n- Navy (#0C2340)..."
162
+ }
163
+ ```
164
+
165
+ **`edit_section`** — Surgically update one markdown section:
166
+ ```json
167
+ {
168
+ "action_type": "edit_section",
169
+ "file": "DESIGN_RULES.md",
170
+ "section_heading": "Color Palette",
171
+ "new_body": "- Navy (#0C2340): primary\n- White: background\n"
172
+ }
173
+ ```
174
+
175
+ ## Observation Space
176
+
177
+ Each step returns:
178
+
179
+ | Field | Type | Description |
180
+ |-------|------|-------------|
181
+ | `scores.background_layout` | int 0–15 | White bg, margins, layout |
182
+ | `scores.color_palette` | int 0–15 | Navy/white/grey restraint |
183
+ | `scores.typography` | int 0–15 | Font hierarchy, serif title |
184
+ | `scores.title_quality` | int 0–15 | "So-what" insight title |
185
+ | `scores.data_presentation` | int 0–15 | Structured table format |
186
+ | `scores.structural_elements` | int 0–15 | Divider line, footer, footnotes |
187
+ | `scores.overall_impression` | int 0–10 | Holistic McKinsey feel |
188
+ | `total` | int 0–100 | Sum of all scores |
189
+ | `strengths` | list[str] | What the slide does well |
190
+ | `weaknesses` | list[str] | What to improve |
191
+ | `one_line_verdict` | str | Evaluator summary |
192
+ | `reward` | float –0.3…+0.3 | Capped score delta / 100 |
193
+ | `done` | bool | True when max_steps reached |
194
+ | `design_rules_content` | str | Current DESIGN_RULES.md |
195
+ | `examples_content` | str | Current EXAMPLES.md |
196
+
197
+ ## Environment Variables
198
+
199
+ See `.env.example` for the full reference.
200
+
201
+ | Variable | Required | Default | Description |
202
+ |----------|----------|---------|-------------|
203
+ | `GEMINI_API_KEY` | Yes | — | Gemini API key — generator (Flash), evaluator + optimizer (Pro) |
204
+ | `SLIDE_SKILL_MAX_STEPS` | No | `7` | Steps per episode (~60-120s each) |
205
+
206
  ## License
207
 
208
  ISC
agent_docs/openenv_migration_plan_v2.md ADDED
@@ -0,0 +1,1728 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv Migration Plan v2 — Skill Forge → OpenEnv Environment
2
+
3
+ **Date**: 2026-03-07
4
+ **Status**: Implementation-ready
5
+ **Target**: HuggingFace Spaces (OpenEnv-compatible)
6
+
7
+ ---
8
+
9
+ ## 1. Overview
10
+
11
+ Skill Forge is a self-improving PowerPoint generation loop that, starting from a minimal brand-style baseline, iteratively improves a McKinsey-style slide by evolving two skill files. The loop reached 89/100 in 5 iterations.
12
+
13
+ **What is being optimized**: Two brand/task-specific files — `DESIGN_RULES.md` and `EXAMPLES.md` — that guide an LLM's pptxgenjs code generation. These files encode McKinsey visual design rules (color palette, typography, structural elements) and accumulated example guidance.
14
+
15
+ **What is NOT being optimized**: The generic pptx tooling skill in `pptx/` (SKILL.md, editing.md, pptxgenjs.md). These files define how the agent-as-executor uses pptxgenjs and remain unchanged across all optimization rounds.
16
+
17
+ **What OpenEnv adds**: A standardized environment interface so that any RL/optimization agent can drive the Skill Forge loop without knowing its internals. The environment exposes `reset()`, `step(action)`, and `observe()` via a gRPC/HTTP server defined by the OpenEnv protocol.
18
+
19
+ **Full generation pipeline per step**:
20
+
21
+ ```
22
+ Agent issues action (edit skill files)
23
+
24
+ skill_manager.py applies edit to isolated session directory
25
+
26
+ slide_generator.py: LLM reads DESIGN_RULES.md + EXAMPLES.md + TASK_PROMPT.md
27
+ → writes JavaScript (pptxgenjs)
28
+
29
+ node generate.js → slide.pptx
30
+
31
+ soffice --headless --convert-to pdf slide.pptx
32
+
33
+ pdftoppm -r 150 slide.pdf slide → slide-1.jpg
34
+
35
+ evaluator.py: Claude Opus 4.6 + vision → scores JSON
36
+
37
+ Observation returned to agent
38
+ ```
39
+
40
+ Each step takes approximately 60–120 seconds (two LLM API calls + Node.js + LibreOffice). At `max_steps=10` an episode runs 10–20 minutes. For HuggingFace Spaces with resource constraints, **5–7 steps per episode is more realistic**.
41
+
42
+ ---
43
+
44
+ ## 2. Conceptual Clarification
45
+
46
+ Understanding which files are "the skill" is critical. There are two distinct layers:
47
+
48
+ ### Layer 1 — Generic pptx Agent Tooling (`pptx/`)
49
+
50
+ These files live in `pptx/` and are maintained by Anthropic. They teach the LLM agent *how to use pptxgenjs as a tool* — the API, shape types, coordinate systems, etc. They are analogous to a standard library: stable, versioned independently, and not task-specific.
51
+
52
+ ```
53
+ pptx/
54
+ ├── SKILL.md # pptxgenjs capability overview and agent instructions
55
+ ├── editing.md # Shape editing primitives and patterns
56
+ └── pptxgenjs.md # Full pptxgenjs API reference
57
+ ```
58
+
59
+ **These files are read by the agent-as-executor (the slide generator LLM). They are NEVER the target of optimization.**
60
+
61
+ ### Layer 2 — Evolving Brand Style Files (the "skill" being optimized)
62
+
63
+ These files live in `skill_v{N}/` and encode McKinsey-specific visual design knowledge:
64
+
65
+ ```
66
+ skill_v0/
67
+ ├── DESIGN_RULES.md # Color palette, typography, layout coords, structural elements
68
+ └── EXAMPLES.md # Accumulated guidance from prior optimization rounds
69
+ ```
70
+
71
+ The optimizer LLM reads `DESIGN_RULES.md + EXAMPLES.md + evaluation feedback` and rewrites or edits these files to produce `skill_v{N+1}/`. The agent environment manages this evolution loop.
72
+
73
+ **Key invariant**: `DESIGN_RULES.md` and `EXAMPLES.md` are the only files the optimizer modifies. The pptx/ tooling files are read-only context for the generator.
74
+
75
+ ### The Baseline
76
+
77
+ The baseline is `skill_v0/` — minimal initial style guidelines with an empty EXAMPLES.md. It must be committed to the repo as `skill_files_baseline/` and represents the true starting point, not any evolved version. On environment `reset()`, the session's skill files are restored to this baseline.
78
+
79
+ ---
80
+
81
+ ## 3. Project Structure
82
+
83
+ ```
84
+ pptx-skillforge-hackathon/
85
+ ├── package.json # pptxgenjs ^4.0.1 dependency
86
+ ├── pyproject.toml # Python package definition
87
+
88
+ ├── pptx/ # Generic pptx agent tooling — DO NOT MODIFY
89
+ │ ├── SKILL.md
90
+ │ ├── editing.md
91
+ │ └── pptxgenjs.md
92
+
93
+ ├── skill_files_baseline/ # Committed minimal baseline (skill_v0 content)
94
+ │ ├── DESIGN_RULES.md # Minimal McKinsey rules, no teal/wrong colors
95
+ │ └── EXAMPLES.md # Empty: "(Empty — no prior optimization rounds)"
96
+
97
+ ├── output/
98
+ │ ├── TASK_PROMPT.md # Fixed task (Dutch Hydrogen Strategy)
99
+ │ ├── evaluator.py # Original standalone evaluator (unchanged)
100
+ │ ├── reference/
101
+ │ │ ├── ref-01.jpg # Cover page reference
102
+ │ │ ├── ref-02.jpg # Content page reference
103
+ │ │ ├── ref-03.jpg # Data/chart page reference
104
+ │ │ ├── ref-04.jpg # Data/chart page reference
105
+ │ │ └── ref-05.jpg # Content page reference
106
+ │ ├── skill_v0/ … skill_v5/ # Historical optimization rounds
107
+ │ ├── generate_v0.js … v5.js # Historical generated JS files
108
+ │ └── slide_v0.pptx … v5.pptx + JPGs
109
+
110
+ └── openenv/ # OpenEnv environment package
111
+ ├── app.py # FastAPI server entry point
112
+ ├── client.py # Reference client implementation
113
+ ├── openenv.yaml # OpenEnv manifest
114
+ ├── Dockerfile
115
+ ├── models.py # Pydantic data models
116
+ ├── slide_skill_environment.py # Core environment logic
117
+ ├── skill_manager.py # Skill file I/O + apply actions
118
+ ├── slide_generator.py # Full pipeline: LLM → JS → .pptx → JPG
119
+ └── evaluator_adapter.py # Adapter wrapping output/evaluator.py logic
120
+ ```
121
+
122
+ ---
123
+
124
+ ## 4. Data Models
125
+
126
+ `openenv/models.py`
127
+
128
+ ```python
129
+ """
130
+ Pydantic data models for the Slide Skill OpenEnv environment.
131
+
132
+ Action space:
133
+ SlideSkillAction is a discriminated union of two action types:
134
+ - EditSectionAction: Replace a named section's body in one skill file.
135
+ - ReplaceFileAction: Replace the entire content of one skill file.
136
+
137
+ EditSectionAction is appropriate when the agent wants surgical edits
138
+ (e.g., update only the typography section). ReplaceFileAction is used
139
+ when the optimizer rewrites the file wholesale, which is what the
140
+ historical optimizer LLM actually does.
141
+
142
+ Observation space:
143
+ SlideSkillObservation contains the full evaluator output including all
144
+ seven score dimensions plus qualitative feedback fields.
145
+ """
146
+
147
+ from __future__ import annotations
148
+
149
+ from typing import Annotated, Literal, Optional
150
+ from pydantic import BaseModel, Field
151
+
152
+
153
+ # ---------------------------------------------------------------------------
154
+ # Actions
155
+ # ---------------------------------------------------------------------------
156
+
157
+ SkillFile = Literal["DESIGN_RULES.md", "EXAMPLES.md"]
158
+ """The two skill files the optimizer is allowed to modify."""
159
+
160
+
161
+ class EditSectionAction(BaseModel):
162
+ """
163
+ Replace the body of a named markdown section within a skill file.
164
+
165
+ The section is identified by its heading text (without the leading #
166
+ characters). The replacement spans from immediately after the heading
167
+ line to (but not including) the next heading of equal or higher level.
168
+
169
+ Example:
170
+ action = EditSectionAction(
171
+ file="DESIGN_RULES.md",
172
+ section_heading="Color Palette",
173
+ new_body="- Navy (#0C2340): primary\\n- White: background\\n"
174
+ )
175
+ """
176
+
177
+ action_type: Literal["edit_section"] = "edit_section"
178
+ file: SkillFile = Field(..., description="Which skill file to edit.")
179
+ section_heading: str = Field(
180
+ ...,
181
+ description=(
182
+ "Exact heading text (without leading # markers). "
183
+ "Case-sensitive. Must match a heading in the file."
184
+ ),
185
+ )
186
+ new_body: str = Field(
187
+ ...,
188
+ description="New markdown content for the section body (after the heading line).",
189
+ )
190
+
191
+
192
+ class ReplaceFileAction(BaseModel):
193
+ """
194
+ Replace the entire content of a skill file.
195
+
196
+ Use this when the optimizer rewrites the file wholesale rather than
197
+ making targeted section edits. This is the mode used by the historical
198
+ optimizer LLM in the Skill Forge loop.
199
+ """
200
+
201
+ action_type: Literal["replace_file"] = "replace_file"
202
+ file: SkillFile = Field(..., description="Which skill file to replace.")
203
+ new_content: str = Field(
204
+ ...,
205
+ description="Complete new file content (valid markdown).",
206
+ )
207
+
208
+
209
+ # Discriminated union — action_type is the discriminator field.
210
+ SlideSkillAction = Annotated[
211
+ EditSectionAction | ReplaceFileAction,
212
+ Field(discriminator="action_type"),
213
+ ]
214
+
215
+
216
+ # ---------------------------------------------------------------------------
217
+ # Scores
218
+ # ---------------------------------------------------------------------------
219
+
220
+ class SlideScores(BaseModel):
221
+ """Raw scores from the McKinsey evaluator. Each dimension is 0–15 except
222
+ overall_impression which is 0–10. Total is 0–100."""
223
+
224
+ background_layout: int = Field(..., ge=0, le=15)
225
+ color_palette: int = Field(..., ge=0, le=15)
226
+ typography: int = Field(..., ge=0, le=15)
227
+ title_quality: int = Field(..., ge=0, le=15)
228
+ data_presentation: int = Field(..., ge=0, le=15)
229
+ structural_elements: int = Field(..., ge=0, le=15)
230
+ overall_impression: int = Field(..., ge=0, le=10)
231
+
232
+ @property
233
+ def total(self) -> int:
234
+ return (
235
+ self.background_layout
236
+ + self.color_palette
237
+ + self.typography
238
+ + self.title_quality
239
+ + self.data_presentation
240
+ + self.structural_elements
241
+ + self.overall_impression
242
+ )
243
+
244
+
245
+ # ---------------------------------------------------------------------------
246
+ # Observation
247
+ # ---------------------------------------------------------------------------
248
+
249
+ class SlideSkillObservation(BaseModel):
250
+ """
251
+ Observation returned to the agent after each step.
252
+
253
+ Contains the full evaluator output so the optimizer LLM has all the
254
+ information it needs to write the next skill revision: numeric scores,
255
+ qualitative strengths/weaknesses, and the one-line verdict.
256
+ """
257
+
258
+ scores: SlideScores
259
+ total: int = Field(..., description="Sum of all score dimensions (0–100).")
260
+ strengths: list[str] = Field(
261
+ default_factory=list,
262
+ description="List of what the slide does well, from the evaluator.",
263
+ )
264
+ weaknesses: list[str] = Field(
265
+ default_factory=list,
266
+ description="List of what needs improvement, from the evaluator.",
267
+ )
268
+ one_line_verdict: str = Field(
269
+ ..., description="Single-sentence summary from the evaluator."
270
+ )
271
+ reward: float = Field(
272
+ ...,
273
+ description=(
274
+ "Score delta vs. previous step, capped to [-0.3, +0.3] and "
275
+ "normalized to [-1.0, +1.0] by dividing by 100. "
276
+ "Capping reduces reward noise from LLM evaluation variance."
277
+ ),
278
+ )
279
+ step: int = Field(..., description="Current step index (0-based).")
280
+ done: bool = Field(..., description="True if max_steps reached.")
281
+ # Paths are strings for JSON serialization
282
+ jpg_path: str = Field(
283
+ ..., description="Absolute path to the generated slide JPG."
284
+ )
285
+ design_rules_content: str = Field(
286
+ ...,
287
+ description="Current DESIGN_RULES.md content (after action was applied).",
288
+ )
289
+ examples_content: str = Field(
290
+ ...,
291
+ description="Current EXAMPLES.md content (after action was applied).",
292
+ )
293
+
294
+
295
+ # ---------------------------------------------------------------------------
296
+ # State (internal, not exposed to client)
297
+ # ---------------------------------------------------------------------------
298
+
299
+ class SlideSkillState(BaseModel):
300
+ """Internal environment state. Not serialized to the client."""
301
+
302
+ session_id: str
303
+ step: int = 0
304
+ prev_total: int = 0 # score from the previous step (for reward calculation)
305
+ session_dir: str = Field(
306
+ ...,
307
+ description=(
308
+ "Absolute path to the isolated session directory under /tmp/. "
309
+ "Contains copies of DESIGN_RULES.md and EXAMPLES.md that this "
310
+ "session is free to modify without affecting other sessions."
311
+ ),
312
+ )
313
+ ```
314
+
315
+ ---
316
+
317
+ ## 5. Environment Logic
318
+
319
+ `openenv/slide_skill_environment.py`
320
+
321
+ ```python
322
+ """
323
+ Slide Skill Environment — OpenEnv-compatible environment for optimizing
324
+ McKinsey-style PowerPoint slide generation.
325
+
326
+ Concurrency model:
327
+ SUPPORTS_CONCURRENT_SESSIONS = True
328
+
329
+ Each session gets an isolated working directory at /tmp/slide_skill_{session_id}/.
330
+ Skill files (DESIGN_RULES.md, EXAMPLES.md) are copied there on reset() and
331
+ modified in place during the session. The shared repo files are never modified.
332
+ This means multiple sessions can run simultaneously without file conflicts.
333
+
334
+ The only shared resource is the Anthropic API key, which is rate-limited
335
+ per-account. For HuggingFace Spaces, running 2-3 concurrent sessions is
336
+ realistic before hitting rate limits.
337
+
338
+ Episode timing:
339
+ Each step involves two LLM calls (generator + evaluator) plus Node.js and
340
+ LibreOffice. Expect 60–120 seconds per step. At max_steps=7, a full episode
341
+ runs 7–14 minutes.
342
+
343
+ Reward function:
344
+ reward = clip(total_score - prev_total_score, -30, +30) / 100
345
+ Capping at ±30 points (±0.3 reward) dampens LLM evaluation noise. A score
346
+ can fluctuate ±5–10 points between identical slides due to evaluator variance,
347
+ so capping prevents large undeserved penalties or bonuses.
348
+ """
349
+
350
+ from __future__ import annotations
351
+
352
+ import shutil
353
+ import uuid
354
+ from pathlib import Path
355
+ from typing import ClassVar
356
+
357
+ from models import (
358
+ SlideSkillAction,
359
+ SlideSkillObservation,
360
+ SlideSkillState,
361
+ SlideScores,
362
+ )
363
+ from skill_manager import SkillManager
364
+ from slide_generator import SlideGenerator
365
+ from evaluator_adapter import EvaluatorAdapter
366
+
367
+
368
+ # Paths relative to repo root — adjust if the package moves.
369
+ REPO_ROOT = Path(__file__).parent.parent
370
+ BASELINE_DIR = REPO_ROOT / "skill_files_baseline"
371
+ TASK_PROMPT_PATH = REPO_ROOT / "output" / "TASK_PROMPT.md"
372
+ REFERENCE_DIR = REPO_ROOT / "output" / "reference"
373
+
374
+ # Reward capping parameters
375
+ REWARD_CLIP_POINTS = 30 # clip score delta to ±30 before normalizing
376
+ REWARD_SCALE = 100.0 # divide clipped delta by this to get [-0.3, +0.3]
377
+
378
+ MAX_STEPS = 7
379
+
380
+
381
+ class SlideSkillEnvironment:
382
+ """OpenEnv environment for the Skill Forge optimization loop."""
383
+
384
+ SUPPORTS_CONCURRENT_SESSIONS: ClassVar[bool] = True
385
+
386
+ def __init__(self) -> None:
387
+ self._sessions: dict[str, SlideSkillState] = {}
388
+ self._generator = SlideGenerator(
389
+ task_prompt_path=TASK_PROMPT_PATH,
390
+ pptx_skill_dir=REPO_ROOT / "pptx",
391
+ reference_dir=REFERENCE_DIR,
392
+ )
393
+ self._evaluator = EvaluatorAdapter(reference_dir=REFERENCE_DIR)
394
+
395
+ # ------------------------------------------------------------------
396
+ # Public OpenEnv interface
397
+ # ------------------------------------------------------------------
398
+
399
+ def reset(self, session_id: str | None = None) -> str:
400
+ """
401
+ Initialize or reinitialize a session.
402
+
403
+ Creates an isolated working directory under /tmp/ and copies the
404
+ baseline skill files into it. Returns the session_id.
405
+ """
406
+ session_id = session_id or str(uuid.uuid4())
407
+
408
+ session_dir = Path(f"/tmp/slide_skill_{session_id}")
409
+ if session_dir.exists():
410
+ shutil.rmtree(session_dir)
411
+ session_dir.mkdir(parents=True)
412
+
413
+ # Copy baseline skill files into the session directory.
414
+ for fname in ("DESIGN_RULES.md", "EXAMPLES.md"):
415
+ src = BASELINE_DIR / fname
416
+ if not src.exists():
417
+ raise FileNotFoundError(
418
+ f"Baseline file missing: {src}. "
419
+ "Commit skill_files_baseline/ to the repo."
420
+ )
421
+ shutil.copy2(src, session_dir / fname)
422
+
423
+ self._sessions[session_id] = SlideSkillState(
424
+ session_id=session_id,
425
+ step=0,
426
+ prev_total=0,
427
+ session_dir=str(session_dir),
428
+ )
429
+ return session_id
430
+
431
+ def step(self, session_id: str, action: SlideSkillAction) -> SlideSkillObservation:
432
+ """
433
+ Apply an action, run the generation pipeline, evaluate, and return
434
+ an observation.
435
+
436
+ Args:
437
+ session_id: Must be a live session (call reset() first).
438
+ action: Either EditSectionAction or ReplaceFileAction.
439
+
440
+ Returns:
441
+ SlideSkillObservation with scores, feedback, reward, and file contents.
442
+
443
+ Raises:
444
+ KeyError: If session_id is not found.
445
+ RuntimeError: If the generation or evaluation pipeline fails.
446
+ """
447
+ state = self._sessions[session_id]
448
+ session_dir = Path(state.session_dir)
449
+
450
+ # 1. Apply the action to the session's skill files.
451
+ manager = SkillManager(session_dir)
452
+ manager.apply(action)
453
+
454
+ # 2. Run the full generation pipeline.
455
+ jpg_path = self._generator.generate(
456
+ session_id=session_id,
457
+ session_dir=session_dir,
458
+ )
459
+
460
+ # 3. Evaluate the generated slide.
461
+ eval_result = self._evaluator.evaluate(jpg_path)
462
+
463
+ # 4. Compute reward (capped score delta).
464
+ delta = eval_result["total"] - state.prev_total
465
+ clipped_delta = max(-REWARD_CLIP_POINTS, min(REWARD_CLIP_POINTS, delta))
466
+ reward = clipped_delta / REWARD_SCALE
467
+
468
+ # 5. Update state.
469
+ state.step += 1
470
+ state.prev_total = eval_result["total"]
471
+ done = state.step >= MAX_STEPS
472
+
473
+ # 6. Read back current file contents for the observation.
474
+ design_rules = (session_dir / "DESIGN_RULES.md").read_text()
475
+ examples = (session_dir / "EXAMPLES.md").read_text()
476
+
477
+ scores = SlideScores(**eval_result["scores"])
478
+
479
+ return SlideSkillObservation(
480
+ scores=scores,
481
+ total=eval_result["total"],
482
+ strengths=eval_result.get("strengths", []),
483
+ weaknesses=eval_result.get("weaknesses", []),
484
+ one_line_verdict=eval_result["one_line_verdict"],
485
+ reward=reward,
486
+ step=state.step,
487
+ done=done,
488
+ jpg_path=str(jpg_path),
489
+ design_rules_content=design_rules,
490
+ examples_content=examples,
491
+ )
492
+
493
+ def close(self, session_id: str) -> None:
494
+ """Clean up session resources. Deletes the /tmp/ session directory."""
495
+ if session_id in self._sessions:
496
+ state = self._sessions.pop(session_id)
497
+ session_dir = Path(state.session_dir)
498
+ if session_dir.exists():
499
+ shutil.rmtree(session_dir)
500
+ ```
501
+
502
+ ---
503
+
504
+ ## 6. Supporting Modules
505
+
506
+ ### 6a. Skill Manager
507
+
508
+ `openenv/skill_manager.py`
509
+
510
+ ```python
511
+ """
512
+ Skill file manager — applies actions to an isolated session directory.
513
+
514
+ Operates exclusively on files within session_dir (a /tmp/ path).
515
+ Never touches the repo's baseline or any shared files.
516
+
517
+ Section editing rules:
518
+ A "section" is a markdown heading of any level (# to ######).
519
+ EditSectionAction finds the first heading whose text matches
520
+ section_heading (case-sensitive, stripped), then replaces everything
521
+ from the line after that heading up to (but not including) the next
522
+ heading of equal or higher level (i.e., same or fewer # characters).
523
+ If no next heading is found, the replacement extends to end-of-file.
524
+ """
525
+
526
+ from __future__ import annotations
527
+
528
+ import re
529
+ from pathlib import Path
530
+
531
+ from models import EditSectionAction, ReplaceFileAction, SlideSkillAction
532
+
533
+
534
+ class SkillManager:
535
+ """Manages DESIGN_RULES.md and EXAMPLES.md within a session directory."""
536
+
537
+ def __init__(self, session_dir: Path) -> None:
538
+ self.session_dir = session_dir
539
+
540
+ def apply(self, action: SlideSkillAction) -> None:
541
+ """
542
+ Dispatch to the appropriate handler based on action type.
543
+
544
+ Raises:
545
+ ValueError: If action_type is unrecognized or section not found.
546
+ FileNotFoundError: If the target skill file does not exist.
547
+ """
548
+ target = self.session_dir / action.file
549
+ if not target.exists():
550
+ raise FileNotFoundError(f"Skill file not found in session: {target}")
551
+
552
+ if action.action_type == "replace_file":
553
+ self._replace_file(target, action)
554
+ elif action.action_type == "edit_section":
555
+ self._edit_section(target, action)
556
+ else:
557
+ raise ValueError(f"Unknown action_type: {action.action_type!r}")
558
+
559
+ # ------------------------------------------------------------------
560
+ # Private helpers
561
+ # ------------------------------------------------------------------
562
+
563
+ @staticmethod
564
+ def _replace_file(target: Path, action: ReplaceFileAction) -> None:
565
+ """Overwrite the entire file with new_content."""
566
+ target.write_text(action.new_content, encoding="utf-8")
567
+
568
+ @staticmethod
569
+ def _edit_section(target: Path, action: EditSectionAction) -> None:
570
+ """Replace the body of a named markdown section."""
571
+ text = target.read_text(encoding="utf-8")
572
+ lines = text.splitlines(keepends=True)
573
+
574
+ # Find the heading line.
575
+ heading_pattern = re.compile(r"^(#{1,6})\s+(.*?)\s*$")
576
+ heading_idx: int | None = None
577
+ heading_level: int = 0
578
+
579
+ for i, line in enumerate(lines):
580
+ m = heading_pattern.match(line.rstrip("\n\r"))
581
+ if m and m.group(2) == action.section_heading:
582
+ heading_idx = i
583
+ heading_level = len(m.group(1))
584
+ break
585
+
586
+ if heading_idx is None:
587
+ raise ValueError(
588
+ f"Section heading {action.section_heading!r} not found in {target.name}."
589
+ )
590
+
591
+ # Find where the section body ends (next heading of equal or higher level).
592
+ end_idx = len(lines)
593
+ for i in range(heading_idx + 1, len(lines)):
594
+ m = heading_pattern.match(lines[i].rstrip("\n\r"))
595
+ if m and len(m.group(1)) <= heading_level:
596
+ end_idx = i
597
+ break
598
+
599
+ # Reconstruct the file.
600
+ new_body = action.new_body
601
+ if new_body and not new_body.endswith("\n"):
602
+ new_body += "\n"
603
+
604
+ new_lines = (
605
+ lines[: heading_idx + 1] # heading itself
606
+ + [new_body]
607
+ + lines[end_idx:] # rest of file after the section
608
+ )
609
+ target.write_text("".join(new_lines), encoding="utf-8")
610
+
611
+ def read_file(self, filename: str) -> str:
612
+ """Read a skill file from the session directory."""
613
+ return (self.session_dir / filename).read_text(encoding="utf-8")
614
+ ```
615
+
616
+ ### 6b. Slide Generator
617
+
618
+ `openenv/slide_generator.py`
619
+
620
+ ```python
621
+ """
622
+ Slide Generator — orchestrates the full PPT generation pipeline.
623
+
624
+ Pipeline (in order):
625
+ 1. LLM reads DESIGN_RULES.md + EXAMPLES.md + TASK_PROMPT.md + pptx/ tooling
626
+ → writes pptxgenjs JavaScript to generate.js in the session output dir.
627
+ 2. `node generate.js` runs in the session output dir → produces slide.pptx.
628
+ 3. `soffice --headless --convert-to pdf slide.pptx` → slide.pdf.
629
+ 4. `pdftoppm -r 150 slide.pdf slide` → slide-1.jpg (page 1).
630
+ 5. Returns the Path to slide-1.jpg.
631
+
632
+ The generator LLM receives the pptx/ tooling files as context so it knows
633
+ the full pptxgenjs API — but those files are read-only; they are never
634
+ written to or returned in the observation.
635
+
636
+ Session isolation:
637
+ All generated artifacts (generate.js, slide.pptx, slide.pdf, slide-1.jpg)
638
+ are written into a subdirectory of session_dir so that concurrent sessions
639
+ do not share output paths.
640
+ """
641
+
642
+ from __future__ import annotations
643
+
644
+ import subprocess
645
+ import textwrap
646
+ from pathlib import Path
647
+
648
+ import anthropic
649
+
650
+
651
+ # The generator uses a capable coding model. Claude Sonnet is a good balance
652
+ # between quality and speed/cost for code generation.
653
+ GENERATOR_MODEL = "claude-sonnet-4-6"
654
+ GENERATOR_MAX_TOKENS = 4096
655
+
656
+
657
+ class SlideGenerator:
658
+ """Drives the LLM → Node.js → LibreOffice → pdftoppm pipeline."""
659
+
660
+ def __init__(
661
+ self,
662
+ task_prompt_path: Path,
663
+ pptx_skill_dir: Path,
664
+ reference_dir: Path,
665
+ ) -> None:
666
+ self.task_prompt = task_prompt_path.read_text(encoding="utf-8")
667
+ self.pptx_skill_dir = pptx_skill_dir
668
+ self.reference_dir = reference_dir
669
+ self._client = anthropic.Anthropic()
670
+
671
+ def generate(self, session_id: str, session_dir: Path) -> Path:
672
+ """
673
+ Run the full pipeline for one optimization step.
674
+
675
+ Args:
676
+ session_id: Used only for logging/naming.
677
+ session_dir: Isolated directory containing the session's
678
+ DESIGN_RULES.md and EXAMPLES.md.
679
+
680
+ Returns:
681
+ Absolute path to the generated slide JPG (slide-1.jpg).
682
+
683
+ Raises:
684
+ RuntimeError: If any pipeline stage (LLM, Node, LibreOffice,
685
+ pdftoppm) fails.
686
+ """
687
+ out_dir = session_dir / "output"
688
+ out_dir.mkdir(exist_ok=True)
689
+
690
+ js_path = out_dir / "generate.js"
691
+ pptx_path = out_dir / "slide.pptx"
692
+ jpg_stem = out_dir / "slide"
693
+ jpg_path = out_dir / "slide-1.jpg"
694
+
695
+ # Stage 1: LLM generates pptxgenjs JavaScript.
696
+ js_code = self._call_generator_llm(session_dir)
697
+ js_path.write_text(js_code, encoding="utf-8")
698
+
699
+ # Stage 2: Node.js executes the JS to produce the .pptx file.
700
+ self._run(
701
+ ["node", str(js_path)],
702
+ cwd=out_dir,
703
+ stage="node generate.js",
704
+ )
705
+ if not pptx_path.exists():
706
+ raise RuntimeError(
707
+ f"node generate.js completed but {pptx_path} was not created."
708
+ )
709
+
710
+ # Stage 3: LibreOffice converts .pptx → .pdf.
711
+ self._run(
712
+ [
713
+ "soffice",
714
+ "--headless",
715
+ "--convert-to", "pdf",
716
+ "--outdir", str(out_dir),
717
+ str(pptx_path),
718
+ ],
719
+ cwd=out_dir,
720
+ stage="soffice --convert-to pdf",
721
+ )
722
+ pdf_path = out_dir / "slide.pdf"
723
+ if not pdf_path.exists():
724
+ raise RuntimeError(
725
+ f"LibreOffice completed but {pdf_path} was not created."
726
+ )
727
+
728
+ # Stage 4: pdftoppm converts PDF page 1 → JPG at 150 DPI.
729
+ # Output: slide-1.jpg (pdftoppm appends "-{page}" automatically).
730
+ self._run(
731
+ [
732
+ "pdftoppm",
733
+ "-r", "150",
734
+ "-jpeg",
735
+ "-f", "1", "-l", "1", # only page 1
736
+ str(pdf_path),
737
+ str(jpg_stem),
738
+ ],
739
+ cwd=out_dir,
740
+ stage="pdftoppm",
741
+ )
742
+ if not jpg_path.exists():
743
+ raise RuntimeError(
744
+ f"pdftoppm completed but {jpg_path} was not created."
745
+ )
746
+
747
+ return jpg_path
748
+
749
+ # ------------------------------------------------------------------
750
+ # Private helpers
751
+ # ------------------------------------------------------------------
752
+
753
+ def _call_generator_llm(self, session_dir: Path) -> str:
754
+ """
755
+ Call the generator LLM with skill files + task prompt as context.
756
+
757
+ Returns the raw JavaScript code string (without markdown fences).
758
+ """
759
+ design_rules = (session_dir / "DESIGN_RULES.md").read_text(encoding="utf-8")
760
+ examples = (session_dir / "EXAMPLES.md").read_text(encoding="utf-8")
761
+
762
+ # Load the generic pptx tooling files as executor context.
763
+ pptx_skill = self._read_pptx_skill()
764
+
765
+ system_prompt = textwrap.dedent("""\
766
+ You are an expert pptxgenjs developer. You will write a complete,
767
+ runnable Node.js script that generates a PowerPoint slide using
768
+ the pptxgenjs library.
769
+
770
+ Rules:
771
+ - Output ONLY the JavaScript code. No markdown fences, no explanation.
772
+ - The script must save the file as "slide.pptx" in the current directory.
773
+ - Follow the DESIGN_RULES.md and EXAMPLES.md exactly.
774
+ - Use the pptxgenjs API reference below for correct method calls.
775
+ """)
776
+
777
+ user_message = textwrap.dedent(f"""\
778
+ ## pptxgenjs API Reference
779
+ {pptx_skill}
780
+
781
+ ## Brand Style: DESIGN_RULES.md
782
+ {design_rules}
783
+
784
+ ## Brand Style: EXAMPLES.md
785
+ {examples}
786
+
787
+ ## Task
788
+ {self.task_prompt}
789
+
790
+ Write the complete pptxgenjs JavaScript file now.
791
+ """)
792
+
793
+ response = self._client.messages.create(
794
+ model=GENERATOR_MODEL,
795
+ max_tokens=GENERATOR_MAX_TOKENS,
796
+ system=system_prompt,
797
+ messages=[{"role": "user", "content": user_message}],
798
+ )
799
+
800
+ code = response.content[0].text.strip()
801
+
802
+ # Strip markdown code fences if the LLM added them despite instructions.
803
+ if code.startswith("```"):
804
+ code = code.split("\n", 1)[1]
805
+ if code.endswith("```"):
806
+ code = code.rsplit("```", 1)[0]
807
+ code = code.strip()
808
+
809
+ return code
810
+
811
+ def _read_pptx_skill(self) -> str:
812
+ """Concatenate the generic pptx skill files for LLM context."""
813
+ parts = []
814
+ for fname in ("SKILL.md", "editing.md", "pptxgenjs.md"):
815
+ p = self.pptx_skill_dir / fname
816
+ if p.exists():
817
+ parts.append(f"### {fname}\n{p.read_text(encoding='utf-8')}")
818
+ return "\n\n".join(parts)
819
+
820
+ @staticmethod
821
+ def _run(cmd: list[str], cwd: Path, stage: str) -> None:
822
+ """Run a subprocess; raise RuntimeError with context if it fails."""
823
+ result = subprocess.run(
824
+ cmd,
825
+ cwd=cwd,
826
+ capture_output=True,
827
+ text=True,
828
+ timeout=300, # 5 min hard limit per stage
829
+ )
830
+ if result.returncode != 0:
831
+ raise RuntimeError(
832
+ f"Pipeline stage '{stage}' failed (exit {result.returncode}).\n"
833
+ f"stdout: {result.stdout[-2000:]}\n"
834
+ f"stderr: {result.stderr[-2000:]}"
835
+ )
836
+ ```
837
+
838
+ ### 6c. Evaluator Adapter
839
+
840
+ `openenv/evaluator_adapter.py`
841
+
842
+ ```python
843
+ """
844
+ Evaluator Adapter — wraps the existing output/evaluator.py logic as a
845
+ reusable module with a clean interface.
846
+
847
+ This module does NOT import output/evaluator.py (which has a __main__ guard
848
+ and hardcoded paths). Instead, it re-implements the core evaluate_slide()
849
+ logic with:
850
+ - Configurable reference image paths
851
+ - A return type that includes all seven score keys, strengths, weaknesses,
852
+ and one_line_verdict
853
+ - No file I/O side effects (no evaluation_results.json written)
854
+
855
+ The evaluation prompt is identical to output/evaluator.py so scores are
856
+ comparable across the historical runs and the OpenEnv loop.
857
+ """
858
+
859
+ from __future__ import annotations
860
+
861
+ import base64
862
+ import json
863
+ from pathlib import Path
864
+
865
+ import anthropic
866
+
867
+
868
+ # Must match output/evaluator.py exactly so historical scores are comparable.
869
+ EVALUATION_SYSTEM_PROMPT = """You are an expert McKinsey & Company slide design evaluator.
870
+
871
+ You will be shown:
872
+ 1. REFERENCE IMAGES: 5 pages from a real McKinsey & Company consulting deck (Chilean Hydrogen Pathway, December 2020). These represent the gold standard for visual style.
873
+ 2. CANDIDATE SLIDE: A programmatically generated PowerPoint slide about Dutch Hydrogen Strategy, rendered as a JPEG image.
874
+
875
+ Your job: Score how closely the CANDIDATE SLIDE matches the McKinsey visual style shown in the REFERENCE IMAGES.
876
+
877
+ ## Scoring Rubric (100 points total)
878
+
879
+ ### 1. Background & Base Layout (0-15 points)
880
+ - McKinsey content/data slides use WHITE backgrounds (dark navy is ONLY for section dividers/covers)
881
+ - Clean margins (~0.5" all sides)
882
+ - No unnecessary visual clutter
883
+ - 15: White bg, clean margins, professional spacing
884
+ - 10: White bg but spacing issues
885
+ - 5: Wrong background color or major layout problems
886
+ - 0: Completely wrong base (e.g., dark bg for data slide)
887
+
888
+ ### 2. Color Palette Fidelity (0-15 points)
889
+ - McKinsey uses a RESTRAINED palette: navy/dark blue (#0C2340-ish), white, light greys
890
+ - Accent colors are used SPARINGLY — typically just 1-2 accent colors max
891
+ - NO rainbow effects, no bright multi-color schemes
892
+ - Crimson/red used only for thin divider lines, not large elements
893
+ - 15: Matches McKinsey's restrained navy/white/grey palette perfectly
894
+ - 10: Mostly correct but 1-2 color choices off
895
+ - 5: Too many colors or wrong color family
896
+ - 0: Completely different color scheme
897
+
898
+ ### 3. Typography (0-15 points)
899
+ - Title: Large, bold, black or very dark, left-aligned (Georgia or similar serif for titles)
900
+ - Body: Clean sans-serif (Calibri-like), smaller, grey or dark grey
901
+ - Clear size hierarchy: title >> subtitle >> body >> footnotes
902
+ - No decorative fonts
903
+ - 15: Perfect type hierarchy matching McKinsey
904
+ - 10: Good hierarchy but font choices slightly off
905
+ - 5: Weak hierarchy or wrong fonts
906
+ - 0: No clear hierarchy
907
+
908
+ ### 4. Title Quality — "So-What" Style (0-15 points)
909
+ - McKinsey titles state a CONCLUSION or INSIGHT, not just a topic
910
+ - GOOD: "The Netherlands aims to become Europe's green hydrogen hub, scaling from 500 MW to 3-4 GW by 2030"
911
+ - BAD: "Dutch Hydrogen Strategy (2020-2035)" or "Roadmap Overview"
912
+ - The title should tell you the key takeaway without reading the slide
913
+ - 15: Clear insight-driven conclusion title
914
+ - 10: Partial insight (has some specifics but reads more like a topic)
915
+ - 5: Pure topic label
916
+ - 0: Generic or missing title
917
+
918
+ ### 5. Data Presentation (0-15 points)
919
+ - McKinsey uses structured TABLES for data (not floating stat callouts)
920
+ - Tables have: navy header borders (top + bottom of header row), light grey row dividers, bold left column labels
921
+ - Data should be organized, scannable, center-aligned values
922
+ - Key columns/years may be subtly highlighted
923
+ - 15: Clean structured table matching McKinsey format
924
+ - 10: Has data but format doesn't match McKinsey tables
925
+ - 5: Data present but poorly structured (floating callouts, inconsistent format)
926
+ - 0: No supporting data
927
+
928
+ ### 6. Structural Elements (0-15 points)
929
+ - Thin crimson/red divider line below title area (not touching title — separated by whitespace)
930
+ - McKinsey footer: thin rule line + source text (left) + "McKinsey & Company" bold (right) + page number
931
+ - Numbered footnotes for data disclaimers
932
+ - Source attribution line
933
+ - 15: All structural elements present and correctly placed
934
+ - 10: Most elements present, minor placement issues
935
+ - 5: Missing 2+ structural elements
936
+ - 0: No McKinsey structural elements
937
+
938
+ ### 7. Overall Visual Impression (0-10 points)
939
+ - Does this FEEL like it came from McKinsey?
940
+ - Would a consulting professional find this polished and credible?
941
+ - Is it clean, restrained, and authoritative — or busy, colorful, and amateur?
942
+ - 10: Indistinguishable from real McKinsey output
943
+ - 7: Close but a trained eye spots differences
944
+ - 4: Clearly generated/templated but has some McKinsey DNA
945
+ - 1: Does not resemble McKinsey at all
946
+
947
+ ## Output Format
948
+
949
+ Return ONLY a JSON object with this exact structure (no markdown, no code fences):
950
+ {
951
+ "scores": {
952
+ "background_layout": <0-15>,
953
+ "color_palette": <0-15>,
954
+ "typography": <0-15>,
955
+ "title_quality": <0-15>,
956
+ "data_presentation": <0-15>,
957
+ "structural_elements": <0-15>,
958
+ "overall_impression": <0-10>
959
+ },
960
+ "total": <sum of all scores, 0-100>,
961
+ "strengths": ["<strength 1>", "<strength 2>", ...],
962
+ "weaknesses": ["<weakness 1>", "<weakness 2>", ...],
963
+ "one_line_verdict": "<one sentence summary>"
964
+ }
965
+ """
966
+
967
+ EVALUATOR_MODEL = "claude-opus-4-6"
968
+
969
+
970
+ def _encode_image(path: Path) -> dict:
971
+ """Encode an image file to base64 for the Anthropic messages API."""
972
+ data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
973
+ suffix = path.suffix.lower()
974
+ media_type = "image/jpeg" if suffix in (".jpg", ".jpeg") else "image/png"
975
+ return {
976
+ "type": "image",
977
+ "source": {
978
+ "type": "base64",
979
+ "media_type": media_type,
980
+ "data": data,
981
+ },
982
+ }
983
+
984
+
985
+ class EvaluatorAdapter:
986
+ """
987
+ Adapter that evaluates a generated slide JPG against McKinsey references.
988
+
989
+ Uses the same Claude Opus 4.6 + vision approach as output/evaluator.py,
990
+ but as a reusable class rather than a script with side effects.
991
+ """
992
+
993
+ REFERENCE_FILENAMES = [
994
+ "ref-01.jpg",
995
+ "ref-02.jpg",
996
+ "ref-03.jpg",
997
+ "ref-04.jpg",
998
+ "ref-05.jpg",
999
+ ]
1000
+
1001
+ def __init__(self, reference_dir: Path) -> None:
1002
+ """
1003
+ Args:
1004
+ reference_dir: Directory containing ref-01.jpg through ref-05.jpg.
1005
+ """
1006
+ self.reference_dir = reference_dir
1007
+ self._client = anthropic.Anthropic()
1008
+
1009
+ # Validate reference images exist at construction time.
1010
+ missing = [
1011
+ f for f in self.REFERENCE_FILENAMES
1012
+ if not (reference_dir / f).exists()
1013
+ ]
1014
+ if missing:
1015
+ raise FileNotFoundError(
1016
+ f"Missing reference images in {reference_dir}: {missing}"
1017
+ )
1018
+
1019
+ def evaluate(self, slide_jpg_path: Path) -> dict:
1020
+ """
1021
+ Evaluate a generated slide against the McKinsey reference images.
1022
+
1023
+ Args:
1024
+ slide_jpg_path: Absolute path to the slide JPG to evaluate.
1025
+
1026
+ Returns:
1027
+ dict with keys:
1028
+ "scores": dict mapping the 7 dimension names to int scores
1029
+ "total": int, sum of all scores (0-100)
1030
+ "strengths": list[str]
1031
+ "weaknesses": list[str]
1032
+ "one_line_verdict": str
1033
+
1034
+ Raises:
1035
+ FileNotFoundError: If slide_jpg_path does not exist.
1036
+ json.JSONDecodeError: If the LLM returns malformed JSON.
1037
+ RuntimeError: If the API call fails.
1038
+ """
1039
+ if not slide_jpg_path.exists():
1040
+ raise FileNotFoundError(f"Slide JPG not found: {slide_jpg_path}")
1041
+
1042
+ content: list[dict] = []
1043
+
1044
+ # Reference images first.
1045
+ content.append({
1046
+ "type": "text",
1047
+ "text": (
1048
+ "## REFERENCE IMAGES (Real McKinsey deck)\n"
1049
+ "The following 5 images are from a real McKinsey & Company consulting "
1050
+ "report. Study their visual style carefully."
1051
+ ),
1052
+ })
1053
+ for i, fname in enumerate(self.REFERENCE_FILENAMES, 1):
1054
+ ref_path = self.reference_dir / fname
1055
+ content.append(_encode_image(ref_path))
1056
+ content.append({"type": "text", "text": f"(Reference page {i})"})
1057
+
1058
+ # Candidate slide.
1059
+ content.append({
1060
+ "type": "text",
1061
+ "text": (
1062
+ f"\n## CANDIDATE SLIDE TO EVALUATE\n"
1063
+ f"This is the generated slide: {slide_jpg_path.name}"
1064
+ ),
1065
+ })
1066
+ content.append(_encode_image(slide_jpg_path))
1067
+ content.append({
1068
+ "type": "text",
1069
+ "text": (
1070
+ "\nNow score this candidate slide against the McKinsey reference "
1071
+ "using the rubric. Return ONLY the JSON object."
1072
+ ),
1073
+ })
1074
+
1075
+ response = self._client.messages.create(
1076
+ model=EVALUATOR_MODEL,
1077
+ max_tokens=1024,
1078
+ system=EVALUATION_SYSTEM_PROMPT,
1079
+ messages=[{"role": "user", "content": content}],
1080
+ )
1081
+
1082
+ text = response.content[0].text.strip()
1083
+
1084
+ # Strip markdown code fences if present (LLMs sometimes add them
1085
+ # despite explicit instructions not to).
1086
+ if text.startswith("```"):
1087
+ text = text.split("\n", 1)[1].rsplit("```", 1)[0].strip()
1088
+
1089
+ result = json.loads(text)
1090
+
1091
+ # Validate required keys are present.
1092
+ required_score_keys = {
1093
+ "background_layout", "color_palette", "typography",
1094
+ "title_quality", "data_presentation", "structural_elements",
1095
+ "overall_impression",
1096
+ }
1097
+ missing_keys = required_score_keys - set(result.get("scores", {}).keys())
1098
+ if missing_keys:
1099
+ raise ValueError(
1100
+ f"Evaluator response missing score keys: {missing_keys}. "
1101
+ f"Full response: {text[:500]}"
1102
+ )
1103
+
1104
+ return result
1105
+ ```
1106
+
1107
+ ---
1108
+
1109
+ ## 7. Server Entry Point
1110
+
1111
+ `openenv/app.py`
1112
+
1113
+ ```python
1114
+ """
1115
+ FastAPI server for the Slide Skill OpenEnv environment.
1116
+
1117
+ Endpoints follow the OpenEnv HTTP protocol:
1118
+ POST /reset → initialize or restart a session
1119
+ POST /step → apply an action and return observation
1120
+ DELETE /sessions/{session_id} → clean up a session
1121
+
1122
+ The server is stateful: environment instances are kept in memory.
1123
+ For production deployments with multiple workers, use a single-worker
1124
+ Uvicorn setup or externalize session state to Redis.
1125
+ """
1126
+
1127
+ from __future__ import annotations
1128
+
1129
+ from contextlib import asynccontextmanager
1130
+ from typing import Annotated
1131
+
1132
+ import uvicorn
1133
+ from fastapi import Body, FastAPI, HTTPException, Path
1134
+ from pydantic import BaseModel
1135
+
1136
+ from models import SlideSkillAction, SlideSkillObservation
1137
+ from slide_skill_environment import SlideSkillEnvironment
1138
+
1139
+
1140
+ # Single shared environment instance. Sessions are isolated at the file
1141
+ # level, so this is safe for concurrent requests.
1142
+ _env: SlideSkillEnvironment | None = None
1143
+
1144
+
1145
+ @asynccontextmanager
1146
+ async def lifespan(app: FastAPI):
1147
+ global _env
1148
+ _env = SlideSkillEnvironment()
1149
+ yield
1150
+ _env = None
1151
+
1152
+
1153
+ app = FastAPI(
1154
+ title="Slide Skill OpenEnv",
1155
+ description=(
1156
+ "OpenEnv-compatible environment for optimizing McKinsey-style "
1157
+ "PowerPoint slides by evolving DESIGN_RULES.md and EXAMPLES.md."
1158
+ ),
1159
+ lifespan=lifespan,
1160
+ )
1161
+
1162
+
1163
+ class ResetRequest(BaseModel):
1164
+ session_id: str | None = None
1165
+
1166
+
1167
+ class ResetResponse(BaseModel):
1168
+ session_id: str
1169
+ message: str
1170
+
1171
+
1172
+ class StepRequest(BaseModel):
1173
+ session_id: str
1174
+ action: SlideSkillAction
1175
+
1176
+
1177
+ @app.post("/reset", response_model=ResetResponse)
1178
+ async def reset(request: ResetRequest = Body(default=ResetRequest())) -> ResetResponse:
1179
+ """Initialize or restart an optimization session."""
1180
+ assert _env is not None
1181
+ session_id = _env.reset(session_id=request.session_id)
1182
+ return ResetResponse(
1183
+ session_id=session_id,
1184
+ message=f"Session {session_id} initialized with baseline skill files.",
1185
+ )
1186
+
1187
+
1188
+ @app.post("/step", response_model=SlideSkillObservation)
1189
+ async def step(request: StepRequest) -> SlideSkillObservation:
1190
+ """Apply an action to the session and return the resulting observation."""
1191
+ assert _env is not None
1192
+ try:
1193
+ observation = _env.step(
1194
+ session_id=request.session_id,
1195
+ action=request.action,
1196
+ )
1197
+ except KeyError:
1198
+ raise HTTPException(
1199
+ status_code=404,
1200
+ detail=f"Session {request.session_id!r} not found. Call /reset first.",
1201
+ )
1202
+ except (RuntimeError, ValueError) as exc:
1203
+ raise HTTPException(status_code=500, detail=str(exc))
1204
+ return observation
1205
+
1206
+
1207
+ @app.delete("/sessions/{session_id}")
1208
+ async def close_session(
1209
+ session_id: Annotated[str, Path(description="Session ID to clean up.")]
1210
+ ) -> dict:
1211
+ """Clean up session resources (deletes /tmp/ working directory)."""
1212
+ assert _env is not None
1213
+ try:
1214
+ _env.close(session_id)
1215
+ except KeyError:
1216
+ raise HTTPException(
1217
+ status_code=404,
1218
+ detail=f"Session {session_id!r} not found.",
1219
+ )
1220
+ return {"message": f"Session {session_id} closed."}
1221
+
1222
+
1223
+ @app.get("/health")
1224
+ async def health() -> dict:
1225
+ return {"status": "ok", "supports_concurrent_sessions": True}
1226
+
1227
+
1228
+ if __name__ == "__main__":
1229
+ uvicorn.run("app:app", host="0.0.0.0", port=8000, workers=1)
1230
+ ```
1231
+
1232
+ ---
1233
+
1234
+ ## 8. Client
1235
+
1236
+ `openenv/client.py`
1237
+
1238
+ ```python
1239
+ """
1240
+ Reference client for the Slide Skill OpenEnv server.
1241
+
1242
+ Demonstrates how an optimizer agent would interact with the environment:
1243
+ 1. Reset to get a session ID.
1244
+ 2. Read the initial skill file contents from the first observation.
1245
+ 3. Call an LLM optimizer to generate an improved DESIGN_RULES.md.
1246
+ 4. Submit as a ReplaceFileAction.
1247
+ 5. Repeat until done=True.
1248
+
1249
+ This client is also useful for smoke-testing the server without a full agent.
1250
+ """
1251
+
1252
+ from __future__ import annotations
1253
+
1254
+ import json
1255
+ import textwrap
1256
+ from pathlib import Path
1257
+ from typing import Any
1258
+
1259
+ import anthropic
1260
+ import httpx
1261
+
1262
+ from models import SlideSkillObservation
1263
+
1264
+ SERVER_URL = "http://localhost:8000"
1265
+ OPTIMIZER_MODEL = "claude-opus-4-6"
1266
+
1267
+
1268
+ class SlideSkillClient:
1269
+ """HTTP client for the Slide Skill OpenEnv server."""
1270
+
1271
+ def __init__(self, base_url: str = SERVER_URL) -> None:
1272
+ self.base_url = base_url.rstrip("/")
1273
+ self._http = httpx.Client(timeout=300.0) # long timeout for pipeline stages
1274
+
1275
+ def reset(self, session_id: str | None = None) -> str:
1276
+ """Start a new session. Returns the session_id."""
1277
+ payload: dict[str, Any] = {}
1278
+ if session_id:
1279
+ payload["session_id"] = session_id
1280
+ resp = self._http.post(f"{self.base_url}/reset", json=payload)
1281
+ resp.raise_for_status()
1282
+ return resp.json()["session_id"]
1283
+
1284
+ def step(self, session_id: str, action: dict) -> SlideSkillObservation:
1285
+ """
1286
+ Apply an action and return the observation.
1287
+
1288
+ Args:
1289
+ session_id: Active session ID.
1290
+ action: Dict matching EditSectionAction or ReplaceFileAction schema.
1291
+ Must include "action_type" key.
1292
+ """
1293
+ payload = {"session_id": session_id, "action": action}
1294
+ resp = self._http.post(f"{self.base_url}/step", json=payload)
1295
+ resp.raise_for_status()
1296
+ return SlideSkillObservation.model_validate(resp.json())
1297
+
1298
+ def close(self, session_id: str) -> None:
1299
+ """Clean up the session."""
1300
+ resp = self._http.delete(f"{self.base_url}/sessions/{session_id}")
1301
+ resp.raise_for_status()
1302
+
1303
+ def __enter__(self) -> SlideSkillClient:
1304
+ return self
1305
+
1306
+ def __exit__(self, *_: Any) -> None:
1307
+ self._http.close()
1308
+
1309
+
1310
+ # ---------------------------------------------------------------------------
1311
+ # Optimizer agent (reference implementation)
1312
+ # ---------------------------------------------------------------------------
1313
+
1314
+ def call_optimizer_llm(
1315
+ obs: SlideSkillObservation,
1316
+ anthropic_client: anthropic.Anthropic,
1317
+ ) -> dict:
1318
+ """
1319
+ Call the optimizer LLM to generate a new DESIGN_RULES.md based on
1320
+ the evaluation feedback.
1321
+
1322
+ Returns a dict suitable for the step() action parameter.
1323
+ This uses ReplaceFileAction since the historical optimizer rewrites
1324
+ the file wholesale.
1325
+ """
1326
+ prompt = textwrap.dedent(f"""\
1327
+ You are a McKinsey slide design optimizer. You are improving a
1328
+ PowerPoint generation skill by rewriting its DESIGN_RULES.md file.
1329
+
1330
+ ## Current Score: {obs.total}/100
1331
+
1332
+ ## Score Breakdown
1333
+ - background_layout: {obs.scores.background_layout}/15
1334
+ - color_palette: {obs.scores.color_palette}/15
1335
+ - typography: {obs.scores.typography}/15
1336
+ - title_quality: {obs.scores.title_quality}/15
1337
+ - data_presentation: {obs.scores.data_presentation}/15
1338
+ - structural_elements: {obs.scores.structural_elements}/15
1339
+ - overall_impression: {obs.scores.overall_impression}/10
1340
+
1341
+ ## Evaluator Feedback
1342
+ Strengths:
1343
+ {chr(10).join(f'- {s}' for s in obs.strengths)}
1344
+
1345
+ Weaknesses:
1346
+ {chr(10).join(f'- {w}' for w in obs.weaknesses)}
1347
+
1348
+ Verdict: {obs.one_line_verdict}
1349
+
1350
+ ## Current DESIGN_RULES.md
1351
+ {obs.design_rules_content}
1352
+
1353
+ ## Current EXAMPLES.md
1354
+ {obs.examples_content}
1355
+
1356
+ Your task:
1357
+ Write an improved DESIGN_RULES.md that addresses the weaknesses above
1358
+ while preserving what works well. Focus on the dimensions with the
1359
+ lowest scores. Output ONLY the markdown file content — no explanation,
1360
+ no code fences.
1361
+ """)
1362
+
1363
+ response = anthropic_client.messages.create(
1364
+ model=OPTIMIZER_MODEL,
1365
+ max_tokens=4096,
1366
+ messages=[{"role": "user", "content": prompt}],
1367
+ )
1368
+
1369
+ new_content = response.content[0].text.strip()
1370
+
1371
+ return {
1372
+ "action_type": "replace_file",
1373
+ "file": "DESIGN_RULES.md",
1374
+ "new_content": new_content,
1375
+ }
1376
+
1377
+
1378
+ def run_optimization_loop(server_url: str = SERVER_URL, max_steps: int = 7) -> None:
1379
+ """
1380
+ Run a full optimization episode using the LLM optimizer.
1381
+
1382
+ This mirrors the historical Skill Forge loop but driven through the
1383
+ OpenEnv HTTP interface.
1384
+ """
1385
+ anthropic_client = anthropic.Anthropic()
1386
+
1387
+ with SlideSkillClient(base_url=server_url) as client:
1388
+ session_id = client.reset()
1389
+ print(f"Session: {session_id}")
1390
+
1391
+ # The first step must use the baseline skill files, so we submit a
1392
+ # no-op edit (replace EXAMPLES.md with its current content, which
1393
+ # forces the generator to run with the baseline DESIGN_RULES.md).
1394
+ # Alternatively, the server could expose a generate-only endpoint.
1395
+ print("Step 0: Generating baseline slide...")
1396
+ obs = client.step(
1397
+ session_id,
1398
+ {
1399
+ "action_type": "replace_file",
1400
+ "file": "EXAMPLES.md",
1401
+ "new_content": obs_initial_examples(client, session_id) if False else "(Empty — no prior optimization rounds)\n",
1402
+ },
1403
+ )
1404
+ print(f" Baseline score: {obs.total}/100 — {obs.one_line_verdict}")
1405
+
1406
+ for step_idx in range(1, max_steps + 1):
1407
+ if obs.done:
1408
+ print("Episode complete.")
1409
+ break
1410
+
1411
+ print(f"\nStep {step_idx}: Calling optimizer LLM...")
1412
+ action = call_optimizer_llm(obs, anthropic_client)
1413
+ obs = client.step(session_id, action)
1414
+
1415
+ print(
1416
+ f" Score: {obs.total}/100 (reward: {obs.reward:+.3f}) "
1417
+ f"— {obs.one_line_verdict}"
1418
+ )
1419
+ print(f" Weaknesses: {'; '.join(obs.weaknesses[:2])}")
1420
+
1421
+ client.close(session_id)
1422
+ print(f"\nFinal score: {obs.total}/100")
1423
+
1424
+
1425
+ if __name__ == "__main__":
1426
+ run_optimization_loop()
1427
+ ```
1428
+
1429
+ ---
1430
+
1431
+ ## 9. OpenEnv Manifest
1432
+
1433
+ `openenv/openenv.yaml`
1434
+
1435
+ ```yaml
1436
+ # OpenEnv environment manifest for Slide Skill
1437
+ # https://openenv.dev/spec
1438
+
1439
+ name: slide-skill
1440
+ version: "1.0.0"
1441
+ description: >
1442
+ Self-improving McKinsey-style PowerPoint slide generation environment.
1443
+ The agent evolves DESIGN_RULES.md and EXAMPLES.md to maximize a visual
1444
+ design score (0-100) evaluated by Claude Opus vision against 5 McKinsey
1445
+ reference images.
1446
+
1447
+ author: Tesserae / Skill Forge Hackathon Team
1448
+
1449
+ supports_concurrent_sessions: true
1450
+ max_steps: 7
1451
+
1452
+ # Approximate time budget per step (seconds).
1453
+ # Each step: generator LLM (~20-40s) + Node.js (<5s) + LibreOffice (~15-30s)
1454
+ # + pdftoppm (<5s) + evaluator LLM (~30-60s)
1455
+ step_timeout_seconds: 180
1456
+
1457
+ action_space:
1458
+ type: union
1459
+ discriminator: action_type
1460
+ variants:
1461
+ - name: edit_section
1462
+ description: Replace the body of a named section in a skill file.
1463
+ fields:
1464
+ file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
1465
+ section_heading: {type: string, description: "Exact heading text without # markers"}
1466
+ new_body: {type: string, description: "New section body content in markdown"}
1467
+
1468
+ - name: replace_file
1469
+ description: Replace the entire content of a skill file.
1470
+ fields:
1471
+ file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
1472
+ new_content: {type: string, description: "Complete new file content"}
1473
+
1474
+ observation_space:
1475
+ scores:
1476
+ background_layout: {type: integer, min: 0, max: 15}
1477
+ color_palette: {type: integer, min: 0, max: 15}
1478
+ typography: {type: integer, min: 0, max: 15}
1479
+ title_quality: {type: integer, min: 0, max: 15}
1480
+ data_presentation: {type: integer, min: 0, max: 15}
1481
+ structural_elements: {type: integer, min: 0, max: 15}
1482
+ overall_impression: {type: integer, min: 0, max: 10}
1483
+ total: {type: integer, min: 0, max: 100}
1484
+ strengths: {type: array, items: string}
1485
+ weaknesses: {type: array, items: string}
1486
+ one_line_verdict: {type: string}
1487
+ reward: {type: float, min: -0.3, max: 0.3}
1488
+ step: {type: integer}
1489
+ done: {type: boolean}
1490
+ jpg_path: {type: string, description: "Absolute path to generated slide JPG"}
1491
+ design_rules_content: {type: string}
1492
+ examples_content: {type: string}
1493
+
1494
+ reward:
1495
+ description: >
1496
+ Normalized score delta vs. previous step, capped to [-0.3, +0.3].
1497
+ Formula: clip(total_score - prev_total_score, -30, +30) / 100
1498
+ range: [-0.3, 0.3]
1499
+
1500
+ baseline:
1501
+ description: >
1502
+ skill_files_baseline/ committed to the repo contains the minimal
1503
+ starting DESIGN_RULES.md (teal palette, basic typography) and an
1504
+ empty EXAMPLES.md. This is skill_v0 content — NOT any evolved version.
1505
+
1506
+ endpoints:
1507
+ reset: POST /reset
1508
+ step: POST /step
1509
+ close: DELETE /sessions/{session_id}
1510
+ health: GET /health
1511
+
1512
+ server:
1513
+ host: 0.0.0.0
1514
+ port: 8000
1515
+ workers: 1 # Do not increase; LibreOffice is not thread-safe
1516
+
1517
+ environment_variables:
1518
+ required:
1519
+ - name: ANTHROPIC_API_KEY
1520
+ description: Anthropic API key for Claude generator and evaluator
1521
+ optional:
1522
+ - name: SLIDE_SKILL_MAX_STEPS
1523
+ description: Override default max_steps (default 7)
1524
+ default: "7"
1525
+ ```
1526
+
1527
+ ---
1528
+
1529
+ ## 10. Dockerfile
1530
+
1531
+ `openenv/Dockerfile`
1532
+
1533
+ ```dockerfile
1534
+ # Slide Skill OpenEnv — Docker image
1535
+ #
1536
+ # Layer sizes (approximate):
1537
+ # python:3.12-slim base: ~130 MB
1538
+ # Node.js 20 + pptxgenjs: ~200 MB
1539
+ # LibreOffice: ~500 MB <-- dominant cost
1540
+ # poppler-utils (pdftoppm): ~30 MB
1541
+ # Python deps: ~80 MB
1542
+ # Total compressed: ~600-700 MB
1543
+ #
1544
+ # LibreOffice is the unavoidable bottleneck. It is required to convert
1545
+ # .pptx → .pdf. There is no lighter alternative that handles pptxgenjs
1546
+ # output faithfully.
1547
+
1548
+ FROM python:3.12-slim
1549
+
1550
+ LABEL description="Slide Skill OpenEnv — McKinsey PPT generation environment"
1551
+
1552
+ # System dependencies
1553
+ RUN apt-get update && apt-get install -y --no-install-recommends \
1554
+ # LibreOffice for .pptx → .pdf conversion
1555
+ libreoffice \
1556
+ # poppler-utils provides pdftoppm (.pdf → .jpg)
1557
+ poppler-utils \
1558
+ # Node.js 20 LTS via NodeSource
1559
+ curl \
1560
+ ca-certificates \
1561
+ gnupg \
1562
+ && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
1563
+ && apt-get install -y nodejs \
1564
+ && apt-get clean \
1565
+ && rm -rf /var/lib/apt/lists/*
1566
+
1567
+ # Verify tools are available
1568
+ RUN node --version && npm --version && soffice --version && pdftoppm -v 2>&1 | head -1
1569
+
1570
+ WORKDIR /app
1571
+
1572
+ # Install pptxgenjs (Node.js dependency)
1573
+ COPY package.json ./
1574
+ RUN npm install --production
1575
+
1576
+ # Install Python dependencies
1577
+ COPY pyproject.toml ./
1578
+ RUN pip install --no-cache-dir -e ".[server]"
1579
+
1580
+ # Copy application code
1581
+ COPY pptx/ ./pptx/
1582
+ COPY skill_files_baseline/ ./skill_files_baseline/
1583
+ COPY output/TASK_PROMPT.md ./output/TASK_PROMPT.md
1584
+ COPY output/reference/ ./output/reference/
1585
+ COPY openenv/ ./openenv/
1586
+
1587
+ WORKDIR /app/openenv
1588
+
1589
+ # LibreOffice needs a writable user profile directory.
1590
+ # Using /tmp/libreoffice-profile prevents concurrent session conflicts.
1591
+ ENV HOME=/tmp
1592
+ ENV SAL_USE_VCLPLUGIN=svp
1593
+
1594
+ EXPOSE 8000
1595
+
1596
+ # Single worker — LibreOffice is not thread-safe within one process.
1597
+ # Concurrent sessions are handled by per-session /tmp/ directories,
1598
+ # but LibreOffice calls must be serialized (or use process-level locking
1599
+ # if scaling to multiple Gunicorn workers is required in the future).
1600
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
1601
+ ```
1602
+
1603
+ ---
1604
+
1605
+ ## 11. Implementation Task Order
1606
+
1607
+ ### Phase 1 — Foundation (no external dependencies)
1608
+ 1. Commit `skill_files_baseline/` to repo (copy `output/skill_v0/` content, verify EXAMPLES.md is truly minimal).
1609
+ 2. Implement `models.py` — pure Pydantic, no I/O.
1610
+ 3. Implement `skill_manager.py` — file I/O only, no LLM calls. Write unit tests with a tmp directory.
1611
+ 4. Implement `evaluator_adapter.py` — port the `evaluate_slide()` function from `output/evaluator.py`. Test against a known slide JPG and verify JSON matches expected structure.
1612
+
1613
+ ### Phase 2 — Pipeline Integration
1614
+ 5. Implement `slide_generator.py` — integrate LLM call + subprocess chain. Test the four subprocess stages independently before wiring together.
1615
+ 6. Implement `slide_skill_environment.py` — wire `SkillManager` + `SlideGenerator` + `EvaluatorAdapter`. Test `reset()` creates isolated `/tmp/` dirs and `close()` removes them.
1616
+
1617
+ ### Phase 3 — Server & Client
1618
+ 7. Implement `app.py` — FastAPI wrapper. Test `/health`, `/reset`, `/step` sequence with a minimal dummy action.
1619
+ 8. Implement `client.py` — test against the live server. Confirm the optimizer LLM loop produces an observation with improving scores.
1620
+
1621
+ ### Phase 4 — Containerization
1622
+ 9. Write `Dockerfile`. Build and verify all four pipeline stages work inside the container.
1623
+ 10. Write `openenv.yaml`. Validate against the OpenEnv manifest schema.
1624
+ 11. Push to HuggingFace Spaces. Verify a full episode (7 steps) completes within resource limits.
1625
+
1626
+ ### Phase 5 — Hardening
1627
+ 12. Add per-session LibreOffice locking if running >1 Uvicorn worker.
1628
+ 13. Add timeout handling in `_run()` and surface timeouts as proper HTTP 504 responses.
1629
+ 14. Add structured logging (JSON lines) so HuggingFace Spaces logs are parseable.
1630
+
1631
+ **Critical dependency note**: Phase 2 cannot start until Phase 1 is complete. Phase 3 cannot start until Phase 2 is stable. Phase 5 is optional for a hackathon demo but recommended for production.
1632
+
1633
+ ---
1634
+
1635
+ ## 12. Design Decisions and Rationale
1636
+
1637
+ ### Per-Session Isolation vs. No-Concurrency
1638
+
1639
+ The original plan set `SUPPORTS_CONCURRENT_SESSIONS = False`. This is safe but prevents any parallel evaluation runs, making HuggingFace Spaces single-threaded even though the hardware could handle more.
1640
+
1641
+ The better approach is per-session file isolation: on `reset()`, copy both skill files into `/tmp/slide_skill_{session_id}/`. Each session's `generate.js`, `.pptx`, `.pdf`, and `.jpg` are written there too. Sessions never touch each other's files.
1642
+
1643
+ The one caveat is LibreOffice: `soffice` is not thread-safe when called concurrently from the same OS user. Options: (a) serialize LibreOffice calls with an `asyncio.Lock`, or (b) each session can set `--env HOME=/tmp/soffice_{session_id}` to get a unique LibreOffice user profile. Option (b) is simpler and is what the Dockerfile's `ENV HOME=/tmp` partially enables.
1644
+
1645
+ ### Dual Action Types
1646
+
1647
+ The historical optimizer LLM rewrites the entire `DESIGN_RULES.md` in each round — it does not do surgical section edits. `ReplaceFileAction` matches this behavior exactly and makes the action space natural for an LLM optimizer.
1648
+
1649
+ `EditSectionAction` is retained because: (a) it is more token-efficient for small targeted changes, (b) it enables gradient-like optimization where an RL agent changes one dimension at a time, and (c) it is a cleaner action space for non-LLM optimizers (e.g., evolutionary algorithms).
1650
+
1651
+ Using a Pydantic discriminated union keeps the API clean: a single `action` field, type-safe dispatch in `SkillManager.apply()`, and automatic OpenAPI schema generation.
1652
+
1653
+ ### Why We Don't Evolve the Generic pptx Skill
1654
+
1655
+ The files in `pptx/` (SKILL.md, editing.md, pptxgenjs.md) are the agent's API reference for using pptxgenjs. They are analogous to a standard library — stable, general-purpose, and not brand-specific. Evolving them would be like optimizing stdlib for one application.
1656
+
1657
+ The brand-specific optimization target is `DESIGN_RULES.md` + `EXAMPLES.md`. These encode McKinsey visual grammar: what colors, what typography, where to put structural elements, what titles should say. This separation is what makes the loop generalizable: swap in a different task prompt + reference images + baseline skill files, and the same environment can optimize slides for any brand.
1658
+
1659
+ ### LibreOffice as the Bottleneck
1660
+
1661
+ LibreOffice adds ~500 MB to the Docker image and ~15–30 seconds per step. There is no lighter alternative that faithfully renders pptxgenjs output to PDF. Headless Chrome can render HTML but not .pptx. The pptxgenjs team does not offer a built-in PDF export.
1662
+
1663
+ Accept LibreOffice as a hard dependency. Optimize around it by: (a) keeping the Docker layer cached (don't change its installation order), (b) pre-warming LibreOffice on server startup with a dummy convert, (c) setting a 60-second timeout on the LibreOffice subprocess and surfacing timeout as a step error rather than hanging.
1664
+
1665
+ ### Reward = Score Delta Capped at [-0.3, +0.3]
1666
+
1667
+ The evaluator is an LLM (Claude Opus 4.6 with vision). LLM evaluators have shot noise: the same slide evaluated twice may score 87 one time and 91 the next. If we use raw score delta as reward, a noise swing of +4 looks like a significant improvement. Capping at ±30 points (±0.3 normalized) means noise within ±5 points produces a small reward signal rather than a large one. The cap is soft for genuine improvements: going from 60→90 in one step (unusual but possible) gives reward = +0.3, same as going from 60→100. This is intentional — we want to reward improvement, not its magnitude, to keep the learning signal stable.
1668
+
1669
+ ### EXAMPLES.md Grows Over Time
1670
+
1671
+ In the historical loop, `EXAMPLES.md` accumulated guidance across rounds — by v4, it referenced v3 and v4 issues explicitly. On `reset()`, we restore to the true `skill_v0` baseline: empty EXAMPLES.md. This is intentional. The optimizer must re-learn from the evaluator feedback each episode, which is the right behavior for RL. If you want warm-started episodes, implement a separate "curriculum baseline" and pass it as an optional `reset(skill_version="v3")` parameter.
1672
+
1673
+ ---
1674
+
1675
+ ## 13. Dependencies
1676
+
1677
+ `pyproject.toml`
1678
+
1679
+ ```toml
1680
+ [build-system]
1681
+ requires = ["hatchling"]
1682
+ build-backend = "hatchling.build"
1683
+
1684
+ [project]
1685
+ name = "slide-skill-openenv"
1686
+ version = "1.0.0"
1687
+ description = "OpenEnv environment for McKinsey-style PowerPoint slide optimization"
1688
+ requires-python = ">=3.12"
1689
+
1690
+ # Core runtime dependencies (required for the environment to run)
1691
+ dependencies = [
1692
+ "anthropic>=0.40.0", # Claude API client (generator + evaluator)
1693
+ "pydantic>=2.6.0", # Data models with discriminated unions
1694
+ "httpx>=0.27.0", # HTTP client for client.py
1695
+ ]
1696
+
1697
+ [project.optional-dependencies]
1698
+ # Server dependencies (required for app.py)
1699
+ server = [
1700
+ "fastapi>=0.111.0",
1701
+ "uvicorn[standard]>=0.30.0",
1702
+ "python-multipart>=0.0.9", # FastAPI form parsing
1703
+ ]
1704
+
1705
+ # Development and testing
1706
+ dev = [
1707
+ "pytest>=8.0.0",
1708
+ "pytest-asyncio>=0.23.0",
1709
+ "httpx>=0.27.0", # for FastAPI TestClient
1710
+ "ruff>=0.4.0",
1711
+ "mypy>=1.10.0",
1712
+ ]
1713
+
1714
+ [tool.hatch.build.targets.wheel]
1715
+ packages = ["openenv"]
1716
+
1717
+ [tool.ruff]
1718
+ target-version = "py312"
1719
+ line-length = 88
1720
+
1721
+ [tool.ruff.lint]
1722
+ select = ["E", "F", "I", "UP"]
1723
+
1724
+ [tool.mypy]
1725
+ python_version = "3.12"
1726
+ strict = true
1727
+ ignore_missing_imports = true
1728
+ ```
openenv/Dockerfile ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Slide Skill OpenEnv — Docker image
2
+ #
3
+ # Layer sizes (approximate):
4
+ # python:3.12-slim base: ~130 MB
5
+ # Node.js 20 + pptxgenjs: ~200 MB
6
+ # LibreOffice: ~500 MB <-- dominant cost; unavoidable for .pptx → .pdf
7
+ # poppler-utils (pdftoppm): ~30 MB
8
+ # Python deps: ~80 MB
9
+ # Total compressed: ~600-700 MB
10
+ #
11
+ # LibreOffice is the unavoidable bottleneck. It is required to convert
12
+ # .pptx → .pdf. There is no lighter alternative that handles pptxgenjs
13
+ # output faithfully.
14
+
15
+ FROM python:3.12-slim
16
+
17
+ LABEL description="Slide Skill OpenEnv — McKinsey PPT generation environment"
18
+
19
+ # System dependencies — installed in one RUN to minimize layers.
20
+ RUN apt-get update && apt-get install -y --no-install-recommends \
21
+ # LibreOffice for .pptx → .pdf conversion
22
+ libreoffice \
23
+ # poppler-utils provides pdftoppm (.pdf → .jpg)
24
+ poppler-utils \
25
+ # Node.js 20 LTS via NodeSource
26
+ curl \
27
+ ca-certificates \
28
+ gnupg \
29
+ && curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
30
+ && apt-get install -y nodejs \
31
+ && apt-get clean \
32
+ && rm -rf /var/lib/apt/lists/*
33
+
34
+ # Verify all required tools are available at build time.
35
+ RUN node --version && npm --version && soffice --version && pdftoppm -v 2>&1 | head -1
36
+
37
+ WORKDIR /app
38
+
39
+ # Install pptxgenjs (Node.js dependency) — copy package.json first for layer caching.
40
+ COPY package.json package-lock.json* ./
41
+ RUN npm install --production
42
+
43
+ # Install Python dependencies — copy pyproject.toml first for layer caching.
44
+ COPY pyproject.toml ./
45
+ RUN pip install --no-cache-dir -e ".[server]"
46
+
47
+ # Copy application code and data.
48
+ COPY pptx/ ./pptx/
49
+ COPY skill_files_baseline/ ./skill_files_baseline/
50
+ COPY output/TASK_PROMPT.md ./output/TASK_PROMPT.md
51
+ COPY output/reference/ ./output/reference/
52
+ COPY openenv/ ./openenv/
53
+
54
+ WORKDIR /app/openenv
55
+
56
+ # LibreOffice needs a writable user profile directory.
57
+ # Setting HOME=/tmp gives each process its own profile path and avoids
58
+ # concurrent session conflicts with the LibreOffice lock files.
59
+ ENV HOME=/tmp
60
+ # Use the headless VCL plugin (no display required).
61
+ ENV SAL_USE_VCLPLUGIN=svp
62
+
63
+ EXPOSE 8000
64
+
65
+ HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
66
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
67
+
68
+ # Single worker — LibreOffice subprocess calls must be serialized within one
69
+ # OS process. Concurrent sessions are handled by per-session /tmp/ directories.
70
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
openenv/app.py ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ FastAPI server for the Slide Skill OpenEnv environment.
3
+
4
+ Endpoints follow the OpenEnv HTTP protocol:
5
+ POST /reset → initialize or restart a session
6
+ POST /step → apply an action and return observation
7
+ DELETE /sessions/{session_id} → clean up a session
8
+ GET /health → liveness check
9
+
10
+ The server is stateful: environment instances are kept in memory.
11
+ Use a single Uvicorn worker (--workers 1) since LibreOffice is not
12
+ thread-safe when called concurrently from the same process.
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import logging
18
+ import traceback
19
+ from contextlib import asynccontextmanager
20
+ from pathlib import Path
21
+
22
+ from dotenv import load_dotenv
23
+
24
+ logging.basicConfig(level=logging.INFO)
25
+ logger = logging.getLogger(__name__)
26
+
27
+ # Load .env from the repo root (one level up from openenv/)
28
+ load_dotenv(Path(__file__).parent.parent / ".env")
29
+ from typing import Annotated, Any
30
+
31
+ import uvicorn
32
+ from fastapi import Body, FastAPI, HTTPException, Path
33
+ from pydantic import BaseModel
34
+
35
+ from models import SlideSkillAction, SlideSkillObservation
36
+ from slide_skill_environment import SlideSkillEnvironment
37
+
38
+
39
+ # Single shared environment instance. Sessions are isolated at the file
40
+ # level, so this is safe for concurrent requests.
41
+ _env: SlideSkillEnvironment | None = None
42
+
43
+
44
+ @asynccontextmanager
45
+ async def lifespan(app: FastAPI): # type: ignore[type-arg]
46
+ global _env
47
+ _env = SlideSkillEnvironment()
48
+ yield
49
+ _env = None
50
+
51
+
52
+ app = FastAPI(
53
+ title="Slide Skill OpenEnv",
54
+ description=(
55
+ "OpenEnv-compatible environment for optimizing McKinsey-style "
56
+ "PowerPoint slides by evolving DESIGN_RULES.md and EXAMPLES.md."
57
+ ),
58
+ lifespan=lifespan,
59
+ )
60
+
61
+
62
+ class ResetRequest(BaseModel):
63
+ session_id: str | None = None
64
+
65
+
66
+ class ResetResponse(BaseModel):
67
+ session_id: str
68
+ message: str
69
+
70
+
71
+ class StepRequest(BaseModel):
72
+ session_id: str
73
+ action: SlideSkillAction
74
+
75
+
76
+ @app.post("/reset", response_model=ResetResponse)
77
+ async def reset(
78
+ request: ResetRequest = Body(default=ResetRequest()),
79
+ ) -> ResetResponse:
80
+ """Initialize or restart an optimization session."""
81
+ assert _env is not None
82
+ session_id = _env.reset(session_id=request.session_id)
83
+ return ResetResponse(
84
+ session_id=session_id,
85
+ message=f"Session {session_id} initialized with baseline skill files.",
86
+ )
87
+
88
+
89
+ @app.post("/step", response_model=SlideSkillObservation)
90
+ async def step(request: StepRequest) -> SlideSkillObservation:
91
+ """Apply an action to the session and return the resulting observation."""
92
+ assert _env is not None
93
+ try:
94
+ observation = _env.step(
95
+ session_id=request.session_id,
96
+ action=request.action,
97
+ )
98
+ except KeyError:
99
+ raise HTTPException(
100
+ status_code=404,
101
+ detail=f"Session {request.session_id!r} not found. Call /reset first.",
102
+ )
103
+ except (RuntimeError, ValueError) as exc:
104
+ logger.error("Step failed:\n%s", traceback.format_exc())
105
+ raise HTTPException(status_code=500, detail=str(exc))
106
+ return observation
107
+
108
+
109
+ @app.delete("/sessions/{session_id}")
110
+ async def close_session(
111
+ session_id: Annotated[str, Path(description="Session ID to clean up.")],
112
+ ) -> dict[str, Any]:
113
+ """Clean up session resources (deletes /tmp/ working directory)."""
114
+ assert _env is not None
115
+ try:
116
+ _env.close(session_id)
117
+ except KeyError:
118
+ raise HTTPException(
119
+ status_code=404,
120
+ detail=f"Session {session_id!r} not found.",
121
+ )
122
+ return {"message": f"Session {session_id} closed."}
123
+
124
+
125
+ @app.get("/health")
126
+ async def health() -> dict[str, Any]:
127
+ return {"status": "ok", "supports_concurrent_sessions": True}
128
+
129
+
130
+ if __name__ == "__main__":
131
+ uvicorn.run("app:app", host="0.0.0.0", port=8000, workers=1)
openenv/client.py ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reference client for the Slide Skill OpenEnv server.
3
+
4
+ Demonstrates how an optimizer agent would interact with the environment:
5
+ 1. Reset to get a session ID.
6
+ 2. Submit the baseline action (no-op replace to trigger generation).
7
+ 3. Call the LLM optimizer using the observation feedback.
8
+ 4. Submit the improved DESIGN_RULES.md as a ReplaceFileAction.
9
+ 5. Repeat until done=True.
10
+
11
+ This client is also useful for smoke-testing the server without a full agent.
12
+
13
+ Usage:
14
+ # Smoke test (single step, no optimizer LLM):
15
+ python client.py --smoke-test
16
+
17
+ # Full optimization loop:
18
+ python client.py --server http://localhost:8000 --max-steps 7
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import argparse
24
+ import os
25
+ import textwrap
26
+ from pathlib import Path
27
+ from typing import Any
28
+
29
+ from dotenv import load_dotenv
30
+ from google import genai
31
+
32
+ load_dotenv(Path(__file__).parent.parent / ".env")
33
+ from google.genai import types
34
+ import httpx
35
+ from loguru import logger
36
+
37
+ from models import SlideSkillObservation
38
+
39
+ SERVER_URL = "http://localhost:8000"
40
+ OPTIMIZER_MODEL = "gemini-3.1-pro-preview"
41
+
42
+ BASELINE_EXAMPLES_CONTENT = "(Empty — no prior optimization rounds)\n"
43
+
44
+
45
+ class SlideSkillClient:
46
+ """HTTP client for the Slide Skill OpenEnv server."""
47
+
48
+ def __init__(self, base_url: str = SERVER_URL) -> None:
49
+ self.base_url = base_url.rstrip("/")
50
+ self._http = httpx.Client(timeout=300.0) # long timeout for pipeline stages
51
+
52
+ def reset(self, session_id: str | None = None) -> str:
53
+ """Start a new session. Returns the session_id."""
54
+ payload: dict[str, Any] = {}
55
+ if session_id:
56
+ payload["session_id"] = session_id
57
+ resp = self._http.post(f"{self.base_url}/reset", json=payload)
58
+ resp.raise_for_status()
59
+ return resp.json()["session_id"]
60
+
61
+ def step(self, session_id: str, action: dict[str, Any]) -> SlideSkillObservation:
62
+ """
63
+ Apply an action and return the observation.
64
+
65
+ Args:
66
+ session_id: Active session ID.
67
+ action: Dict matching EditSectionAction or ReplaceFileAction schema.
68
+ Must include "action_type" key.
69
+ """
70
+ payload = {"session_id": session_id, "action": action}
71
+ resp = self._http.post(f"{self.base_url}/step", json=payload)
72
+ if not resp.is_success:
73
+ raise RuntimeError(
74
+ f"Step failed ({resp.status_code}): {resp.text}"
75
+ )
76
+ return SlideSkillObservation.model_validate(resp.json())
77
+
78
+ def close(self, session_id: str) -> None:
79
+ """Clean up the session."""
80
+ resp = self._http.delete(f"{self.base_url}/sessions/{session_id}")
81
+ resp.raise_for_status()
82
+
83
+ def __enter__(self) -> SlideSkillClient:
84
+ return self
85
+
86
+ def __exit__(self, *_: Any) -> None:
87
+ self._http.close()
88
+
89
+
90
+ # ---------------------------------------------------------------------------
91
+ # Optimizer agent (reference implementation)
92
+ # ---------------------------------------------------------------------------
93
+
94
+
95
+ def call_optimizer_llm(
96
+ obs: SlideSkillObservation,
97
+ gemini_client: genai.Client,
98
+ ) -> dict[str, Any]:
99
+ """
100
+ Call the optimizer LLM to generate a new DESIGN_RULES.md based on
101
+ the evaluation feedback.
102
+
103
+ Returns a dict suitable for the step() action parameter.
104
+ Uses ReplaceFileAction since the historical optimizer rewrites
105
+ the file wholesale.
106
+ """
107
+ prompt = textwrap.dedent(f"""\
108
+ You are a McKinsey slide design optimizer. You are improving a
109
+ PowerPoint generation skill by rewriting its DESIGN_RULES.md file.
110
+
111
+ ## Current Score: {obs.total}/100
112
+
113
+ ## Score Breakdown
114
+ - background_layout: {obs.scores.background_layout}/15
115
+ - color_palette: {obs.scores.color_palette}/15
116
+ - typography: {obs.scores.typography}/15
117
+ - title_quality: {obs.scores.title_quality}/15
118
+ - data_presentation: {obs.scores.data_presentation}/15
119
+ - structural_elements: {obs.scores.structural_elements}/15
120
+ - overall_impression: {obs.scores.overall_impression}/10
121
+
122
+ ## Evaluator Feedback
123
+ Strengths:
124
+ {chr(10).join(f'- {s}' for s in obs.strengths)}
125
+
126
+ Weaknesses:
127
+ {chr(10).join(f'- {w}' for w in obs.weaknesses)}
128
+
129
+ Verdict: {obs.one_line_verdict}
130
+
131
+ ## Current DESIGN_RULES.md
132
+ {obs.design_rules_content}
133
+
134
+ ## Current EXAMPLES.md
135
+ {obs.examples_content}
136
+
137
+ Your task:
138
+ Write an improved DESIGN_RULES.md that addresses the weaknesses above
139
+ while preserving what works well. Focus on the dimensions with the
140
+ lowest scores. Output ONLY the markdown file content — no explanation,
141
+ no code fences.
142
+ """)
143
+
144
+ response = gemini_client.models.generate_content(
145
+ model=OPTIMIZER_MODEL,
146
+ contents=prompt,
147
+ config=types.GenerateContentConfig(max_output_tokens=4096),
148
+ )
149
+
150
+ new_content = response.text.strip()
151
+
152
+ return {
153
+ "action_type": "replace_file",
154
+ "file": "DESIGN_RULES.md",
155
+ "new_content": new_content,
156
+ }
157
+
158
+
159
+ def run_optimization_loop(server_url: str = SERVER_URL, max_steps: int = 7) -> None:
160
+ """
161
+ Run a full optimization episode using the LLM optimizer.
162
+
163
+ This mirrors the historical Skill Forge loop but driven through the
164
+ OpenEnv HTTP interface.
165
+ """
166
+ gemini_client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
167
+
168
+ with SlideSkillClient(base_url=server_url) as client:
169
+ logger.info(f"Starting optimization loop (max {max_steps} steps) | server={server_url}")
170
+ session_id = client.reset()
171
+ logger.info(f"Session: {session_id}")
172
+
173
+ # Step 0: baseline — generate slide with unmodified skill files.
174
+ logger.info("Step 0/baseline | generating slide (Flash)...")
175
+ logger.info("Step 0/baseline | running Node.js + LibreOffice → JPG...")
176
+ logger.info("Step 0/baseline | evaluating slide (Pro)...")
177
+ obs = client.step(
178
+ session_id,
179
+ {
180
+ "action_type": "replace_file",
181
+ "file": "EXAMPLES.md",
182
+ "new_content": BASELINE_EXAMPLES_CONTENT,
183
+ },
184
+ )
185
+ logger.info(f"Step 0/baseline | score={obs.total}/100 — {obs.one_line_verdict}")
186
+
187
+ for step_idx in range(1, max_steps + 1):
188
+ if obs.done:
189
+ logger.info("Episode complete (max_steps reached).")
190
+ break
191
+
192
+ logger.info(f"Step {step_idx}/{max_steps} | optimizing skill files (Pro)...")
193
+ action = call_optimizer_llm(obs, gemini_client)
194
+ logger.info(f"Step {step_idx}/{max_steps} | generating slide (Flash)...")
195
+ logger.info(f"Step {step_idx}/{max_steps} | running Node.js + LibreOffice → JPG...")
196
+ logger.info(f"Step {step_idx}/{max_steps} | evaluating slide (Pro)...")
197
+ obs = client.step(session_id, action)
198
+
199
+ delta_str = f"{obs.reward * 100:+.0f} pts"
200
+ logger.info(f"Step {step_idx}/{max_steps} | score={obs.total}/100 ({delta_str}) — {obs.one_line_verdict}")
201
+ if obs.weaknesses:
202
+ logger.info(f"Step {step_idx}/{max_steps} | top weakness: {obs.weaknesses[0]}")
203
+
204
+ client.close(session_id)
205
+ logger.success(f"Done. Final score: {obs.total}/100")
206
+
207
+
208
+ def smoke_test(server_url: str = SERVER_URL) -> None:
209
+ """Run a single reset + step to verify the server is working."""
210
+ with SlideSkillClient(base_url=server_url) as client:
211
+ logger.info("Smoke test: resetting session...")
212
+ session_id = client.reset()
213
+ logger.info(f"Smoke test: session_id={session_id}")
214
+
215
+ logger.info("Smoke test: submitting baseline action (full pipeline)...")
216
+ obs = client.step(
217
+ session_id,
218
+ {
219
+ "action_type": "replace_file",
220
+ "file": "EXAMPLES.md",
221
+ "new_content": BASELINE_EXAMPLES_CONTENT,
222
+ },
223
+ )
224
+ logger.info(f"Smoke test: score={obs.total}/100 reward={obs.reward:+.3f} done={obs.done}")
225
+ logger.info(f"Smoke test: verdict: {obs.one_line_verdict}")
226
+
227
+ client.close(session_id)
228
+ logger.success("Smoke test passed.")
229
+
230
+
231
+ if __name__ == "__main__":
232
+ parser = argparse.ArgumentParser(description="Slide Skill OpenEnv client")
233
+ parser.add_argument(
234
+ "--server", default=SERVER_URL, help="Server base URL"
235
+ )
236
+ parser.add_argument(
237
+ "--max-steps", type=int, default=7, help="Max optimization steps"
238
+ )
239
+ parser.add_argument(
240
+ "--smoke-test",
241
+ action="store_true",
242
+ help="Run a single step smoke test instead of the full loop",
243
+ )
244
+ args = parser.parse_args()
245
+
246
+ if args.smoke_test:
247
+ smoke_test(server_url=args.server)
248
+ else:
249
+ run_optimization_loop(server_url=args.server, max_steps=args.max_steps)
openenv/evaluator_adapter.py ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Evaluator Adapter — wraps the existing output/evaluator.py logic as a
3
+ reusable module with a clean interface.
4
+
5
+ This module does NOT import output/evaluator.py (which has a __main__ guard
6
+ and hardcoded paths). Instead, it re-implements the core evaluate_slide()
7
+ logic with:
8
+ - Configurable reference image paths
9
+ - A return type that includes all seven score keys, strengths, weaknesses,
10
+ and one_line_verdict
11
+ - No file I/O side effects (no evaluation_results.json written)
12
+
13
+ The evaluation prompt is identical to output/evaluator.py so scores are
14
+ comparable across the historical runs and the OpenEnv loop.
15
+
16
+ Note on Gemini vs. Anthropic image handling:
17
+ Gemini's SDK accepts image bytes directly via types.Part.from_bytes(),
18
+ so base64 encoding is not needed here (unlike the Anthropic SDK).
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import json
24
+ import os
25
+ import re
26
+ from pathlib import Path
27
+
28
+ from google import genai
29
+ from google.genai import types
30
+
31
+
32
+ # Must match output/evaluator.py exactly so historical scores are comparable.
33
+ EVALUATION_SYSTEM_PROMPT = """You are an expert McKinsey & Company slide design evaluator.
34
+
35
+ You will be shown:
36
+ 1. REFERENCE IMAGES: 5 pages from a real McKinsey & Company consulting deck (Chilean Hydrogen Pathway, December 2020). These represent the gold standard for visual style.
37
+ 2. CANDIDATE SLIDE: A programmatically generated PowerPoint slide about Dutch Hydrogen Strategy, rendered as a JPEG image.
38
+
39
+ Your job: Score how closely the CANDIDATE SLIDE matches the McKinsey visual style shown in the REFERENCE IMAGES.
40
+
41
+ ## Scoring Rubric (100 points total)
42
+
43
+ ### 1. Background & Base Layout (0-15 points)
44
+ - McKinsey content/data slides use WHITE backgrounds (dark navy is ONLY for section dividers/covers)
45
+ - Clean margins (~0.5" all sides)
46
+ - No unnecessary visual clutter
47
+ - 15: White bg, clean margins, professional spacing
48
+ - 10: White bg but spacing issues
49
+ - 5: Wrong background color or major layout problems
50
+ - 0: Completely wrong base (e.g., dark bg for data slide)
51
+
52
+ ### 2. Color Palette Fidelity (0-15 points)
53
+ - McKinsey uses a RESTRAINED palette: navy/dark blue (#0C2340-ish), white, light greys
54
+ - Accent colors are used SPARINGLY — typically just 1-2 accent colors max
55
+ - NO rainbow effects, no bright multi-color schemes
56
+ - Crimson/red used only for thin divider lines, not large elements
57
+ - 15: Matches McKinsey's restrained navy/white/grey palette perfectly
58
+ - 10: Mostly correct but 1-2 color choices off
59
+ - 5: Too many colors or wrong color family
60
+ - 0: Completely different color scheme
61
+
62
+ ### 3. Typography (0-15 points)
63
+ - Title: Large, bold, black or very dark, left-aligned (Georgia or similar serif for titles)
64
+ - Body: Clean sans-serif (Calibri-like), smaller, grey or dark grey
65
+ - Clear size hierarchy: title >> subtitle >> body >> footnotes
66
+ - No decorative fonts
67
+ - 15: Perfect type hierarchy matching McKinsey
68
+ - 10: Good hierarchy but font choices slightly off
69
+ - 5: Weak hierarchy or wrong fonts
70
+ - 0: No clear hierarchy
71
+
72
+ ### 4. Title Quality — "So-What" Style (0-15 points)
73
+ - McKinsey titles state a CONCLUSION or INSIGHT, not just a topic
74
+ - GOOD: "The Netherlands aims to become Europe's green hydrogen hub, scaling from 500 MW to 3-4 GW by 2030"
75
+ - BAD: "Dutch Hydrogen Strategy (2020-2035)" or "Roadmap Overview"
76
+ - The title should tell you the key takeaway without reading the slide
77
+ - 15: Clear insight-driven conclusion title
78
+ - 10: Partial insight (has some specifics but reads more like a topic)
79
+ - 5: Pure topic label
80
+ - 0: Generic or missing title
81
+
82
+ ### 5. Data Presentation (0-15 points)
83
+ - McKinsey uses structured TABLES for data (not floating stat callouts)
84
+ - Tables have: navy header borders (top + bottom of header row), light grey row dividers, bold left column labels
85
+ - Data should be organized, scannable, center-aligned values
86
+ - Key columns/years may be subtly highlighted
87
+ - 15: Clean structured table matching McKinsey format
88
+ - 10: Has data but format doesn't match McKinsey tables
89
+ - 5: Data present but poorly structured (floating callouts, inconsistent format)
90
+ - 0: No supporting data
91
+
92
+ ### 6. Structural Elements (0-15 points)
93
+ - Thin crimson/red divider line below title area (not touching title — separated by whitespace)
94
+ - McKinsey footer: thin rule line + source text (left) + "McKinsey & Company" bold (right) + page number
95
+ - Numbered footnotes for data disclaimers
96
+ - Source attribution line
97
+ - 15: All structural elements present and correctly placed
98
+ - 10: Most elements present, minor placement issues
99
+ - 5: Missing 2+ structural elements
100
+ - 0: No McKinsey structural elements
101
+
102
+ ### 7. Overall Visual Impression (0-10 points)
103
+ - Does this FEEL like it came from McKinsey?
104
+ - Would a consulting professional find this polished and credible?
105
+ - Is it clean, restrained, and authoritative — or busy, colorful, and amateur?
106
+ - 10: Indistinguishable from real McKinsey output
107
+ - 7: Close but a trained eye spots differences
108
+ - 4: Clearly generated/templated but has some McKinsey DNA
109
+ - 1: Does not resemble McKinsey at all
110
+
111
+ ## Output Format
112
+
113
+ Return ONLY a JSON object with this exact structure (no markdown, no code fences):
114
+ {
115
+ "scores": {
116
+ "background_layout": <0-15>,
117
+ "color_palette": <0-15>,
118
+ "typography": <0-15>,
119
+ "title_quality": <0-15>,
120
+ "data_presentation": <0-15>,
121
+ "structural_elements": <0-15>,
122
+ "overall_impression": <0-10>
123
+ },
124
+ "total": <sum of all scores, 0-100>,
125
+ "strengths": ["<strength 1>", "<strength 2>", ...],
126
+ "weaknesses": ["<weakness 1>", "<weakness 2>", ...],
127
+ "one_line_verdict": "<one sentence summary>"
128
+ }
129
+ """
130
+
131
+ EVALUATOR_MODEL = "gemini-3.1-pro-preview"
132
+
133
+
134
+ def _image_part(path: Path) -> types.Part:
135
+ """Load an image file as a Gemini Part (bytes + mime type)."""
136
+ suffix = path.suffix.lower()
137
+ mime_type = "image/jpeg" if suffix in (".jpg", ".jpeg") else "image/png"
138
+ return types.Part.from_bytes(data=path.read_bytes(), mime_type=mime_type)
139
+
140
+
141
+ class EvaluatorAdapter:
142
+ """
143
+ Adapter that evaluates a generated slide JPG against McKinsey references.
144
+
145
+ Uses Gemini 3.1 Pro with vision, replicating the evaluation logic from
146
+ output/evaluator.py as a reusable class with no file I/O side effects.
147
+ """
148
+
149
+ REFERENCE_FILENAMES = [
150
+ "ref-01.jpg",
151
+ "ref-02.jpg",
152
+ "ref-03.jpg",
153
+ "ref-04.jpg",
154
+ "ref-05.jpg",
155
+ ]
156
+
157
+ def __init__(self, reference_dir: Path) -> None:
158
+ """
159
+ Args:
160
+ reference_dir: Directory containing ref-01.jpg through ref-05.jpg.
161
+ """
162
+ self.reference_dir = reference_dir
163
+ self._client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
164
+
165
+ # Validate reference images exist at construction time.
166
+ missing = [
167
+ f
168
+ for f in self.REFERENCE_FILENAMES
169
+ if not (reference_dir / f).exists()
170
+ ]
171
+ if missing:
172
+ raise FileNotFoundError(
173
+ f"Missing reference images in {reference_dir}: {missing}"
174
+ )
175
+
176
+ def evaluate(self, slide_jpg_path: Path) -> dict:
177
+ """
178
+ Evaluate a generated slide against the McKinsey reference images.
179
+
180
+ Args:
181
+ slide_jpg_path: Absolute path to the slide JPG to evaluate.
182
+
183
+ Returns:
184
+ dict with keys:
185
+ "scores": dict mapping the 7 dimension names to int scores
186
+ "total": int, sum of all scores (0-100)
187
+ "strengths": list[str]
188
+ "weaknesses": list[str]
189
+ "one_line_verdict": str
190
+
191
+ Raises:
192
+ FileNotFoundError: If slide_jpg_path does not exist.
193
+ json.JSONDecodeError: If the LLM returns malformed JSON.
194
+ RuntimeError: If the API call fails.
195
+ """
196
+ if not slide_jpg_path.exists():
197
+ raise FileNotFoundError(f"Slide JPG not found: {slide_jpg_path}")
198
+
199
+ # Build a flat list of Parts for the Gemini content parameter.
200
+ # Gemini accepts text strings and Part objects interleaved.
201
+ contents: list[types.Part | str] = []
202
+
203
+ # Reference images first.
204
+ contents.append(
205
+ "## REFERENCE IMAGES (Real McKinsey deck)\n"
206
+ "The following 5 images are from a real McKinsey & Company consulting "
207
+ "report. Study their visual style carefully."
208
+ )
209
+ for i, fname in enumerate(self.REFERENCE_FILENAMES, 1):
210
+ contents.append(_image_part(self.reference_dir / fname))
211
+ contents.append(f"(Reference page {i})")
212
+
213
+ # Candidate slide.
214
+ contents.append(
215
+ f"\n## CANDIDATE SLIDE TO EVALUATE\n"
216
+ f"This is the generated slide: {slide_jpg_path.name}"
217
+ )
218
+ contents.append(_image_part(slide_jpg_path))
219
+ contents.append(
220
+ "\nNow score this candidate slide against the McKinsey reference "
221
+ "using the rubric. Return ONLY the JSON object."
222
+ )
223
+
224
+ response = self._client.models.generate_content(
225
+ model=EVALUATOR_MODEL,
226
+ contents=contents,
227
+ config=types.GenerateContentConfig(
228
+ system_instruction=EVALUATION_SYSTEM_PROMPT,
229
+ max_output_tokens=2048,
230
+ ),
231
+ )
232
+
233
+ text = response.text.strip()
234
+
235
+ # Extract JSON object robustly (handles markdown fences and surrounding text).
236
+ json_match = re.search(r"\{.*\}", text, re.DOTALL)
237
+ if json_match:
238
+ text = json_match.group(0)
239
+
240
+ result = json.loads(text)
241
+
242
+ # Validate required keys are present.
243
+ required_score_keys = {
244
+ "background_layout",
245
+ "color_palette",
246
+ "typography",
247
+ "title_quality",
248
+ "data_presentation",
249
+ "structural_elements",
250
+ "overall_impression",
251
+ }
252
+ missing_keys = required_score_keys - set(result.get("scores", {}).keys())
253
+ if missing_keys:
254
+ raise ValueError(
255
+ f"Evaluator response missing score keys: {missing_keys}. "
256
+ f"Full response: {text[:500]}"
257
+ )
258
+
259
+ return result
openenv/models.py ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pydantic data models for the Slide Skill OpenEnv environment.
3
+
4
+ Action space:
5
+ SlideSkillAction is a discriminated union of two action types:
6
+ - EditSectionAction: Replace a named section's body in one skill file.
7
+ - ReplaceFileAction: Replace the entire content of one skill file.
8
+
9
+ EditSectionAction is appropriate when the agent wants surgical edits
10
+ (e.g., update only the typography section). ReplaceFileAction is used
11
+ when the optimizer rewrites the file wholesale, which is what the
12
+ historical optimizer LLM actually does.
13
+
14
+ Observation space:
15
+ SlideSkillObservation contains the full evaluator output including all
16
+ seven score dimensions plus qualitative feedback fields.
17
+ """
18
+
19
+ from __future__ import annotations
20
+
21
+ from typing import Annotated, Literal
22
+
23
+ from pydantic import BaseModel, Field
24
+
25
+
26
+ # ---------------------------------------------------------------------------
27
+ # Actions
28
+ # ---------------------------------------------------------------------------
29
+
30
+ SkillFile = Literal["DESIGN_RULES.md", "EXAMPLES.md"]
31
+ """The two skill files the optimizer is allowed to modify."""
32
+
33
+
34
+ class EditSectionAction(BaseModel):
35
+ """
36
+ Replace the body of a named markdown section within a skill file.
37
+
38
+ The section is identified by its heading text (without the leading #
39
+ characters). The replacement spans from immediately after the heading
40
+ line to (but not including) the next heading of equal or higher level.
41
+
42
+ Example:
43
+ action = EditSectionAction(
44
+ file="DESIGN_RULES.md",
45
+ section_heading="Color Palette",
46
+ new_body="- Navy (#0C2340): primary\\n- White: background\\n"
47
+ )
48
+ """
49
+
50
+ action_type: Literal["edit_section"] = "edit_section"
51
+ file: SkillFile = Field(..., description="Which skill file to edit.")
52
+ section_heading: str = Field(
53
+ ...,
54
+ description=(
55
+ "Exact heading text (without leading # markers). "
56
+ "Case-sensitive. Must match a heading in the file."
57
+ ),
58
+ )
59
+ new_body: str = Field(
60
+ ...,
61
+ description="New markdown content for the section body (after the heading line).",
62
+ )
63
+
64
+
65
+ class ReplaceFileAction(BaseModel):
66
+ """
67
+ Replace the entire content of a skill file.
68
+
69
+ Use this when the optimizer rewrites the file wholesale rather than
70
+ making targeted section edits. This is the mode used by the historical
71
+ optimizer LLM in the Skill Forge loop.
72
+ """
73
+
74
+ action_type: Literal["replace_file"] = "replace_file"
75
+ file: SkillFile = Field(..., description="Which skill file to replace.")
76
+ new_content: str = Field(
77
+ ...,
78
+ description="Complete new file content (valid markdown).",
79
+ )
80
+
81
+
82
+ # Discriminated union — action_type is the discriminator field.
83
+ SlideSkillAction = Annotated[
84
+ EditSectionAction | ReplaceFileAction,
85
+ Field(discriminator="action_type"),
86
+ ]
87
+
88
+
89
+ # ---------------------------------------------------------------------------
90
+ # Scores
91
+ # ---------------------------------------------------------------------------
92
+
93
+
94
+ class SlideScores(BaseModel):
95
+ """Raw scores from the McKinsey evaluator. Each dimension is 0-15 except
96
+ overall_impression which is 0-10. Total is 0-100."""
97
+
98
+ background_layout: int = Field(..., ge=0, le=15)
99
+ color_palette: int = Field(..., ge=0, le=15)
100
+ typography: int = Field(..., ge=0, le=15)
101
+ title_quality: int = Field(..., ge=0, le=15)
102
+ data_presentation: int = Field(..., ge=0, le=15)
103
+ structural_elements: int = Field(..., ge=0, le=15)
104
+ overall_impression: int = Field(..., ge=0, le=10)
105
+
106
+ @property
107
+ def total(self) -> int:
108
+ return (
109
+ self.background_layout
110
+ + self.color_palette
111
+ + self.typography
112
+ + self.title_quality
113
+ + self.data_presentation
114
+ + self.structural_elements
115
+ + self.overall_impression
116
+ )
117
+
118
+
119
+ # ---------------------------------------------------------------------------
120
+ # Observation
121
+ # ---------------------------------------------------------------------------
122
+
123
+
124
+ class SlideSkillObservation(BaseModel):
125
+ """
126
+ Observation returned to the agent after each step.
127
+
128
+ Contains the full evaluator output so the optimizer LLM has all the
129
+ information it needs to write the next skill revision: numeric scores,
130
+ qualitative strengths/weaknesses, and the one-line verdict.
131
+ """
132
+
133
+ scores: SlideScores
134
+ total: int = Field(..., description="Sum of all score dimensions (0-100).")
135
+ strengths: list[str] = Field(
136
+ default_factory=list,
137
+ description="List of what the slide does well, from the evaluator.",
138
+ )
139
+ weaknesses: list[str] = Field(
140
+ default_factory=list,
141
+ description="List of what needs improvement, from the evaluator.",
142
+ )
143
+ one_line_verdict: str = Field(
144
+ ..., description="Single-sentence summary from the evaluator."
145
+ )
146
+ reward: float = Field(
147
+ ...,
148
+ description=(
149
+ "Score delta vs. previous step, capped to [-0.3, +0.3] and "
150
+ "normalized to [-1.0, +1.0] by dividing by 100. "
151
+ "Capping reduces reward noise from LLM evaluation variance."
152
+ ),
153
+ )
154
+ step: int = Field(..., description="Current step index (0-based).")
155
+ done: bool = Field(..., description="True if max_steps reached.")
156
+ jpg_path: str = Field(
157
+ ..., description="Absolute path to the generated slide JPG."
158
+ )
159
+ design_rules_content: str = Field(
160
+ ...,
161
+ description="Current DESIGN_RULES.md content (after action was applied).",
162
+ )
163
+ examples_content: str = Field(
164
+ ...,
165
+ description="Current EXAMPLES.md content (after action was applied).",
166
+ )
167
+
168
+
169
+ # ---------------------------------------------------------------------------
170
+ # State (internal, not exposed to client)
171
+ # ---------------------------------------------------------------------------
172
+
173
+
174
+ class SlideSkillState(BaseModel):
175
+ """Internal environment state. Not serialized to the client."""
176
+
177
+ session_id: str
178
+ step: int = 0
179
+ prev_total: int = 0 # score from the previous step (for reward calculation)
180
+ session_dir: str = Field(
181
+ ...,
182
+ description=(
183
+ "Absolute path to the isolated session directory under /tmp/. "
184
+ "Contains copies of DESIGN_RULES.md and EXAMPLES.md that this "
185
+ "session is free to modify without affecting other sessions."
186
+ ),
187
+ )
openenv/openenv.yaml ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OpenEnv environment manifest for Slide Skill
2
+ # https://openenv.dev/spec
3
+
4
+ name: slide-skill
5
+ version: "1.0.0"
6
+ description: >
7
+ Self-improving McKinsey-style PowerPoint slide generation environment.
8
+ The agent evolves DESIGN_RULES.md and EXAMPLES.md to maximize a visual
9
+ design score (0-100) evaluated by Claude Opus 4.6 vision against 5 McKinsey
10
+ reference images.
11
+
12
+ author: Tesserae / Skill Forge Hackathon Team
13
+
14
+ supports_concurrent_sessions: true
15
+ max_steps: 7
16
+
17
+ # Approximate time budget per step (seconds).
18
+ # Each step: generator LLM (~20-40s) + Node.js (<5s) + LibreOffice (~15-30s)
19
+ # + pdftoppm (<5s) + evaluator LLM (~30-60s)
20
+ step_timeout_seconds: 180
21
+
22
+ action_space:
23
+ type: union
24
+ discriminator: action_type
25
+ variants:
26
+ - name: edit_section
27
+ description: Replace the body of a named section in a skill file.
28
+ fields:
29
+ file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
30
+ section_heading: {type: string, description: "Exact heading text without # markers"}
31
+ new_body: {type: string, description: "New section body content in markdown"}
32
+
33
+ - name: replace_file
34
+ description: Replace the entire content of a skill file.
35
+ fields:
36
+ file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
37
+ new_content: {type: string, description: "Complete new file content"}
38
+
39
+ observation_space:
40
+ scores:
41
+ background_layout: {type: integer, min: 0, max: 15}
42
+ color_palette: {type: integer, min: 0, max: 15}
43
+ typography: {type: integer, min: 0, max: 15}
44
+ title_quality: {type: integer, min: 0, max: 15}
45
+ data_presentation: {type: integer, min: 0, max: 15}
46
+ structural_elements: {type: integer, min: 0, max: 15}
47
+ overall_impression: {type: integer, min: 0, max: 10}
48
+ total: {type: integer, min: 0, max: 100}
49
+ strengths: {type: array, items: string}
50
+ weaknesses: {type: array, items: string}
51
+ one_line_verdict: {type: string}
52
+ reward: {type: float, min: -0.3, max: 0.3}
53
+ step: {type: integer}
54
+ done: {type: boolean}
55
+ jpg_path: {type: string, description: "Absolute path to generated slide JPG"}
56
+ design_rules_content: {type: string}
57
+ examples_content: {type: string}
58
+
59
+ reward:
60
+ description: >
61
+ Normalized score delta vs. previous step, capped to [-0.3, +0.3].
62
+ Formula: clip(total_score - prev_total_score, -30, +30) / 100
63
+ range: [-0.3, 0.3]
64
+
65
+ baseline:
66
+ description: >
67
+ skill_files_baseline/ committed to the repo contains the minimal
68
+ starting DESIGN_RULES.md (teal palette, basic typography) and an
69
+ empty EXAMPLES.md. This is skill_v0 content — NOT any evolved version.
70
+
71
+ endpoints:
72
+ reset: POST /reset
73
+ step: POST /step
74
+ close: DELETE /sessions/{session_id}
75
+ health: GET /health
76
+
77
+ server:
78
+ host: 0.0.0.0
79
+ port: 8000
80
+ workers: 1 # Do not increase; LibreOffice is not thread-safe within one process
81
+
82
+ environment_variables:
83
+ required:
84
+ - name: GEMINI_API_KEY
85
+ description: >
86
+ Google Gemini API key. Used by all three LLM roles:
87
+ generator (Gemini 3 Flash), evaluator (Gemini 3.1 Pro),
88
+ and optimizer (Gemini 3.1 Pro).
89
+ optional:
90
+ - name: SLIDE_SKILL_MAX_STEPS
91
+ description: Override default max_steps per episode
92
+ default: "7"
openenv/skill_manager.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Skill file manager — applies actions to an isolated session directory.
3
+
4
+ Operates exclusively on files within session_dir (a /tmp/ path).
5
+ Never touches the repo's baseline or any shared files.
6
+
7
+ Section editing rules:
8
+ A "section" is a markdown heading of any level (# to ######).
9
+ EditSectionAction finds the first heading whose text matches
10
+ section_heading (case-sensitive, stripped), then replaces everything
11
+ from the line after that heading up to (but not including) the next
12
+ heading of equal or higher level (i.e., same or fewer # characters).
13
+ If no next heading is found, the replacement extends to end-of-file.
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ import re
19
+ from pathlib import Path
20
+
21
+ from models import EditSectionAction, ReplaceFileAction, SlideSkillAction
22
+
23
+
24
+ class SkillManager:
25
+ """Manages DESIGN_RULES.md and EXAMPLES.md within a session directory."""
26
+
27
+ def __init__(self, session_dir: Path) -> None:
28
+ self.session_dir = session_dir
29
+
30
+ def apply(self, action: SlideSkillAction) -> None:
31
+ """
32
+ Dispatch to the appropriate handler based on action type.
33
+
34
+ Raises:
35
+ ValueError: If action_type is unrecognized or section not found.
36
+ FileNotFoundError: If the target skill file does not exist.
37
+ """
38
+ target = self.session_dir / action.file
39
+ if not target.exists():
40
+ raise FileNotFoundError(f"Skill file not found in session: {target}")
41
+
42
+ if action.action_type == "replace_file":
43
+ self._replace_file(target, action) # type: ignore[arg-type]
44
+ elif action.action_type == "edit_section":
45
+ self._edit_section(target, action) # type: ignore[arg-type]
46
+ else:
47
+ raise ValueError(f"Unknown action_type: {action.action_type!r}")
48
+
49
+ # ------------------------------------------------------------------
50
+ # Private helpers
51
+ # ------------------------------------------------------------------
52
+
53
+ @staticmethod
54
+ def _replace_file(target: Path, action: ReplaceFileAction) -> None:
55
+ """Overwrite the entire file with new_content."""
56
+ target.write_text(action.new_content, encoding="utf-8")
57
+
58
+ @staticmethod
59
+ def _edit_section(target: Path, action: EditSectionAction) -> None:
60
+ """Replace the body of a named markdown section."""
61
+ text = target.read_text(encoding="utf-8")
62
+ lines = text.splitlines(keepends=True)
63
+
64
+ # Find the heading line.
65
+ heading_pattern = re.compile(r"^(#{1,6})\s+(.*?)\s*$")
66
+ heading_idx: int | None = None
67
+ heading_level: int = 0
68
+
69
+ for i, line in enumerate(lines):
70
+ m = heading_pattern.match(line.rstrip("\n\r"))
71
+ if m and m.group(2) == action.section_heading:
72
+ heading_idx = i
73
+ heading_level = len(m.group(1))
74
+ break
75
+
76
+ if heading_idx is None:
77
+ raise ValueError(
78
+ f"Section heading {action.section_heading!r} not found in {target.name}."
79
+ )
80
+
81
+ # Find where the section body ends (next heading of equal or higher level).
82
+ end_idx = len(lines)
83
+ for i in range(heading_idx + 1, len(lines)):
84
+ m = heading_pattern.match(lines[i].rstrip("\n\r"))
85
+ if m and len(m.group(1)) <= heading_level:
86
+ end_idx = i
87
+ break
88
+
89
+ # Reconstruct the file.
90
+ new_body = action.new_body
91
+ if new_body and not new_body.endswith("\n"):
92
+ new_body += "\n"
93
+
94
+ new_lines = (
95
+ lines[: heading_idx + 1] # heading itself
96
+ + [new_body]
97
+ + lines[end_idx:] # rest of file after the section
98
+ )
99
+ target.write_text("".join(new_lines), encoding="utf-8")
100
+
101
+ def read_file(self, filename: str) -> str:
102
+ """Read a skill file from the session directory."""
103
+ return (self.session_dir / filename).read_text(encoding="utf-8")
openenv/slide_generator.py ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Slide Generator — orchestrates the full PPT generation pipeline.
3
+
4
+ Pipeline (in order):
5
+ 1. LLM reads DESIGN_RULES.md + EXAMPLES.md + TASK_PROMPT.md + pptx/ tooling
6
+ → writes pptxgenjs JavaScript to generate.js in the session output dir.
7
+ 2. `node generate.js` runs in the session output dir → produces slide.pptx.
8
+ 3. `soffice --headless --convert-to pdf slide.pptx` → slide.pdf.
9
+ 4. `pdftoppm -r 150 -jpeg -f 1 -l 1 slide.pdf slide` → slide-1.jpg (page 1).
10
+ 5. Returns the Path to slide-1.jpg.
11
+
12
+ The generator LLM receives the pptx/ tooling files as context so it knows
13
+ the full pptxgenjs API — but those files are read-only; they are never
14
+ written to or returned in the observation.
15
+
16
+ Session isolation:
17
+ All generated artifacts (generate.js, slide.pptx, slide.pdf, slide-1.jpg)
18
+ are written into a subdirectory of session_dir so that concurrent sessions
19
+ do not share output paths.
20
+ """
21
+
22
+ from __future__ import annotations
23
+
24
+ import os
25
+ import re
26
+ import shutil
27
+ import subprocess
28
+ import textwrap
29
+ from pathlib import Path
30
+
31
+ from google import genai
32
+ from google.genai import types
33
+
34
+
35
+ REPO_ROOT = Path(__file__).parent.parent
36
+
37
+ # On macOS, LibreOffice installs to a .app bundle not on PATH by default.
38
+ _SOFFICE_MACOS = "/Applications/LibreOffice.app/Contents/MacOS/soffice"
39
+ SOFFICE = shutil.which("soffice") or (_SOFFICE_MACOS if Path(_SOFFICE_MACOS).exists() else "soffice")
40
+
41
+ # On macOS, poppler (pdftoppm) is installed via Homebrew — check both
42
+ # Apple Silicon and Intel prefix locations.
43
+ PDFTOPPM = (
44
+ shutil.which("pdftoppm")
45
+ or ("/opt/homebrew/bin/pdftoppm" if Path("/opt/homebrew/bin/pdftoppm").exists() else None)
46
+ or ("/usr/local/bin/pdftoppm" if Path("/usr/local/bin/pdftoppm").exists() else None)
47
+ or "pdftoppm"
48
+ )
49
+
50
+ # Gemini Flash: fast and cost-effective for code generation.
51
+ GENERATOR_MODEL = "gemini-3-flash-preview"
52
+ GENERATOR_MAX_TOKENS = 4096
53
+
54
+
55
+ class SlideGenerator:
56
+ """Drives the LLM → Node.js → LibreOffice → pdftoppm pipeline."""
57
+
58
+ def __init__(
59
+ self,
60
+ task_prompt_path: Path,
61
+ pptx_skill_dir: Path,
62
+ reference_dir: Path,
63
+ ) -> None:
64
+ self.task_prompt = task_prompt_path.read_text(encoding="utf-8")
65
+ self.pptx_skill_dir = pptx_skill_dir
66
+ self.reference_dir = reference_dir
67
+ self._client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
68
+
69
+ def generate(self, session_id: str, session_dir: Path) -> Path:
70
+ """
71
+ Run the full pipeline for one optimization step.
72
+
73
+ Args:
74
+ session_id: Used only for logging/naming.
75
+ session_dir: Isolated directory containing the session's
76
+ DESIGN_RULES.md and EXAMPLES.md.
77
+
78
+ Returns:
79
+ Absolute path to the generated slide JPG (slide-1.jpg).
80
+
81
+ Raises:
82
+ RuntimeError: If any pipeline stage (LLM, Node, LibreOffice,
83
+ pdftoppm) fails.
84
+ """
85
+ out_dir = session_dir / "output"
86
+ out_dir.mkdir(exist_ok=True)
87
+
88
+ js_path = out_dir / "generate.js"
89
+ pptx_path = out_dir / "slide.pptx"
90
+ jpg_stem = out_dir / "slide"
91
+ jpg_path = out_dir / "slide-1.jpg"
92
+
93
+ # Stage 1+2: LLM generates JS, Node executes it.
94
+ # Retry up to 3 times feeding Node errors back to the LLM.
95
+ node_error: str | None = None
96
+ for attempt in range(1, 4):
97
+ js_code = self._call_generator_llm(session_dir, node_error=node_error)
98
+ js_path.write_text(js_code, encoding="utf-8")
99
+ try:
100
+ self._run(["node", str(js_path)], cwd=out_dir, stage="node generate.js")
101
+ node_error = None
102
+ break
103
+ except RuntimeError as exc:
104
+ node_error = str(exc)
105
+ if attempt == 3:
106
+ raise
107
+ if not pptx_path.exists():
108
+ raise RuntimeError(
109
+ f"node generate.js completed but {pptx_path} was not created."
110
+ )
111
+
112
+ # Stage 3: LibreOffice converts .pptx → .pdf.
113
+ self._run(
114
+ [
115
+ SOFFICE,
116
+ "--headless",
117
+ "--convert-to",
118
+ "pdf",
119
+ "--outdir",
120
+ str(out_dir),
121
+ str(pptx_path),
122
+ ],
123
+ cwd=out_dir,
124
+ stage="soffice --convert-to pdf",
125
+ )
126
+ pdf_path = out_dir / "slide.pdf"
127
+ if not pdf_path.exists():
128
+ raise RuntimeError(
129
+ f"LibreOffice completed but {pdf_path} was not created."
130
+ )
131
+
132
+ # Stage 4: pdftoppm converts PDF page 1 → JPG at 150 DPI.
133
+ # Output: slide-1.jpg (pdftoppm appends "-{page}" automatically).
134
+ self._run(
135
+ [
136
+ PDFTOPPM,
137
+ "-r",
138
+ "150",
139
+ "-jpeg",
140
+ "-f",
141
+ "1",
142
+ "-l",
143
+ "1", # only page 1
144
+ str(pdf_path),
145
+ str(jpg_stem),
146
+ ],
147
+ cwd=out_dir,
148
+ stage="pdftoppm",
149
+ )
150
+ if not jpg_path.exists():
151
+ raise RuntimeError(
152
+ f"pdftoppm completed but {jpg_path} was not created."
153
+ )
154
+
155
+ return jpg_path
156
+
157
+ # ------------------------------------------------------------------
158
+ # Private helpers
159
+ # ------------------------------------------------------------------
160
+
161
+ def _call_generator_llm(self, session_dir: Path, node_error: str | None = None) -> str:
162
+ """
163
+ Call the generator LLM with skill files + task prompt as context.
164
+
165
+ Returns the raw JavaScript code string (without markdown fences).
166
+ """
167
+ design_rules = (session_dir / "DESIGN_RULES.md").read_text(encoding="utf-8")
168
+ examples = (session_dir / "EXAMPLES.md").read_text(encoding="utf-8")
169
+
170
+ # Load the generic pptx tooling files as executor context.
171
+ pptx_skill = self._read_pptx_skill()
172
+
173
+ system_prompt = textwrap.dedent("""\
174
+ You are an expert pptxgenjs developer. You will write a complete,
175
+ runnable Node.js script that generates a PowerPoint slide using
176
+ the pptxgenjs library.
177
+
178
+ Rules:
179
+ - Output ONLY the JavaScript code. No markdown fences, no explanation.
180
+ - The script must save the file as "slide.pptx" in the current directory.
181
+ - Follow the DESIGN_RULES.md and EXAMPLES.md exactly.
182
+ - Use the pptxgenjs API reference below for correct method calls.
183
+ """)
184
+
185
+ user_message = textwrap.dedent(f"""\
186
+ ## pptxgenjs API Reference
187
+ {pptx_skill}
188
+
189
+ ## Brand Style: DESIGN_RULES.md
190
+ {design_rules}
191
+
192
+ ## Brand Style: EXAMPLES.md
193
+ {examples}
194
+
195
+ ## Task
196
+ {self.task_prompt}
197
+
198
+ Write the complete pptxgenjs JavaScript file now.
199
+ """)
200
+
201
+ if node_error:
202
+ user_message += textwrap.dedent(f"""
203
+
204
+ ## Previous attempt failed — fix these errors
205
+ Your previous script produced the following Node.js error.
206
+ Rewrite the script and fix the issue:
207
+
208
+ {node_error}
209
+ """)
210
+
211
+ response = self._client.models.generate_content(
212
+ model=GENERATOR_MODEL,
213
+ contents=user_message,
214
+ config=types.GenerateContentConfig(
215
+ system_instruction=system_prompt,
216
+ max_output_tokens=GENERATOR_MAX_TOKENS,
217
+ ),
218
+ )
219
+
220
+ code = response.text.strip()
221
+
222
+ # Extract from markdown code fence if present (LLMs often add them
223
+ # despite instructions). Handles ```javascript, ```js, or plain ```.
224
+ fence_match = re.search(r"```(?:javascript|js)?\n(.*?)```", code, re.DOTALL)
225
+ if fence_match:
226
+ code = fence_match.group(1).strip()
227
+
228
+ # Rewrite all bare require('pkg') calls to absolute paths so the
229
+ # script works when run from any /tmp/ directory. We only rewrite
230
+ # packages that actually exist in node_modules; unknown packages are
231
+ # left untouched (they'd fail at runtime but at least not silently).
232
+ node_modules = REPO_ROOT / "node_modules"
233
+
234
+ def _rewrite_require(m: re.Match) -> str:
235
+ quote = m.group(1)
236
+ pkg = m.group(2)
237
+ pkg_path = node_modules / pkg
238
+ if pkg_path.exists():
239
+ return f"require({quote}{pkg_path}{quote})"
240
+ return m.group(0) # leave unknown packages as-is
241
+
242
+ code = re.sub(r"require\((['\"])([^./][^'\"]*)\1\)", _rewrite_require, code)
243
+
244
+ # LLMs sometimes emit the require line twice. Keep only the first
245
+ # declaration to avoid "Identifier already declared" SyntaxError.
246
+ seen: set[str] = set()
247
+ deduped = []
248
+ for line in code.splitlines():
249
+ m = re.search(r"require\(['\"]([^'\"]+)['\"]\)", line)
250
+ if m and "node_modules" in line:
251
+ pkg = m.group(1)
252
+ if pkg in seen:
253
+ continue
254
+ seen.add(pkg)
255
+ deduped.append(line)
256
+ code = "\n".join(deduped)
257
+
258
+ return code
259
+
260
+ def _read_pptx_skill(self) -> str:
261
+ """Concatenate the generic pptx skill files for LLM context."""
262
+ parts = []
263
+ for fname in ("SKILL.md", "editing.md", "pptxgenjs.md"):
264
+ p = self.pptx_skill_dir / fname
265
+ if p.exists():
266
+ parts.append(f"### {fname}\n{p.read_text(encoding='utf-8')}")
267
+ return "\n\n".join(parts)
268
+
269
+ @staticmethod
270
+ def _run(cmd: list[str], cwd: Path, stage: str) -> None:
271
+ """Run a subprocess; raise RuntimeError with context if it fails."""
272
+ result = subprocess.run(
273
+ cmd,
274
+ cwd=cwd,
275
+ capture_output=True,
276
+ text=True,
277
+ timeout=300, # 5 min hard limit per stage
278
+ )
279
+ if result.returncode != 0:
280
+ raise RuntimeError(
281
+ f"Pipeline stage '{stage}' failed (exit {result.returncode}).\n"
282
+ f"stdout: {result.stdout[-2000:]}\n"
283
+ f"stderr: {result.stderr[-2000:]}"
284
+ )
openenv/slide_skill_environment.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Slide Skill Environment — OpenEnv-compatible environment for optimizing
3
+ McKinsey-style PowerPoint slide generation.
4
+
5
+ Concurrency model:
6
+ SUPPORTS_CONCURRENT_SESSIONS = True
7
+
8
+ Each session gets an isolated working directory at /tmp/slide_skill_{session_id}/.
9
+ Skill files (DESIGN_RULES.md, EXAMPLES.md) are copied there on reset() and
10
+ modified in place during the session. The shared repo files are never modified.
11
+ This means multiple sessions can run simultaneously without file conflicts.
12
+
13
+ The only shared resource is the Anthropic API key, which is rate-limited
14
+ per-account. For HuggingFace Spaces, running 2-3 concurrent sessions is
15
+ realistic before hitting rate limits.
16
+
17
+ Episode timing:
18
+ Each step involves two LLM calls (generator + evaluator) plus Node.js and
19
+ LibreOffice. Expect 60-120 seconds per step. At max_steps=7, a full episode
20
+ runs 7-14 minutes.
21
+
22
+ Reward function:
23
+ reward = clip(total_score - prev_total_score, -30, +30) / 100
24
+ Capping at +/-30 points (+/-0.3 reward) dampens LLM evaluation noise. A score
25
+ can fluctuate +/-5-10 points between identical slides due to evaluator variance,
26
+ so capping prevents large undeserved penalties or bonuses.
27
+ """
28
+
29
+ from __future__ import annotations
30
+
31
+ import os
32
+ import shutil
33
+ import uuid
34
+ from pathlib import Path
35
+ from typing import ClassVar
36
+
37
+ from models import (
38
+ SlideScores,
39
+ SlideSkillAction,
40
+ SlideSkillObservation,
41
+ SlideSkillState,
42
+ )
43
+ from skill_manager import SkillManager
44
+ from slide_generator import SlideGenerator
45
+ from evaluator_adapter import EvaluatorAdapter
46
+
47
+
48
+ # Paths relative to repo root — adjust if the package moves.
49
+ REPO_ROOT = Path(__file__).parent.parent
50
+ BASELINE_DIR = REPO_ROOT / "skill_files_baseline"
51
+ TASK_PROMPT_PATH = REPO_ROOT / "output" / "TASK_PROMPT.md"
52
+ REFERENCE_DIR = REPO_ROOT / "output" / "reference"
53
+
54
+ # Reward capping parameters
55
+ REWARD_CLIP_POINTS = 30 # clip score delta to +/-30 before normalizing
56
+ REWARD_SCALE = 100.0 # divide clipped delta by this to get [-0.3, +0.3]
57
+
58
+ MAX_STEPS = int(os.environ.get("SLIDE_SKILL_MAX_STEPS", "7"))
59
+
60
+
61
+ class SlideSkillEnvironment:
62
+ """OpenEnv environment for the Skill Forge optimization loop."""
63
+
64
+ SUPPORTS_CONCURRENT_SESSIONS: ClassVar[bool] = True
65
+
66
+ def __init__(self) -> None:
67
+ self._sessions: dict[str, SlideSkillState] = {}
68
+ self._generator = SlideGenerator(
69
+ task_prompt_path=TASK_PROMPT_PATH,
70
+ pptx_skill_dir=REPO_ROOT / "pptx",
71
+ reference_dir=REFERENCE_DIR,
72
+ )
73
+ self._evaluator = EvaluatorAdapter(reference_dir=REFERENCE_DIR)
74
+
75
+ # ------------------------------------------------------------------
76
+ # Public OpenEnv interface
77
+ # ------------------------------------------------------------------
78
+
79
+ def reset(self, session_id: str | None = None) -> str:
80
+ """
81
+ Initialize or reinitialize a session.
82
+
83
+ Creates an isolated working directory under /tmp/ and copies the
84
+ baseline skill files into it. Returns the session_id.
85
+ """
86
+ session_id = session_id or str(uuid.uuid4())
87
+
88
+ session_dir = Path(f"/tmp/slide_skill_{session_id}")
89
+ if session_dir.exists():
90
+ shutil.rmtree(session_dir)
91
+ session_dir.mkdir(parents=True)
92
+
93
+ # Copy baseline skill files into the session directory.
94
+ for fname in ("DESIGN_RULES.md", "EXAMPLES.md"):
95
+ src = BASELINE_DIR / fname
96
+ if not src.exists():
97
+ raise FileNotFoundError(
98
+ f"Baseline file missing: {src}. "
99
+ "Commit skill_files_baseline/ to the repo."
100
+ )
101
+ shutil.copy2(src, session_dir / fname)
102
+
103
+ self._sessions[session_id] = SlideSkillState(
104
+ session_id=session_id,
105
+ step=0,
106
+ prev_total=0,
107
+ session_dir=str(session_dir),
108
+ )
109
+ return session_id
110
+
111
+ def step(self, session_id: str, action: SlideSkillAction) -> SlideSkillObservation:
112
+ """
113
+ Apply an action, run the generation pipeline, evaluate, and return
114
+ an observation.
115
+
116
+ Args:
117
+ session_id: Must be a live session (call reset() first).
118
+ action: Either EditSectionAction or ReplaceFileAction.
119
+
120
+ Returns:
121
+ SlideSkillObservation with scores, feedback, reward, and file contents.
122
+
123
+ Raises:
124
+ KeyError: If session_id is not found.
125
+ RuntimeError: If the generation or evaluation pipeline fails.
126
+ """
127
+ state = self._sessions[session_id]
128
+ session_dir = Path(state.session_dir)
129
+
130
+ # 1. Apply the action to the session's skill files.
131
+ manager = SkillManager(session_dir)
132
+ manager.apply(action)
133
+
134
+ # 2. Run the full generation pipeline.
135
+ jpg_path = self._generator.generate(
136
+ session_id=session_id,
137
+ session_dir=session_dir,
138
+ )
139
+
140
+ # 3. Evaluate the generated slide.
141
+ eval_result = self._evaluator.evaluate(jpg_path)
142
+
143
+ # 4. Compute reward (capped score delta).
144
+ delta = eval_result["total"] - state.prev_total
145
+ clipped_delta = max(-REWARD_CLIP_POINTS, min(REWARD_CLIP_POINTS, delta))
146
+ reward = clipped_delta / REWARD_SCALE
147
+
148
+ # 5. Update state.
149
+ state.step += 1
150
+ state.prev_total = eval_result["total"]
151
+ done = state.step >= MAX_STEPS
152
+
153
+ # 6. Read back current file contents for the observation.
154
+ design_rules = (session_dir / "DESIGN_RULES.md").read_text(encoding="utf-8")
155
+ examples = (session_dir / "EXAMPLES.md").read_text(encoding="utf-8")
156
+
157
+ scores = SlideScores(**eval_result["scores"])
158
+
159
+ return SlideSkillObservation(
160
+ scores=scores,
161
+ total=eval_result["total"],
162
+ strengths=eval_result.get("strengths", []),
163
+ weaknesses=eval_result.get("weaknesses", []),
164
+ one_line_verdict=eval_result["one_line_verdict"],
165
+ reward=reward,
166
+ step=state.step,
167
+ done=done,
168
+ jpg_path=str(jpg_path),
169
+ design_rules_content=design_rules,
170
+ examples_content=examples,
171
+ )
172
+
173
+ def close(self, session_id: str) -> None:
174
+ """Clean up session resources. Deletes the /tmp/ session directory."""
175
+ if session_id in self._sessions:
176
+ state = self._sessions.pop(session_id)
177
+ session_dir = Path(state.session_dir)
178
+ if session_dir.exists():
179
+ shutil.rmtree(session_dir)
pyproject.toml ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
+ [project]
6
+ name = "slide-skill-openenv"
7
+ version = "1.0.0"
8
+ description = "OpenEnv environment for McKinsey-style PowerPoint slide optimization"
9
+ requires-python = ">=3.12"
10
+
11
+ # Core runtime dependencies (required for the environment to run)
12
+ dependencies = [
13
+ "google-genai>=1.0.0", # Gemini API client (generator + evaluator + optimizer)
14
+ "pydantic>=2.6.0", # Data models with discriminated unions
15
+ "httpx>=0.27.0", # HTTP client for client.py
16
+ "loguru>=0.7.0", # Structured logging for client
17
+ ]
18
+
19
+ [project.optional-dependencies]
20
+ # Server dependencies (required for app.py)
21
+ server = [
22
+ "fastapi>=0.111.0",
23
+ "uvicorn[standard]>=0.30.0",
24
+ "python-multipart>=0.0.9", # FastAPI form parsing
25
+ "python-dotenv>=1.0.0", # Load .env file automatically
26
+ ]
27
+
28
+ # Development and testing
29
+ dev = [
30
+ "pytest>=8.0.0",
31
+ "pytest-asyncio>=0.23.0",
32
+ "httpx>=0.27.0", # for FastAPI TestClient
33
+ "ruff>=0.4.0",
34
+ "mypy>=1.10.0",
35
+ ]
36
+
37
+ [tool.hatch.build.targets.wheel]
38
+ packages = ["openenv"]
39
+
40
+ [tool.ruff]
41
+ target-version = "py312"
42
+ line-length = 88
43
+
44
+ [tool.ruff.lint]
45
+ select = ["E", "F", "I", "UP"]
46
+
47
+ [tool.mypy]
48
+ python_version = "3.12"
49
+ strict = true
50
+ ignore_missing_imports = true
skill_files_baseline/DESIGN_RULES.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Design Rules (Original pptx skill defaults)
2
+
3
+ ## Color Palette
4
+ Pick from skill's built-in palettes. For hydrogen/energy topic, use "Teal Trust":
5
+ - Primary: `028090` (teal)
6
+ - Secondary: `00A896` (seafoam)
7
+ - Accent: `02C39A` (mint)
8
+ - Commit to dark throughout for a premium feel.
9
+
10
+ ## Typography
11
+ - Title: Georgia, 36-44pt, bold
12
+ - Body: Calibri, 14-16pt
13
+ - Captions: 10-12pt, muted
14
+
15
+ ## Layout
16
+ - 0.5" minimum margins
17
+ - 0.3-0.5" between content blocks
18
+ - Timeline or process flow for data display
19
+ - NEVER use accent lines under titles
skill_files_baseline/EXAMPLES.md ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Examples
2
+ (Empty — no prior optimization rounds)