kabalan Claude Opus 4.6 commited on
Commit ·
20b7748
1
Parent(s): 64d69d7
Add OpenEnv server implementation and Python packaging
Browse filesRefactors the classical optimization loop into an OpenEnv-compatible environment with FastAPI server, Docker support, and standardized action/observation spaces. Adds skill_files_baseline/ as committed minimal starting point. Updates README with server setup, Docker instructions, and API documentation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- .env.example +30 -0
- .gitignore +7 -0
- README.MD +145 -32
- agent_docs/openenv_migration_plan_v2.md +1728 -0
- openenv/Dockerfile +70 -0
- openenv/app.py +131 -0
- openenv/client.py +249 -0
- openenv/evaluator_adapter.py +259 -0
- openenv/models.py +187 -0
- openenv/openenv.yaml +92 -0
- openenv/skill_manager.py +103 -0
- openenv/slide_generator.py +284 -0
- openenv/slide_skill_environment.py +179 -0
- pyproject.toml +50 -0
- skill_files_baseline/DESIGN_RULES.md +19 -0
- skill_files_baseline/EXAMPLES.md +2 -0
.env.example
ADDED
|
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Slide Skill OpenEnv — Environment Variables
|
| 2 |
+
#
|
| 3 |
+
# Copy this file to .env and fill in the values.
|
| 4 |
+
# Never commit .env to version control.
|
| 5 |
+
|
| 6 |
+
# ---------------------------------------------------------------------------
|
| 7 |
+
# Required
|
| 8 |
+
# ---------------------------------------------------------------------------
|
| 9 |
+
|
| 10 |
+
# Google Gemini API key — used by all three LLM roles:
|
| 11 |
+
# Generator: Gemini 3 Flash (writes pptxgenjs JavaScript)
|
| 12 |
+
# Evaluator: Gemini 3.1 Pro (scores the slide with vision)
|
| 13 |
+
# Optimizer: Gemini 3.1 Pro (rewrites DESIGN_RULES.md)
|
| 14 |
+
# Get your key at: https://aistudio.google.com/app/apikey
|
| 15 |
+
GEMINI_API_KEY=your_gemini_api_key_here
|
| 16 |
+
|
| 17 |
+
# ---------------------------------------------------------------------------
|
| 18 |
+
# Optional — override defaults
|
| 19 |
+
# ---------------------------------------------------------------------------
|
| 20 |
+
|
| 21 |
+
# Maximum number of optimization steps per episode (default: 7).
|
| 22 |
+
# Each step takes ~60-120s. At 7 steps, a full episode runs ~7-14 minutes.
|
| 23 |
+
# Reduce for faster iteration during development; increase for deeper optimization.
|
| 24 |
+
# SLIDE_SKILL_MAX_STEPS=7
|
| 25 |
+
|
| 26 |
+
# ---------------------------------------------------------------------------
|
| 27 |
+
# HuggingFace Spaces (set these as Space secrets, not in .env)
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
# When deploying to HF Spaces, add GEMINI_API_KEY as a repository secret
|
| 30 |
+
# via the Space settings UI. Do not hardcode it in the Dockerfile or source.
|
.gitignore
CHANGED
|
@@ -1,2 +1,9 @@
|
|
| 1 |
node_modules/
|
| 2 |
.DS_Store
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
node_modules/
|
| 2 |
.DS_Store
|
| 3 |
+
.env
|
| 4 |
+
__pycache__/
|
| 5 |
+
*.pyc
|
| 6 |
+
.mypy_cache/
|
| 7 |
+
.ruff_cache/
|
| 8 |
+
dist/
|
| 9 |
+
*.egg-info/
|
README.MD
CHANGED
|
@@ -26,29 +26,30 @@ A fixed task is used across all rounds so improvements are solely from skill opt
|
|
| 26 |
|
| 27 |
> Generate a 1-slide timeline PowerPoint about Dutch Hydrogen Strategy (2020-2035) in McKinsey & Company consulting style.
|
| 28 |
|
| 29 |
-
##
|
| 30 |
|
| 31 |
-
|
| 32 |
-
skill_vN/
|
| 33 |
-
├── DESIGN_RULES.md # Colors, fonts, spacing rules
|
| 34 |
-
└── EXAMPLES.md # Good/bad patterns (grows over rounds)
|
| 35 |
-
```
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
| 40 |
|
| 41 |
Ran 5 rounds (v0 through v4). Final score: **89/100**.
|
| 42 |
|
| 43 |
-
| Dimension | Score
|
| 44 |
-
|-----------|-------
|
| 45 |
-
| Background & Layout | 14 |
|
| 46 |
-
| Color Palette | 14 |
|
| 47 |
-
| Typography | 13 |
|
| 48 |
-
| Title Quality | 15 |
|
| 49 |
-
| Data Presentation | 12 |
|
| 50 |
-
| Structural Elements | 13 |
|
| 51 |
-
| Overall Impression | 8
|
| 52 |
|
| 53 |
**Verdict:** A highly professional slide that closely mirrors McKinsey's visual language with an insight-driven title, restrained color palette, and proper structural elements.
|
| 54 |
|
|
@@ -57,39 +58,151 @@ Ran 5 rounds (v0 through v4). Final score: **89/100**.
|
|
| 57 |
```
|
| 58 |
Skill-Forge/
|
| 59 |
├── README.MD
|
| 60 |
-
├── package.json
|
| 61 |
-
├──
|
|
|
|
|
|
|
|
|
|
| 62 |
│ ├── SKILL.md
|
| 63 |
│ ├── pptxgenjs.md
|
| 64 |
│ ├── editing.md
|
| 65 |
-
│ └── scripts/ # Office utilities (unpack, validate, thumbnail
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
│ ├──
|
| 69 |
-
│
|
| 70 |
-
│
|
| 71 |
-
|
| 72 |
-
│ ├──
|
| 73 |
-
│ ├──
|
| 74 |
-
│ ├──
|
| 75 |
-
│
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
```
|
| 78 |
|
| 79 |
## Prerequisites
|
| 80 |
|
|
|
|
| 81 |
- Node.js
|
| 82 |
- Python 3
|
| 83 |
- LibreOffice (`soffice`) for PDF conversion
|
| 84 |
- Poppler (`pdftoppm`) for PDF-to-image conversion
|
| 85 |
|
|
|
|
|
|
|
|
|
|
| 86 |
## Setup
|
| 87 |
|
| 88 |
```bash
|
|
|
|
| 89 |
npm install
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
```
|
| 92 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
## License
|
| 94 |
|
| 95 |
ISC
|
|
|
|
| 26 |
|
| 27 |
> Generate a 1-slide timeline PowerPoint about Dutch Hydrogen Strategy (2020-2035) in McKinsey & Company consulting style.
|
| 28 |
|
| 29 |
+
## What Gets Optimized
|
| 30 |
|
| 31 |
+
There are two distinct layers of "skill files":
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
+
| Layer | Location | Purpose | Optimized? |
|
| 34 |
+
|-------|----------|---------|------------|
|
| 35 |
+
| Generic pptx tooling | `pptx/` | Teaches Claude how to use pptxgenjs (API reference, shapes, coordinates) | **No** — stable Anthropic skill |
|
| 36 |
+
| Brand style guidelines | `skill_vN/` or `skill_files_baseline/` | McKinsey-specific colors, typography, structural elements | **Yes** — evolves each round |
|
| 37 |
|
| 38 |
+
The optimizer rewrites `DESIGN_RULES.md` and `EXAMPLES.md` each round. The `pptx/` skill files are never touched.
|
| 39 |
+
|
| 40 |
+
## Results (Classical Loop)
|
| 41 |
|
| 42 |
Ran 5 rounds (v0 through v4). Final score: **89/100**.
|
| 43 |
|
| 44 |
+
| Dimension | Score |
|
| 45 |
+
|-----------|-------|
|
| 46 |
+
| Background & Layout | 14/15 |
|
| 47 |
+
| Color Palette | 14/15 |
|
| 48 |
+
| Typography | 13/15 |
|
| 49 |
+
| Title Quality | 15/15 |
|
| 50 |
+
| Data Presentation | 12/15 |
|
| 51 |
+
| Structural Elements | 13/15 |
|
| 52 |
+
| Overall Impression | 8/10 |
|
| 53 |
|
| 54 |
**Verdict:** A highly professional slide that closely mirrors McKinsey's visual language with an insight-driven title, restrained color palette, and proper structural elements.
|
| 55 |
|
|
|
|
| 58 |
```
|
| 59 |
Skill-Forge/
|
| 60 |
├── README.MD
|
| 61 |
+
├── package.json # pptxgenjs ^4.0.1
|
| 62 |
+
├── pyproject.toml # Python package (OpenEnv server)
|
| 63 |
+
├── .env.example # Environment variable reference
|
| 64 |
+
│
|
| 65 |
+
├── pptx/ # Generic pptx skill (DO NOT MODIFY)
|
| 66 |
│ ├── SKILL.md
|
| 67 |
│ ├── pptxgenjs.md
|
| 68 |
│ ├── editing.md
|
| 69 |
+
│ └── scripts/ # Office utilities (unpack, validate, thumbnail)
|
| 70 |
+
│
|
| 71 |
+
├── skill_files_baseline/ # Committed minimal baseline (skill_v0 content)
|
| 72 |
+
│ ├── DESIGN_RULES.md # Starting style rules (teal palette, basic typography)
|
| 73 |
+
│ └── EXAMPLES.md # Empty — no prior rounds
|
| 74 |
+
│
|
| 75 |
+
├── openenv/ # OpenEnv environment (new)
|
| 76 |
+
│ ├── app.py # FastAPI server (POST /reset, /step, DELETE /sessions)
|
| 77 |
+
│ ├── client.py # Reference client + LLM optimizer loop
|
| 78 |
+
│ ├── models.py # Pydantic models: actions, observation, state
|
| 79 |
+
│ ├── slide_skill_environment.py # Core environment logic (reset, step, close)
|
| 80 |
+
│ ├── skill_manager.py # Applies EditSection / ReplaceFile actions
|
| 81 |
+
│ ├── slide_generator.py # LLM → JS → Node → LibreOffice → JPG pipeline
|
| 82 |
+
│ ├── evaluator_adapter.py # Gemini 3.1 Pro vision evaluator (reusable class)
|
| 83 |
+
│ ├── openenv.yaml # OpenEnv manifest
|
| 84 |
+
│ └── Dockerfile # Node.js + LibreOffice + poppler + Python
|
| 85 |
+
│
|
| 86 |
+
└── output/
|
| 87 |
+
├── TASK_PROMPT.md # Fixed task used every round
|
| 88 |
+
├── reference/ # Gold-standard McKinsey reference images (JPGs)
|
| 89 |
+
├── skill_v0/ .. skill_v5/ # Historical skill versions
|
| 90 |
+
├── generate_v0.js .. v5.js # Historical generated JS scripts
|
| 91 |
+
├── slide_v0.pptx .. v5.pptx # Historical generated slides
|
| 92 |
+
├── evaluator.py # Original standalone evaluator script
|
| 93 |
+
└── evaluation_results.json # Score progression
|
| 94 |
```
|
| 95 |
|
| 96 |
## Prerequisites
|
| 97 |
|
| 98 |
+
### Classical loop (manual)
|
| 99 |
- Node.js
|
| 100 |
- Python 3
|
| 101 |
- LibreOffice (`soffice`) for PDF conversion
|
| 102 |
- Poppler (`pdftoppm`) for PDF-to-image conversion
|
| 103 |
|
| 104 |
+
### OpenEnv server
|
| 105 |
+
All of the above, plus Python 3.12+ and the packages in `pyproject.toml`.
|
| 106 |
+
|
| 107 |
## Setup
|
| 108 |
|
| 109 |
```bash
|
| 110 |
+
# Node dependencies (pptxgenjs)
|
| 111 |
npm install
|
| 112 |
+
|
| 113 |
+
# Python dependencies
|
| 114 |
+
pip install -e ".[server]"
|
| 115 |
+
|
| 116 |
+
# Environment variables
|
| 117 |
+
cp .env.example .env
|
| 118 |
+
# Edit .env and set GEMINI_API_KEY
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
## Running the OpenEnv Server
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
cd openenv
|
| 125 |
+
uvicorn app:app --host 0.0.0.0 --port 8000 --workers 1
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
Then run the reference client (full optimization loop):
|
| 129 |
+
|
| 130 |
+
```bash
|
| 131 |
+
python openenv/client.py --server http://localhost:8000 --max-steps 7
|
| 132 |
```
|
| 133 |
|
| 134 |
+
Or a smoke test (single step):
|
| 135 |
+
|
| 136 |
+
```bash
|
| 137 |
+
python openenv/client.py --server http://localhost:8000 --smoke-test
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
## Docker
|
| 141 |
+
|
| 142 |
+
```bash
|
| 143 |
+
# Build
|
| 144 |
+
docker build -f openenv/Dockerfile -t slide-skill-openenv .
|
| 145 |
+
|
| 146 |
+
# Run
|
| 147 |
+
docker run -p 8000:8000 -e GEMINI_API_KEY=$GEMINI_API_KEY slide-skill-openenv
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
> **Note:** The Docker image is ~600-700 MB due to LibreOffice (~500 MB). LibreOffice is required for `.pptx → .pdf` conversion and has no lighter alternative that faithfully renders pptxgenjs output.
|
| 151 |
+
|
| 152 |
+
## OpenEnv Action Space
|
| 153 |
+
|
| 154 |
+
The agent can submit two types of actions each step:
|
| 155 |
+
|
| 156 |
+
**`replace_file`** — Rewrite an entire skill file (matches how the historical optimizer works):
|
| 157 |
+
```json
|
| 158 |
+
{
|
| 159 |
+
"action_type": "replace_file",
|
| 160 |
+
"file": "DESIGN_RULES.md",
|
| 161 |
+
"new_content": "# Design Rules\n\n## Color Palette\n- Navy (#0C2340)..."
|
| 162 |
+
}
|
| 163 |
+
```
|
| 164 |
+
|
| 165 |
+
**`edit_section`** — Surgically update one markdown section:
|
| 166 |
+
```json
|
| 167 |
+
{
|
| 168 |
+
"action_type": "edit_section",
|
| 169 |
+
"file": "DESIGN_RULES.md",
|
| 170 |
+
"section_heading": "Color Palette",
|
| 171 |
+
"new_body": "- Navy (#0C2340): primary\n- White: background\n"
|
| 172 |
+
}
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
## Observation Space
|
| 176 |
+
|
| 177 |
+
Each step returns:
|
| 178 |
+
|
| 179 |
+
| Field | Type | Description |
|
| 180 |
+
|-------|------|-------------|
|
| 181 |
+
| `scores.background_layout` | int 0–15 | White bg, margins, layout |
|
| 182 |
+
| `scores.color_palette` | int 0–15 | Navy/white/grey restraint |
|
| 183 |
+
| `scores.typography` | int 0–15 | Font hierarchy, serif title |
|
| 184 |
+
| `scores.title_quality` | int 0–15 | "So-what" insight title |
|
| 185 |
+
| `scores.data_presentation` | int 0–15 | Structured table format |
|
| 186 |
+
| `scores.structural_elements` | int 0–15 | Divider line, footer, footnotes |
|
| 187 |
+
| `scores.overall_impression` | int 0–10 | Holistic McKinsey feel |
|
| 188 |
+
| `total` | int 0–100 | Sum of all scores |
|
| 189 |
+
| `strengths` | list[str] | What the slide does well |
|
| 190 |
+
| `weaknesses` | list[str] | What to improve |
|
| 191 |
+
| `one_line_verdict` | str | Evaluator summary |
|
| 192 |
+
| `reward` | float –0.3…+0.3 | Capped score delta / 100 |
|
| 193 |
+
| `done` | bool | True when max_steps reached |
|
| 194 |
+
| `design_rules_content` | str | Current DESIGN_RULES.md |
|
| 195 |
+
| `examples_content` | str | Current EXAMPLES.md |
|
| 196 |
+
|
| 197 |
+
## Environment Variables
|
| 198 |
+
|
| 199 |
+
See `.env.example` for the full reference.
|
| 200 |
+
|
| 201 |
+
| Variable | Required | Default | Description |
|
| 202 |
+
|----------|----------|---------|-------------|
|
| 203 |
+
| `GEMINI_API_KEY` | Yes | — | Gemini API key — generator (Flash), evaluator + optimizer (Pro) |
|
| 204 |
+
| `SLIDE_SKILL_MAX_STEPS` | No | `7` | Steps per episode (~60-120s each) |
|
| 205 |
+
|
| 206 |
## License
|
| 207 |
|
| 208 |
ISC
|
agent_docs/openenv_migration_plan_v2.md
ADDED
|
@@ -0,0 +1,1728 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenEnv Migration Plan v2 — Skill Forge → OpenEnv Environment
|
| 2 |
+
|
| 3 |
+
**Date**: 2026-03-07
|
| 4 |
+
**Status**: Implementation-ready
|
| 5 |
+
**Target**: HuggingFace Spaces (OpenEnv-compatible)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Overview
|
| 10 |
+
|
| 11 |
+
Skill Forge is a self-improving PowerPoint generation loop that, starting from a minimal brand-style baseline, iteratively improves a McKinsey-style slide by evolving two skill files. The loop reached 89/100 in 5 iterations.
|
| 12 |
+
|
| 13 |
+
**What is being optimized**: Two brand/task-specific files — `DESIGN_RULES.md` and `EXAMPLES.md` — that guide an LLM's pptxgenjs code generation. These files encode McKinsey visual design rules (color palette, typography, structural elements) and accumulated example guidance.
|
| 14 |
+
|
| 15 |
+
**What is NOT being optimized**: The generic pptx tooling skill in `pptx/` (SKILL.md, editing.md, pptxgenjs.md). These files define how the agent-as-executor uses pptxgenjs and remain unchanged across all optimization rounds.
|
| 16 |
+
|
| 17 |
+
**What OpenEnv adds**: A standardized environment interface so that any RL/optimization agent can drive the Skill Forge loop without knowing its internals. The environment exposes `reset()`, `step(action)`, and `observe()` via a gRPC/HTTP server defined by the OpenEnv protocol.
|
| 18 |
+
|
| 19 |
+
**Full generation pipeline per step**:
|
| 20 |
+
|
| 21 |
+
```
|
| 22 |
+
Agent issues action (edit skill files)
|
| 23 |
+
↓
|
| 24 |
+
skill_manager.py applies edit to isolated session directory
|
| 25 |
+
↓
|
| 26 |
+
slide_generator.py: LLM reads DESIGN_RULES.md + EXAMPLES.md + TASK_PROMPT.md
|
| 27 |
+
→ writes JavaScript (pptxgenjs)
|
| 28 |
+
↓
|
| 29 |
+
node generate.js → slide.pptx
|
| 30 |
+
↓
|
| 31 |
+
soffice --headless --convert-to pdf slide.pptx
|
| 32 |
+
↓
|
| 33 |
+
pdftoppm -r 150 slide.pdf slide → slide-1.jpg
|
| 34 |
+
↓
|
| 35 |
+
evaluator.py: Claude Opus 4.6 + vision → scores JSON
|
| 36 |
+
↓
|
| 37 |
+
Observation returned to agent
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
Each step takes approximately 60–120 seconds (two LLM API calls + Node.js + LibreOffice). At `max_steps=10` an episode runs 10–20 minutes. For HuggingFace Spaces with resource constraints, **5–7 steps per episode is more realistic**.
|
| 41 |
+
|
| 42 |
+
---
|
| 43 |
+
|
| 44 |
+
## 2. Conceptual Clarification
|
| 45 |
+
|
| 46 |
+
Understanding which files are "the skill" is critical. There are two distinct layers:
|
| 47 |
+
|
| 48 |
+
### Layer 1 — Generic pptx Agent Tooling (`pptx/`)
|
| 49 |
+
|
| 50 |
+
These files live in `pptx/` and are maintained by Anthropic. They teach the LLM agent *how to use pptxgenjs as a tool* — the API, shape types, coordinate systems, etc. They are analogous to a standard library: stable, versioned independently, and not task-specific.
|
| 51 |
+
|
| 52 |
+
```
|
| 53 |
+
pptx/
|
| 54 |
+
├── SKILL.md # pptxgenjs capability overview and agent instructions
|
| 55 |
+
├── editing.md # Shape editing primitives and patterns
|
| 56 |
+
└── pptxgenjs.md # Full pptxgenjs API reference
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
**These files are read by the agent-as-executor (the slide generator LLM). They are NEVER the target of optimization.**
|
| 60 |
+
|
| 61 |
+
### Layer 2 — Evolving Brand Style Files (the "skill" being optimized)
|
| 62 |
+
|
| 63 |
+
These files live in `skill_v{N}/` and encode McKinsey-specific visual design knowledge:
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
skill_v0/
|
| 67 |
+
├── DESIGN_RULES.md # Color palette, typography, layout coords, structural elements
|
| 68 |
+
└── EXAMPLES.md # Accumulated guidance from prior optimization rounds
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
The optimizer LLM reads `DESIGN_RULES.md + EXAMPLES.md + evaluation feedback` and rewrites or edits these files to produce `skill_v{N+1}/`. The agent environment manages this evolution loop.
|
| 72 |
+
|
| 73 |
+
**Key invariant**: `DESIGN_RULES.md` and `EXAMPLES.md` are the only files the optimizer modifies. The pptx/ tooling files are read-only context for the generator.
|
| 74 |
+
|
| 75 |
+
### The Baseline
|
| 76 |
+
|
| 77 |
+
The baseline is `skill_v0/` — minimal initial style guidelines with an empty EXAMPLES.md. It must be committed to the repo as `skill_files_baseline/` and represents the true starting point, not any evolved version. On environment `reset()`, the session's skill files are restored to this baseline.
|
| 78 |
+
|
| 79 |
+
---
|
| 80 |
+
|
| 81 |
+
## 3. Project Structure
|
| 82 |
+
|
| 83 |
+
```
|
| 84 |
+
pptx-skillforge-hackathon/
|
| 85 |
+
├── package.json # pptxgenjs ^4.0.1 dependency
|
| 86 |
+
├── pyproject.toml # Python package definition
|
| 87 |
+
│
|
| 88 |
+
├── pptx/ # Generic pptx agent tooling — DO NOT MODIFY
|
| 89 |
+
│ ├── SKILL.md
|
| 90 |
+
│ ├── editing.md
|
| 91 |
+
│ └── pptxgenjs.md
|
| 92 |
+
│
|
| 93 |
+
├── skill_files_baseline/ # Committed minimal baseline (skill_v0 content)
|
| 94 |
+
│ ├── DESIGN_RULES.md # Minimal McKinsey rules, no teal/wrong colors
|
| 95 |
+
│ └── EXAMPLES.md # Empty: "(Empty — no prior optimization rounds)"
|
| 96 |
+
│
|
| 97 |
+
├── output/
|
| 98 |
+
│ ├── TASK_PROMPT.md # Fixed task (Dutch Hydrogen Strategy)
|
| 99 |
+
│ ├── evaluator.py # Original standalone evaluator (unchanged)
|
| 100 |
+
│ ├── reference/
|
| 101 |
+
│ │ ├── ref-01.jpg # Cover page reference
|
| 102 |
+
│ │ ├── ref-02.jpg # Content page reference
|
| 103 |
+
│ │ ├── ref-03.jpg # Data/chart page reference
|
| 104 |
+
│ │ ├── ref-04.jpg # Data/chart page reference
|
| 105 |
+
│ │ └── ref-05.jpg # Content page reference
|
| 106 |
+
│ ├── skill_v0/ … skill_v5/ # Historical optimization rounds
|
| 107 |
+
│ ├── generate_v0.js … v5.js # Historical generated JS files
|
| 108 |
+
│ └── slide_v0.pptx … v5.pptx + JPGs
|
| 109 |
+
│
|
| 110 |
+
└── openenv/ # OpenEnv environment package
|
| 111 |
+
├── app.py # FastAPI server entry point
|
| 112 |
+
├── client.py # Reference client implementation
|
| 113 |
+
├── openenv.yaml # OpenEnv manifest
|
| 114 |
+
├── Dockerfile
|
| 115 |
+
├── models.py # Pydantic data models
|
| 116 |
+
├── slide_skill_environment.py # Core environment logic
|
| 117 |
+
├── skill_manager.py # Skill file I/O + apply actions
|
| 118 |
+
├── slide_generator.py # Full pipeline: LLM → JS → .pptx → JPG
|
| 119 |
+
└── evaluator_adapter.py # Adapter wrapping output/evaluator.py logic
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
---
|
| 123 |
+
|
| 124 |
+
## 4. Data Models
|
| 125 |
+
|
| 126 |
+
`openenv/models.py`
|
| 127 |
+
|
| 128 |
+
```python
|
| 129 |
+
"""
|
| 130 |
+
Pydantic data models for the Slide Skill OpenEnv environment.
|
| 131 |
+
|
| 132 |
+
Action space:
|
| 133 |
+
SlideSkillAction is a discriminated union of two action types:
|
| 134 |
+
- EditSectionAction: Replace a named section's body in one skill file.
|
| 135 |
+
- ReplaceFileAction: Replace the entire content of one skill file.
|
| 136 |
+
|
| 137 |
+
EditSectionAction is appropriate when the agent wants surgical edits
|
| 138 |
+
(e.g., update only the typography section). ReplaceFileAction is used
|
| 139 |
+
when the optimizer rewrites the file wholesale, which is what the
|
| 140 |
+
historical optimizer LLM actually does.
|
| 141 |
+
|
| 142 |
+
Observation space:
|
| 143 |
+
SlideSkillObservation contains the full evaluator output including all
|
| 144 |
+
seven score dimensions plus qualitative feedback fields.
|
| 145 |
+
"""
|
| 146 |
+
|
| 147 |
+
from __future__ import annotations
|
| 148 |
+
|
| 149 |
+
from typing import Annotated, Literal, Optional
|
| 150 |
+
from pydantic import BaseModel, Field
|
| 151 |
+
|
| 152 |
+
|
| 153 |
+
# ---------------------------------------------------------------------------
|
| 154 |
+
# Actions
|
| 155 |
+
# ---------------------------------------------------------------------------
|
| 156 |
+
|
| 157 |
+
SkillFile = Literal["DESIGN_RULES.md", "EXAMPLES.md"]
|
| 158 |
+
"""The two skill files the optimizer is allowed to modify."""
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
class EditSectionAction(BaseModel):
|
| 162 |
+
"""
|
| 163 |
+
Replace the body of a named markdown section within a skill file.
|
| 164 |
+
|
| 165 |
+
The section is identified by its heading text (without the leading #
|
| 166 |
+
characters). The replacement spans from immediately after the heading
|
| 167 |
+
line to (but not including) the next heading of equal or higher level.
|
| 168 |
+
|
| 169 |
+
Example:
|
| 170 |
+
action = EditSectionAction(
|
| 171 |
+
file="DESIGN_RULES.md",
|
| 172 |
+
section_heading="Color Palette",
|
| 173 |
+
new_body="- Navy (#0C2340): primary\\n- White: background\\n"
|
| 174 |
+
)
|
| 175 |
+
"""
|
| 176 |
+
|
| 177 |
+
action_type: Literal["edit_section"] = "edit_section"
|
| 178 |
+
file: SkillFile = Field(..., description="Which skill file to edit.")
|
| 179 |
+
section_heading: str = Field(
|
| 180 |
+
...,
|
| 181 |
+
description=(
|
| 182 |
+
"Exact heading text (without leading # markers). "
|
| 183 |
+
"Case-sensitive. Must match a heading in the file."
|
| 184 |
+
),
|
| 185 |
+
)
|
| 186 |
+
new_body: str = Field(
|
| 187 |
+
...,
|
| 188 |
+
description="New markdown content for the section body (after the heading line).",
|
| 189 |
+
)
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
class ReplaceFileAction(BaseModel):
|
| 193 |
+
"""
|
| 194 |
+
Replace the entire content of a skill file.
|
| 195 |
+
|
| 196 |
+
Use this when the optimizer rewrites the file wholesale rather than
|
| 197 |
+
making targeted section edits. This is the mode used by the historical
|
| 198 |
+
optimizer LLM in the Skill Forge loop.
|
| 199 |
+
"""
|
| 200 |
+
|
| 201 |
+
action_type: Literal["replace_file"] = "replace_file"
|
| 202 |
+
file: SkillFile = Field(..., description="Which skill file to replace.")
|
| 203 |
+
new_content: str = Field(
|
| 204 |
+
...,
|
| 205 |
+
description="Complete new file content (valid markdown).",
|
| 206 |
+
)
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
# Discriminated union — action_type is the discriminator field.
|
| 210 |
+
SlideSkillAction = Annotated[
|
| 211 |
+
EditSectionAction | ReplaceFileAction,
|
| 212 |
+
Field(discriminator="action_type"),
|
| 213 |
+
]
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
# ---------------------------------------------------------------------------
|
| 217 |
+
# Scores
|
| 218 |
+
# ---------------------------------------------------------------------------
|
| 219 |
+
|
| 220 |
+
class SlideScores(BaseModel):
|
| 221 |
+
"""Raw scores from the McKinsey evaluator. Each dimension is 0–15 except
|
| 222 |
+
overall_impression which is 0–10. Total is 0–100."""
|
| 223 |
+
|
| 224 |
+
background_layout: int = Field(..., ge=0, le=15)
|
| 225 |
+
color_palette: int = Field(..., ge=0, le=15)
|
| 226 |
+
typography: int = Field(..., ge=0, le=15)
|
| 227 |
+
title_quality: int = Field(..., ge=0, le=15)
|
| 228 |
+
data_presentation: int = Field(..., ge=0, le=15)
|
| 229 |
+
structural_elements: int = Field(..., ge=0, le=15)
|
| 230 |
+
overall_impression: int = Field(..., ge=0, le=10)
|
| 231 |
+
|
| 232 |
+
@property
|
| 233 |
+
def total(self) -> int:
|
| 234 |
+
return (
|
| 235 |
+
self.background_layout
|
| 236 |
+
+ self.color_palette
|
| 237 |
+
+ self.typography
|
| 238 |
+
+ self.title_quality
|
| 239 |
+
+ self.data_presentation
|
| 240 |
+
+ self.structural_elements
|
| 241 |
+
+ self.overall_impression
|
| 242 |
+
)
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
# ---------------------------------------------------------------------------
|
| 246 |
+
# Observation
|
| 247 |
+
# ---------------------------------------------------------------------------
|
| 248 |
+
|
| 249 |
+
class SlideSkillObservation(BaseModel):
|
| 250 |
+
"""
|
| 251 |
+
Observation returned to the agent after each step.
|
| 252 |
+
|
| 253 |
+
Contains the full evaluator output so the optimizer LLM has all the
|
| 254 |
+
information it needs to write the next skill revision: numeric scores,
|
| 255 |
+
qualitative strengths/weaknesses, and the one-line verdict.
|
| 256 |
+
"""
|
| 257 |
+
|
| 258 |
+
scores: SlideScores
|
| 259 |
+
total: int = Field(..., description="Sum of all score dimensions (0–100).")
|
| 260 |
+
strengths: list[str] = Field(
|
| 261 |
+
default_factory=list,
|
| 262 |
+
description="List of what the slide does well, from the evaluator.",
|
| 263 |
+
)
|
| 264 |
+
weaknesses: list[str] = Field(
|
| 265 |
+
default_factory=list,
|
| 266 |
+
description="List of what needs improvement, from the evaluator.",
|
| 267 |
+
)
|
| 268 |
+
one_line_verdict: str = Field(
|
| 269 |
+
..., description="Single-sentence summary from the evaluator."
|
| 270 |
+
)
|
| 271 |
+
reward: float = Field(
|
| 272 |
+
...,
|
| 273 |
+
description=(
|
| 274 |
+
"Score delta vs. previous step, capped to [-0.3, +0.3] and "
|
| 275 |
+
"normalized to [-1.0, +1.0] by dividing by 100. "
|
| 276 |
+
"Capping reduces reward noise from LLM evaluation variance."
|
| 277 |
+
),
|
| 278 |
+
)
|
| 279 |
+
step: int = Field(..., description="Current step index (0-based).")
|
| 280 |
+
done: bool = Field(..., description="True if max_steps reached.")
|
| 281 |
+
# Paths are strings for JSON serialization
|
| 282 |
+
jpg_path: str = Field(
|
| 283 |
+
..., description="Absolute path to the generated slide JPG."
|
| 284 |
+
)
|
| 285 |
+
design_rules_content: str = Field(
|
| 286 |
+
...,
|
| 287 |
+
description="Current DESIGN_RULES.md content (after action was applied).",
|
| 288 |
+
)
|
| 289 |
+
examples_content: str = Field(
|
| 290 |
+
...,
|
| 291 |
+
description="Current EXAMPLES.md content (after action was applied).",
|
| 292 |
+
)
|
| 293 |
+
|
| 294 |
+
|
| 295 |
+
# ---------------------------------------------------------------------------
|
| 296 |
+
# State (internal, not exposed to client)
|
| 297 |
+
# ---------------------------------------------------------------------------
|
| 298 |
+
|
| 299 |
+
class SlideSkillState(BaseModel):
|
| 300 |
+
"""Internal environment state. Not serialized to the client."""
|
| 301 |
+
|
| 302 |
+
session_id: str
|
| 303 |
+
step: int = 0
|
| 304 |
+
prev_total: int = 0 # score from the previous step (for reward calculation)
|
| 305 |
+
session_dir: str = Field(
|
| 306 |
+
...,
|
| 307 |
+
description=(
|
| 308 |
+
"Absolute path to the isolated session directory under /tmp/. "
|
| 309 |
+
"Contains copies of DESIGN_RULES.md and EXAMPLES.md that this "
|
| 310 |
+
"session is free to modify without affecting other sessions."
|
| 311 |
+
),
|
| 312 |
+
)
|
| 313 |
+
```
|
| 314 |
+
|
| 315 |
+
---
|
| 316 |
+
|
| 317 |
+
## 5. Environment Logic
|
| 318 |
+
|
| 319 |
+
`openenv/slide_skill_environment.py`
|
| 320 |
+
|
| 321 |
+
```python
|
| 322 |
+
"""
|
| 323 |
+
Slide Skill Environment — OpenEnv-compatible environment for optimizing
|
| 324 |
+
McKinsey-style PowerPoint slide generation.
|
| 325 |
+
|
| 326 |
+
Concurrency model:
|
| 327 |
+
SUPPORTS_CONCURRENT_SESSIONS = True
|
| 328 |
+
|
| 329 |
+
Each session gets an isolated working directory at /tmp/slide_skill_{session_id}/.
|
| 330 |
+
Skill files (DESIGN_RULES.md, EXAMPLES.md) are copied there on reset() and
|
| 331 |
+
modified in place during the session. The shared repo files are never modified.
|
| 332 |
+
This means multiple sessions can run simultaneously without file conflicts.
|
| 333 |
+
|
| 334 |
+
The only shared resource is the Anthropic API key, which is rate-limited
|
| 335 |
+
per-account. For HuggingFace Spaces, running 2-3 concurrent sessions is
|
| 336 |
+
realistic before hitting rate limits.
|
| 337 |
+
|
| 338 |
+
Episode timing:
|
| 339 |
+
Each step involves two LLM calls (generator + evaluator) plus Node.js and
|
| 340 |
+
LibreOffice. Expect 60–120 seconds per step. At max_steps=7, a full episode
|
| 341 |
+
runs 7–14 minutes.
|
| 342 |
+
|
| 343 |
+
Reward function:
|
| 344 |
+
reward = clip(total_score - prev_total_score, -30, +30) / 100
|
| 345 |
+
Capping at ±30 points (±0.3 reward) dampens LLM evaluation noise. A score
|
| 346 |
+
can fluctuate ±5–10 points between identical slides due to evaluator variance,
|
| 347 |
+
so capping prevents large undeserved penalties or bonuses.
|
| 348 |
+
"""
|
| 349 |
+
|
| 350 |
+
from __future__ import annotations
|
| 351 |
+
|
| 352 |
+
import shutil
|
| 353 |
+
import uuid
|
| 354 |
+
from pathlib import Path
|
| 355 |
+
from typing import ClassVar
|
| 356 |
+
|
| 357 |
+
from models import (
|
| 358 |
+
SlideSkillAction,
|
| 359 |
+
SlideSkillObservation,
|
| 360 |
+
SlideSkillState,
|
| 361 |
+
SlideScores,
|
| 362 |
+
)
|
| 363 |
+
from skill_manager import SkillManager
|
| 364 |
+
from slide_generator import SlideGenerator
|
| 365 |
+
from evaluator_adapter import EvaluatorAdapter
|
| 366 |
+
|
| 367 |
+
|
| 368 |
+
# Paths relative to repo root — adjust if the package moves.
|
| 369 |
+
REPO_ROOT = Path(__file__).parent.parent
|
| 370 |
+
BASELINE_DIR = REPO_ROOT / "skill_files_baseline"
|
| 371 |
+
TASK_PROMPT_PATH = REPO_ROOT / "output" / "TASK_PROMPT.md"
|
| 372 |
+
REFERENCE_DIR = REPO_ROOT / "output" / "reference"
|
| 373 |
+
|
| 374 |
+
# Reward capping parameters
|
| 375 |
+
REWARD_CLIP_POINTS = 30 # clip score delta to ±30 before normalizing
|
| 376 |
+
REWARD_SCALE = 100.0 # divide clipped delta by this to get [-0.3, +0.3]
|
| 377 |
+
|
| 378 |
+
MAX_STEPS = 7
|
| 379 |
+
|
| 380 |
+
|
| 381 |
+
class SlideSkillEnvironment:
|
| 382 |
+
"""OpenEnv environment for the Skill Forge optimization loop."""
|
| 383 |
+
|
| 384 |
+
SUPPORTS_CONCURRENT_SESSIONS: ClassVar[bool] = True
|
| 385 |
+
|
| 386 |
+
def __init__(self) -> None:
|
| 387 |
+
self._sessions: dict[str, SlideSkillState] = {}
|
| 388 |
+
self._generator = SlideGenerator(
|
| 389 |
+
task_prompt_path=TASK_PROMPT_PATH,
|
| 390 |
+
pptx_skill_dir=REPO_ROOT / "pptx",
|
| 391 |
+
reference_dir=REFERENCE_DIR,
|
| 392 |
+
)
|
| 393 |
+
self._evaluator = EvaluatorAdapter(reference_dir=REFERENCE_DIR)
|
| 394 |
+
|
| 395 |
+
# ------------------------------------------------------------------
|
| 396 |
+
# Public OpenEnv interface
|
| 397 |
+
# ------------------------------------------------------------------
|
| 398 |
+
|
| 399 |
+
def reset(self, session_id: str | None = None) -> str:
|
| 400 |
+
"""
|
| 401 |
+
Initialize or reinitialize a session.
|
| 402 |
+
|
| 403 |
+
Creates an isolated working directory under /tmp/ and copies the
|
| 404 |
+
baseline skill files into it. Returns the session_id.
|
| 405 |
+
"""
|
| 406 |
+
session_id = session_id or str(uuid.uuid4())
|
| 407 |
+
|
| 408 |
+
session_dir = Path(f"/tmp/slide_skill_{session_id}")
|
| 409 |
+
if session_dir.exists():
|
| 410 |
+
shutil.rmtree(session_dir)
|
| 411 |
+
session_dir.mkdir(parents=True)
|
| 412 |
+
|
| 413 |
+
# Copy baseline skill files into the session directory.
|
| 414 |
+
for fname in ("DESIGN_RULES.md", "EXAMPLES.md"):
|
| 415 |
+
src = BASELINE_DIR / fname
|
| 416 |
+
if not src.exists():
|
| 417 |
+
raise FileNotFoundError(
|
| 418 |
+
f"Baseline file missing: {src}. "
|
| 419 |
+
"Commit skill_files_baseline/ to the repo."
|
| 420 |
+
)
|
| 421 |
+
shutil.copy2(src, session_dir / fname)
|
| 422 |
+
|
| 423 |
+
self._sessions[session_id] = SlideSkillState(
|
| 424 |
+
session_id=session_id,
|
| 425 |
+
step=0,
|
| 426 |
+
prev_total=0,
|
| 427 |
+
session_dir=str(session_dir),
|
| 428 |
+
)
|
| 429 |
+
return session_id
|
| 430 |
+
|
| 431 |
+
def step(self, session_id: str, action: SlideSkillAction) -> SlideSkillObservation:
|
| 432 |
+
"""
|
| 433 |
+
Apply an action, run the generation pipeline, evaluate, and return
|
| 434 |
+
an observation.
|
| 435 |
+
|
| 436 |
+
Args:
|
| 437 |
+
session_id: Must be a live session (call reset() first).
|
| 438 |
+
action: Either EditSectionAction or ReplaceFileAction.
|
| 439 |
+
|
| 440 |
+
Returns:
|
| 441 |
+
SlideSkillObservation with scores, feedback, reward, and file contents.
|
| 442 |
+
|
| 443 |
+
Raises:
|
| 444 |
+
KeyError: If session_id is not found.
|
| 445 |
+
RuntimeError: If the generation or evaluation pipeline fails.
|
| 446 |
+
"""
|
| 447 |
+
state = self._sessions[session_id]
|
| 448 |
+
session_dir = Path(state.session_dir)
|
| 449 |
+
|
| 450 |
+
# 1. Apply the action to the session's skill files.
|
| 451 |
+
manager = SkillManager(session_dir)
|
| 452 |
+
manager.apply(action)
|
| 453 |
+
|
| 454 |
+
# 2. Run the full generation pipeline.
|
| 455 |
+
jpg_path = self._generator.generate(
|
| 456 |
+
session_id=session_id,
|
| 457 |
+
session_dir=session_dir,
|
| 458 |
+
)
|
| 459 |
+
|
| 460 |
+
# 3. Evaluate the generated slide.
|
| 461 |
+
eval_result = self._evaluator.evaluate(jpg_path)
|
| 462 |
+
|
| 463 |
+
# 4. Compute reward (capped score delta).
|
| 464 |
+
delta = eval_result["total"] - state.prev_total
|
| 465 |
+
clipped_delta = max(-REWARD_CLIP_POINTS, min(REWARD_CLIP_POINTS, delta))
|
| 466 |
+
reward = clipped_delta / REWARD_SCALE
|
| 467 |
+
|
| 468 |
+
# 5. Update state.
|
| 469 |
+
state.step += 1
|
| 470 |
+
state.prev_total = eval_result["total"]
|
| 471 |
+
done = state.step >= MAX_STEPS
|
| 472 |
+
|
| 473 |
+
# 6. Read back current file contents for the observation.
|
| 474 |
+
design_rules = (session_dir / "DESIGN_RULES.md").read_text()
|
| 475 |
+
examples = (session_dir / "EXAMPLES.md").read_text()
|
| 476 |
+
|
| 477 |
+
scores = SlideScores(**eval_result["scores"])
|
| 478 |
+
|
| 479 |
+
return SlideSkillObservation(
|
| 480 |
+
scores=scores,
|
| 481 |
+
total=eval_result["total"],
|
| 482 |
+
strengths=eval_result.get("strengths", []),
|
| 483 |
+
weaknesses=eval_result.get("weaknesses", []),
|
| 484 |
+
one_line_verdict=eval_result["one_line_verdict"],
|
| 485 |
+
reward=reward,
|
| 486 |
+
step=state.step,
|
| 487 |
+
done=done,
|
| 488 |
+
jpg_path=str(jpg_path),
|
| 489 |
+
design_rules_content=design_rules,
|
| 490 |
+
examples_content=examples,
|
| 491 |
+
)
|
| 492 |
+
|
| 493 |
+
def close(self, session_id: str) -> None:
|
| 494 |
+
"""Clean up session resources. Deletes the /tmp/ session directory."""
|
| 495 |
+
if session_id in self._sessions:
|
| 496 |
+
state = self._sessions.pop(session_id)
|
| 497 |
+
session_dir = Path(state.session_dir)
|
| 498 |
+
if session_dir.exists():
|
| 499 |
+
shutil.rmtree(session_dir)
|
| 500 |
+
```
|
| 501 |
+
|
| 502 |
+
---
|
| 503 |
+
|
| 504 |
+
## 6. Supporting Modules
|
| 505 |
+
|
| 506 |
+
### 6a. Skill Manager
|
| 507 |
+
|
| 508 |
+
`openenv/skill_manager.py`
|
| 509 |
+
|
| 510 |
+
```python
|
| 511 |
+
"""
|
| 512 |
+
Skill file manager — applies actions to an isolated session directory.
|
| 513 |
+
|
| 514 |
+
Operates exclusively on files within session_dir (a /tmp/ path).
|
| 515 |
+
Never touches the repo's baseline or any shared files.
|
| 516 |
+
|
| 517 |
+
Section editing rules:
|
| 518 |
+
A "section" is a markdown heading of any level (# to ######).
|
| 519 |
+
EditSectionAction finds the first heading whose text matches
|
| 520 |
+
section_heading (case-sensitive, stripped), then replaces everything
|
| 521 |
+
from the line after that heading up to (but not including) the next
|
| 522 |
+
heading of equal or higher level (i.e., same or fewer # characters).
|
| 523 |
+
If no next heading is found, the replacement extends to end-of-file.
|
| 524 |
+
"""
|
| 525 |
+
|
| 526 |
+
from __future__ import annotations
|
| 527 |
+
|
| 528 |
+
import re
|
| 529 |
+
from pathlib import Path
|
| 530 |
+
|
| 531 |
+
from models import EditSectionAction, ReplaceFileAction, SlideSkillAction
|
| 532 |
+
|
| 533 |
+
|
| 534 |
+
class SkillManager:
|
| 535 |
+
"""Manages DESIGN_RULES.md and EXAMPLES.md within a session directory."""
|
| 536 |
+
|
| 537 |
+
def __init__(self, session_dir: Path) -> None:
|
| 538 |
+
self.session_dir = session_dir
|
| 539 |
+
|
| 540 |
+
def apply(self, action: SlideSkillAction) -> None:
|
| 541 |
+
"""
|
| 542 |
+
Dispatch to the appropriate handler based on action type.
|
| 543 |
+
|
| 544 |
+
Raises:
|
| 545 |
+
ValueError: If action_type is unrecognized or section not found.
|
| 546 |
+
FileNotFoundError: If the target skill file does not exist.
|
| 547 |
+
"""
|
| 548 |
+
target = self.session_dir / action.file
|
| 549 |
+
if not target.exists():
|
| 550 |
+
raise FileNotFoundError(f"Skill file not found in session: {target}")
|
| 551 |
+
|
| 552 |
+
if action.action_type == "replace_file":
|
| 553 |
+
self._replace_file(target, action)
|
| 554 |
+
elif action.action_type == "edit_section":
|
| 555 |
+
self._edit_section(target, action)
|
| 556 |
+
else:
|
| 557 |
+
raise ValueError(f"Unknown action_type: {action.action_type!r}")
|
| 558 |
+
|
| 559 |
+
# ------------------------------------------------------------------
|
| 560 |
+
# Private helpers
|
| 561 |
+
# ------------------------------------------------------------------
|
| 562 |
+
|
| 563 |
+
@staticmethod
|
| 564 |
+
def _replace_file(target: Path, action: ReplaceFileAction) -> None:
|
| 565 |
+
"""Overwrite the entire file with new_content."""
|
| 566 |
+
target.write_text(action.new_content, encoding="utf-8")
|
| 567 |
+
|
| 568 |
+
@staticmethod
|
| 569 |
+
def _edit_section(target: Path, action: EditSectionAction) -> None:
|
| 570 |
+
"""Replace the body of a named markdown section."""
|
| 571 |
+
text = target.read_text(encoding="utf-8")
|
| 572 |
+
lines = text.splitlines(keepends=True)
|
| 573 |
+
|
| 574 |
+
# Find the heading line.
|
| 575 |
+
heading_pattern = re.compile(r"^(#{1,6})\s+(.*?)\s*$")
|
| 576 |
+
heading_idx: int | None = None
|
| 577 |
+
heading_level: int = 0
|
| 578 |
+
|
| 579 |
+
for i, line in enumerate(lines):
|
| 580 |
+
m = heading_pattern.match(line.rstrip("\n\r"))
|
| 581 |
+
if m and m.group(2) == action.section_heading:
|
| 582 |
+
heading_idx = i
|
| 583 |
+
heading_level = len(m.group(1))
|
| 584 |
+
break
|
| 585 |
+
|
| 586 |
+
if heading_idx is None:
|
| 587 |
+
raise ValueError(
|
| 588 |
+
f"Section heading {action.section_heading!r} not found in {target.name}."
|
| 589 |
+
)
|
| 590 |
+
|
| 591 |
+
# Find where the section body ends (next heading of equal or higher level).
|
| 592 |
+
end_idx = len(lines)
|
| 593 |
+
for i in range(heading_idx + 1, len(lines)):
|
| 594 |
+
m = heading_pattern.match(lines[i].rstrip("\n\r"))
|
| 595 |
+
if m and len(m.group(1)) <= heading_level:
|
| 596 |
+
end_idx = i
|
| 597 |
+
break
|
| 598 |
+
|
| 599 |
+
# Reconstruct the file.
|
| 600 |
+
new_body = action.new_body
|
| 601 |
+
if new_body and not new_body.endswith("\n"):
|
| 602 |
+
new_body += "\n"
|
| 603 |
+
|
| 604 |
+
new_lines = (
|
| 605 |
+
lines[: heading_idx + 1] # heading itself
|
| 606 |
+
+ [new_body]
|
| 607 |
+
+ lines[end_idx:] # rest of file after the section
|
| 608 |
+
)
|
| 609 |
+
target.write_text("".join(new_lines), encoding="utf-8")
|
| 610 |
+
|
| 611 |
+
def read_file(self, filename: str) -> str:
|
| 612 |
+
"""Read a skill file from the session directory."""
|
| 613 |
+
return (self.session_dir / filename).read_text(encoding="utf-8")
|
| 614 |
+
```
|
| 615 |
+
|
| 616 |
+
### 6b. Slide Generator
|
| 617 |
+
|
| 618 |
+
`openenv/slide_generator.py`
|
| 619 |
+
|
| 620 |
+
```python
|
| 621 |
+
"""
|
| 622 |
+
Slide Generator — orchestrates the full PPT generation pipeline.
|
| 623 |
+
|
| 624 |
+
Pipeline (in order):
|
| 625 |
+
1. LLM reads DESIGN_RULES.md + EXAMPLES.md + TASK_PROMPT.md + pptx/ tooling
|
| 626 |
+
→ writes pptxgenjs JavaScript to generate.js in the session output dir.
|
| 627 |
+
2. `node generate.js` runs in the session output dir → produces slide.pptx.
|
| 628 |
+
3. `soffice --headless --convert-to pdf slide.pptx` → slide.pdf.
|
| 629 |
+
4. `pdftoppm -r 150 slide.pdf slide` → slide-1.jpg (page 1).
|
| 630 |
+
5. Returns the Path to slide-1.jpg.
|
| 631 |
+
|
| 632 |
+
The generator LLM receives the pptx/ tooling files as context so it knows
|
| 633 |
+
the full pptxgenjs API — but those files are read-only; they are never
|
| 634 |
+
written to or returned in the observation.
|
| 635 |
+
|
| 636 |
+
Session isolation:
|
| 637 |
+
All generated artifacts (generate.js, slide.pptx, slide.pdf, slide-1.jpg)
|
| 638 |
+
are written into a subdirectory of session_dir so that concurrent sessions
|
| 639 |
+
do not share output paths.
|
| 640 |
+
"""
|
| 641 |
+
|
| 642 |
+
from __future__ import annotations
|
| 643 |
+
|
| 644 |
+
import subprocess
|
| 645 |
+
import textwrap
|
| 646 |
+
from pathlib import Path
|
| 647 |
+
|
| 648 |
+
import anthropic
|
| 649 |
+
|
| 650 |
+
|
| 651 |
+
# The generator uses a capable coding model. Claude Sonnet is a good balance
|
| 652 |
+
# between quality and speed/cost for code generation.
|
| 653 |
+
GENERATOR_MODEL = "claude-sonnet-4-6"
|
| 654 |
+
GENERATOR_MAX_TOKENS = 4096
|
| 655 |
+
|
| 656 |
+
|
| 657 |
+
class SlideGenerator:
|
| 658 |
+
"""Drives the LLM → Node.js → LibreOffice → pdftoppm pipeline."""
|
| 659 |
+
|
| 660 |
+
def __init__(
|
| 661 |
+
self,
|
| 662 |
+
task_prompt_path: Path,
|
| 663 |
+
pptx_skill_dir: Path,
|
| 664 |
+
reference_dir: Path,
|
| 665 |
+
) -> None:
|
| 666 |
+
self.task_prompt = task_prompt_path.read_text(encoding="utf-8")
|
| 667 |
+
self.pptx_skill_dir = pptx_skill_dir
|
| 668 |
+
self.reference_dir = reference_dir
|
| 669 |
+
self._client = anthropic.Anthropic()
|
| 670 |
+
|
| 671 |
+
def generate(self, session_id: str, session_dir: Path) -> Path:
|
| 672 |
+
"""
|
| 673 |
+
Run the full pipeline for one optimization step.
|
| 674 |
+
|
| 675 |
+
Args:
|
| 676 |
+
session_id: Used only for logging/naming.
|
| 677 |
+
session_dir: Isolated directory containing the session's
|
| 678 |
+
DESIGN_RULES.md and EXAMPLES.md.
|
| 679 |
+
|
| 680 |
+
Returns:
|
| 681 |
+
Absolute path to the generated slide JPG (slide-1.jpg).
|
| 682 |
+
|
| 683 |
+
Raises:
|
| 684 |
+
RuntimeError: If any pipeline stage (LLM, Node, LibreOffice,
|
| 685 |
+
pdftoppm) fails.
|
| 686 |
+
"""
|
| 687 |
+
out_dir = session_dir / "output"
|
| 688 |
+
out_dir.mkdir(exist_ok=True)
|
| 689 |
+
|
| 690 |
+
js_path = out_dir / "generate.js"
|
| 691 |
+
pptx_path = out_dir / "slide.pptx"
|
| 692 |
+
jpg_stem = out_dir / "slide"
|
| 693 |
+
jpg_path = out_dir / "slide-1.jpg"
|
| 694 |
+
|
| 695 |
+
# Stage 1: LLM generates pptxgenjs JavaScript.
|
| 696 |
+
js_code = self._call_generator_llm(session_dir)
|
| 697 |
+
js_path.write_text(js_code, encoding="utf-8")
|
| 698 |
+
|
| 699 |
+
# Stage 2: Node.js executes the JS to produce the .pptx file.
|
| 700 |
+
self._run(
|
| 701 |
+
["node", str(js_path)],
|
| 702 |
+
cwd=out_dir,
|
| 703 |
+
stage="node generate.js",
|
| 704 |
+
)
|
| 705 |
+
if not pptx_path.exists():
|
| 706 |
+
raise RuntimeError(
|
| 707 |
+
f"node generate.js completed but {pptx_path} was not created."
|
| 708 |
+
)
|
| 709 |
+
|
| 710 |
+
# Stage 3: LibreOffice converts .pptx → .pdf.
|
| 711 |
+
self._run(
|
| 712 |
+
[
|
| 713 |
+
"soffice",
|
| 714 |
+
"--headless",
|
| 715 |
+
"--convert-to", "pdf",
|
| 716 |
+
"--outdir", str(out_dir),
|
| 717 |
+
str(pptx_path),
|
| 718 |
+
],
|
| 719 |
+
cwd=out_dir,
|
| 720 |
+
stage="soffice --convert-to pdf",
|
| 721 |
+
)
|
| 722 |
+
pdf_path = out_dir / "slide.pdf"
|
| 723 |
+
if not pdf_path.exists():
|
| 724 |
+
raise RuntimeError(
|
| 725 |
+
f"LibreOffice completed but {pdf_path} was not created."
|
| 726 |
+
)
|
| 727 |
+
|
| 728 |
+
# Stage 4: pdftoppm converts PDF page 1 → JPG at 150 DPI.
|
| 729 |
+
# Output: slide-1.jpg (pdftoppm appends "-{page}" automatically).
|
| 730 |
+
self._run(
|
| 731 |
+
[
|
| 732 |
+
"pdftoppm",
|
| 733 |
+
"-r", "150",
|
| 734 |
+
"-jpeg",
|
| 735 |
+
"-f", "1", "-l", "1", # only page 1
|
| 736 |
+
str(pdf_path),
|
| 737 |
+
str(jpg_stem),
|
| 738 |
+
],
|
| 739 |
+
cwd=out_dir,
|
| 740 |
+
stage="pdftoppm",
|
| 741 |
+
)
|
| 742 |
+
if not jpg_path.exists():
|
| 743 |
+
raise RuntimeError(
|
| 744 |
+
f"pdftoppm completed but {jpg_path} was not created."
|
| 745 |
+
)
|
| 746 |
+
|
| 747 |
+
return jpg_path
|
| 748 |
+
|
| 749 |
+
# ------------------------------------------------------------------
|
| 750 |
+
# Private helpers
|
| 751 |
+
# ------------------------------------------------------------------
|
| 752 |
+
|
| 753 |
+
def _call_generator_llm(self, session_dir: Path) -> str:
|
| 754 |
+
"""
|
| 755 |
+
Call the generator LLM with skill files + task prompt as context.
|
| 756 |
+
|
| 757 |
+
Returns the raw JavaScript code string (without markdown fences).
|
| 758 |
+
"""
|
| 759 |
+
design_rules = (session_dir / "DESIGN_RULES.md").read_text(encoding="utf-8")
|
| 760 |
+
examples = (session_dir / "EXAMPLES.md").read_text(encoding="utf-8")
|
| 761 |
+
|
| 762 |
+
# Load the generic pptx tooling files as executor context.
|
| 763 |
+
pptx_skill = self._read_pptx_skill()
|
| 764 |
+
|
| 765 |
+
system_prompt = textwrap.dedent("""\
|
| 766 |
+
You are an expert pptxgenjs developer. You will write a complete,
|
| 767 |
+
runnable Node.js script that generates a PowerPoint slide using
|
| 768 |
+
the pptxgenjs library.
|
| 769 |
+
|
| 770 |
+
Rules:
|
| 771 |
+
- Output ONLY the JavaScript code. No markdown fences, no explanation.
|
| 772 |
+
- The script must save the file as "slide.pptx" in the current directory.
|
| 773 |
+
- Follow the DESIGN_RULES.md and EXAMPLES.md exactly.
|
| 774 |
+
- Use the pptxgenjs API reference below for correct method calls.
|
| 775 |
+
""")
|
| 776 |
+
|
| 777 |
+
user_message = textwrap.dedent(f"""\
|
| 778 |
+
## pptxgenjs API Reference
|
| 779 |
+
{pptx_skill}
|
| 780 |
+
|
| 781 |
+
## Brand Style: DESIGN_RULES.md
|
| 782 |
+
{design_rules}
|
| 783 |
+
|
| 784 |
+
## Brand Style: EXAMPLES.md
|
| 785 |
+
{examples}
|
| 786 |
+
|
| 787 |
+
## Task
|
| 788 |
+
{self.task_prompt}
|
| 789 |
+
|
| 790 |
+
Write the complete pptxgenjs JavaScript file now.
|
| 791 |
+
""")
|
| 792 |
+
|
| 793 |
+
response = self._client.messages.create(
|
| 794 |
+
model=GENERATOR_MODEL,
|
| 795 |
+
max_tokens=GENERATOR_MAX_TOKENS,
|
| 796 |
+
system=system_prompt,
|
| 797 |
+
messages=[{"role": "user", "content": user_message}],
|
| 798 |
+
)
|
| 799 |
+
|
| 800 |
+
code = response.content[0].text.strip()
|
| 801 |
+
|
| 802 |
+
# Strip markdown code fences if the LLM added them despite instructions.
|
| 803 |
+
if code.startswith("```"):
|
| 804 |
+
code = code.split("\n", 1)[1]
|
| 805 |
+
if code.endswith("```"):
|
| 806 |
+
code = code.rsplit("```", 1)[0]
|
| 807 |
+
code = code.strip()
|
| 808 |
+
|
| 809 |
+
return code
|
| 810 |
+
|
| 811 |
+
def _read_pptx_skill(self) -> str:
|
| 812 |
+
"""Concatenate the generic pptx skill files for LLM context."""
|
| 813 |
+
parts = []
|
| 814 |
+
for fname in ("SKILL.md", "editing.md", "pptxgenjs.md"):
|
| 815 |
+
p = self.pptx_skill_dir / fname
|
| 816 |
+
if p.exists():
|
| 817 |
+
parts.append(f"### {fname}\n{p.read_text(encoding='utf-8')}")
|
| 818 |
+
return "\n\n".join(parts)
|
| 819 |
+
|
| 820 |
+
@staticmethod
|
| 821 |
+
def _run(cmd: list[str], cwd: Path, stage: str) -> None:
|
| 822 |
+
"""Run a subprocess; raise RuntimeError with context if it fails."""
|
| 823 |
+
result = subprocess.run(
|
| 824 |
+
cmd,
|
| 825 |
+
cwd=cwd,
|
| 826 |
+
capture_output=True,
|
| 827 |
+
text=True,
|
| 828 |
+
timeout=300, # 5 min hard limit per stage
|
| 829 |
+
)
|
| 830 |
+
if result.returncode != 0:
|
| 831 |
+
raise RuntimeError(
|
| 832 |
+
f"Pipeline stage '{stage}' failed (exit {result.returncode}).\n"
|
| 833 |
+
f"stdout: {result.stdout[-2000:]}\n"
|
| 834 |
+
f"stderr: {result.stderr[-2000:]}"
|
| 835 |
+
)
|
| 836 |
+
```
|
| 837 |
+
|
| 838 |
+
### 6c. Evaluator Adapter
|
| 839 |
+
|
| 840 |
+
`openenv/evaluator_adapter.py`
|
| 841 |
+
|
| 842 |
+
```python
|
| 843 |
+
"""
|
| 844 |
+
Evaluator Adapter — wraps the existing output/evaluator.py logic as a
|
| 845 |
+
reusable module with a clean interface.
|
| 846 |
+
|
| 847 |
+
This module does NOT import output/evaluator.py (which has a __main__ guard
|
| 848 |
+
and hardcoded paths). Instead, it re-implements the core evaluate_slide()
|
| 849 |
+
logic with:
|
| 850 |
+
- Configurable reference image paths
|
| 851 |
+
- A return type that includes all seven score keys, strengths, weaknesses,
|
| 852 |
+
and one_line_verdict
|
| 853 |
+
- No file I/O side effects (no evaluation_results.json written)
|
| 854 |
+
|
| 855 |
+
The evaluation prompt is identical to output/evaluator.py so scores are
|
| 856 |
+
comparable across the historical runs and the OpenEnv loop.
|
| 857 |
+
"""
|
| 858 |
+
|
| 859 |
+
from __future__ import annotations
|
| 860 |
+
|
| 861 |
+
import base64
|
| 862 |
+
import json
|
| 863 |
+
from pathlib import Path
|
| 864 |
+
|
| 865 |
+
import anthropic
|
| 866 |
+
|
| 867 |
+
|
| 868 |
+
# Must match output/evaluator.py exactly so historical scores are comparable.
|
| 869 |
+
EVALUATION_SYSTEM_PROMPT = """You are an expert McKinsey & Company slide design evaluator.
|
| 870 |
+
|
| 871 |
+
You will be shown:
|
| 872 |
+
1. REFERENCE IMAGES: 5 pages from a real McKinsey & Company consulting deck (Chilean Hydrogen Pathway, December 2020). These represent the gold standard for visual style.
|
| 873 |
+
2. CANDIDATE SLIDE: A programmatically generated PowerPoint slide about Dutch Hydrogen Strategy, rendered as a JPEG image.
|
| 874 |
+
|
| 875 |
+
Your job: Score how closely the CANDIDATE SLIDE matches the McKinsey visual style shown in the REFERENCE IMAGES.
|
| 876 |
+
|
| 877 |
+
## Scoring Rubric (100 points total)
|
| 878 |
+
|
| 879 |
+
### 1. Background & Base Layout (0-15 points)
|
| 880 |
+
- McKinsey content/data slides use WHITE backgrounds (dark navy is ONLY for section dividers/covers)
|
| 881 |
+
- Clean margins (~0.5" all sides)
|
| 882 |
+
- No unnecessary visual clutter
|
| 883 |
+
- 15: White bg, clean margins, professional spacing
|
| 884 |
+
- 10: White bg but spacing issues
|
| 885 |
+
- 5: Wrong background color or major layout problems
|
| 886 |
+
- 0: Completely wrong base (e.g., dark bg for data slide)
|
| 887 |
+
|
| 888 |
+
### 2. Color Palette Fidelity (0-15 points)
|
| 889 |
+
- McKinsey uses a RESTRAINED palette: navy/dark blue (#0C2340-ish), white, light greys
|
| 890 |
+
- Accent colors are used SPARINGLY — typically just 1-2 accent colors max
|
| 891 |
+
- NO rainbow effects, no bright multi-color schemes
|
| 892 |
+
- Crimson/red used only for thin divider lines, not large elements
|
| 893 |
+
- 15: Matches McKinsey's restrained navy/white/grey palette perfectly
|
| 894 |
+
- 10: Mostly correct but 1-2 color choices off
|
| 895 |
+
- 5: Too many colors or wrong color family
|
| 896 |
+
- 0: Completely different color scheme
|
| 897 |
+
|
| 898 |
+
### 3. Typography (0-15 points)
|
| 899 |
+
- Title: Large, bold, black or very dark, left-aligned (Georgia or similar serif for titles)
|
| 900 |
+
- Body: Clean sans-serif (Calibri-like), smaller, grey or dark grey
|
| 901 |
+
- Clear size hierarchy: title >> subtitle >> body >> footnotes
|
| 902 |
+
- No decorative fonts
|
| 903 |
+
- 15: Perfect type hierarchy matching McKinsey
|
| 904 |
+
- 10: Good hierarchy but font choices slightly off
|
| 905 |
+
- 5: Weak hierarchy or wrong fonts
|
| 906 |
+
- 0: No clear hierarchy
|
| 907 |
+
|
| 908 |
+
### 4. Title Quality — "So-What" Style (0-15 points)
|
| 909 |
+
- McKinsey titles state a CONCLUSION or INSIGHT, not just a topic
|
| 910 |
+
- GOOD: "The Netherlands aims to become Europe's green hydrogen hub, scaling from 500 MW to 3-4 GW by 2030"
|
| 911 |
+
- BAD: "Dutch Hydrogen Strategy (2020-2035)" or "Roadmap Overview"
|
| 912 |
+
- The title should tell you the key takeaway without reading the slide
|
| 913 |
+
- 15: Clear insight-driven conclusion title
|
| 914 |
+
- 10: Partial insight (has some specifics but reads more like a topic)
|
| 915 |
+
- 5: Pure topic label
|
| 916 |
+
- 0: Generic or missing title
|
| 917 |
+
|
| 918 |
+
### 5. Data Presentation (0-15 points)
|
| 919 |
+
- McKinsey uses structured TABLES for data (not floating stat callouts)
|
| 920 |
+
- Tables have: navy header borders (top + bottom of header row), light grey row dividers, bold left column labels
|
| 921 |
+
- Data should be organized, scannable, center-aligned values
|
| 922 |
+
- Key columns/years may be subtly highlighted
|
| 923 |
+
- 15: Clean structured table matching McKinsey format
|
| 924 |
+
- 10: Has data but format doesn't match McKinsey tables
|
| 925 |
+
- 5: Data present but poorly structured (floating callouts, inconsistent format)
|
| 926 |
+
- 0: No supporting data
|
| 927 |
+
|
| 928 |
+
### 6. Structural Elements (0-15 points)
|
| 929 |
+
- Thin crimson/red divider line below title area (not touching title — separated by whitespace)
|
| 930 |
+
- McKinsey footer: thin rule line + source text (left) + "McKinsey & Company" bold (right) + page number
|
| 931 |
+
- Numbered footnotes for data disclaimers
|
| 932 |
+
- Source attribution line
|
| 933 |
+
- 15: All structural elements present and correctly placed
|
| 934 |
+
- 10: Most elements present, minor placement issues
|
| 935 |
+
- 5: Missing 2+ structural elements
|
| 936 |
+
- 0: No McKinsey structural elements
|
| 937 |
+
|
| 938 |
+
### 7. Overall Visual Impression (0-10 points)
|
| 939 |
+
- Does this FEEL like it came from McKinsey?
|
| 940 |
+
- Would a consulting professional find this polished and credible?
|
| 941 |
+
- Is it clean, restrained, and authoritative — or busy, colorful, and amateur?
|
| 942 |
+
- 10: Indistinguishable from real McKinsey output
|
| 943 |
+
- 7: Close but a trained eye spots differences
|
| 944 |
+
- 4: Clearly generated/templated but has some McKinsey DNA
|
| 945 |
+
- 1: Does not resemble McKinsey at all
|
| 946 |
+
|
| 947 |
+
## Output Format
|
| 948 |
+
|
| 949 |
+
Return ONLY a JSON object with this exact structure (no markdown, no code fences):
|
| 950 |
+
{
|
| 951 |
+
"scores": {
|
| 952 |
+
"background_layout": <0-15>,
|
| 953 |
+
"color_palette": <0-15>,
|
| 954 |
+
"typography": <0-15>,
|
| 955 |
+
"title_quality": <0-15>,
|
| 956 |
+
"data_presentation": <0-15>,
|
| 957 |
+
"structural_elements": <0-15>,
|
| 958 |
+
"overall_impression": <0-10>
|
| 959 |
+
},
|
| 960 |
+
"total": <sum of all scores, 0-100>,
|
| 961 |
+
"strengths": ["<strength 1>", "<strength 2>", ...],
|
| 962 |
+
"weaknesses": ["<weakness 1>", "<weakness 2>", ...],
|
| 963 |
+
"one_line_verdict": "<one sentence summary>"
|
| 964 |
+
}
|
| 965 |
+
"""
|
| 966 |
+
|
| 967 |
+
EVALUATOR_MODEL = "claude-opus-4-6"
|
| 968 |
+
|
| 969 |
+
|
| 970 |
+
def _encode_image(path: Path) -> dict:
|
| 971 |
+
"""Encode an image file to base64 for the Anthropic messages API."""
|
| 972 |
+
data = base64.standard_b64encode(path.read_bytes()).decode("utf-8")
|
| 973 |
+
suffix = path.suffix.lower()
|
| 974 |
+
media_type = "image/jpeg" if suffix in (".jpg", ".jpeg") else "image/png"
|
| 975 |
+
return {
|
| 976 |
+
"type": "image",
|
| 977 |
+
"source": {
|
| 978 |
+
"type": "base64",
|
| 979 |
+
"media_type": media_type,
|
| 980 |
+
"data": data,
|
| 981 |
+
},
|
| 982 |
+
}
|
| 983 |
+
|
| 984 |
+
|
| 985 |
+
class EvaluatorAdapter:
|
| 986 |
+
"""
|
| 987 |
+
Adapter that evaluates a generated slide JPG against McKinsey references.
|
| 988 |
+
|
| 989 |
+
Uses the same Claude Opus 4.6 + vision approach as output/evaluator.py,
|
| 990 |
+
but as a reusable class rather than a script with side effects.
|
| 991 |
+
"""
|
| 992 |
+
|
| 993 |
+
REFERENCE_FILENAMES = [
|
| 994 |
+
"ref-01.jpg",
|
| 995 |
+
"ref-02.jpg",
|
| 996 |
+
"ref-03.jpg",
|
| 997 |
+
"ref-04.jpg",
|
| 998 |
+
"ref-05.jpg",
|
| 999 |
+
]
|
| 1000 |
+
|
| 1001 |
+
def __init__(self, reference_dir: Path) -> None:
|
| 1002 |
+
"""
|
| 1003 |
+
Args:
|
| 1004 |
+
reference_dir: Directory containing ref-01.jpg through ref-05.jpg.
|
| 1005 |
+
"""
|
| 1006 |
+
self.reference_dir = reference_dir
|
| 1007 |
+
self._client = anthropic.Anthropic()
|
| 1008 |
+
|
| 1009 |
+
# Validate reference images exist at construction time.
|
| 1010 |
+
missing = [
|
| 1011 |
+
f for f in self.REFERENCE_FILENAMES
|
| 1012 |
+
if not (reference_dir / f).exists()
|
| 1013 |
+
]
|
| 1014 |
+
if missing:
|
| 1015 |
+
raise FileNotFoundError(
|
| 1016 |
+
f"Missing reference images in {reference_dir}: {missing}"
|
| 1017 |
+
)
|
| 1018 |
+
|
| 1019 |
+
def evaluate(self, slide_jpg_path: Path) -> dict:
|
| 1020 |
+
"""
|
| 1021 |
+
Evaluate a generated slide against the McKinsey reference images.
|
| 1022 |
+
|
| 1023 |
+
Args:
|
| 1024 |
+
slide_jpg_path: Absolute path to the slide JPG to evaluate.
|
| 1025 |
+
|
| 1026 |
+
Returns:
|
| 1027 |
+
dict with keys:
|
| 1028 |
+
"scores": dict mapping the 7 dimension names to int scores
|
| 1029 |
+
"total": int, sum of all scores (0-100)
|
| 1030 |
+
"strengths": list[str]
|
| 1031 |
+
"weaknesses": list[str]
|
| 1032 |
+
"one_line_verdict": str
|
| 1033 |
+
|
| 1034 |
+
Raises:
|
| 1035 |
+
FileNotFoundError: If slide_jpg_path does not exist.
|
| 1036 |
+
json.JSONDecodeError: If the LLM returns malformed JSON.
|
| 1037 |
+
RuntimeError: If the API call fails.
|
| 1038 |
+
"""
|
| 1039 |
+
if not slide_jpg_path.exists():
|
| 1040 |
+
raise FileNotFoundError(f"Slide JPG not found: {slide_jpg_path}")
|
| 1041 |
+
|
| 1042 |
+
content: list[dict] = []
|
| 1043 |
+
|
| 1044 |
+
# Reference images first.
|
| 1045 |
+
content.append({
|
| 1046 |
+
"type": "text",
|
| 1047 |
+
"text": (
|
| 1048 |
+
"## REFERENCE IMAGES (Real McKinsey deck)\n"
|
| 1049 |
+
"The following 5 images are from a real McKinsey & Company consulting "
|
| 1050 |
+
"report. Study their visual style carefully."
|
| 1051 |
+
),
|
| 1052 |
+
})
|
| 1053 |
+
for i, fname in enumerate(self.REFERENCE_FILENAMES, 1):
|
| 1054 |
+
ref_path = self.reference_dir / fname
|
| 1055 |
+
content.append(_encode_image(ref_path))
|
| 1056 |
+
content.append({"type": "text", "text": f"(Reference page {i})"})
|
| 1057 |
+
|
| 1058 |
+
# Candidate slide.
|
| 1059 |
+
content.append({
|
| 1060 |
+
"type": "text",
|
| 1061 |
+
"text": (
|
| 1062 |
+
f"\n## CANDIDATE SLIDE TO EVALUATE\n"
|
| 1063 |
+
f"This is the generated slide: {slide_jpg_path.name}"
|
| 1064 |
+
),
|
| 1065 |
+
})
|
| 1066 |
+
content.append(_encode_image(slide_jpg_path))
|
| 1067 |
+
content.append({
|
| 1068 |
+
"type": "text",
|
| 1069 |
+
"text": (
|
| 1070 |
+
"\nNow score this candidate slide against the McKinsey reference "
|
| 1071 |
+
"using the rubric. Return ONLY the JSON object."
|
| 1072 |
+
),
|
| 1073 |
+
})
|
| 1074 |
+
|
| 1075 |
+
response = self._client.messages.create(
|
| 1076 |
+
model=EVALUATOR_MODEL,
|
| 1077 |
+
max_tokens=1024,
|
| 1078 |
+
system=EVALUATION_SYSTEM_PROMPT,
|
| 1079 |
+
messages=[{"role": "user", "content": content}],
|
| 1080 |
+
)
|
| 1081 |
+
|
| 1082 |
+
text = response.content[0].text.strip()
|
| 1083 |
+
|
| 1084 |
+
# Strip markdown code fences if present (LLMs sometimes add them
|
| 1085 |
+
# despite explicit instructions not to).
|
| 1086 |
+
if text.startswith("```"):
|
| 1087 |
+
text = text.split("\n", 1)[1].rsplit("```", 1)[0].strip()
|
| 1088 |
+
|
| 1089 |
+
result = json.loads(text)
|
| 1090 |
+
|
| 1091 |
+
# Validate required keys are present.
|
| 1092 |
+
required_score_keys = {
|
| 1093 |
+
"background_layout", "color_palette", "typography",
|
| 1094 |
+
"title_quality", "data_presentation", "structural_elements",
|
| 1095 |
+
"overall_impression",
|
| 1096 |
+
}
|
| 1097 |
+
missing_keys = required_score_keys - set(result.get("scores", {}).keys())
|
| 1098 |
+
if missing_keys:
|
| 1099 |
+
raise ValueError(
|
| 1100 |
+
f"Evaluator response missing score keys: {missing_keys}. "
|
| 1101 |
+
f"Full response: {text[:500]}"
|
| 1102 |
+
)
|
| 1103 |
+
|
| 1104 |
+
return result
|
| 1105 |
+
```
|
| 1106 |
+
|
| 1107 |
+
---
|
| 1108 |
+
|
| 1109 |
+
## 7. Server Entry Point
|
| 1110 |
+
|
| 1111 |
+
`openenv/app.py`
|
| 1112 |
+
|
| 1113 |
+
```python
|
| 1114 |
+
"""
|
| 1115 |
+
FastAPI server for the Slide Skill OpenEnv environment.
|
| 1116 |
+
|
| 1117 |
+
Endpoints follow the OpenEnv HTTP protocol:
|
| 1118 |
+
POST /reset → initialize or restart a session
|
| 1119 |
+
POST /step → apply an action and return observation
|
| 1120 |
+
DELETE /sessions/{session_id} → clean up a session
|
| 1121 |
+
|
| 1122 |
+
The server is stateful: environment instances are kept in memory.
|
| 1123 |
+
For production deployments with multiple workers, use a single-worker
|
| 1124 |
+
Uvicorn setup or externalize session state to Redis.
|
| 1125 |
+
"""
|
| 1126 |
+
|
| 1127 |
+
from __future__ import annotations
|
| 1128 |
+
|
| 1129 |
+
from contextlib import asynccontextmanager
|
| 1130 |
+
from typing import Annotated
|
| 1131 |
+
|
| 1132 |
+
import uvicorn
|
| 1133 |
+
from fastapi import Body, FastAPI, HTTPException, Path
|
| 1134 |
+
from pydantic import BaseModel
|
| 1135 |
+
|
| 1136 |
+
from models import SlideSkillAction, SlideSkillObservation
|
| 1137 |
+
from slide_skill_environment import SlideSkillEnvironment
|
| 1138 |
+
|
| 1139 |
+
|
| 1140 |
+
# Single shared environment instance. Sessions are isolated at the file
|
| 1141 |
+
# level, so this is safe for concurrent requests.
|
| 1142 |
+
_env: SlideSkillEnvironment | None = None
|
| 1143 |
+
|
| 1144 |
+
|
| 1145 |
+
@asynccontextmanager
|
| 1146 |
+
async def lifespan(app: FastAPI):
|
| 1147 |
+
global _env
|
| 1148 |
+
_env = SlideSkillEnvironment()
|
| 1149 |
+
yield
|
| 1150 |
+
_env = None
|
| 1151 |
+
|
| 1152 |
+
|
| 1153 |
+
app = FastAPI(
|
| 1154 |
+
title="Slide Skill OpenEnv",
|
| 1155 |
+
description=(
|
| 1156 |
+
"OpenEnv-compatible environment for optimizing McKinsey-style "
|
| 1157 |
+
"PowerPoint slides by evolving DESIGN_RULES.md and EXAMPLES.md."
|
| 1158 |
+
),
|
| 1159 |
+
lifespan=lifespan,
|
| 1160 |
+
)
|
| 1161 |
+
|
| 1162 |
+
|
| 1163 |
+
class ResetRequest(BaseModel):
|
| 1164 |
+
session_id: str | None = None
|
| 1165 |
+
|
| 1166 |
+
|
| 1167 |
+
class ResetResponse(BaseModel):
|
| 1168 |
+
session_id: str
|
| 1169 |
+
message: str
|
| 1170 |
+
|
| 1171 |
+
|
| 1172 |
+
class StepRequest(BaseModel):
|
| 1173 |
+
session_id: str
|
| 1174 |
+
action: SlideSkillAction
|
| 1175 |
+
|
| 1176 |
+
|
| 1177 |
+
@app.post("/reset", response_model=ResetResponse)
|
| 1178 |
+
async def reset(request: ResetRequest = Body(default=ResetRequest())) -> ResetResponse:
|
| 1179 |
+
"""Initialize or restart an optimization session."""
|
| 1180 |
+
assert _env is not None
|
| 1181 |
+
session_id = _env.reset(session_id=request.session_id)
|
| 1182 |
+
return ResetResponse(
|
| 1183 |
+
session_id=session_id,
|
| 1184 |
+
message=f"Session {session_id} initialized with baseline skill files.",
|
| 1185 |
+
)
|
| 1186 |
+
|
| 1187 |
+
|
| 1188 |
+
@app.post("/step", response_model=SlideSkillObservation)
|
| 1189 |
+
async def step(request: StepRequest) -> SlideSkillObservation:
|
| 1190 |
+
"""Apply an action to the session and return the resulting observation."""
|
| 1191 |
+
assert _env is not None
|
| 1192 |
+
try:
|
| 1193 |
+
observation = _env.step(
|
| 1194 |
+
session_id=request.session_id,
|
| 1195 |
+
action=request.action,
|
| 1196 |
+
)
|
| 1197 |
+
except KeyError:
|
| 1198 |
+
raise HTTPException(
|
| 1199 |
+
status_code=404,
|
| 1200 |
+
detail=f"Session {request.session_id!r} not found. Call /reset first.",
|
| 1201 |
+
)
|
| 1202 |
+
except (RuntimeError, ValueError) as exc:
|
| 1203 |
+
raise HTTPException(status_code=500, detail=str(exc))
|
| 1204 |
+
return observation
|
| 1205 |
+
|
| 1206 |
+
|
| 1207 |
+
@app.delete("/sessions/{session_id}")
|
| 1208 |
+
async def close_session(
|
| 1209 |
+
session_id: Annotated[str, Path(description="Session ID to clean up.")]
|
| 1210 |
+
) -> dict:
|
| 1211 |
+
"""Clean up session resources (deletes /tmp/ working directory)."""
|
| 1212 |
+
assert _env is not None
|
| 1213 |
+
try:
|
| 1214 |
+
_env.close(session_id)
|
| 1215 |
+
except KeyError:
|
| 1216 |
+
raise HTTPException(
|
| 1217 |
+
status_code=404,
|
| 1218 |
+
detail=f"Session {session_id!r} not found.",
|
| 1219 |
+
)
|
| 1220 |
+
return {"message": f"Session {session_id} closed."}
|
| 1221 |
+
|
| 1222 |
+
|
| 1223 |
+
@app.get("/health")
|
| 1224 |
+
async def health() -> dict:
|
| 1225 |
+
return {"status": "ok", "supports_concurrent_sessions": True}
|
| 1226 |
+
|
| 1227 |
+
|
| 1228 |
+
if __name__ == "__main__":
|
| 1229 |
+
uvicorn.run("app:app", host="0.0.0.0", port=8000, workers=1)
|
| 1230 |
+
```
|
| 1231 |
+
|
| 1232 |
+
---
|
| 1233 |
+
|
| 1234 |
+
## 8. Client
|
| 1235 |
+
|
| 1236 |
+
`openenv/client.py`
|
| 1237 |
+
|
| 1238 |
+
```python
|
| 1239 |
+
"""
|
| 1240 |
+
Reference client for the Slide Skill OpenEnv server.
|
| 1241 |
+
|
| 1242 |
+
Demonstrates how an optimizer agent would interact with the environment:
|
| 1243 |
+
1. Reset to get a session ID.
|
| 1244 |
+
2. Read the initial skill file contents from the first observation.
|
| 1245 |
+
3. Call an LLM optimizer to generate an improved DESIGN_RULES.md.
|
| 1246 |
+
4. Submit as a ReplaceFileAction.
|
| 1247 |
+
5. Repeat until done=True.
|
| 1248 |
+
|
| 1249 |
+
This client is also useful for smoke-testing the server without a full agent.
|
| 1250 |
+
"""
|
| 1251 |
+
|
| 1252 |
+
from __future__ import annotations
|
| 1253 |
+
|
| 1254 |
+
import json
|
| 1255 |
+
import textwrap
|
| 1256 |
+
from pathlib import Path
|
| 1257 |
+
from typing import Any
|
| 1258 |
+
|
| 1259 |
+
import anthropic
|
| 1260 |
+
import httpx
|
| 1261 |
+
|
| 1262 |
+
from models import SlideSkillObservation
|
| 1263 |
+
|
| 1264 |
+
SERVER_URL = "http://localhost:8000"
|
| 1265 |
+
OPTIMIZER_MODEL = "claude-opus-4-6"
|
| 1266 |
+
|
| 1267 |
+
|
| 1268 |
+
class SlideSkillClient:
|
| 1269 |
+
"""HTTP client for the Slide Skill OpenEnv server."""
|
| 1270 |
+
|
| 1271 |
+
def __init__(self, base_url: str = SERVER_URL) -> None:
|
| 1272 |
+
self.base_url = base_url.rstrip("/")
|
| 1273 |
+
self._http = httpx.Client(timeout=300.0) # long timeout for pipeline stages
|
| 1274 |
+
|
| 1275 |
+
def reset(self, session_id: str | None = None) -> str:
|
| 1276 |
+
"""Start a new session. Returns the session_id."""
|
| 1277 |
+
payload: dict[str, Any] = {}
|
| 1278 |
+
if session_id:
|
| 1279 |
+
payload["session_id"] = session_id
|
| 1280 |
+
resp = self._http.post(f"{self.base_url}/reset", json=payload)
|
| 1281 |
+
resp.raise_for_status()
|
| 1282 |
+
return resp.json()["session_id"]
|
| 1283 |
+
|
| 1284 |
+
def step(self, session_id: str, action: dict) -> SlideSkillObservation:
|
| 1285 |
+
"""
|
| 1286 |
+
Apply an action and return the observation.
|
| 1287 |
+
|
| 1288 |
+
Args:
|
| 1289 |
+
session_id: Active session ID.
|
| 1290 |
+
action: Dict matching EditSectionAction or ReplaceFileAction schema.
|
| 1291 |
+
Must include "action_type" key.
|
| 1292 |
+
"""
|
| 1293 |
+
payload = {"session_id": session_id, "action": action}
|
| 1294 |
+
resp = self._http.post(f"{self.base_url}/step", json=payload)
|
| 1295 |
+
resp.raise_for_status()
|
| 1296 |
+
return SlideSkillObservation.model_validate(resp.json())
|
| 1297 |
+
|
| 1298 |
+
def close(self, session_id: str) -> None:
|
| 1299 |
+
"""Clean up the session."""
|
| 1300 |
+
resp = self._http.delete(f"{self.base_url}/sessions/{session_id}")
|
| 1301 |
+
resp.raise_for_status()
|
| 1302 |
+
|
| 1303 |
+
def __enter__(self) -> SlideSkillClient:
|
| 1304 |
+
return self
|
| 1305 |
+
|
| 1306 |
+
def __exit__(self, *_: Any) -> None:
|
| 1307 |
+
self._http.close()
|
| 1308 |
+
|
| 1309 |
+
|
| 1310 |
+
# ---------------------------------------------------------------------------
|
| 1311 |
+
# Optimizer agent (reference implementation)
|
| 1312 |
+
# ---------------------------------------------------------------------------
|
| 1313 |
+
|
| 1314 |
+
def call_optimizer_llm(
|
| 1315 |
+
obs: SlideSkillObservation,
|
| 1316 |
+
anthropic_client: anthropic.Anthropic,
|
| 1317 |
+
) -> dict:
|
| 1318 |
+
"""
|
| 1319 |
+
Call the optimizer LLM to generate a new DESIGN_RULES.md based on
|
| 1320 |
+
the evaluation feedback.
|
| 1321 |
+
|
| 1322 |
+
Returns a dict suitable for the step() action parameter.
|
| 1323 |
+
This uses ReplaceFileAction since the historical optimizer rewrites
|
| 1324 |
+
the file wholesale.
|
| 1325 |
+
"""
|
| 1326 |
+
prompt = textwrap.dedent(f"""\
|
| 1327 |
+
You are a McKinsey slide design optimizer. You are improving a
|
| 1328 |
+
PowerPoint generation skill by rewriting its DESIGN_RULES.md file.
|
| 1329 |
+
|
| 1330 |
+
## Current Score: {obs.total}/100
|
| 1331 |
+
|
| 1332 |
+
## Score Breakdown
|
| 1333 |
+
- background_layout: {obs.scores.background_layout}/15
|
| 1334 |
+
- color_palette: {obs.scores.color_palette}/15
|
| 1335 |
+
- typography: {obs.scores.typography}/15
|
| 1336 |
+
- title_quality: {obs.scores.title_quality}/15
|
| 1337 |
+
- data_presentation: {obs.scores.data_presentation}/15
|
| 1338 |
+
- structural_elements: {obs.scores.structural_elements}/15
|
| 1339 |
+
- overall_impression: {obs.scores.overall_impression}/10
|
| 1340 |
+
|
| 1341 |
+
## Evaluator Feedback
|
| 1342 |
+
Strengths:
|
| 1343 |
+
{chr(10).join(f'- {s}' for s in obs.strengths)}
|
| 1344 |
+
|
| 1345 |
+
Weaknesses:
|
| 1346 |
+
{chr(10).join(f'- {w}' for w in obs.weaknesses)}
|
| 1347 |
+
|
| 1348 |
+
Verdict: {obs.one_line_verdict}
|
| 1349 |
+
|
| 1350 |
+
## Current DESIGN_RULES.md
|
| 1351 |
+
{obs.design_rules_content}
|
| 1352 |
+
|
| 1353 |
+
## Current EXAMPLES.md
|
| 1354 |
+
{obs.examples_content}
|
| 1355 |
+
|
| 1356 |
+
Your task:
|
| 1357 |
+
Write an improved DESIGN_RULES.md that addresses the weaknesses above
|
| 1358 |
+
while preserving what works well. Focus on the dimensions with the
|
| 1359 |
+
lowest scores. Output ONLY the markdown file content — no explanation,
|
| 1360 |
+
no code fences.
|
| 1361 |
+
""")
|
| 1362 |
+
|
| 1363 |
+
response = anthropic_client.messages.create(
|
| 1364 |
+
model=OPTIMIZER_MODEL,
|
| 1365 |
+
max_tokens=4096,
|
| 1366 |
+
messages=[{"role": "user", "content": prompt}],
|
| 1367 |
+
)
|
| 1368 |
+
|
| 1369 |
+
new_content = response.content[0].text.strip()
|
| 1370 |
+
|
| 1371 |
+
return {
|
| 1372 |
+
"action_type": "replace_file",
|
| 1373 |
+
"file": "DESIGN_RULES.md",
|
| 1374 |
+
"new_content": new_content,
|
| 1375 |
+
}
|
| 1376 |
+
|
| 1377 |
+
|
| 1378 |
+
def run_optimization_loop(server_url: str = SERVER_URL, max_steps: int = 7) -> None:
|
| 1379 |
+
"""
|
| 1380 |
+
Run a full optimization episode using the LLM optimizer.
|
| 1381 |
+
|
| 1382 |
+
This mirrors the historical Skill Forge loop but driven through the
|
| 1383 |
+
OpenEnv HTTP interface.
|
| 1384 |
+
"""
|
| 1385 |
+
anthropic_client = anthropic.Anthropic()
|
| 1386 |
+
|
| 1387 |
+
with SlideSkillClient(base_url=server_url) as client:
|
| 1388 |
+
session_id = client.reset()
|
| 1389 |
+
print(f"Session: {session_id}")
|
| 1390 |
+
|
| 1391 |
+
# The first step must use the baseline skill files, so we submit a
|
| 1392 |
+
# no-op edit (replace EXAMPLES.md with its current content, which
|
| 1393 |
+
# forces the generator to run with the baseline DESIGN_RULES.md).
|
| 1394 |
+
# Alternatively, the server could expose a generate-only endpoint.
|
| 1395 |
+
print("Step 0: Generating baseline slide...")
|
| 1396 |
+
obs = client.step(
|
| 1397 |
+
session_id,
|
| 1398 |
+
{
|
| 1399 |
+
"action_type": "replace_file",
|
| 1400 |
+
"file": "EXAMPLES.md",
|
| 1401 |
+
"new_content": obs_initial_examples(client, session_id) if False else "(Empty — no prior optimization rounds)\n",
|
| 1402 |
+
},
|
| 1403 |
+
)
|
| 1404 |
+
print(f" Baseline score: {obs.total}/100 — {obs.one_line_verdict}")
|
| 1405 |
+
|
| 1406 |
+
for step_idx in range(1, max_steps + 1):
|
| 1407 |
+
if obs.done:
|
| 1408 |
+
print("Episode complete.")
|
| 1409 |
+
break
|
| 1410 |
+
|
| 1411 |
+
print(f"\nStep {step_idx}: Calling optimizer LLM...")
|
| 1412 |
+
action = call_optimizer_llm(obs, anthropic_client)
|
| 1413 |
+
obs = client.step(session_id, action)
|
| 1414 |
+
|
| 1415 |
+
print(
|
| 1416 |
+
f" Score: {obs.total}/100 (reward: {obs.reward:+.3f}) "
|
| 1417 |
+
f"— {obs.one_line_verdict}"
|
| 1418 |
+
)
|
| 1419 |
+
print(f" Weaknesses: {'; '.join(obs.weaknesses[:2])}")
|
| 1420 |
+
|
| 1421 |
+
client.close(session_id)
|
| 1422 |
+
print(f"\nFinal score: {obs.total}/100")
|
| 1423 |
+
|
| 1424 |
+
|
| 1425 |
+
if __name__ == "__main__":
|
| 1426 |
+
run_optimization_loop()
|
| 1427 |
+
```
|
| 1428 |
+
|
| 1429 |
+
---
|
| 1430 |
+
|
| 1431 |
+
## 9. OpenEnv Manifest
|
| 1432 |
+
|
| 1433 |
+
`openenv/openenv.yaml`
|
| 1434 |
+
|
| 1435 |
+
```yaml
|
| 1436 |
+
# OpenEnv environment manifest for Slide Skill
|
| 1437 |
+
# https://openenv.dev/spec
|
| 1438 |
+
|
| 1439 |
+
name: slide-skill
|
| 1440 |
+
version: "1.0.0"
|
| 1441 |
+
description: >
|
| 1442 |
+
Self-improving McKinsey-style PowerPoint slide generation environment.
|
| 1443 |
+
The agent evolves DESIGN_RULES.md and EXAMPLES.md to maximize a visual
|
| 1444 |
+
design score (0-100) evaluated by Claude Opus vision against 5 McKinsey
|
| 1445 |
+
reference images.
|
| 1446 |
+
|
| 1447 |
+
author: Tesserae / Skill Forge Hackathon Team
|
| 1448 |
+
|
| 1449 |
+
supports_concurrent_sessions: true
|
| 1450 |
+
max_steps: 7
|
| 1451 |
+
|
| 1452 |
+
# Approximate time budget per step (seconds).
|
| 1453 |
+
# Each step: generator LLM (~20-40s) + Node.js (<5s) + LibreOffice (~15-30s)
|
| 1454 |
+
# + pdftoppm (<5s) + evaluator LLM (~30-60s)
|
| 1455 |
+
step_timeout_seconds: 180
|
| 1456 |
+
|
| 1457 |
+
action_space:
|
| 1458 |
+
type: union
|
| 1459 |
+
discriminator: action_type
|
| 1460 |
+
variants:
|
| 1461 |
+
- name: edit_section
|
| 1462 |
+
description: Replace the body of a named section in a skill file.
|
| 1463 |
+
fields:
|
| 1464 |
+
file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
|
| 1465 |
+
section_heading: {type: string, description: "Exact heading text without # markers"}
|
| 1466 |
+
new_body: {type: string, description: "New section body content in markdown"}
|
| 1467 |
+
|
| 1468 |
+
- name: replace_file
|
| 1469 |
+
description: Replace the entire content of a skill file.
|
| 1470 |
+
fields:
|
| 1471 |
+
file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
|
| 1472 |
+
new_content: {type: string, description: "Complete new file content"}
|
| 1473 |
+
|
| 1474 |
+
observation_space:
|
| 1475 |
+
scores:
|
| 1476 |
+
background_layout: {type: integer, min: 0, max: 15}
|
| 1477 |
+
color_palette: {type: integer, min: 0, max: 15}
|
| 1478 |
+
typography: {type: integer, min: 0, max: 15}
|
| 1479 |
+
title_quality: {type: integer, min: 0, max: 15}
|
| 1480 |
+
data_presentation: {type: integer, min: 0, max: 15}
|
| 1481 |
+
structural_elements: {type: integer, min: 0, max: 15}
|
| 1482 |
+
overall_impression: {type: integer, min: 0, max: 10}
|
| 1483 |
+
total: {type: integer, min: 0, max: 100}
|
| 1484 |
+
strengths: {type: array, items: string}
|
| 1485 |
+
weaknesses: {type: array, items: string}
|
| 1486 |
+
one_line_verdict: {type: string}
|
| 1487 |
+
reward: {type: float, min: -0.3, max: 0.3}
|
| 1488 |
+
step: {type: integer}
|
| 1489 |
+
done: {type: boolean}
|
| 1490 |
+
jpg_path: {type: string, description: "Absolute path to generated slide JPG"}
|
| 1491 |
+
design_rules_content: {type: string}
|
| 1492 |
+
examples_content: {type: string}
|
| 1493 |
+
|
| 1494 |
+
reward:
|
| 1495 |
+
description: >
|
| 1496 |
+
Normalized score delta vs. previous step, capped to [-0.3, +0.3].
|
| 1497 |
+
Formula: clip(total_score - prev_total_score, -30, +30) / 100
|
| 1498 |
+
range: [-0.3, 0.3]
|
| 1499 |
+
|
| 1500 |
+
baseline:
|
| 1501 |
+
description: >
|
| 1502 |
+
skill_files_baseline/ committed to the repo contains the minimal
|
| 1503 |
+
starting DESIGN_RULES.md (teal palette, basic typography) and an
|
| 1504 |
+
empty EXAMPLES.md. This is skill_v0 content — NOT any evolved version.
|
| 1505 |
+
|
| 1506 |
+
endpoints:
|
| 1507 |
+
reset: POST /reset
|
| 1508 |
+
step: POST /step
|
| 1509 |
+
close: DELETE /sessions/{session_id}
|
| 1510 |
+
health: GET /health
|
| 1511 |
+
|
| 1512 |
+
server:
|
| 1513 |
+
host: 0.0.0.0
|
| 1514 |
+
port: 8000
|
| 1515 |
+
workers: 1 # Do not increase; LibreOffice is not thread-safe
|
| 1516 |
+
|
| 1517 |
+
environment_variables:
|
| 1518 |
+
required:
|
| 1519 |
+
- name: ANTHROPIC_API_KEY
|
| 1520 |
+
description: Anthropic API key for Claude generator and evaluator
|
| 1521 |
+
optional:
|
| 1522 |
+
- name: SLIDE_SKILL_MAX_STEPS
|
| 1523 |
+
description: Override default max_steps (default 7)
|
| 1524 |
+
default: "7"
|
| 1525 |
+
```
|
| 1526 |
+
|
| 1527 |
+
---
|
| 1528 |
+
|
| 1529 |
+
## 10. Dockerfile
|
| 1530 |
+
|
| 1531 |
+
`openenv/Dockerfile`
|
| 1532 |
+
|
| 1533 |
+
```dockerfile
|
| 1534 |
+
# Slide Skill OpenEnv — Docker image
|
| 1535 |
+
#
|
| 1536 |
+
# Layer sizes (approximate):
|
| 1537 |
+
# python:3.12-slim base: ~130 MB
|
| 1538 |
+
# Node.js 20 + pptxgenjs: ~200 MB
|
| 1539 |
+
# LibreOffice: ~500 MB <-- dominant cost
|
| 1540 |
+
# poppler-utils (pdftoppm): ~30 MB
|
| 1541 |
+
# Python deps: ~80 MB
|
| 1542 |
+
# Total compressed: ~600-700 MB
|
| 1543 |
+
#
|
| 1544 |
+
# LibreOffice is the unavoidable bottleneck. It is required to convert
|
| 1545 |
+
# .pptx → .pdf. There is no lighter alternative that handles pptxgenjs
|
| 1546 |
+
# output faithfully.
|
| 1547 |
+
|
| 1548 |
+
FROM python:3.12-slim
|
| 1549 |
+
|
| 1550 |
+
LABEL description="Slide Skill OpenEnv — McKinsey PPT generation environment"
|
| 1551 |
+
|
| 1552 |
+
# System dependencies
|
| 1553 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 1554 |
+
# LibreOffice for .pptx → .pdf conversion
|
| 1555 |
+
libreoffice \
|
| 1556 |
+
# poppler-utils provides pdftoppm (.pdf → .jpg)
|
| 1557 |
+
poppler-utils \
|
| 1558 |
+
# Node.js 20 LTS via NodeSource
|
| 1559 |
+
curl \
|
| 1560 |
+
ca-certificates \
|
| 1561 |
+
gnupg \
|
| 1562 |
+
&& curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
|
| 1563 |
+
&& apt-get install -y nodejs \
|
| 1564 |
+
&& apt-get clean \
|
| 1565 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 1566 |
+
|
| 1567 |
+
# Verify tools are available
|
| 1568 |
+
RUN node --version && npm --version && soffice --version && pdftoppm -v 2>&1 | head -1
|
| 1569 |
+
|
| 1570 |
+
WORKDIR /app
|
| 1571 |
+
|
| 1572 |
+
# Install pptxgenjs (Node.js dependency)
|
| 1573 |
+
COPY package.json ./
|
| 1574 |
+
RUN npm install --production
|
| 1575 |
+
|
| 1576 |
+
# Install Python dependencies
|
| 1577 |
+
COPY pyproject.toml ./
|
| 1578 |
+
RUN pip install --no-cache-dir -e ".[server]"
|
| 1579 |
+
|
| 1580 |
+
# Copy application code
|
| 1581 |
+
COPY pptx/ ./pptx/
|
| 1582 |
+
COPY skill_files_baseline/ ./skill_files_baseline/
|
| 1583 |
+
COPY output/TASK_PROMPT.md ./output/TASK_PROMPT.md
|
| 1584 |
+
COPY output/reference/ ./output/reference/
|
| 1585 |
+
COPY openenv/ ./openenv/
|
| 1586 |
+
|
| 1587 |
+
WORKDIR /app/openenv
|
| 1588 |
+
|
| 1589 |
+
# LibreOffice needs a writable user profile directory.
|
| 1590 |
+
# Using /tmp/libreoffice-profile prevents concurrent session conflicts.
|
| 1591 |
+
ENV HOME=/tmp
|
| 1592 |
+
ENV SAL_USE_VCLPLUGIN=svp
|
| 1593 |
+
|
| 1594 |
+
EXPOSE 8000
|
| 1595 |
+
|
| 1596 |
+
# Single worker — LibreOffice is not thread-safe within one process.
|
| 1597 |
+
# Concurrent sessions are handled by per-session /tmp/ directories,
|
| 1598 |
+
# but LibreOffice calls must be serialized (or use process-level locking
|
| 1599 |
+
# if scaling to multiple Gunicorn workers is required in the future).
|
| 1600 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
|
| 1601 |
+
```
|
| 1602 |
+
|
| 1603 |
+
---
|
| 1604 |
+
|
| 1605 |
+
## 11. Implementation Task Order
|
| 1606 |
+
|
| 1607 |
+
### Phase 1 — Foundation (no external dependencies)
|
| 1608 |
+
1. Commit `skill_files_baseline/` to repo (copy `output/skill_v0/` content, verify EXAMPLES.md is truly minimal).
|
| 1609 |
+
2. Implement `models.py` — pure Pydantic, no I/O.
|
| 1610 |
+
3. Implement `skill_manager.py` — file I/O only, no LLM calls. Write unit tests with a tmp directory.
|
| 1611 |
+
4. Implement `evaluator_adapter.py` — port the `evaluate_slide()` function from `output/evaluator.py`. Test against a known slide JPG and verify JSON matches expected structure.
|
| 1612 |
+
|
| 1613 |
+
### Phase 2 — Pipeline Integration
|
| 1614 |
+
5. Implement `slide_generator.py` — integrate LLM call + subprocess chain. Test the four subprocess stages independently before wiring together.
|
| 1615 |
+
6. Implement `slide_skill_environment.py` — wire `SkillManager` + `SlideGenerator` + `EvaluatorAdapter`. Test `reset()` creates isolated `/tmp/` dirs and `close()` removes them.
|
| 1616 |
+
|
| 1617 |
+
### Phase 3 — Server & Client
|
| 1618 |
+
7. Implement `app.py` — FastAPI wrapper. Test `/health`, `/reset`, `/step` sequence with a minimal dummy action.
|
| 1619 |
+
8. Implement `client.py` — test against the live server. Confirm the optimizer LLM loop produces an observation with improving scores.
|
| 1620 |
+
|
| 1621 |
+
### Phase 4 — Containerization
|
| 1622 |
+
9. Write `Dockerfile`. Build and verify all four pipeline stages work inside the container.
|
| 1623 |
+
10. Write `openenv.yaml`. Validate against the OpenEnv manifest schema.
|
| 1624 |
+
11. Push to HuggingFace Spaces. Verify a full episode (7 steps) completes within resource limits.
|
| 1625 |
+
|
| 1626 |
+
### Phase 5 — Hardening
|
| 1627 |
+
12. Add per-session LibreOffice locking if running >1 Uvicorn worker.
|
| 1628 |
+
13. Add timeout handling in `_run()` and surface timeouts as proper HTTP 504 responses.
|
| 1629 |
+
14. Add structured logging (JSON lines) so HuggingFace Spaces logs are parseable.
|
| 1630 |
+
|
| 1631 |
+
**Critical dependency note**: Phase 2 cannot start until Phase 1 is complete. Phase 3 cannot start until Phase 2 is stable. Phase 5 is optional for a hackathon demo but recommended for production.
|
| 1632 |
+
|
| 1633 |
+
---
|
| 1634 |
+
|
| 1635 |
+
## 12. Design Decisions and Rationale
|
| 1636 |
+
|
| 1637 |
+
### Per-Session Isolation vs. No-Concurrency
|
| 1638 |
+
|
| 1639 |
+
The original plan set `SUPPORTS_CONCURRENT_SESSIONS = False`. This is safe but prevents any parallel evaluation runs, making HuggingFace Spaces single-threaded even though the hardware could handle more.
|
| 1640 |
+
|
| 1641 |
+
The better approach is per-session file isolation: on `reset()`, copy both skill files into `/tmp/slide_skill_{session_id}/`. Each session's `generate.js`, `.pptx`, `.pdf`, and `.jpg` are written there too. Sessions never touch each other's files.
|
| 1642 |
+
|
| 1643 |
+
The one caveat is LibreOffice: `soffice` is not thread-safe when called concurrently from the same OS user. Options: (a) serialize LibreOffice calls with an `asyncio.Lock`, or (b) each session can set `--env HOME=/tmp/soffice_{session_id}` to get a unique LibreOffice user profile. Option (b) is simpler and is what the Dockerfile's `ENV HOME=/tmp` partially enables.
|
| 1644 |
+
|
| 1645 |
+
### Dual Action Types
|
| 1646 |
+
|
| 1647 |
+
The historical optimizer LLM rewrites the entire `DESIGN_RULES.md` in each round — it does not do surgical section edits. `ReplaceFileAction` matches this behavior exactly and makes the action space natural for an LLM optimizer.
|
| 1648 |
+
|
| 1649 |
+
`EditSectionAction` is retained because: (a) it is more token-efficient for small targeted changes, (b) it enables gradient-like optimization where an RL agent changes one dimension at a time, and (c) it is a cleaner action space for non-LLM optimizers (e.g., evolutionary algorithms).
|
| 1650 |
+
|
| 1651 |
+
Using a Pydantic discriminated union keeps the API clean: a single `action` field, type-safe dispatch in `SkillManager.apply()`, and automatic OpenAPI schema generation.
|
| 1652 |
+
|
| 1653 |
+
### Why We Don't Evolve the Generic pptx Skill
|
| 1654 |
+
|
| 1655 |
+
The files in `pptx/` (SKILL.md, editing.md, pptxgenjs.md) are the agent's API reference for using pptxgenjs. They are analogous to a standard library — stable, general-purpose, and not brand-specific. Evolving them would be like optimizing stdlib for one application.
|
| 1656 |
+
|
| 1657 |
+
The brand-specific optimization target is `DESIGN_RULES.md` + `EXAMPLES.md`. These encode McKinsey visual grammar: what colors, what typography, where to put structural elements, what titles should say. This separation is what makes the loop generalizable: swap in a different task prompt + reference images + baseline skill files, and the same environment can optimize slides for any brand.
|
| 1658 |
+
|
| 1659 |
+
### LibreOffice as the Bottleneck
|
| 1660 |
+
|
| 1661 |
+
LibreOffice adds ~500 MB to the Docker image and ~15–30 seconds per step. There is no lighter alternative that faithfully renders pptxgenjs output to PDF. Headless Chrome can render HTML but not .pptx. The pptxgenjs team does not offer a built-in PDF export.
|
| 1662 |
+
|
| 1663 |
+
Accept LibreOffice as a hard dependency. Optimize around it by: (a) keeping the Docker layer cached (don't change its installation order), (b) pre-warming LibreOffice on server startup with a dummy convert, (c) setting a 60-second timeout on the LibreOffice subprocess and surfacing timeout as a step error rather than hanging.
|
| 1664 |
+
|
| 1665 |
+
### Reward = Score Delta Capped at [-0.3, +0.3]
|
| 1666 |
+
|
| 1667 |
+
The evaluator is an LLM (Claude Opus 4.6 with vision). LLM evaluators have shot noise: the same slide evaluated twice may score 87 one time and 91 the next. If we use raw score delta as reward, a noise swing of +4 looks like a significant improvement. Capping at ±30 points (±0.3 normalized) means noise within ±5 points produces a small reward signal rather than a large one. The cap is soft for genuine improvements: going from 60→90 in one step (unusual but possible) gives reward = +0.3, same as going from 60→100. This is intentional — we want to reward improvement, not its magnitude, to keep the learning signal stable.
|
| 1668 |
+
|
| 1669 |
+
### EXAMPLES.md Grows Over Time
|
| 1670 |
+
|
| 1671 |
+
In the historical loop, `EXAMPLES.md` accumulated guidance across rounds — by v4, it referenced v3 and v4 issues explicitly. On `reset()`, we restore to the true `skill_v0` baseline: empty EXAMPLES.md. This is intentional. The optimizer must re-learn from the evaluator feedback each episode, which is the right behavior for RL. If you want warm-started episodes, implement a separate "curriculum baseline" and pass it as an optional `reset(skill_version="v3")` parameter.
|
| 1672 |
+
|
| 1673 |
+
---
|
| 1674 |
+
|
| 1675 |
+
## 13. Dependencies
|
| 1676 |
+
|
| 1677 |
+
`pyproject.toml`
|
| 1678 |
+
|
| 1679 |
+
```toml
|
| 1680 |
+
[build-system]
|
| 1681 |
+
requires = ["hatchling"]
|
| 1682 |
+
build-backend = "hatchling.build"
|
| 1683 |
+
|
| 1684 |
+
[project]
|
| 1685 |
+
name = "slide-skill-openenv"
|
| 1686 |
+
version = "1.0.0"
|
| 1687 |
+
description = "OpenEnv environment for McKinsey-style PowerPoint slide optimization"
|
| 1688 |
+
requires-python = ">=3.12"
|
| 1689 |
+
|
| 1690 |
+
# Core runtime dependencies (required for the environment to run)
|
| 1691 |
+
dependencies = [
|
| 1692 |
+
"anthropic>=0.40.0", # Claude API client (generator + evaluator)
|
| 1693 |
+
"pydantic>=2.6.0", # Data models with discriminated unions
|
| 1694 |
+
"httpx>=0.27.0", # HTTP client for client.py
|
| 1695 |
+
]
|
| 1696 |
+
|
| 1697 |
+
[project.optional-dependencies]
|
| 1698 |
+
# Server dependencies (required for app.py)
|
| 1699 |
+
server = [
|
| 1700 |
+
"fastapi>=0.111.0",
|
| 1701 |
+
"uvicorn[standard]>=0.30.0",
|
| 1702 |
+
"python-multipart>=0.0.9", # FastAPI form parsing
|
| 1703 |
+
]
|
| 1704 |
+
|
| 1705 |
+
# Development and testing
|
| 1706 |
+
dev = [
|
| 1707 |
+
"pytest>=8.0.0",
|
| 1708 |
+
"pytest-asyncio>=0.23.0",
|
| 1709 |
+
"httpx>=0.27.0", # for FastAPI TestClient
|
| 1710 |
+
"ruff>=0.4.0",
|
| 1711 |
+
"mypy>=1.10.0",
|
| 1712 |
+
]
|
| 1713 |
+
|
| 1714 |
+
[tool.hatch.build.targets.wheel]
|
| 1715 |
+
packages = ["openenv"]
|
| 1716 |
+
|
| 1717 |
+
[tool.ruff]
|
| 1718 |
+
target-version = "py312"
|
| 1719 |
+
line-length = 88
|
| 1720 |
+
|
| 1721 |
+
[tool.ruff.lint]
|
| 1722 |
+
select = ["E", "F", "I", "UP"]
|
| 1723 |
+
|
| 1724 |
+
[tool.mypy]
|
| 1725 |
+
python_version = "3.12"
|
| 1726 |
+
strict = true
|
| 1727 |
+
ignore_missing_imports = true
|
| 1728 |
+
```
|
openenv/Dockerfile
ADDED
|
@@ -0,0 +1,70 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Slide Skill OpenEnv — Docker image
|
| 2 |
+
#
|
| 3 |
+
# Layer sizes (approximate):
|
| 4 |
+
# python:3.12-slim base: ~130 MB
|
| 5 |
+
# Node.js 20 + pptxgenjs: ~200 MB
|
| 6 |
+
# LibreOffice: ~500 MB <-- dominant cost; unavoidable for .pptx → .pdf
|
| 7 |
+
# poppler-utils (pdftoppm): ~30 MB
|
| 8 |
+
# Python deps: ~80 MB
|
| 9 |
+
# Total compressed: ~600-700 MB
|
| 10 |
+
#
|
| 11 |
+
# LibreOffice is the unavoidable bottleneck. It is required to convert
|
| 12 |
+
# .pptx → .pdf. There is no lighter alternative that handles pptxgenjs
|
| 13 |
+
# output faithfully.
|
| 14 |
+
|
| 15 |
+
FROM python:3.12-slim
|
| 16 |
+
|
| 17 |
+
LABEL description="Slide Skill OpenEnv — McKinsey PPT generation environment"
|
| 18 |
+
|
| 19 |
+
# System dependencies — installed in one RUN to minimize layers.
|
| 20 |
+
RUN apt-get update && apt-get install -y --no-install-recommends \
|
| 21 |
+
# LibreOffice for .pptx → .pdf conversion
|
| 22 |
+
libreoffice \
|
| 23 |
+
# poppler-utils provides pdftoppm (.pdf → .jpg)
|
| 24 |
+
poppler-utils \
|
| 25 |
+
# Node.js 20 LTS via NodeSource
|
| 26 |
+
curl \
|
| 27 |
+
ca-certificates \
|
| 28 |
+
gnupg \
|
| 29 |
+
&& curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
|
| 30 |
+
&& apt-get install -y nodejs \
|
| 31 |
+
&& apt-get clean \
|
| 32 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 33 |
+
|
| 34 |
+
# Verify all required tools are available at build time.
|
| 35 |
+
RUN node --version && npm --version && soffice --version && pdftoppm -v 2>&1 | head -1
|
| 36 |
+
|
| 37 |
+
WORKDIR /app
|
| 38 |
+
|
| 39 |
+
# Install pptxgenjs (Node.js dependency) — copy package.json first for layer caching.
|
| 40 |
+
COPY package.json package-lock.json* ./
|
| 41 |
+
RUN npm install --production
|
| 42 |
+
|
| 43 |
+
# Install Python dependencies — copy pyproject.toml first for layer caching.
|
| 44 |
+
COPY pyproject.toml ./
|
| 45 |
+
RUN pip install --no-cache-dir -e ".[server]"
|
| 46 |
+
|
| 47 |
+
# Copy application code and data.
|
| 48 |
+
COPY pptx/ ./pptx/
|
| 49 |
+
COPY skill_files_baseline/ ./skill_files_baseline/
|
| 50 |
+
COPY output/TASK_PROMPT.md ./output/TASK_PROMPT.md
|
| 51 |
+
COPY output/reference/ ./output/reference/
|
| 52 |
+
COPY openenv/ ./openenv/
|
| 53 |
+
|
| 54 |
+
WORKDIR /app/openenv
|
| 55 |
+
|
| 56 |
+
# LibreOffice needs a writable user profile directory.
|
| 57 |
+
# Setting HOME=/tmp gives each process its own profile path and avoids
|
| 58 |
+
# concurrent session conflicts with the LibreOffice lock files.
|
| 59 |
+
ENV HOME=/tmp
|
| 60 |
+
# Use the headless VCL plugin (no display required).
|
| 61 |
+
ENV SAL_USE_VCLPLUGIN=svp
|
| 62 |
+
|
| 63 |
+
EXPOSE 8000
|
| 64 |
+
|
| 65 |
+
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
|
| 66 |
+
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
|
| 67 |
+
|
| 68 |
+
# Single worker — LibreOffice subprocess calls must be serialized within one
|
| 69 |
+
# OS process. Concurrent sessions are handled by per-session /tmp/ directories.
|
| 70 |
+
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
|
openenv/app.py
ADDED
|
@@ -0,0 +1,131 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
FastAPI server for the Slide Skill OpenEnv environment.
|
| 3 |
+
|
| 4 |
+
Endpoints follow the OpenEnv HTTP protocol:
|
| 5 |
+
POST /reset → initialize or restart a session
|
| 6 |
+
POST /step → apply an action and return observation
|
| 7 |
+
DELETE /sessions/{session_id} → clean up a session
|
| 8 |
+
GET /health → liveness check
|
| 9 |
+
|
| 10 |
+
The server is stateful: environment instances are kept in memory.
|
| 11 |
+
Use a single Uvicorn worker (--workers 1) since LibreOffice is not
|
| 12 |
+
thread-safe when called concurrently from the same process.
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
from __future__ import annotations
|
| 16 |
+
|
| 17 |
+
import logging
|
| 18 |
+
import traceback
|
| 19 |
+
from contextlib import asynccontextmanager
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
|
| 22 |
+
from dotenv import load_dotenv
|
| 23 |
+
|
| 24 |
+
logging.basicConfig(level=logging.INFO)
|
| 25 |
+
logger = logging.getLogger(__name__)
|
| 26 |
+
|
| 27 |
+
# Load .env from the repo root (one level up from openenv/)
|
| 28 |
+
load_dotenv(Path(__file__).parent.parent / ".env")
|
| 29 |
+
from typing import Annotated, Any
|
| 30 |
+
|
| 31 |
+
import uvicorn
|
| 32 |
+
from fastapi import Body, FastAPI, HTTPException, Path
|
| 33 |
+
from pydantic import BaseModel
|
| 34 |
+
|
| 35 |
+
from models import SlideSkillAction, SlideSkillObservation
|
| 36 |
+
from slide_skill_environment import SlideSkillEnvironment
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
# Single shared environment instance. Sessions are isolated at the file
|
| 40 |
+
# level, so this is safe for concurrent requests.
|
| 41 |
+
_env: SlideSkillEnvironment | None = None
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
@asynccontextmanager
|
| 45 |
+
async def lifespan(app: FastAPI): # type: ignore[type-arg]
|
| 46 |
+
global _env
|
| 47 |
+
_env = SlideSkillEnvironment()
|
| 48 |
+
yield
|
| 49 |
+
_env = None
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
app = FastAPI(
|
| 53 |
+
title="Slide Skill OpenEnv",
|
| 54 |
+
description=(
|
| 55 |
+
"OpenEnv-compatible environment for optimizing McKinsey-style "
|
| 56 |
+
"PowerPoint slides by evolving DESIGN_RULES.md and EXAMPLES.md."
|
| 57 |
+
),
|
| 58 |
+
lifespan=lifespan,
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
class ResetRequest(BaseModel):
|
| 63 |
+
session_id: str | None = None
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
class ResetResponse(BaseModel):
|
| 67 |
+
session_id: str
|
| 68 |
+
message: str
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
class StepRequest(BaseModel):
|
| 72 |
+
session_id: str
|
| 73 |
+
action: SlideSkillAction
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
@app.post("/reset", response_model=ResetResponse)
|
| 77 |
+
async def reset(
|
| 78 |
+
request: ResetRequest = Body(default=ResetRequest()),
|
| 79 |
+
) -> ResetResponse:
|
| 80 |
+
"""Initialize or restart an optimization session."""
|
| 81 |
+
assert _env is not None
|
| 82 |
+
session_id = _env.reset(session_id=request.session_id)
|
| 83 |
+
return ResetResponse(
|
| 84 |
+
session_id=session_id,
|
| 85 |
+
message=f"Session {session_id} initialized with baseline skill files.",
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
@app.post("/step", response_model=SlideSkillObservation)
|
| 90 |
+
async def step(request: StepRequest) -> SlideSkillObservation:
|
| 91 |
+
"""Apply an action to the session and return the resulting observation."""
|
| 92 |
+
assert _env is not None
|
| 93 |
+
try:
|
| 94 |
+
observation = _env.step(
|
| 95 |
+
session_id=request.session_id,
|
| 96 |
+
action=request.action,
|
| 97 |
+
)
|
| 98 |
+
except KeyError:
|
| 99 |
+
raise HTTPException(
|
| 100 |
+
status_code=404,
|
| 101 |
+
detail=f"Session {request.session_id!r} not found. Call /reset first.",
|
| 102 |
+
)
|
| 103 |
+
except (RuntimeError, ValueError) as exc:
|
| 104 |
+
logger.error("Step failed:\n%s", traceback.format_exc())
|
| 105 |
+
raise HTTPException(status_code=500, detail=str(exc))
|
| 106 |
+
return observation
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
@app.delete("/sessions/{session_id}")
|
| 110 |
+
async def close_session(
|
| 111 |
+
session_id: Annotated[str, Path(description="Session ID to clean up.")],
|
| 112 |
+
) -> dict[str, Any]:
|
| 113 |
+
"""Clean up session resources (deletes /tmp/ working directory)."""
|
| 114 |
+
assert _env is not None
|
| 115 |
+
try:
|
| 116 |
+
_env.close(session_id)
|
| 117 |
+
except KeyError:
|
| 118 |
+
raise HTTPException(
|
| 119 |
+
status_code=404,
|
| 120 |
+
detail=f"Session {session_id!r} not found.",
|
| 121 |
+
)
|
| 122 |
+
return {"message": f"Session {session_id} closed."}
|
| 123 |
+
|
| 124 |
+
|
| 125 |
+
@app.get("/health")
|
| 126 |
+
async def health() -> dict[str, Any]:
|
| 127 |
+
return {"status": "ok", "supports_concurrent_sessions": True}
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
if __name__ == "__main__":
|
| 131 |
+
uvicorn.run("app:app", host="0.0.0.0", port=8000, workers=1)
|
openenv/client.py
ADDED
|
@@ -0,0 +1,249 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Reference client for the Slide Skill OpenEnv server.
|
| 3 |
+
|
| 4 |
+
Demonstrates how an optimizer agent would interact with the environment:
|
| 5 |
+
1. Reset to get a session ID.
|
| 6 |
+
2. Submit the baseline action (no-op replace to trigger generation).
|
| 7 |
+
3. Call the LLM optimizer using the observation feedback.
|
| 8 |
+
4. Submit the improved DESIGN_RULES.md as a ReplaceFileAction.
|
| 9 |
+
5. Repeat until done=True.
|
| 10 |
+
|
| 11 |
+
This client is also useful for smoke-testing the server without a full agent.
|
| 12 |
+
|
| 13 |
+
Usage:
|
| 14 |
+
# Smoke test (single step, no optimizer LLM):
|
| 15 |
+
python client.py --smoke-test
|
| 16 |
+
|
| 17 |
+
# Full optimization loop:
|
| 18 |
+
python client.py --server http://localhost:8000 --max-steps 7
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import argparse
|
| 24 |
+
import os
|
| 25 |
+
import textwrap
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
from typing import Any
|
| 28 |
+
|
| 29 |
+
from dotenv import load_dotenv
|
| 30 |
+
from google import genai
|
| 31 |
+
|
| 32 |
+
load_dotenv(Path(__file__).parent.parent / ".env")
|
| 33 |
+
from google.genai import types
|
| 34 |
+
import httpx
|
| 35 |
+
from loguru import logger
|
| 36 |
+
|
| 37 |
+
from models import SlideSkillObservation
|
| 38 |
+
|
| 39 |
+
SERVER_URL = "http://localhost:8000"
|
| 40 |
+
OPTIMIZER_MODEL = "gemini-3.1-pro-preview"
|
| 41 |
+
|
| 42 |
+
BASELINE_EXAMPLES_CONTENT = "(Empty — no prior optimization rounds)\n"
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class SlideSkillClient:
|
| 46 |
+
"""HTTP client for the Slide Skill OpenEnv server."""
|
| 47 |
+
|
| 48 |
+
def __init__(self, base_url: str = SERVER_URL) -> None:
|
| 49 |
+
self.base_url = base_url.rstrip("/")
|
| 50 |
+
self._http = httpx.Client(timeout=300.0) # long timeout for pipeline stages
|
| 51 |
+
|
| 52 |
+
def reset(self, session_id: str | None = None) -> str:
|
| 53 |
+
"""Start a new session. Returns the session_id."""
|
| 54 |
+
payload: dict[str, Any] = {}
|
| 55 |
+
if session_id:
|
| 56 |
+
payload["session_id"] = session_id
|
| 57 |
+
resp = self._http.post(f"{self.base_url}/reset", json=payload)
|
| 58 |
+
resp.raise_for_status()
|
| 59 |
+
return resp.json()["session_id"]
|
| 60 |
+
|
| 61 |
+
def step(self, session_id: str, action: dict[str, Any]) -> SlideSkillObservation:
|
| 62 |
+
"""
|
| 63 |
+
Apply an action and return the observation.
|
| 64 |
+
|
| 65 |
+
Args:
|
| 66 |
+
session_id: Active session ID.
|
| 67 |
+
action: Dict matching EditSectionAction or ReplaceFileAction schema.
|
| 68 |
+
Must include "action_type" key.
|
| 69 |
+
"""
|
| 70 |
+
payload = {"session_id": session_id, "action": action}
|
| 71 |
+
resp = self._http.post(f"{self.base_url}/step", json=payload)
|
| 72 |
+
if not resp.is_success:
|
| 73 |
+
raise RuntimeError(
|
| 74 |
+
f"Step failed ({resp.status_code}): {resp.text}"
|
| 75 |
+
)
|
| 76 |
+
return SlideSkillObservation.model_validate(resp.json())
|
| 77 |
+
|
| 78 |
+
def close(self, session_id: str) -> None:
|
| 79 |
+
"""Clean up the session."""
|
| 80 |
+
resp = self._http.delete(f"{self.base_url}/sessions/{session_id}")
|
| 81 |
+
resp.raise_for_status()
|
| 82 |
+
|
| 83 |
+
def __enter__(self) -> SlideSkillClient:
|
| 84 |
+
return self
|
| 85 |
+
|
| 86 |
+
def __exit__(self, *_: Any) -> None:
|
| 87 |
+
self._http.close()
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
# ---------------------------------------------------------------------------
|
| 91 |
+
# Optimizer agent (reference implementation)
|
| 92 |
+
# ---------------------------------------------------------------------------
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def call_optimizer_llm(
|
| 96 |
+
obs: SlideSkillObservation,
|
| 97 |
+
gemini_client: genai.Client,
|
| 98 |
+
) -> dict[str, Any]:
|
| 99 |
+
"""
|
| 100 |
+
Call the optimizer LLM to generate a new DESIGN_RULES.md based on
|
| 101 |
+
the evaluation feedback.
|
| 102 |
+
|
| 103 |
+
Returns a dict suitable for the step() action parameter.
|
| 104 |
+
Uses ReplaceFileAction since the historical optimizer rewrites
|
| 105 |
+
the file wholesale.
|
| 106 |
+
"""
|
| 107 |
+
prompt = textwrap.dedent(f"""\
|
| 108 |
+
You are a McKinsey slide design optimizer. You are improving a
|
| 109 |
+
PowerPoint generation skill by rewriting its DESIGN_RULES.md file.
|
| 110 |
+
|
| 111 |
+
## Current Score: {obs.total}/100
|
| 112 |
+
|
| 113 |
+
## Score Breakdown
|
| 114 |
+
- background_layout: {obs.scores.background_layout}/15
|
| 115 |
+
- color_palette: {obs.scores.color_palette}/15
|
| 116 |
+
- typography: {obs.scores.typography}/15
|
| 117 |
+
- title_quality: {obs.scores.title_quality}/15
|
| 118 |
+
- data_presentation: {obs.scores.data_presentation}/15
|
| 119 |
+
- structural_elements: {obs.scores.structural_elements}/15
|
| 120 |
+
- overall_impression: {obs.scores.overall_impression}/10
|
| 121 |
+
|
| 122 |
+
## Evaluator Feedback
|
| 123 |
+
Strengths:
|
| 124 |
+
{chr(10).join(f'- {s}' for s in obs.strengths)}
|
| 125 |
+
|
| 126 |
+
Weaknesses:
|
| 127 |
+
{chr(10).join(f'- {w}' for w in obs.weaknesses)}
|
| 128 |
+
|
| 129 |
+
Verdict: {obs.one_line_verdict}
|
| 130 |
+
|
| 131 |
+
## Current DESIGN_RULES.md
|
| 132 |
+
{obs.design_rules_content}
|
| 133 |
+
|
| 134 |
+
## Current EXAMPLES.md
|
| 135 |
+
{obs.examples_content}
|
| 136 |
+
|
| 137 |
+
Your task:
|
| 138 |
+
Write an improved DESIGN_RULES.md that addresses the weaknesses above
|
| 139 |
+
while preserving what works well. Focus on the dimensions with the
|
| 140 |
+
lowest scores. Output ONLY the markdown file content — no explanation,
|
| 141 |
+
no code fences.
|
| 142 |
+
""")
|
| 143 |
+
|
| 144 |
+
response = gemini_client.models.generate_content(
|
| 145 |
+
model=OPTIMIZER_MODEL,
|
| 146 |
+
contents=prompt,
|
| 147 |
+
config=types.GenerateContentConfig(max_output_tokens=4096),
|
| 148 |
+
)
|
| 149 |
+
|
| 150 |
+
new_content = response.text.strip()
|
| 151 |
+
|
| 152 |
+
return {
|
| 153 |
+
"action_type": "replace_file",
|
| 154 |
+
"file": "DESIGN_RULES.md",
|
| 155 |
+
"new_content": new_content,
|
| 156 |
+
}
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
def run_optimization_loop(server_url: str = SERVER_URL, max_steps: int = 7) -> None:
|
| 160 |
+
"""
|
| 161 |
+
Run a full optimization episode using the LLM optimizer.
|
| 162 |
+
|
| 163 |
+
This mirrors the historical Skill Forge loop but driven through the
|
| 164 |
+
OpenEnv HTTP interface.
|
| 165 |
+
"""
|
| 166 |
+
gemini_client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
|
| 167 |
+
|
| 168 |
+
with SlideSkillClient(base_url=server_url) as client:
|
| 169 |
+
logger.info(f"Starting optimization loop (max {max_steps} steps) | server={server_url}")
|
| 170 |
+
session_id = client.reset()
|
| 171 |
+
logger.info(f"Session: {session_id}")
|
| 172 |
+
|
| 173 |
+
# Step 0: baseline — generate slide with unmodified skill files.
|
| 174 |
+
logger.info("Step 0/baseline | generating slide (Flash)...")
|
| 175 |
+
logger.info("Step 0/baseline | running Node.js + LibreOffice → JPG...")
|
| 176 |
+
logger.info("Step 0/baseline | evaluating slide (Pro)...")
|
| 177 |
+
obs = client.step(
|
| 178 |
+
session_id,
|
| 179 |
+
{
|
| 180 |
+
"action_type": "replace_file",
|
| 181 |
+
"file": "EXAMPLES.md",
|
| 182 |
+
"new_content": BASELINE_EXAMPLES_CONTENT,
|
| 183 |
+
},
|
| 184 |
+
)
|
| 185 |
+
logger.info(f"Step 0/baseline | score={obs.total}/100 — {obs.one_line_verdict}")
|
| 186 |
+
|
| 187 |
+
for step_idx in range(1, max_steps + 1):
|
| 188 |
+
if obs.done:
|
| 189 |
+
logger.info("Episode complete (max_steps reached).")
|
| 190 |
+
break
|
| 191 |
+
|
| 192 |
+
logger.info(f"Step {step_idx}/{max_steps} | optimizing skill files (Pro)...")
|
| 193 |
+
action = call_optimizer_llm(obs, gemini_client)
|
| 194 |
+
logger.info(f"Step {step_idx}/{max_steps} | generating slide (Flash)...")
|
| 195 |
+
logger.info(f"Step {step_idx}/{max_steps} | running Node.js + LibreOffice → JPG...")
|
| 196 |
+
logger.info(f"Step {step_idx}/{max_steps} | evaluating slide (Pro)...")
|
| 197 |
+
obs = client.step(session_id, action)
|
| 198 |
+
|
| 199 |
+
delta_str = f"{obs.reward * 100:+.0f} pts"
|
| 200 |
+
logger.info(f"Step {step_idx}/{max_steps} | score={obs.total}/100 ({delta_str}) — {obs.one_line_verdict}")
|
| 201 |
+
if obs.weaknesses:
|
| 202 |
+
logger.info(f"Step {step_idx}/{max_steps} | top weakness: {obs.weaknesses[0]}")
|
| 203 |
+
|
| 204 |
+
client.close(session_id)
|
| 205 |
+
logger.success(f"Done. Final score: {obs.total}/100")
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
def smoke_test(server_url: str = SERVER_URL) -> None:
|
| 209 |
+
"""Run a single reset + step to verify the server is working."""
|
| 210 |
+
with SlideSkillClient(base_url=server_url) as client:
|
| 211 |
+
logger.info("Smoke test: resetting session...")
|
| 212 |
+
session_id = client.reset()
|
| 213 |
+
logger.info(f"Smoke test: session_id={session_id}")
|
| 214 |
+
|
| 215 |
+
logger.info("Smoke test: submitting baseline action (full pipeline)...")
|
| 216 |
+
obs = client.step(
|
| 217 |
+
session_id,
|
| 218 |
+
{
|
| 219 |
+
"action_type": "replace_file",
|
| 220 |
+
"file": "EXAMPLES.md",
|
| 221 |
+
"new_content": BASELINE_EXAMPLES_CONTENT,
|
| 222 |
+
},
|
| 223 |
+
)
|
| 224 |
+
logger.info(f"Smoke test: score={obs.total}/100 reward={obs.reward:+.3f} done={obs.done}")
|
| 225 |
+
logger.info(f"Smoke test: verdict: {obs.one_line_verdict}")
|
| 226 |
+
|
| 227 |
+
client.close(session_id)
|
| 228 |
+
logger.success("Smoke test passed.")
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
if __name__ == "__main__":
|
| 232 |
+
parser = argparse.ArgumentParser(description="Slide Skill OpenEnv client")
|
| 233 |
+
parser.add_argument(
|
| 234 |
+
"--server", default=SERVER_URL, help="Server base URL"
|
| 235 |
+
)
|
| 236 |
+
parser.add_argument(
|
| 237 |
+
"--max-steps", type=int, default=7, help="Max optimization steps"
|
| 238 |
+
)
|
| 239 |
+
parser.add_argument(
|
| 240 |
+
"--smoke-test",
|
| 241 |
+
action="store_true",
|
| 242 |
+
help="Run a single step smoke test instead of the full loop",
|
| 243 |
+
)
|
| 244 |
+
args = parser.parse_args()
|
| 245 |
+
|
| 246 |
+
if args.smoke_test:
|
| 247 |
+
smoke_test(server_url=args.server)
|
| 248 |
+
else:
|
| 249 |
+
run_optimization_loop(server_url=args.server, max_steps=args.max_steps)
|
openenv/evaluator_adapter.py
ADDED
|
@@ -0,0 +1,259 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Evaluator Adapter — wraps the existing output/evaluator.py logic as a
|
| 3 |
+
reusable module with a clean interface.
|
| 4 |
+
|
| 5 |
+
This module does NOT import output/evaluator.py (which has a __main__ guard
|
| 6 |
+
and hardcoded paths). Instead, it re-implements the core evaluate_slide()
|
| 7 |
+
logic with:
|
| 8 |
+
- Configurable reference image paths
|
| 9 |
+
- A return type that includes all seven score keys, strengths, weaknesses,
|
| 10 |
+
and one_line_verdict
|
| 11 |
+
- No file I/O side effects (no evaluation_results.json written)
|
| 12 |
+
|
| 13 |
+
The evaluation prompt is identical to output/evaluator.py so scores are
|
| 14 |
+
comparable across the historical runs and the OpenEnv loop.
|
| 15 |
+
|
| 16 |
+
Note on Gemini vs. Anthropic image handling:
|
| 17 |
+
Gemini's SDK accepts image bytes directly via types.Part.from_bytes(),
|
| 18 |
+
so base64 encoding is not needed here (unlike the Anthropic SDK).
|
| 19 |
+
"""
|
| 20 |
+
|
| 21 |
+
from __future__ import annotations
|
| 22 |
+
|
| 23 |
+
import json
|
| 24 |
+
import os
|
| 25 |
+
import re
|
| 26 |
+
from pathlib import Path
|
| 27 |
+
|
| 28 |
+
from google import genai
|
| 29 |
+
from google.genai import types
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
# Must match output/evaluator.py exactly so historical scores are comparable.
|
| 33 |
+
EVALUATION_SYSTEM_PROMPT = """You are an expert McKinsey & Company slide design evaluator.
|
| 34 |
+
|
| 35 |
+
You will be shown:
|
| 36 |
+
1. REFERENCE IMAGES: 5 pages from a real McKinsey & Company consulting deck (Chilean Hydrogen Pathway, December 2020). These represent the gold standard for visual style.
|
| 37 |
+
2. CANDIDATE SLIDE: A programmatically generated PowerPoint slide about Dutch Hydrogen Strategy, rendered as a JPEG image.
|
| 38 |
+
|
| 39 |
+
Your job: Score how closely the CANDIDATE SLIDE matches the McKinsey visual style shown in the REFERENCE IMAGES.
|
| 40 |
+
|
| 41 |
+
## Scoring Rubric (100 points total)
|
| 42 |
+
|
| 43 |
+
### 1. Background & Base Layout (0-15 points)
|
| 44 |
+
- McKinsey content/data slides use WHITE backgrounds (dark navy is ONLY for section dividers/covers)
|
| 45 |
+
- Clean margins (~0.5" all sides)
|
| 46 |
+
- No unnecessary visual clutter
|
| 47 |
+
- 15: White bg, clean margins, professional spacing
|
| 48 |
+
- 10: White bg but spacing issues
|
| 49 |
+
- 5: Wrong background color or major layout problems
|
| 50 |
+
- 0: Completely wrong base (e.g., dark bg for data slide)
|
| 51 |
+
|
| 52 |
+
### 2. Color Palette Fidelity (0-15 points)
|
| 53 |
+
- McKinsey uses a RESTRAINED palette: navy/dark blue (#0C2340-ish), white, light greys
|
| 54 |
+
- Accent colors are used SPARINGLY — typically just 1-2 accent colors max
|
| 55 |
+
- NO rainbow effects, no bright multi-color schemes
|
| 56 |
+
- Crimson/red used only for thin divider lines, not large elements
|
| 57 |
+
- 15: Matches McKinsey's restrained navy/white/grey palette perfectly
|
| 58 |
+
- 10: Mostly correct but 1-2 color choices off
|
| 59 |
+
- 5: Too many colors or wrong color family
|
| 60 |
+
- 0: Completely different color scheme
|
| 61 |
+
|
| 62 |
+
### 3. Typography (0-15 points)
|
| 63 |
+
- Title: Large, bold, black or very dark, left-aligned (Georgia or similar serif for titles)
|
| 64 |
+
- Body: Clean sans-serif (Calibri-like), smaller, grey or dark grey
|
| 65 |
+
- Clear size hierarchy: title >> subtitle >> body >> footnotes
|
| 66 |
+
- No decorative fonts
|
| 67 |
+
- 15: Perfect type hierarchy matching McKinsey
|
| 68 |
+
- 10: Good hierarchy but font choices slightly off
|
| 69 |
+
- 5: Weak hierarchy or wrong fonts
|
| 70 |
+
- 0: No clear hierarchy
|
| 71 |
+
|
| 72 |
+
### 4. Title Quality — "So-What" Style (0-15 points)
|
| 73 |
+
- McKinsey titles state a CONCLUSION or INSIGHT, not just a topic
|
| 74 |
+
- GOOD: "The Netherlands aims to become Europe's green hydrogen hub, scaling from 500 MW to 3-4 GW by 2030"
|
| 75 |
+
- BAD: "Dutch Hydrogen Strategy (2020-2035)" or "Roadmap Overview"
|
| 76 |
+
- The title should tell you the key takeaway without reading the slide
|
| 77 |
+
- 15: Clear insight-driven conclusion title
|
| 78 |
+
- 10: Partial insight (has some specifics but reads more like a topic)
|
| 79 |
+
- 5: Pure topic label
|
| 80 |
+
- 0: Generic or missing title
|
| 81 |
+
|
| 82 |
+
### 5. Data Presentation (0-15 points)
|
| 83 |
+
- McKinsey uses structured TABLES for data (not floating stat callouts)
|
| 84 |
+
- Tables have: navy header borders (top + bottom of header row), light grey row dividers, bold left column labels
|
| 85 |
+
- Data should be organized, scannable, center-aligned values
|
| 86 |
+
- Key columns/years may be subtly highlighted
|
| 87 |
+
- 15: Clean structured table matching McKinsey format
|
| 88 |
+
- 10: Has data but format doesn't match McKinsey tables
|
| 89 |
+
- 5: Data present but poorly structured (floating callouts, inconsistent format)
|
| 90 |
+
- 0: No supporting data
|
| 91 |
+
|
| 92 |
+
### 6. Structural Elements (0-15 points)
|
| 93 |
+
- Thin crimson/red divider line below title area (not touching title — separated by whitespace)
|
| 94 |
+
- McKinsey footer: thin rule line + source text (left) + "McKinsey & Company" bold (right) + page number
|
| 95 |
+
- Numbered footnotes for data disclaimers
|
| 96 |
+
- Source attribution line
|
| 97 |
+
- 15: All structural elements present and correctly placed
|
| 98 |
+
- 10: Most elements present, minor placement issues
|
| 99 |
+
- 5: Missing 2+ structural elements
|
| 100 |
+
- 0: No McKinsey structural elements
|
| 101 |
+
|
| 102 |
+
### 7. Overall Visual Impression (0-10 points)
|
| 103 |
+
- Does this FEEL like it came from McKinsey?
|
| 104 |
+
- Would a consulting professional find this polished and credible?
|
| 105 |
+
- Is it clean, restrained, and authoritative — or busy, colorful, and amateur?
|
| 106 |
+
- 10: Indistinguishable from real McKinsey output
|
| 107 |
+
- 7: Close but a trained eye spots differences
|
| 108 |
+
- 4: Clearly generated/templated but has some McKinsey DNA
|
| 109 |
+
- 1: Does not resemble McKinsey at all
|
| 110 |
+
|
| 111 |
+
## Output Format
|
| 112 |
+
|
| 113 |
+
Return ONLY a JSON object with this exact structure (no markdown, no code fences):
|
| 114 |
+
{
|
| 115 |
+
"scores": {
|
| 116 |
+
"background_layout": <0-15>,
|
| 117 |
+
"color_palette": <0-15>,
|
| 118 |
+
"typography": <0-15>,
|
| 119 |
+
"title_quality": <0-15>,
|
| 120 |
+
"data_presentation": <0-15>,
|
| 121 |
+
"structural_elements": <0-15>,
|
| 122 |
+
"overall_impression": <0-10>
|
| 123 |
+
},
|
| 124 |
+
"total": <sum of all scores, 0-100>,
|
| 125 |
+
"strengths": ["<strength 1>", "<strength 2>", ...],
|
| 126 |
+
"weaknesses": ["<weakness 1>", "<weakness 2>", ...],
|
| 127 |
+
"one_line_verdict": "<one sentence summary>"
|
| 128 |
+
}
|
| 129 |
+
"""
|
| 130 |
+
|
| 131 |
+
EVALUATOR_MODEL = "gemini-3.1-pro-preview"
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
def _image_part(path: Path) -> types.Part:
|
| 135 |
+
"""Load an image file as a Gemini Part (bytes + mime type)."""
|
| 136 |
+
suffix = path.suffix.lower()
|
| 137 |
+
mime_type = "image/jpeg" if suffix in (".jpg", ".jpeg") else "image/png"
|
| 138 |
+
return types.Part.from_bytes(data=path.read_bytes(), mime_type=mime_type)
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
class EvaluatorAdapter:
|
| 142 |
+
"""
|
| 143 |
+
Adapter that evaluates a generated slide JPG against McKinsey references.
|
| 144 |
+
|
| 145 |
+
Uses Gemini 3.1 Pro with vision, replicating the evaluation logic from
|
| 146 |
+
output/evaluator.py as a reusable class with no file I/O side effects.
|
| 147 |
+
"""
|
| 148 |
+
|
| 149 |
+
REFERENCE_FILENAMES = [
|
| 150 |
+
"ref-01.jpg",
|
| 151 |
+
"ref-02.jpg",
|
| 152 |
+
"ref-03.jpg",
|
| 153 |
+
"ref-04.jpg",
|
| 154 |
+
"ref-05.jpg",
|
| 155 |
+
]
|
| 156 |
+
|
| 157 |
+
def __init__(self, reference_dir: Path) -> None:
|
| 158 |
+
"""
|
| 159 |
+
Args:
|
| 160 |
+
reference_dir: Directory containing ref-01.jpg through ref-05.jpg.
|
| 161 |
+
"""
|
| 162 |
+
self.reference_dir = reference_dir
|
| 163 |
+
self._client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
|
| 164 |
+
|
| 165 |
+
# Validate reference images exist at construction time.
|
| 166 |
+
missing = [
|
| 167 |
+
f
|
| 168 |
+
for f in self.REFERENCE_FILENAMES
|
| 169 |
+
if not (reference_dir / f).exists()
|
| 170 |
+
]
|
| 171 |
+
if missing:
|
| 172 |
+
raise FileNotFoundError(
|
| 173 |
+
f"Missing reference images in {reference_dir}: {missing}"
|
| 174 |
+
)
|
| 175 |
+
|
| 176 |
+
def evaluate(self, slide_jpg_path: Path) -> dict:
|
| 177 |
+
"""
|
| 178 |
+
Evaluate a generated slide against the McKinsey reference images.
|
| 179 |
+
|
| 180 |
+
Args:
|
| 181 |
+
slide_jpg_path: Absolute path to the slide JPG to evaluate.
|
| 182 |
+
|
| 183 |
+
Returns:
|
| 184 |
+
dict with keys:
|
| 185 |
+
"scores": dict mapping the 7 dimension names to int scores
|
| 186 |
+
"total": int, sum of all scores (0-100)
|
| 187 |
+
"strengths": list[str]
|
| 188 |
+
"weaknesses": list[str]
|
| 189 |
+
"one_line_verdict": str
|
| 190 |
+
|
| 191 |
+
Raises:
|
| 192 |
+
FileNotFoundError: If slide_jpg_path does not exist.
|
| 193 |
+
json.JSONDecodeError: If the LLM returns malformed JSON.
|
| 194 |
+
RuntimeError: If the API call fails.
|
| 195 |
+
"""
|
| 196 |
+
if not slide_jpg_path.exists():
|
| 197 |
+
raise FileNotFoundError(f"Slide JPG not found: {slide_jpg_path}")
|
| 198 |
+
|
| 199 |
+
# Build a flat list of Parts for the Gemini content parameter.
|
| 200 |
+
# Gemini accepts text strings and Part objects interleaved.
|
| 201 |
+
contents: list[types.Part | str] = []
|
| 202 |
+
|
| 203 |
+
# Reference images first.
|
| 204 |
+
contents.append(
|
| 205 |
+
"## REFERENCE IMAGES (Real McKinsey deck)\n"
|
| 206 |
+
"The following 5 images are from a real McKinsey & Company consulting "
|
| 207 |
+
"report. Study their visual style carefully."
|
| 208 |
+
)
|
| 209 |
+
for i, fname in enumerate(self.REFERENCE_FILENAMES, 1):
|
| 210 |
+
contents.append(_image_part(self.reference_dir / fname))
|
| 211 |
+
contents.append(f"(Reference page {i})")
|
| 212 |
+
|
| 213 |
+
# Candidate slide.
|
| 214 |
+
contents.append(
|
| 215 |
+
f"\n## CANDIDATE SLIDE TO EVALUATE\n"
|
| 216 |
+
f"This is the generated slide: {slide_jpg_path.name}"
|
| 217 |
+
)
|
| 218 |
+
contents.append(_image_part(slide_jpg_path))
|
| 219 |
+
contents.append(
|
| 220 |
+
"\nNow score this candidate slide against the McKinsey reference "
|
| 221 |
+
"using the rubric. Return ONLY the JSON object."
|
| 222 |
+
)
|
| 223 |
+
|
| 224 |
+
response = self._client.models.generate_content(
|
| 225 |
+
model=EVALUATOR_MODEL,
|
| 226 |
+
contents=contents,
|
| 227 |
+
config=types.GenerateContentConfig(
|
| 228 |
+
system_instruction=EVALUATION_SYSTEM_PROMPT,
|
| 229 |
+
max_output_tokens=2048,
|
| 230 |
+
),
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
text = response.text.strip()
|
| 234 |
+
|
| 235 |
+
# Extract JSON object robustly (handles markdown fences and surrounding text).
|
| 236 |
+
json_match = re.search(r"\{.*\}", text, re.DOTALL)
|
| 237 |
+
if json_match:
|
| 238 |
+
text = json_match.group(0)
|
| 239 |
+
|
| 240 |
+
result = json.loads(text)
|
| 241 |
+
|
| 242 |
+
# Validate required keys are present.
|
| 243 |
+
required_score_keys = {
|
| 244 |
+
"background_layout",
|
| 245 |
+
"color_palette",
|
| 246 |
+
"typography",
|
| 247 |
+
"title_quality",
|
| 248 |
+
"data_presentation",
|
| 249 |
+
"structural_elements",
|
| 250 |
+
"overall_impression",
|
| 251 |
+
}
|
| 252 |
+
missing_keys = required_score_keys - set(result.get("scores", {}).keys())
|
| 253 |
+
if missing_keys:
|
| 254 |
+
raise ValueError(
|
| 255 |
+
f"Evaluator response missing score keys: {missing_keys}. "
|
| 256 |
+
f"Full response: {text[:500]}"
|
| 257 |
+
)
|
| 258 |
+
|
| 259 |
+
return result
|
openenv/models.py
ADDED
|
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Pydantic data models for the Slide Skill OpenEnv environment.
|
| 3 |
+
|
| 4 |
+
Action space:
|
| 5 |
+
SlideSkillAction is a discriminated union of two action types:
|
| 6 |
+
- EditSectionAction: Replace a named section's body in one skill file.
|
| 7 |
+
- ReplaceFileAction: Replace the entire content of one skill file.
|
| 8 |
+
|
| 9 |
+
EditSectionAction is appropriate when the agent wants surgical edits
|
| 10 |
+
(e.g., update only the typography section). ReplaceFileAction is used
|
| 11 |
+
when the optimizer rewrites the file wholesale, which is what the
|
| 12 |
+
historical optimizer LLM actually does.
|
| 13 |
+
|
| 14 |
+
Observation space:
|
| 15 |
+
SlideSkillObservation contains the full evaluator output including all
|
| 16 |
+
seven score dimensions plus qualitative feedback fields.
|
| 17 |
+
"""
|
| 18 |
+
|
| 19 |
+
from __future__ import annotations
|
| 20 |
+
|
| 21 |
+
from typing import Annotated, Literal
|
| 22 |
+
|
| 23 |
+
from pydantic import BaseModel, Field
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
# ---------------------------------------------------------------------------
|
| 27 |
+
# Actions
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
|
| 30 |
+
SkillFile = Literal["DESIGN_RULES.md", "EXAMPLES.md"]
|
| 31 |
+
"""The two skill files the optimizer is allowed to modify."""
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class EditSectionAction(BaseModel):
|
| 35 |
+
"""
|
| 36 |
+
Replace the body of a named markdown section within a skill file.
|
| 37 |
+
|
| 38 |
+
The section is identified by its heading text (without the leading #
|
| 39 |
+
characters). The replacement spans from immediately after the heading
|
| 40 |
+
line to (but not including) the next heading of equal or higher level.
|
| 41 |
+
|
| 42 |
+
Example:
|
| 43 |
+
action = EditSectionAction(
|
| 44 |
+
file="DESIGN_RULES.md",
|
| 45 |
+
section_heading="Color Palette",
|
| 46 |
+
new_body="- Navy (#0C2340): primary\\n- White: background\\n"
|
| 47 |
+
)
|
| 48 |
+
"""
|
| 49 |
+
|
| 50 |
+
action_type: Literal["edit_section"] = "edit_section"
|
| 51 |
+
file: SkillFile = Field(..., description="Which skill file to edit.")
|
| 52 |
+
section_heading: str = Field(
|
| 53 |
+
...,
|
| 54 |
+
description=(
|
| 55 |
+
"Exact heading text (without leading # markers). "
|
| 56 |
+
"Case-sensitive. Must match a heading in the file."
|
| 57 |
+
),
|
| 58 |
+
)
|
| 59 |
+
new_body: str = Field(
|
| 60 |
+
...,
|
| 61 |
+
description="New markdown content for the section body (after the heading line).",
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
class ReplaceFileAction(BaseModel):
|
| 66 |
+
"""
|
| 67 |
+
Replace the entire content of a skill file.
|
| 68 |
+
|
| 69 |
+
Use this when the optimizer rewrites the file wholesale rather than
|
| 70 |
+
making targeted section edits. This is the mode used by the historical
|
| 71 |
+
optimizer LLM in the Skill Forge loop.
|
| 72 |
+
"""
|
| 73 |
+
|
| 74 |
+
action_type: Literal["replace_file"] = "replace_file"
|
| 75 |
+
file: SkillFile = Field(..., description="Which skill file to replace.")
|
| 76 |
+
new_content: str = Field(
|
| 77 |
+
...,
|
| 78 |
+
description="Complete new file content (valid markdown).",
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
# Discriminated union — action_type is the discriminator field.
|
| 83 |
+
SlideSkillAction = Annotated[
|
| 84 |
+
EditSectionAction | ReplaceFileAction,
|
| 85 |
+
Field(discriminator="action_type"),
|
| 86 |
+
]
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
# ---------------------------------------------------------------------------
|
| 90 |
+
# Scores
|
| 91 |
+
# ---------------------------------------------------------------------------
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
class SlideScores(BaseModel):
|
| 95 |
+
"""Raw scores from the McKinsey evaluator. Each dimension is 0-15 except
|
| 96 |
+
overall_impression which is 0-10. Total is 0-100."""
|
| 97 |
+
|
| 98 |
+
background_layout: int = Field(..., ge=0, le=15)
|
| 99 |
+
color_palette: int = Field(..., ge=0, le=15)
|
| 100 |
+
typography: int = Field(..., ge=0, le=15)
|
| 101 |
+
title_quality: int = Field(..., ge=0, le=15)
|
| 102 |
+
data_presentation: int = Field(..., ge=0, le=15)
|
| 103 |
+
structural_elements: int = Field(..., ge=0, le=15)
|
| 104 |
+
overall_impression: int = Field(..., ge=0, le=10)
|
| 105 |
+
|
| 106 |
+
@property
|
| 107 |
+
def total(self) -> int:
|
| 108 |
+
return (
|
| 109 |
+
self.background_layout
|
| 110 |
+
+ self.color_palette
|
| 111 |
+
+ self.typography
|
| 112 |
+
+ self.title_quality
|
| 113 |
+
+ self.data_presentation
|
| 114 |
+
+ self.structural_elements
|
| 115 |
+
+ self.overall_impression
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
# ---------------------------------------------------------------------------
|
| 120 |
+
# Observation
|
| 121 |
+
# ---------------------------------------------------------------------------
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
class SlideSkillObservation(BaseModel):
|
| 125 |
+
"""
|
| 126 |
+
Observation returned to the agent after each step.
|
| 127 |
+
|
| 128 |
+
Contains the full evaluator output so the optimizer LLM has all the
|
| 129 |
+
information it needs to write the next skill revision: numeric scores,
|
| 130 |
+
qualitative strengths/weaknesses, and the one-line verdict.
|
| 131 |
+
"""
|
| 132 |
+
|
| 133 |
+
scores: SlideScores
|
| 134 |
+
total: int = Field(..., description="Sum of all score dimensions (0-100).")
|
| 135 |
+
strengths: list[str] = Field(
|
| 136 |
+
default_factory=list,
|
| 137 |
+
description="List of what the slide does well, from the evaluator.",
|
| 138 |
+
)
|
| 139 |
+
weaknesses: list[str] = Field(
|
| 140 |
+
default_factory=list,
|
| 141 |
+
description="List of what needs improvement, from the evaluator.",
|
| 142 |
+
)
|
| 143 |
+
one_line_verdict: str = Field(
|
| 144 |
+
..., description="Single-sentence summary from the evaluator."
|
| 145 |
+
)
|
| 146 |
+
reward: float = Field(
|
| 147 |
+
...,
|
| 148 |
+
description=(
|
| 149 |
+
"Score delta vs. previous step, capped to [-0.3, +0.3] and "
|
| 150 |
+
"normalized to [-1.0, +1.0] by dividing by 100. "
|
| 151 |
+
"Capping reduces reward noise from LLM evaluation variance."
|
| 152 |
+
),
|
| 153 |
+
)
|
| 154 |
+
step: int = Field(..., description="Current step index (0-based).")
|
| 155 |
+
done: bool = Field(..., description="True if max_steps reached.")
|
| 156 |
+
jpg_path: str = Field(
|
| 157 |
+
..., description="Absolute path to the generated slide JPG."
|
| 158 |
+
)
|
| 159 |
+
design_rules_content: str = Field(
|
| 160 |
+
...,
|
| 161 |
+
description="Current DESIGN_RULES.md content (after action was applied).",
|
| 162 |
+
)
|
| 163 |
+
examples_content: str = Field(
|
| 164 |
+
...,
|
| 165 |
+
description="Current EXAMPLES.md content (after action was applied).",
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
# ---------------------------------------------------------------------------
|
| 170 |
+
# State (internal, not exposed to client)
|
| 171 |
+
# ---------------------------------------------------------------------------
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
class SlideSkillState(BaseModel):
|
| 175 |
+
"""Internal environment state. Not serialized to the client."""
|
| 176 |
+
|
| 177 |
+
session_id: str
|
| 178 |
+
step: int = 0
|
| 179 |
+
prev_total: int = 0 # score from the previous step (for reward calculation)
|
| 180 |
+
session_dir: str = Field(
|
| 181 |
+
...,
|
| 182 |
+
description=(
|
| 183 |
+
"Absolute path to the isolated session directory under /tmp/. "
|
| 184 |
+
"Contains copies of DESIGN_RULES.md and EXAMPLES.md that this "
|
| 185 |
+
"session is free to modify without affecting other sessions."
|
| 186 |
+
),
|
| 187 |
+
)
|
openenv/openenv.yaml
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# OpenEnv environment manifest for Slide Skill
|
| 2 |
+
# https://openenv.dev/spec
|
| 3 |
+
|
| 4 |
+
name: slide-skill
|
| 5 |
+
version: "1.0.0"
|
| 6 |
+
description: >
|
| 7 |
+
Self-improving McKinsey-style PowerPoint slide generation environment.
|
| 8 |
+
The agent evolves DESIGN_RULES.md and EXAMPLES.md to maximize a visual
|
| 9 |
+
design score (0-100) evaluated by Claude Opus 4.6 vision against 5 McKinsey
|
| 10 |
+
reference images.
|
| 11 |
+
|
| 12 |
+
author: Tesserae / Skill Forge Hackathon Team
|
| 13 |
+
|
| 14 |
+
supports_concurrent_sessions: true
|
| 15 |
+
max_steps: 7
|
| 16 |
+
|
| 17 |
+
# Approximate time budget per step (seconds).
|
| 18 |
+
# Each step: generator LLM (~20-40s) + Node.js (<5s) + LibreOffice (~15-30s)
|
| 19 |
+
# + pdftoppm (<5s) + evaluator LLM (~30-60s)
|
| 20 |
+
step_timeout_seconds: 180
|
| 21 |
+
|
| 22 |
+
action_space:
|
| 23 |
+
type: union
|
| 24 |
+
discriminator: action_type
|
| 25 |
+
variants:
|
| 26 |
+
- name: edit_section
|
| 27 |
+
description: Replace the body of a named section in a skill file.
|
| 28 |
+
fields:
|
| 29 |
+
file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
|
| 30 |
+
section_heading: {type: string, description: "Exact heading text without # markers"}
|
| 31 |
+
new_body: {type: string, description: "New section body content in markdown"}
|
| 32 |
+
|
| 33 |
+
- name: replace_file
|
| 34 |
+
description: Replace the entire content of a skill file.
|
| 35 |
+
fields:
|
| 36 |
+
file: {type: string, enum: ["DESIGN_RULES.md", "EXAMPLES.md"]}
|
| 37 |
+
new_content: {type: string, description: "Complete new file content"}
|
| 38 |
+
|
| 39 |
+
observation_space:
|
| 40 |
+
scores:
|
| 41 |
+
background_layout: {type: integer, min: 0, max: 15}
|
| 42 |
+
color_palette: {type: integer, min: 0, max: 15}
|
| 43 |
+
typography: {type: integer, min: 0, max: 15}
|
| 44 |
+
title_quality: {type: integer, min: 0, max: 15}
|
| 45 |
+
data_presentation: {type: integer, min: 0, max: 15}
|
| 46 |
+
structural_elements: {type: integer, min: 0, max: 15}
|
| 47 |
+
overall_impression: {type: integer, min: 0, max: 10}
|
| 48 |
+
total: {type: integer, min: 0, max: 100}
|
| 49 |
+
strengths: {type: array, items: string}
|
| 50 |
+
weaknesses: {type: array, items: string}
|
| 51 |
+
one_line_verdict: {type: string}
|
| 52 |
+
reward: {type: float, min: -0.3, max: 0.3}
|
| 53 |
+
step: {type: integer}
|
| 54 |
+
done: {type: boolean}
|
| 55 |
+
jpg_path: {type: string, description: "Absolute path to generated slide JPG"}
|
| 56 |
+
design_rules_content: {type: string}
|
| 57 |
+
examples_content: {type: string}
|
| 58 |
+
|
| 59 |
+
reward:
|
| 60 |
+
description: >
|
| 61 |
+
Normalized score delta vs. previous step, capped to [-0.3, +0.3].
|
| 62 |
+
Formula: clip(total_score - prev_total_score, -30, +30) / 100
|
| 63 |
+
range: [-0.3, 0.3]
|
| 64 |
+
|
| 65 |
+
baseline:
|
| 66 |
+
description: >
|
| 67 |
+
skill_files_baseline/ committed to the repo contains the minimal
|
| 68 |
+
starting DESIGN_RULES.md (teal palette, basic typography) and an
|
| 69 |
+
empty EXAMPLES.md. This is skill_v0 content — NOT any evolved version.
|
| 70 |
+
|
| 71 |
+
endpoints:
|
| 72 |
+
reset: POST /reset
|
| 73 |
+
step: POST /step
|
| 74 |
+
close: DELETE /sessions/{session_id}
|
| 75 |
+
health: GET /health
|
| 76 |
+
|
| 77 |
+
server:
|
| 78 |
+
host: 0.0.0.0
|
| 79 |
+
port: 8000
|
| 80 |
+
workers: 1 # Do not increase; LibreOffice is not thread-safe within one process
|
| 81 |
+
|
| 82 |
+
environment_variables:
|
| 83 |
+
required:
|
| 84 |
+
- name: GEMINI_API_KEY
|
| 85 |
+
description: >
|
| 86 |
+
Google Gemini API key. Used by all three LLM roles:
|
| 87 |
+
generator (Gemini 3 Flash), evaluator (Gemini 3.1 Pro),
|
| 88 |
+
and optimizer (Gemini 3.1 Pro).
|
| 89 |
+
optional:
|
| 90 |
+
- name: SLIDE_SKILL_MAX_STEPS
|
| 91 |
+
description: Override default max_steps per episode
|
| 92 |
+
default: "7"
|
openenv/skill_manager.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Skill file manager — applies actions to an isolated session directory.
|
| 3 |
+
|
| 4 |
+
Operates exclusively on files within session_dir (a /tmp/ path).
|
| 5 |
+
Never touches the repo's baseline or any shared files.
|
| 6 |
+
|
| 7 |
+
Section editing rules:
|
| 8 |
+
A "section" is a markdown heading of any level (# to ######).
|
| 9 |
+
EditSectionAction finds the first heading whose text matches
|
| 10 |
+
section_heading (case-sensitive, stripped), then replaces everything
|
| 11 |
+
from the line after that heading up to (but not including) the next
|
| 12 |
+
heading of equal or higher level (i.e., same or fewer # characters).
|
| 13 |
+
If no next heading is found, the replacement extends to end-of-file.
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
import re
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
from models import EditSectionAction, ReplaceFileAction, SlideSkillAction
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class SkillManager:
|
| 25 |
+
"""Manages DESIGN_RULES.md and EXAMPLES.md within a session directory."""
|
| 26 |
+
|
| 27 |
+
def __init__(self, session_dir: Path) -> None:
|
| 28 |
+
self.session_dir = session_dir
|
| 29 |
+
|
| 30 |
+
def apply(self, action: SlideSkillAction) -> None:
|
| 31 |
+
"""
|
| 32 |
+
Dispatch to the appropriate handler based on action type.
|
| 33 |
+
|
| 34 |
+
Raises:
|
| 35 |
+
ValueError: If action_type is unrecognized or section not found.
|
| 36 |
+
FileNotFoundError: If the target skill file does not exist.
|
| 37 |
+
"""
|
| 38 |
+
target = self.session_dir / action.file
|
| 39 |
+
if not target.exists():
|
| 40 |
+
raise FileNotFoundError(f"Skill file not found in session: {target}")
|
| 41 |
+
|
| 42 |
+
if action.action_type == "replace_file":
|
| 43 |
+
self._replace_file(target, action) # type: ignore[arg-type]
|
| 44 |
+
elif action.action_type == "edit_section":
|
| 45 |
+
self._edit_section(target, action) # type: ignore[arg-type]
|
| 46 |
+
else:
|
| 47 |
+
raise ValueError(f"Unknown action_type: {action.action_type!r}")
|
| 48 |
+
|
| 49 |
+
# ------------------------------------------------------------------
|
| 50 |
+
# Private helpers
|
| 51 |
+
# ------------------------------------------------------------------
|
| 52 |
+
|
| 53 |
+
@staticmethod
|
| 54 |
+
def _replace_file(target: Path, action: ReplaceFileAction) -> None:
|
| 55 |
+
"""Overwrite the entire file with new_content."""
|
| 56 |
+
target.write_text(action.new_content, encoding="utf-8")
|
| 57 |
+
|
| 58 |
+
@staticmethod
|
| 59 |
+
def _edit_section(target: Path, action: EditSectionAction) -> None:
|
| 60 |
+
"""Replace the body of a named markdown section."""
|
| 61 |
+
text = target.read_text(encoding="utf-8")
|
| 62 |
+
lines = text.splitlines(keepends=True)
|
| 63 |
+
|
| 64 |
+
# Find the heading line.
|
| 65 |
+
heading_pattern = re.compile(r"^(#{1,6})\s+(.*?)\s*$")
|
| 66 |
+
heading_idx: int | None = None
|
| 67 |
+
heading_level: int = 0
|
| 68 |
+
|
| 69 |
+
for i, line in enumerate(lines):
|
| 70 |
+
m = heading_pattern.match(line.rstrip("\n\r"))
|
| 71 |
+
if m and m.group(2) == action.section_heading:
|
| 72 |
+
heading_idx = i
|
| 73 |
+
heading_level = len(m.group(1))
|
| 74 |
+
break
|
| 75 |
+
|
| 76 |
+
if heading_idx is None:
|
| 77 |
+
raise ValueError(
|
| 78 |
+
f"Section heading {action.section_heading!r} not found in {target.name}."
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
# Find where the section body ends (next heading of equal or higher level).
|
| 82 |
+
end_idx = len(lines)
|
| 83 |
+
for i in range(heading_idx + 1, len(lines)):
|
| 84 |
+
m = heading_pattern.match(lines[i].rstrip("\n\r"))
|
| 85 |
+
if m and len(m.group(1)) <= heading_level:
|
| 86 |
+
end_idx = i
|
| 87 |
+
break
|
| 88 |
+
|
| 89 |
+
# Reconstruct the file.
|
| 90 |
+
new_body = action.new_body
|
| 91 |
+
if new_body and not new_body.endswith("\n"):
|
| 92 |
+
new_body += "\n"
|
| 93 |
+
|
| 94 |
+
new_lines = (
|
| 95 |
+
lines[: heading_idx + 1] # heading itself
|
| 96 |
+
+ [new_body]
|
| 97 |
+
+ lines[end_idx:] # rest of file after the section
|
| 98 |
+
)
|
| 99 |
+
target.write_text("".join(new_lines), encoding="utf-8")
|
| 100 |
+
|
| 101 |
+
def read_file(self, filename: str) -> str:
|
| 102 |
+
"""Read a skill file from the session directory."""
|
| 103 |
+
return (self.session_dir / filename).read_text(encoding="utf-8")
|
openenv/slide_generator.py
ADDED
|
@@ -0,0 +1,284 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Slide Generator — orchestrates the full PPT generation pipeline.
|
| 3 |
+
|
| 4 |
+
Pipeline (in order):
|
| 5 |
+
1. LLM reads DESIGN_RULES.md + EXAMPLES.md + TASK_PROMPT.md + pptx/ tooling
|
| 6 |
+
→ writes pptxgenjs JavaScript to generate.js in the session output dir.
|
| 7 |
+
2. `node generate.js` runs in the session output dir → produces slide.pptx.
|
| 8 |
+
3. `soffice --headless --convert-to pdf slide.pptx` → slide.pdf.
|
| 9 |
+
4. `pdftoppm -r 150 -jpeg -f 1 -l 1 slide.pdf slide` → slide-1.jpg (page 1).
|
| 10 |
+
5. Returns the Path to slide-1.jpg.
|
| 11 |
+
|
| 12 |
+
The generator LLM receives the pptx/ tooling files as context so it knows
|
| 13 |
+
the full pptxgenjs API — but those files are read-only; they are never
|
| 14 |
+
written to or returned in the observation.
|
| 15 |
+
|
| 16 |
+
Session isolation:
|
| 17 |
+
All generated artifacts (generate.js, slide.pptx, slide.pdf, slide-1.jpg)
|
| 18 |
+
are written into a subdirectory of session_dir so that concurrent sessions
|
| 19 |
+
do not share output paths.
|
| 20 |
+
"""
|
| 21 |
+
|
| 22 |
+
from __future__ import annotations
|
| 23 |
+
|
| 24 |
+
import os
|
| 25 |
+
import re
|
| 26 |
+
import shutil
|
| 27 |
+
import subprocess
|
| 28 |
+
import textwrap
|
| 29 |
+
from pathlib import Path
|
| 30 |
+
|
| 31 |
+
from google import genai
|
| 32 |
+
from google.genai import types
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
REPO_ROOT = Path(__file__).parent.parent
|
| 36 |
+
|
| 37 |
+
# On macOS, LibreOffice installs to a .app bundle not on PATH by default.
|
| 38 |
+
_SOFFICE_MACOS = "/Applications/LibreOffice.app/Contents/MacOS/soffice"
|
| 39 |
+
SOFFICE = shutil.which("soffice") or (_SOFFICE_MACOS if Path(_SOFFICE_MACOS).exists() else "soffice")
|
| 40 |
+
|
| 41 |
+
# On macOS, poppler (pdftoppm) is installed via Homebrew — check both
|
| 42 |
+
# Apple Silicon and Intel prefix locations.
|
| 43 |
+
PDFTOPPM = (
|
| 44 |
+
shutil.which("pdftoppm")
|
| 45 |
+
or ("/opt/homebrew/bin/pdftoppm" if Path("/opt/homebrew/bin/pdftoppm").exists() else None)
|
| 46 |
+
or ("/usr/local/bin/pdftoppm" if Path("/usr/local/bin/pdftoppm").exists() else None)
|
| 47 |
+
or "pdftoppm"
|
| 48 |
+
)
|
| 49 |
+
|
| 50 |
+
# Gemini Flash: fast and cost-effective for code generation.
|
| 51 |
+
GENERATOR_MODEL = "gemini-3-flash-preview"
|
| 52 |
+
GENERATOR_MAX_TOKENS = 4096
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
class SlideGenerator:
|
| 56 |
+
"""Drives the LLM → Node.js → LibreOffice → pdftoppm pipeline."""
|
| 57 |
+
|
| 58 |
+
def __init__(
|
| 59 |
+
self,
|
| 60 |
+
task_prompt_path: Path,
|
| 61 |
+
pptx_skill_dir: Path,
|
| 62 |
+
reference_dir: Path,
|
| 63 |
+
) -> None:
|
| 64 |
+
self.task_prompt = task_prompt_path.read_text(encoding="utf-8")
|
| 65 |
+
self.pptx_skill_dir = pptx_skill_dir
|
| 66 |
+
self.reference_dir = reference_dir
|
| 67 |
+
self._client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
|
| 68 |
+
|
| 69 |
+
def generate(self, session_id: str, session_dir: Path) -> Path:
|
| 70 |
+
"""
|
| 71 |
+
Run the full pipeline for one optimization step.
|
| 72 |
+
|
| 73 |
+
Args:
|
| 74 |
+
session_id: Used only for logging/naming.
|
| 75 |
+
session_dir: Isolated directory containing the session's
|
| 76 |
+
DESIGN_RULES.md and EXAMPLES.md.
|
| 77 |
+
|
| 78 |
+
Returns:
|
| 79 |
+
Absolute path to the generated slide JPG (slide-1.jpg).
|
| 80 |
+
|
| 81 |
+
Raises:
|
| 82 |
+
RuntimeError: If any pipeline stage (LLM, Node, LibreOffice,
|
| 83 |
+
pdftoppm) fails.
|
| 84 |
+
"""
|
| 85 |
+
out_dir = session_dir / "output"
|
| 86 |
+
out_dir.mkdir(exist_ok=True)
|
| 87 |
+
|
| 88 |
+
js_path = out_dir / "generate.js"
|
| 89 |
+
pptx_path = out_dir / "slide.pptx"
|
| 90 |
+
jpg_stem = out_dir / "slide"
|
| 91 |
+
jpg_path = out_dir / "slide-1.jpg"
|
| 92 |
+
|
| 93 |
+
# Stage 1+2: LLM generates JS, Node executes it.
|
| 94 |
+
# Retry up to 3 times feeding Node errors back to the LLM.
|
| 95 |
+
node_error: str | None = None
|
| 96 |
+
for attempt in range(1, 4):
|
| 97 |
+
js_code = self._call_generator_llm(session_dir, node_error=node_error)
|
| 98 |
+
js_path.write_text(js_code, encoding="utf-8")
|
| 99 |
+
try:
|
| 100 |
+
self._run(["node", str(js_path)], cwd=out_dir, stage="node generate.js")
|
| 101 |
+
node_error = None
|
| 102 |
+
break
|
| 103 |
+
except RuntimeError as exc:
|
| 104 |
+
node_error = str(exc)
|
| 105 |
+
if attempt == 3:
|
| 106 |
+
raise
|
| 107 |
+
if not pptx_path.exists():
|
| 108 |
+
raise RuntimeError(
|
| 109 |
+
f"node generate.js completed but {pptx_path} was not created."
|
| 110 |
+
)
|
| 111 |
+
|
| 112 |
+
# Stage 3: LibreOffice converts .pptx → .pdf.
|
| 113 |
+
self._run(
|
| 114 |
+
[
|
| 115 |
+
SOFFICE,
|
| 116 |
+
"--headless",
|
| 117 |
+
"--convert-to",
|
| 118 |
+
"pdf",
|
| 119 |
+
"--outdir",
|
| 120 |
+
str(out_dir),
|
| 121 |
+
str(pptx_path),
|
| 122 |
+
],
|
| 123 |
+
cwd=out_dir,
|
| 124 |
+
stage="soffice --convert-to pdf",
|
| 125 |
+
)
|
| 126 |
+
pdf_path = out_dir / "slide.pdf"
|
| 127 |
+
if not pdf_path.exists():
|
| 128 |
+
raise RuntimeError(
|
| 129 |
+
f"LibreOffice completed but {pdf_path} was not created."
|
| 130 |
+
)
|
| 131 |
+
|
| 132 |
+
# Stage 4: pdftoppm converts PDF page 1 → JPG at 150 DPI.
|
| 133 |
+
# Output: slide-1.jpg (pdftoppm appends "-{page}" automatically).
|
| 134 |
+
self._run(
|
| 135 |
+
[
|
| 136 |
+
PDFTOPPM,
|
| 137 |
+
"-r",
|
| 138 |
+
"150",
|
| 139 |
+
"-jpeg",
|
| 140 |
+
"-f",
|
| 141 |
+
"1",
|
| 142 |
+
"-l",
|
| 143 |
+
"1", # only page 1
|
| 144 |
+
str(pdf_path),
|
| 145 |
+
str(jpg_stem),
|
| 146 |
+
],
|
| 147 |
+
cwd=out_dir,
|
| 148 |
+
stage="pdftoppm",
|
| 149 |
+
)
|
| 150 |
+
if not jpg_path.exists():
|
| 151 |
+
raise RuntimeError(
|
| 152 |
+
f"pdftoppm completed but {jpg_path} was not created."
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
return jpg_path
|
| 156 |
+
|
| 157 |
+
# ------------------------------------------------------------------
|
| 158 |
+
# Private helpers
|
| 159 |
+
# ------------------------------------------------------------------
|
| 160 |
+
|
| 161 |
+
def _call_generator_llm(self, session_dir: Path, node_error: str | None = None) -> str:
|
| 162 |
+
"""
|
| 163 |
+
Call the generator LLM with skill files + task prompt as context.
|
| 164 |
+
|
| 165 |
+
Returns the raw JavaScript code string (without markdown fences).
|
| 166 |
+
"""
|
| 167 |
+
design_rules = (session_dir / "DESIGN_RULES.md").read_text(encoding="utf-8")
|
| 168 |
+
examples = (session_dir / "EXAMPLES.md").read_text(encoding="utf-8")
|
| 169 |
+
|
| 170 |
+
# Load the generic pptx tooling files as executor context.
|
| 171 |
+
pptx_skill = self._read_pptx_skill()
|
| 172 |
+
|
| 173 |
+
system_prompt = textwrap.dedent("""\
|
| 174 |
+
You are an expert pptxgenjs developer. You will write a complete,
|
| 175 |
+
runnable Node.js script that generates a PowerPoint slide using
|
| 176 |
+
the pptxgenjs library.
|
| 177 |
+
|
| 178 |
+
Rules:
|
| 179 |
+
- Output ONLY the JavaScript code. No markdown fences, no explanation.
|
| 180 |
+
- The script must save the file as "slide.pptx" in the current directory.
|
| 181 |
+
- Follow the DESIGN_RULES.md and EXAMPLES.md exactly.
|
| 182 |
+
- Use the pptxgenjs API reference below for correct method calls.
|
| 183 |
+
""")
|
| 184 |
+
|
| 185 |
+
user_message = textwrap.dedent(f"""\
|
| 186 |
+
## pptxgenjs API Reference
|
| 187 |
+
{pptx_skill}
|
| 188 |
+
|
| 189 |
+
## Brand Style: DESIGN_RULES.md
|
| 190 |
+
{design_rules}
|
| 191 |
+
|
| 192 |
+
## Brand Style: EXAMPLES.md
|
| 193 |
+
{examples}
|
| 194 |
+
|
| 195 |
+
## Task
|
| 196 |
+
{self.task_prompt}
|
| 197 |
+
|
| 198 |
+
Write the complete pptxgenjs JavaScript file now.
|
| 199 |
+
""")
|
| 200 |
+
|
| 201 |
+
if node_error:
|
| 202 |
+
user_message += textwrap.dedent(f"""
|
| 203 |
+
|
| 204 |
+
## Previous attempt failed — fix these errors
|
| 205 |
+
Your previous script produced the following Node.js error.
|
| 206 |
+
Rewrite the script and fix the issue:
|
| 207 |
+
|
| 208 |
+
{node_error}
|
| 209 |
+
""")
|
| 210 |
+
|
| 211 |
+
response = self._client.models.generate_content(
|
| 212 |
+
model=GENERATOR_MODEL,
|
| 213 |
+
contents=user_message,
|
| 214 |
+
config=types.GenerateContentConfig(
|
| 215 |
+
system_instruction=system_prompt,
|
| 216 |
+
max_output_tokens=GENERATOR_MAX_TOKENS,
|
| 217 |
+
),
|
| 218 |
+
)
|
| 219 |
+
|
| 220 |
+
code = response.text.strip()
|
| 221 |
+
|
| 222 |
+
# Extract from markdown code fence if present (LLMs often add them
|
| 223 |
+
# despite instructions). Handles ```javascript, ```js, or plain ```.
|
| 224 |
+
fence_match = re.search(r"```(?:javascript|js)?\n(.*?)```", code, re.DOTALL)
|
| 225 |
+
if fence_match:
|
| 226 |
+
code = fence_match.group(1).strip()
|
| 227 |
+
|
| 228 |
+
# Rewrite all bare require('pkg') calls to absolute paths so the
|
| 229 |
+
# script works when run from any /tmp/ directory. We only rewrite
|
| 230 |
+
# packages that actually exist in node_modules; unknown packages are
|
| 231 |
+
# left untouched (they'd fail at runtime but at least not silently).
|
| 232 |
+
node_modules = REPO_ROOT / "node_modules"
|
| 233 |
+
|
| 234 |
+
def _rewrite_require(m: re.Match) -> str:
|
| 235 |
+
quote = m.group(1)
|
| 236 |
+
pkg = m.group(2)
|
| 237 |
+
pkg_path = node_modules / pkg
|
| 238 |
+
if pkg_path.exists():
|
| 239 |
+
return f"require({quote}{pkg_path}{quote})"
|
| 240 |
+
return m.group(0) # leave unknown packages as-is
|
| 241 |
+
|
| 242 |
+
code = re.sub(r"require\((['\"])([^./][^'\"]*)\1\)", _rewrite_require, code)
|
| 243 |
+
|
| 244 |
+
# LLMs sometimes emit the require line twice. Keep only the first
|
| 245 |
+
# declaration to avoid "Identifier already declared" SyntaxError.
|
| 246 |
+
seen: set[str] = set()
|
| 247 |
+
deduped = []
|
| 248 |
+
for line in code.splitlines():
|
| 249 |
+
m = re.search(r"require\(['\"]([^'\"]+)['\"]\)", line)
|
| 250 |
+
if m and "node_modules" in line:
|
| 251 |
+
pkg = m.group(1)
|
| 252 |
+
if pkg in seen:
|
| 253 |
+
continue
|
| 254 |
+
seen.add(pkg)
|
| 255 |
+
deduped.append(line)
|
| 256 |
+
code = "\n".join(deduped)
|
| 257 |
+
|
| 258 |
+
return code
|
| 259 |
+
|
| 260 |
+
def _read_pptx_skill(self) -> str:
|
| 261 |
+
"""Concatenate the generic pptx skill files for LLM context."""
|
| 262 |
+
parts = []
|
| 263 |
+
for fname in ("SKILL.md", "editing.md", "pptxgenjs.md"):
|
| 264 |
+
p = self.pptx_skill_dir / fname
|
| 265 |
+
if p.exists():
|
| 266 |
+
parts.append(f"### {fname}\n{p.read_text(encoding='utf-8')}")
|
| 267 |
+
return "\n\n".join(parts)
|
| 268 |
+
|
| 269 |
+
@staticmethod
|
| 270 |
+
def _run(cmd: list[str], cwd: Path, stage: str) -> None:
|
| 271 |
+
"""Run a subprocess; raise RuntimeError with context if it fails."""
|
| 272 |
+
result = subprocess.run(
|
| 273 |
+
cmd,
|
| 274 |
+
cwd=cwd,
|
| 275 |
+
capture_output=True,
|
| 276 |
+
text=True,
|
| 277 |
+
timeout=300, # 5 min hard limit per stage
|
| 278 |
+
)
|
| 279 |
+
if result.returncode != 0:
|
| 280 |
+
raise RuntimeError(
|
| 281 |
+
f"Pipeline stage '{stage}' failed (exit {result.returncode}).\n"
|
| 282 |
+
f"stdout: {result.stdout[-2000:]}\n"
|
| 283 |
+
f"stderr: {result.stderr[-2000:]}"
|
| 284 |
+
)
|
openenv/slide_skill_environment.py
ADDED
|
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Slide Skill Environment — OpenEnv-compatible environment for optimizing
|
| 3 |
+
McKinsey-style PowerPoint slide generation.
|
| 4 |
+
|
| 5 |
+
Concurrency model:
|
| 6 |
+
SUPPORTS_CONCURRENT_SESSIONS = True
|
| 7 |
+
|
| 8 |
+
Each session gets an isolated working directory at /tmp/slide_skill_{session_id}/.
|
| 9 |
+
Skill files (DESIGN_RULES.md, EXAMPLES.md) are copied there on reset() and
|
| 10 |
+
modified in place during the session. The shared repo files are never modified.
|
| 11 |
+
This means multiple sessions can run simultaneously without file conflicts.
|
| 12 |
+
|
| 13 |
+
The only shared resource is the Anthropic API key, which is rate-limited
|
| 14 |
+
per-account. For HuggingFace Spaces, running 2-3 concurrent sessions is
|
| 15 |
+
realistic before hitting rate limits.
|
| 16 |
+
|
| 17 |
+
Episode timing:
|
| 18 |
+
Each step involves two LLM calls (generator + evaluator) plus Node.js and
|
| 19 |
+
LibreOffice. Expect 60-120 seconds per step. At max_steps=7, a full episode
|
| 20 |
+
runs 7-14 minutes.
|
| 21 |
+
|
| 22 |
+
Reward function:
|
| 23 |
+
reward = clip(total_score - prev_total_score, -30, +30) / 100
|
| 24 |
+
Capping at +/-30 points (+/-0.3 reward) dampens LLM evaluation noise. A score
|
| 25 |
+
can fluctuate +/-5-10 points between identical slides due to evaluator variance,
|
| 26 |
+
so capping prevents large undeserved penalties or bonuses.
|
| 27 |
+
"""
|
| 28 |
+
|
| 29 |
+
from __future__ import annotations
|
| 30 |
+
|
| 31 |
+
import os
|
| 32 |
+
import shutil
|
| 33 |
+
import uuid
|
| 34 |
+
from pathlib import Path
|
| 35 |
+
from typing import ClassVar
|
| 36 |
+
|
| 37 |
+
from models import (
|
| 38 |
+
SlideScores,
|
| 39 |
+
SlideSkillAction,
|
| 40 |
+
SlideSkillObservation,
|
| 41 |
+
SlideSkillState,
|
| 42 |
+
)
|
| 43 |
+
from skill_manager import SkillManager
|
| 44 |
+
from slide_generator import SlideGenerator
|
| 45 |
+
from evaluator_adapter import EvaluatorAdapter
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
# Paths relative to repo root — adjust if the package moves.
|
| 49 |
+
REPO_ROOT = Path(__file__).parent.parent
|
| 50 |
+
BASELINE_DIR = REPO_ROOT / "skill_files_baseline"
|
| 51 |
+
TASK_PROMPT_PATH = REPO_ROOT / "output" / "TASK_PROMPT.md"
|
| 52 |
+
REFERENCE_DIR = REPO_ROOT / "output" / "reference"
|
| 53 |
+
|
| 54 |
+
# Reward capping parameters
|
| 55 |
+
REWARD_CLIP_POINTS = 30 # clip score delta to +/-30 before normalizing
|
| 56 |
+
REWARD_SCALE = 100.0 # divide clipped delta by this to get [-0.3, +0.3]
|
| 57 |
+
|
| 58 |
+
MAX_STEPS = int(os.environ.get("SLIDE_SKILL_MAX_STEPS", "7"))
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
class SlideSkillEnvironment:
|
| 62 |
+
"""OpenEnv environment for the Skill Forge optimization loop."""
|
| 63 |
+
|
| 64 |
+
SUPPORTS_CONCURRENT_SESSIONS: ClassVar[bool] = True
|
| 65 |
+
|
| 66 |
+
def __init__(self) -> None:
|
| 67 |
+
self._sessions: dict[str, SlideSkillState] = {}
|
| 68 |
+
self._generator = SlideGenerator(
|
| 69 |
+
task_prompt_path=TASK_PROMPT_PATH,
|
| 70 |
+
pptx_skill_dir=REPO_ROOT / "pptx",
|
| 71 |
+
reference_dir=REFERENCE_DIR,
|
| 72 |
+
)
|
| 73 |
+
self._evaluator = EvaluatorAdapter(reference_dir=REFERENCE_DIR)
|
| 74 |
+
|
| 75 |
+
# ------------------------------------------------------------------
|
| 76 |
+
# Public OpenEnv interface
|
| 77 |
+
# ------------------------------------------------------------------
|
| 78 |
+
|
| 79 |
+
def reset(self, session_id: str | None = None) -> str:
|
| 80 |
+
"""
|
| 81 |
+
Initialize or reinitialize a session.
|
| 82 |
+
|
| 83 |
+
Creates an isolated working directory under /tmp/ and copies the
|
| 84 |
+
baseline skill files into it. Returns the session_id.
|
| 85 |
+
"""
|
| 86 |
+
session_id = session_id or str(uuid.uuid4())
|
| 87 |
+
|
| 88 |
+
session_dir = Path(f"/tmp/slide_skill_{session_id}")
|
| 89 |
+
if session_dir.exists():
|
| 90 |
+
shutil.rmtree(session_dir)
|
| 91 |
+
session_dir.mkdir(parents=True)
|
| 92 |
+
|
| 93 |
+
# Copy baseline skill files into the session directory.
|
| 94 |
+
for fname in ("DESIGN_RULES.md", "EXAMPLES.md"):
|
| 95 |
+
src = BASELINE_DIR / fname
|
| 96 |
+
if not src.exists():
|
| 97 |
+
raise FileNotFoundError(
|
| 98 |
+
f"Baseline file missing: {src}. "
|
| 99 |
+
"Commit skill_files_baseline/ to the repo."
|
| 100 |
+
)
|
| 101 |
+
shutil.copy2(src, session_dir / fname)
|
| 102 |
+
|
| 103 |
+
self._sessions[session_id] = SlideSkillState(
|
| 104 |
+
session_id=session_id,
|
| 105 |
+
step=0,
|
| 106 |
+
prev_total=0,
|
| 107 |
+
session_dir=str(session_dir),
|
| 108 |
+
)
|
| 109 |
+
return session_id
|
| 110 |
+
|
| 111 |
+
def step(self, session_id: str, action: SlideSkillAction) -> SlideSkillObservation:
|
| 112 |
+
"""
|
| 113 |
+
Apply an action, run the generation pipeline, evaluate, and return
|
| 114 |
+
an observation.
|
| 115 |
+
|
| 116 |
+
Args:
|
| 117 |
+
session_id: Must be a live session (call reset() first).
|
| 118 |
+
action: Either EditSectionAction or ReplaceFileAction.
|
| 119 |
+
|
| 120 |
+
Returns:
|
| 121 |
+
SlideSkillObservation with scores, feedback, reward, and file contents.
|
| 122 |
+
|
| 123 |
+
Raises:
|
| 124 |
+
KeyError: If session_id is not found.
|
| 125 |
+
RuntimeError: If the generation or evaluation pipeline fails.
|
| 126 |
+
"""
|
| 127 |
+
state = self._sessions[session_id]
|
| 128 |
+
session_dir = Path(state.session_dir)
|
| 129 |
+
|
| 130 |
+
# 1. Apply the action to the session's skill files.
|
| 131 |
+
manager = SkillManager(session_dir)
|
| 132 |
+
manager.apply(action)
|
| 133 |
+
|
| 134 |
+
# 2. Run the full generation pipeline.
|
| 135 |
+
jpg_path = self._generator.generate(
|
| 136 |
+
session_id=session_id,
|
| 137 |
+
session_dir=session_dir,
|
| 138 |
+
)
|
| 139 |
+
|
| 140 |
+
# 3. Evaluate the generated slide.
|
| 141 |
+
eval_result = self._evaluator.evaluate(jpg_path)
|
| 142 |
+
|
| 143 |
+
# 4. Compute reward (capped score delta).
|
| 144 |
+
delta = eval_result["total"] - state.prev_total
|
| 145 |
+
clipped_delta = max(-REWARD_CLIP_POINTS, min(REWARD_CLIP_POINTS, delta))
|
| 146 |
+
reward = clipped_delta / REWARD_SCALE
|
| 147 |
+
|
| 148 |
+
# 5. Update state.
|
| 149 |
+
state.step += 1
|
| 150 |
+
state.prev_total = eval_result["total"]
|
| 151 |
+
done = state.step >= MAX_STEPS
|
| 152 |
+
|
| 153 |
+
# 6. Read back current file contents for the observation.
|
| 154 |
+
design_rules = (session_dir / "DESIGN_RULES.md").read_text(encoding="utf-8")
|
| 155 |
+
examples = (session_dir / "EXAMPLES.md").read_text(encoding="utf-8")
|
| 156 |
+
|
| 157 |
+
scores = SlideScores(**eval_result["scores"])
|
| 158 |
+
|
| 159 |
+
return SlideSkillObservation(
|
| 160 |
+
scores=scores,
|
| 161 |
+
total=eval_result["total"],
|
| 162 |
+
strengths=eval_result.get("strengths", []),
|
| 163 |
+
weaknesses=eval_result.get("weaknesses", []),
|
| 164 |
+
one_line_verdict=eval_result["one_line_verdict"],
|
| 165 |
+
reward=reward,
|
| 166 |
+
step=state.step,
|
| 167 |
+
done=done,
|
| 168 |
+
jpg_path=str(jpg_path),
|
| 169 |
+
design_rules_content=design_rules,
|
| 170 |
+
examples_content=examples,
|
| 171 |
+
)
|
| 172 |
+
|
| 173 |
+
def close(self, session_id: str) -> None:
|
| 174 |
+
"""Clean up session resources. Deletes the /tmp/ session directory."""
|
| 175 |
+
if session_id in self._sessions:
|
| 176 |
+
state = self._sessions.pop(session_id)
|
| 177 |
+
session_dir = Path(state.session_dir)
|
| 178 |
+
if session_dir.exists():
|
| 179 |
+
shutil.rmtree(session_dir)
|
pyproject.toml
ADDED
|
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[build-system]
|
| 2 |
+
requires = ["hatchling"]
|
| 3 |
+
build-backend = "hatchling.build"
|
| 4 |
+
|
| 5 |
+
[project]
|
| 6 |
+
name = "slide-skill-openenv"
|
| 7 |
+
version = "1.0.0"
|
| 8 |
+
description = "OpenEnv environment for McKinsey-style PowerPoint slide optimization"
|
| 9 |
+
requires-python = ">=3.12"
|
| 10 |
+
|
| 11 |
+
# Core runtime dependencies (required for the environment to run)
|
| 12 |
+
dependencies = [
|
| 13 |
+
"google-genai>=1.0.0", # Gemini API client (generator + evaluator + optimizer)
|
| 14 |
+
"pydantic>=2.6.0", # Data models with discriminated unions
|
| 15 |
+
"httpx>=0.27.0", # HTTP client for client.py
|
| 16 |
+
"loguru>=0.7.0", # Structured logging for client
|
| 17 |
+
]
|
| 18 |
+
|
| 19 |
+
[project.optional-dependencies]
|
| 20 |
+
# Server dependencies (required for app.py)
|
| 21 |
+
server = [
|
| 22 |
+
"fastapi>=0.111.0",
|
| 23 |
+
"uvicorn[standard]>=0.30.0",
|
| 24 |
+
"python-multipart>=0.0.9", # FastAPI form parsing
|
| 25 |
+
"python-dotenv>=1.0.0", # Load .env file automatically
|
| 26 |
+
]
|
| 27 |
+
|
| 28 |
+
# Development and testing
|
| 29 |
+
dev = [
|
| 30 |
+
"pytest>=8.0.0",
|
| 31 |
+
"pytest-asyncio>=0.23.0",
|
| 32 |
+
"httpx>=0.27.0", # for FastAPI TestClient
|
| 33 |
+
"ruff>=0.4.0",
|
| 34 |
+
"mypy>=1.10.0",
|
| 35 |
+
]
|
| 36 |
+
|
| 37 |
+
[tool.hatch.build.targets.wheel]
|
| 38 |
+
packages = ["openenv"]
|
| 39 |
+
|
| 40 |
+
[tool.ruff]
|
| 41 |
+
target-version = "py312"
|
| 42 |
+
line-length = 88
|
| 43 |
+
|
| 44 |
+
[tool.ruff.lint]
|
| 45 |
+
select = ["E", "F", "I", "UP"]
|
| 46 |
+
|
| 47 |
+
[tool.mypy]
|
| 48 |
+
python_version = "3.12"
|
| 49 |
+
strict = true
|
| 50 |
+
ignore_missing_imports = true
|
skill_files_baseline/DESIGN_RULES.md
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Design Rules (Original pptx skill defaults)
|
| 2 |
+
|
| 3 |
+
## Color Palette
|
| 4 |
+
Pick from skill's built-in palettes. For hydrogen/energy topic, use "Teal Trust":
|
| 5 |
+
- Primary: `028090` (teal)
|
| 6 |
+
- Secondary: `00A896` (seafoam)
|
| 7 |
+
- Accent: `02C39A` (mint)
|
| 8 |
+
- Commit to dark throughout for a premium feel.
|
| 9 |
+
|
| 10 |
+
## Typography
|
| 11 |
+
- Title: Georgia, 36-44pt, bold
|
| 12 |
+
- Body: Calibri, 14-16pt
|
| 13 |
+
- Captions: 10-12pt, muted
|
| 14 |
+
|
| 15 |
+
## Layout
|
| 16 |
+
- 0.5" minimum margins
|
| 17 |
+
- 0.3-0.5" between content blocks
|
| 18 |
+
- Timeline or process flow for data display
|
| 19 |
+
- NEVER use accent lines under titles
|
skill_files_baseline/EXAMPLES.md
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Examples
|
| 2 |
+
(Empty — no prior optimization rounds)
|