File size: 22,154 Bytes
6ca330b 67f71c2 6ca330b 67f71c2 6ca330b 67f71c2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 | ---
title: Shader Environment Server
emoji: π¨
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Shader Environment
## Overview
`shader` is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints.
Supported interaction modes:
- Reference-conditioned shader recreation via SSIM reward
- Multi-turn shader refinement (not one-shot only)
- GLSL-first execution via headless rendering, with WGSL/browser-native as a later target
- Pluggable reward traces for RL, evaluation, reranking, and distillation
- Hidden evaluation across time, resolution, and seed variations
## Project Structure
```
envs/shader/
βββ __init__.py # Public API: ShaderEnv, ShaderAction, ShaderObservation
βββ models.py # Pydantic Action / Observation types
βββ client.py # OpenEnv client (ShaderEnv)
βββ tasks.py # Task bank loader (shaders21k corpus)
βββ reward.py # SSIM reward computation
βββ render.py # Headless GLSL renderer (ModernGL + EGL)
βββ harness.py # Subprocess-isolated render wrapper
βββ download.sh # Fetches shaders21k dataset
βββ openenv.yaml # OpenEnv space descriptor
βββ pyproject.toml
βββ server/
βββ app.py # FastAPI application (OpenEnv HTTP server)
βββ environment.py # Environment implementation (reset / step loop)
βββ Dockerfile
```
## Quickstart
```bash
# Download the shaders21k corpus (~41 MB)
cd envs/shader
./download.sh
```
### Running via uv / uvicorn
```bash
cd envs/shader
PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Running via Docker
The corpus is downloaded at build time automatically:
```bash
cd envs/shader
docker build -f server/Dockerfile -t shader .
docker run -p 8000:8000 shader
```
### Validation
```bash
# Validate local structure
cd envs/shader
openenv validate --verbose
# Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode)
openenv validate http://localhost:8000
```
### Interacting via WebSocket
The HTTP endpoints (`/reset`, `/step`) are stateless β each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint:
```python
import asyncio, json, websockets
async def main():
async with websockets.connect("ws://localhost:8000/ws") as ws:
# Reset β picks a task, renders reference
await ws.send(json.dumps({"type": "reset", "data": {}}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(obs["task"], obs["remaining"])
# Step β submit GLSL, get back SSIM + render
await ws.send(json.dumps({
"type": "step",
"data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"}
}))
resp = json.loads(await ws.recv())
r = resp["data"]
print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}")
await ws.send(json.dumps({"type": "close"}))
asyncio.run(main())
```
### Python Client
```python
from shader import ShaderEnv, ShaderAction
with ShaderEnv(base_url="http://localhost:8000").sync() as client:
result = client.reset()
print(result.observation.task) # ShaderToy ID
print(result.observation.reference_png) # base64 PNG
result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"))
print(result.observation.compiled) # True/False
print(result.observation.ssim) # similarity vs reference
```
### Benchmark
Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline:
```bash
# Requires a running server and OPENAI_API_KEY set
python envs/shader/benchmark.py # 3 episodes, default seeds
python envs/shader/benchmark.py --turns 5 # cap turns per episode
python envs/shader/benchmark.py --url ws://localhost:8001/ws # custom server
python envs/shader/benchmark.py --seeds 10 20 30 # custom seeds
```
Seeds control reproducible task selection. Results are saved to `benchmark_output/results.json`.
## Tasks
The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0].
| Task | Difficulty | Lines | Description | What the agent needs |
|------|-----------|-------|-------------|---------------------|
| `Nd33R4` | Easy | 13 | XOR color pattern on pixel coordinates | `int()` casting, bitwise XOR/AND, float conversion |
| `stlXWH` | Medium | 44 | SDF distance field (square minus circle) with smooth coloring | Signed distance functions, `abs`, `exp`, `smoothstep`, `cos` for distance coloring |
| `ftjSRd` | Hard | 122 | Raymarcher with SDF repetition, polar coordinates, HSV coloring | Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords |
All 3 tasks are sourced from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time.
To select a specific task, pass its name to `reset()`:
```python
result = env.reset(task="Nd33R4") # easy β XOR pattern
result = env.reset(task="stlXWH") # medium β SDF visualization
result = env.reset(task="ftjSRd") # hard β raymarcher
result = env.reset() # random from full corpus
```
### Grading
Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input.
- **Score range**: 0.0 (no similarity) to 1.0 (pixel-perfect match)
- **Success threshold**: score >= 0.90
- **Compile/render failures**: score = 0.0
### Baseline Scores
Evaluated with GPT 5.4 via `inference.py` (5 steps per task, temperature 0.2):
| Task | Difficulty | Best SSIM | Step-by-step rewards | Multi-turn improvement |
|------|-----------|-----------|---------------------|----------------------|
| `Nd33R4` | Easy | **0.27** | 0.27, 0.13, 0.15, 0.15, 0.16 | No β model generates hash noise instead of XOR pattern |
| `stlXWH` | Medium | **0.94** | 0.85, 0.88, 0.91, 0.90, 0.94 | Yes β steady refinement from 0.85 to 0.94 |
| `ftjSRd` | Hard | **0.40** | 0.31, 0.15, 0.33, 0.40, 0.28 | Partial β oscillates between 0.15-0.40 |
Key observations:
- `Nd33R4` (easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math.
- `stlXWH` (medium) shows clear multi-turn refinement β the model progressively improves the SDF shape and color mapping across steps.
- `ftjSRd` (hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters.
*Run `inference.py` to reproduce. Scores vary by model and API endpoint.*
## Task Bank (Corpus)
Beyond the 3 curated tasks, the full task bank is loaded from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks.
Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it.
| Field | Description |
|-------|-------------|
| `name` | ShaderToy ID (e.g. `MdGcDc`) |
| `code` | GLSL fragment shader source |
| `source` | ShaderToy URL for provenance |
| `resolution` | Render resolution, default 512x288 |
| `time` | iTime uniform value, default 0.0 |
| `difficulty` | `easy`, `medium`, `hard` (curated tasks only) |
## Motivation
Shader work has properties that make it well-suited as an RL environment:
- Feedback is fast and dense
- Compile success, render success, and performance are easy to gate
- The same shader can be evaluated under multiple controlled render conditions
- An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (`shaderc`, `glslang`)
ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines.
### Positioning
There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins.
## Runtime Stack
The primary target is GLSL via headless OpenGL.
- **GLSL** as the authoring and execution language
- The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark)
- GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data
- WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate)
- **ModernGL** with EGL as the headless rendering backend (`render.py`)
- Runs on Linux servers without a display via EGL
- Used by shaders21k for offline rendering
- Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms, `mainImage` forward declaration)
- Strips `#version`, `precision`, and `#extension` directives from user code to avoid conflicts
- Adjusts error line numbers reported by the driver to map back to user code
- **Subprocess isolation** (`harness.py`)
- Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption
- Configurable per-render timeout (default 10s)
- Returns structured `RenderResult` with compile/render status, error messages, and raw RGBA pixel data
- **`shaderc` / `glslang`** for offline validation and portability checks (planned)
- **WGSL / WebGPU** deferred to a later phase for browser-native demos
## Episode Schema
The environment follows a standard multi-turn refinement loop:
1. `reset()` picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG
2. The agent submits GLSL code via `step(ShaderAction(code=...))`
3. The server compiles and renders the shader, computes SSIM vs reference
4. Returns compile/render status, errors, rendered image, and SSIM reward
5. Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99
The server supports up to 4 concurrent environment sessions (`max_concurrent_envs=4`).
This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime.
## Environment Variants
### `shader` (primary)
Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain.
### Material Graph
Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations.
### Shader Repair
Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance.
## Task Families
- **Reference recreation** β recreate a target still or short effect from a reference render
- **Repair** β fix syntax errors, portability failures, or numerical instability
- **Optimization** β preserve appearance while reducing frame time or instruction count
- **Style transfer** β preserve scene logic while shifting color, texture, motion, or lighting style
- **Critique incorporation** β revise the shader based on iterative feedback
- **Robustness repair** β stabilize a shader across resolutions, aspect ratios, and time ranges
## Evaluation
Evaluation relies on hidden checks rather than visible examples only:
- Compile success
- Render success (no NaNs or fatal runtime failures)
- Perceptual similarity on held-out stills
- Temporal consistency on held-out short clips
- Stability across resolutions, aspect ratios, and parameter seeds
- Frame-time or instruction-budget limits
- Optional portability checks across compiler/validator paths
### Reward Structure
**Current implementation** (`reward.py`): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy `uniform_filter` for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0].
**Planned multi-component reward:**
```text
R = G_compile * G_render * (
0.35 * appearance_match +
0.20 * temporal_stability +
0.20 * performance +
0.15 * robustness +
0.10 * code_quality
) - step_penalty - regression_penalty
```
Component notes:
- **`appearance_match`** β measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants.
- **`temporal_stability`** β requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable.
- **`performance`** β frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score.
- **`robustness`** β captures resolution changes, seed changes, and compiler portability.
- **`G_compile * G_render`** β hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening.
- **SFT warm-up** is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance.
## Action / Observation Contract
```python
class ShaderAction(Action):
code: str # Shadertoy-dialect GLSL fragment shader source
class ShaderObservation(Observation):
task: str # ShaderToy ID
remaining: int # turns left in episode
reference_png: str # base64 PNG (non-empty on reset only)
compiled: bool
rendered: bool
errors: list[str]
agent_png: str # base64 PNG of agent's render
ssim: float # SSIM vs reference in [0, 1]
done: bool = False
reward: float | None = None
```
- `reward` and `done` inside `Observation` follows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-level `StepResponse`.
- Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn.
## Training
### Algorithm
- **GRPO** as the baseline, with multi-turn extensions:
- **MURPHY** (NeurIPS 2025) β for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO.
- **TRLOO** (Dr. Kernel) β addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings.
- The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop.
### Libraries
- **veRL**: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts
- **OpenRLHF-M**: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs)
Both have TRL and OpenEnv integrations.
### Task Bank
The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K).
## Related Work
| Work | Relevance |
|------|-----------|
| **Shadertoy** | Active public corpus and community; seed source for tasks and reference effects |
| **OpenEnv** (Meta/PyTorch) | Target framework. Client-server RL environment with `Action`/`Observation` base classes, `reset()`/`step()` contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations. |
| **ShaderEval** (LLM4Code @ ICSE 2025) | Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via `shadermatch`. |
| **shaders21k** (NeurIPS 2022) | 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation. |
| **AI Co-Artist** (arXiv:2512.08951) | Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL. |
| **VLMaterial** (ICLR 2025 Spotlight) | Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies. |
| **Dr. Kernel / KernelGYM** (arXiv:2602.05885) | Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment. |
| **MURPHY** (NeurIPS 2025) | Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here. |
| **ProcMatRL** (SIGGRAPH Asia 2024) | RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain. |
| **ShadAR** (arXiv:2602.17481) | LLM-driven real-time shader generation for AR. |
| **Procedural Shader Evolution** (arXiv:2312.17587) | Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization. |
## Licensing
Shader code in the task bank is sourced from [ShaderToy](https://www.shadertoy.com/) via the [shaders21k](https://github.com/mbaradad/shaders21k) dataset. ShaderToy's default license is **CC BY-NC-SA 3.0** β authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API.
- The dataset is not redistributed in this repository; it is downloaded at build time
- Individual shader provenance is tracked via the `source` field on each task (links back to the ShaderToy page)
- For commercial use, per-shader license review is required
## Next Steps
- Plan the SFT warm-up dataset and training run (prerequisite for RL)
- Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness
- Implement the reward plugin API (multi-component, pluggable)
- Establish the hidden evaluation protocol across time, resolution, and seed variations
- Add `shaderc` / `glslang` offline validation
- Add support for multi-pass shaders and texture inputs
## References
### Primary
- Shadertoy: <https://www.shadertoy.com/>
- OpenEnv (Meta/PyTorch): <https://github.com/meta-pytorch/OpenEnv>
- OpenEnv RFC 002: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/002-env-spec.md>
- OpenEnv RFC 004: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/004-rubrics.md>
- TRL OpenEnv integration: <https://huggingface.co/docs/trl/openenv>
- ShaderEval (LLM4Code @ ICSE 2025): <https://conf.researchr.org/details/icse-2025/llm4code-2025-papers/13/Evaluating-Language-Models-for-Computer-Graphics-Code-Completion>
- shadertoys-dataset: <https://github.com/Vipitis/shadertoys-dataset>
- Shadereval-inputs (HuggingFace): <https://huggingface.co/datasets/Vipitis/Shadereval-inputs>
- shaders21k (NeurIPS 2022): <https://arxiv.org/abs/2211.16412>
- shaders21k dataset: <https://github.com/mbaradad/shaders21k>
- AI Co-Artist: <https://arxiv.org/abs/2512.08951>
- VLMaterial (ICLR 2025): <https://arxiv.org/abs/2501.18623>
- VLMaterial code: <https://github.com/mit-gfx/VLMaterial>
- Dr. Kernel / KernelGYM: <https://arxiv.org/abs/2602.05885>
- MURPHY (NeurIPS 2025): <https://arxiv.org/abs/2511.07833>
- RLRF (SVG RL): <https://arxiv.org/abs/2505.20793>
### Tools and Infrastructure
- `shaderc`: <https://github.com/google/shaderc>
- `glslang`: <https://github.com/KhronosGroup/glslang>
- `pygfx/shadertoy`: <https://github.com/pygfx/shadertoy>
- moderngl: <https://github.com/moderngl/moderngl>
- veRL: <https://github.com/volcengine/verl>
- OpenRLHF-M: <https://github.com/OpenRLHF/OpenRLHF-M>
- Adobe ProcMatRL: <https://github.com/adobe-research/ProcMatRL>
### Specifications
- WGSL specification: <https://www.w3.org/TR/WGSL/>
- WebGPU: <https://webgpu.org/>
- Khronos `glslang` reference: <https://www.khronos.org/opengles/sdk/Reference-Compiler/>
### Supplementary
- Procedural Shader Evolution: <https://arxiv.org/abs/2312.17587>
- ShadAR: <https://arxiv.org/abs/2602.17481>
- ProcMatRL paper (SIGGRAPH Asia 2024): <https://doi.org/10.1145/3687979>
- Conditioned Procedural Materials (SIGGRAPH 2023): <https://dl.acm.org/doi/10.1145/3588432.3591520>
- FragCoord.xyz: <https://fragcoord.xyz/>
- naga GLSL front-end failures: <https://github.com/Vipitis/shadertoys-dataset/issues/15>
|