| --- |
| title: Shader Environment Server |
| emoji: π¨ |
| colorFrom: purple |
| colorTo: blue |
| sdk: docker |
| pinned: false |
| app_port: 8000 |
| base_path: /web |
| tags: |
| - openenv |
| --- |
| |
| # Shader Environment |
|
|
| ## Overview |
|
|
| `shader` is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints. |
|
|
| Supported interaction modes: |
|
|
| - Reference-conditioned shader recreation via SSIM reward |
| - Multi-turn shader refinement (not one-shot only) |
| - GLSL-first execution via headless rendering, with WGSL/browser-native as a later target |
| - Pluggable reward traces for RL, evaluation, reranking, and distillation |
| - Hidden evaluation across time, resolution, and seed variations |
|
|
| ## Project Structure |
|
|
| ``` |
| envs/shader/ |
| βββ __init__.py # Public API: ShaderEnv, ShaderAction, ShaderObservation |
| βββ models.py # Pydantic Action / Observation types |
| βββ client.py # OpenEnv client (ShaderEnv) |
| βββ tasks.py # Task bank loader (shaders21k corpus) |
| βββ reward.py # SSIM reward computation |
| βββ render.py # Headless GLSL renderer (ModernGL + EGL) |
| βββ harness.py # Subprocess-isolated render wrapper |
| βββ download.sh # Fetches shaders21k dataset |
| βββ openenv.yaml # OpenEnv space descriptor |
| βββ pyproject.toml |
| βββ server/ |
| βββ app.py # FastAPI application (OpenEnv HTTP server) |
| βββ environment.py # Environment implementation (reset / step loop) |
| βββ Dockerfile |
| ``` |
|
|
| ## Quickstart |
|
|
| ```bash |
| # Download the shaders21k corpus (~41 MB) |
| cd envs/shader |
| ./download.sh |
| ``` |
|
|
| ### Running via uv / uvicorn |
|
|
| ```bash |
| cd envs/shader |
| PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000 |
| ``` |
|
|
| ### Running via Docker |
|
|
| The corpus is downloaded at build time automatically: |
|
|
| ```bash |
| cd envs/shader |
| docker build -f server/Dockerfile -t shader . |
| docker run -p 8000:8000 shader |
| ``` |
|
|
| ### Validation |
|
|
| ```bash |
| # Validate local structure |
| cd envs/shader |
| openenv validate --verbose |
| |
| # Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode) |
| openenv validate http://localhost:8000 |
| ``` |
|
|
| ### Interacting via WebSocket |
|
|
| The HTTP endpoints (`/reset`, `/step`) are stateless β each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint: |
|
|
| ```python |
| import asyncio, json, websockets |
| |
| async def main(): |
| async with websockets.connect("ws://localhost:8000/ws") as ws: |
| # Reset β picks a task, renders reference |
| await ws.send(json.dumps({"type": "reset", "data": {}})) |
| resp = json.loads(await ws.recv()) |
| obs = resp["data"]["observation"] |
| print(obs["task"], obs["remaining"]) |
| |
| # Step β submit GLSL, get back SSIM + render |
| await ws.send(json.dumps({ |
| "type": "step", |
| "data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"} |
| })) |
| resp = json.loads(await ws.recv()) |
| r = resp["data"] |
| print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}") |
| |
| await ws.send(json.dumps({"type": "close"})) |
| |
| asyncio.run(main()) |
| ``` |
|
|
| ### Python Client |
|
|
| ```python |
| from shader import ShaderEnv, ShaderAction |
| |
| with ShaderEnv(base_url="http://localhost:8000").sync() as client: |
| result = client.reset() |
| print(result.observation.task) # ShaderToy ID |
| print(result.observation.reference_png) # base64 PNG |
| |
| result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }")) |
| print(result.observation.compiled) # True/False |
| print(result.observation.ssim) # similarity vs reference |
| ``` |
|
|
| ### Benchmark |
|
|
| Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline: |
|
|
| ```bash |
| # Requires a running server and OPENAI_API_KEY set |
| python envs/shader/benchmark.py # 3 episodes, default seeds |
| python envs/shader/benchmark.py --turns 5 # cap turns per episode |
| python envs/shader/benchmark.py --url ws://localhost:8001/ws # custom server |
| python envs/shader/benchmark.py --seeds 10 20 30 # custom seeds |
| ``` |
|
|
| Seeds control reproducible task selection. Results are saved to `benchmark_output/results.json`. |
|
|
| ## Tasks |
|
|
| The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0]. |
|
|
| | Task | Difficulty | Lines | Description | What the agent needs | |
| |------|-----------|-------|-------------|---------------------| |
| | `Nd33R4` | Easy | 13 | XOR color pattern on pixel coordinates | `int()` casting, bitwise XOR/AND, float conversion | |
| | `stlXWH` | Medium | 44 | SDF distance field (square minus circle) with smooth coloring | Signed distance functions, `abs`, `exp`, `smoothstep`, `cos` for distance coloring | |
| | `ftjSRd` | Hard | 122 | Raymarcher with SDF repetition, polar coordinates, HSV coloring | Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords | |
|
|
| All 3 tasks are sourced from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time. |
|
|
| To select a specific task, pass its name to `reset()`: |
|
|
| ```python |
| result = env.reset(task="Nd33R4") # easy β XOR pattern |
| result = env.reset(task="stlXWH") # medium β SDF visualization |
| result = env.reset(task="ftjSRd") # hard β raymarcher |
| result = env.reset() # random from full corpus |
| ``` |
|
|
| ### Grading |
|
|
| Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input. |
|
|
| - **Score range**: 0.0 (no similarity) to 1.0 (pixel-perfect match) |
| - **Success threshold**: score >= 0.90 |
| - **Compile/render failures**: score = 0.0 |
|
|
| ### Baseline Scores |
|
|
| Evaluated with GPT 5.4 via `inference.py` (5 steps per task, temperature 0.2): |
|
|
| | Task | Difficulty | Best SSIM | Step-by-step rewards | Multi-turn improvement | |
| |------|-----------|-----------|---------------------|----------------------| |
| | `Nd33R4` | Easy | **0.27** | 0.27, 0.13, 0.15, 0.15, 0.16 | No β model generates hash noise instead of XOR pattern | |
| | `stlXWH` | Medium | **0.94** | 0.85, 0.88, 0.91, 0.90, 0.94 | Yes β steady refinement from 0.85 to 0.94 | |
| | `ftjSRd` | Hard | **0.40** | 0.31, 0.15, 0.33, 0.40, 0.28 | Partial β oscillates between 0.15-0.40 | |
|
|
| Key observations: |
| - `Nd33R4` (easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math. |
| - `stlXWH` (medium) shows clear multi-turn refinement β the model progressively improves the SDF shape and color mapping across steps. |
| - `ftjSRd` (hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters. |
|
|
| *Run `inference.py` to reproduce. Scores vary by model and API endpoint.* |
|
|
| ## Task Bank (Corpus) |
|
|
| Beyond the 3 curated tasks, the full task bank is loaded from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks. |
|
|
| Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it. |
|
|
| | Field | Description | |
| |-------|-------------| |
| | `name` | ShaderToy ID (e.g. `MdGcDc`) | |
| | `code` | GLSL fragment shader source | |
| | `source` | ShaderToy URL for provenance | |
| | `resolution` | Render resolution, default 512x288 | |
| | `time` | iTime uniform value, default 0.0 | |
| | `difficulty` | `easy`, `medium`, `hard` (curated tasks only) | |
|
|
| ## Motivation |
|
|
| Shader work has properties that make it well-suited as an RL environment: |
|
|
| - Feedback is fast and dense |
| - Compile success, render success, and performance are easy to gate |
| - The same shader can be evaluated under multiple controlled render conditions |
| - An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (`shaderc`, `glslang`) |
|
|
| ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines. |
|
|
| ### Positioning |
|
|
| There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins. |
|
|
| ## Runtime Stack |
|
|
| The primary target is GLSL via headless OpenGL. |
|
|
| - **GLSL** as the authoring and execution language |
| - The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark) |
| - GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data |
| - WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate) |
| - **ModernGL** with EGL as the headless rendering backend (`render.py`) |
| - Runs on Linux servers without a display via EGL |
| - Used by shaders21k for offline rendering |
| - Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms, `mainImage` forward declaration) |
| - Strips `#version`, `precision`, and `#extension` directives from user code to avoid conflicts |
| - Adjusts error line numbers reported by the driver to map back to user code |
| - **Subprocess isolation** (`harness.py`) |
| - Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption |
| - Configurable per-render timeout (default 10s) |
| - Returns structured `RenderResult` with compile/render status, error messages, and raw RGBA pixel data |
| - **`shaderc` / `glslang`** for offline validation and portability checks (planned) |
| - **WGSL / WebGPU** deferred to a later phase for browser-native demos |
|
|
| ## Episode Schema |
|
|
| The environment follows a standard multi-turn refinement loop: |
|
|
| 1. `reset()` picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG |
| 2. The agent submits GLSL code via `step(ShaderAction(code=...))` |
| 3. The server compiles and renders the shader, computes SSIM vs reference |
| 4. Returns compile/render status, errors, rendered image, and SSIM reward |
| 5. Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99 |
|
|
| The server supports up to 4 concurrent environment sessions (`max_concurrent_envs=4`). |
|
|
| This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime. |
|
|
| ## Environment Variants |
|
|
| ### `shader` (primary) |
|
|
| Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain. |
|
|
| ### Material Graph |
|
|
| Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations. |
|
|
| ### Shader Repair |
|
|
| Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance. |
|
|
| ## Task Families |
|
|
| - **Reference recreation** β recreate a target still or short effect from a reference render |
| - **Repair** β fix syntax errors, portability failures, or numerical instability |
| - **Optimization** β preserve appearance while reducing frame time or instruction count |
| - **Style transfer** β preserve scene logic while shifting color, texture, motion, or lighting style |
| - **Critique incorporation** β revise the shader based on iterative feedback |
| - **Robustness repair** β stabilize a shader across resolutions, aspect ratios, and time ranges |
|
|
| ## Evaluation |
|
|
| Evaluation relies on hidden checks rather than visible examples only: |
|
|
| - Compile success |
| - Render success (no NaNs or fatal runtime failures) |
| - Perceptual similarity on held-out stills |
| - Temporal consistency on held-out short clips |
| - Stability across resolutions, aspect ratios, and parameter seeds |
| - Frame-time or instruction-budget limits |
| - Optional portability checks across compiler/validator paths |
|
|
| ### Reward Structure |
|
|
| **Current implementation** (`reward.py`): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy `uniform_filter` for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0]. |
|
|
| **Planned multi-component reward:** |
|
|
| ```text |
| R = G_compile * G_render * ( |
| 0.35 * appearance_match + |
| 0.20 * temporal_stability + |
| 0.20 * performance + |
| 0.15 * robustness + |
| 0.10 * code_quality |
| ) - step_penalty - regression_penalty |
| ``` |
|
|
| Component notes: |
|
|
| - **`appearance_match`** β measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants. |
| - **`temporal_stability`** β requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable. |
| - **`performance`** β frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score. |
| - **`robustness`** β captures resolution changes, seed changes, and compiler portability. |
| - **`G_compile * G_render`** β hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening. |
| - **SFT warm-up** is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance. |
|
|
| ## Action / Observation Contract |
|
|
| ```python |
| class ShaderAction(Action): |
| code: str # Shadertoy-dialect GLSL fragment shader source |
| |
| class ShaderObservation(Observation): |
| task: str # ShaderToy ID |
| remaining: int # turns left in episode |
| reference_png: str # base64 PNG (non-empty on reset only) |
| compiled: bool |
| rendered: bool |
| errors: list[str] |
| agent_png: str # base64 PNG of agent's render |
| ssim: float # SSIM vs reference in [0, 1] |
| done: bool = False |
| reward: float | None = None |
| ``` |
|
|
| - `reward` and `done` inside `Observation` follows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-level `StepResponse`. |
| - Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn. |
|
|
| ## Training |
|
|
| ### Algorithm |
|
|
| - **GRPO** as the baseline, with multi-turn extensions: |
| - **MURPHY** (NeurIPS 2025) β for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO. |
| - **TRLOO** (Dr. Kernel) β addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings. |
| - The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop. |
|
|
| ### Libraries |
|
|
| - **veRL**: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts |
| - **OpenRLHF-M**: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs) |
|
|
| Both have TRL and OpenEnv integrations. |
|
|
| ### Task Bank |
|
|
| The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K). |
|
|
| ## Related Work |
|
|
| | Work | Relevance | |
| |------|-----------| |
| | **Shadertoy** | Active public corpus and community; seed source for tasks and reference effects | |
| | **OpenEnv** (Meta/PyTorch) | Target framework. Client-server RL environment with `Action`/`Observation` base classes, `reset()`/`step()` contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations. | |
| | **ShaderEval** (LLM4Code @ ICSE 2025) | Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via `shadermatch`. | |
| | **shaders21k** (NeurIPS 2022) | 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation. | |
| | **AI Co-Artist** (arXiv:2512.08951) | Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL. | |
| | **VLMaterial** (ICLR 2025 Spotlight) | Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies. | |
| | **Dr. Kernel / KernelGYM** (arXiv:2602.05885) | Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment. | |
| | **MURPHY** (NeurIPS 2025) | Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here. | |
| | **ProcMatRL** (SIGGRAPH Asia 2024) | RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain. | |
| | **ShadAR** (arXiv:2602.17481) | LLM-driven real-time shader generation for AR. | |
| | **Procedural Shader Evolution** (arXiv:2312.17587) | Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization. | |
|
|
| ## Licensing |
|
|
| Shader code in the task bank is sourced from [ShaderToy](https://www.shadertoy.com/) via the [shaders21k](https://github.com/mbaradad/shaders21k) dataset. ShaderToy's default license is **CC BY-NC-SA 3.0** β authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API. |
|
|
| - The dataset is not redistributed in this repository; it is downloaded at build time |
| - Individual shader provenance is tracked via the `source` field on each task (links back to the ShaderToy page) |
| - For commercial use, per-shader license review is required |
|
|
| ## Next Steps |
|
|
| - Plan the SFT warm-up dataset and training run (prerequisite for RL) |
| - Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness |
| - Implement the reward plugin API (multi-component, pluggable) |
| - Establish the hidden evaluation protocol across time, resolution, and seed variations |
| - Add `shaderc` / `glslang` offline validation |
| - Add support for multi-pass shaders and texture inputs |
|
|
| ## References |
|
|
| ### Primary |
|
|
| - Shadertoy: <https://www.shadertoy.com/> |
| - OpenEnv (Meta/PyTorch): <https://github.com/meta-pytorch/OpenEnv> |
| - OpenEnv RFC 002: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/002-env-spec.md> |
| - OpenEnv RFC 004: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/004-rubrics.md> |
| - TRL OpenEnv integration: <https://huggingface.co/docs/trl/openenv> |
| - ShaderEval (LLM4Code @ ICSE 2025): <https://conf.researchr.org/details/icse-2025/llm4code-2025-papers/13/Evaluating-Language-Models-for-Computer-Graphics-Code-Completion> |
| - shadertoys-dataset: <https://github.com/Vipitis/shadertoys-dataset> |
| - Shadereval-inputs (HuggingFace): <https://huggingface.co/datasets/Vipitis/Shadereval-inputs> |
| - shaders21k (NeurIPS 2022): <https://arxiv.org/abs/2211.16412> |
| - shaders21k dataset: <https://github.com/mbaradad/shaders21k> |
| - AI Co-Artist: <https://arxiv.org/abs/2512.08951> |
| - VLMaterial (ICLR 2025): <https://arxiv.org/abs/2501.18623> |
| - VLMaterial code: <https://github.com/mit-gfx/VLMaterial> |
| - Dr. Kernel / KernelGYM: <https://arxiv.org/abs/2602.05885> |
| - MURPHY (NeurIPS 2025): <https://arxiv.org/abs/2511.07833> |
| - RLRF (SVG RL): <https://arxiv.org/abs/2505.20793> |
|
|
| ### Tools and Infrastructure |
|
|
| - `shaderc`: <https://github.com/google/shaderc> |
| - `glslang`: <https://github.com/KhronosGroup/glslang> |
| - `pygfx/shadertoy`: <https://github.com/pygfx/shadertoy> |
| - moderngl: <https://github.com/moderngl/moderngl> |
| - veRL: <https://github.com/volcengine/verl> |
| - OpenRLHF-M: <https://github.com/OpenRLHF/OpenRLHF-M> |
| - Adobe ProcMatRL: <https://github.com/adobe-research/ProcMatRL> |
|
|
| ### Specifications |
|
|
| - WGSL specification: <https://www.w3.org/TR/WGSL/> |
| - WebGPU: <https://webgpu.org/> |
| - Khronos `glslang` reference: <https://www.khronos.org/opengles/sdk/Reference-Compiler/> |
|
|
| ### Supplementary |
|
|
| - Procedural Shader Evolution: <https://arxiv.org/abs/2312.17587> |
| - ShadAR: <https://arxiv.org/abs/2602.17481> |
| - ProcMatRL paper (SIGGRAPH Asia 2024): <https://doi.org/10.1145/3687979> |
| - Conditioned Procedural Materials (SIGGRAPH 2023): <https://dl.acm.org/doi/10.1145/3588432.3591520> |
| - FragCoord.xyz: <https://fragcoord.xyz/> |
| - naga GLSL front-end failures: <https://github.com/Vipitis/shadertoys-dataset/issues/15> |
|
|