--- title: Shader Environment Server emoji: 🎨 colorFrom: purple colorTo: blue sdk: docker pinned: false app_port: 8000 base_path: /web tags: - openenv --- # Shader Environment ## Overview `shader` is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints. Supported interaction modes: - Reference-conditioned shader recreation via SSIM reward - Multi-turn shader refinement (not one-shot only) - GLSL-first execution via headless rendering, with WGSL/browser-native as a later target - Pluggable reward traces for RL, evaluation, reranking, and distillation - Hidden evaluation across time, resolution, and seed variations ## Project Structure ``` envs/shader/ ├── __init__.py # Public API: ShaderEnv, ShaderAction, ShaderObservation ├── models.py # Pydantic Action / Observation types ├── client.py # OpenEnv client (ShaderEnv) ├── tasks.py # Task bank loader (shaders21k corpus) ├── reward.py # SSIM reward computation ├── render.py # Headless GLSL renderer (ModernGL + EGL) ├── harness.py # Subprocess-isolated render wrapper ├── download.sh # Fetches shaders21k dataset ├── openenv.yaml # OpenEnv space descriptor ├── pyproject.toml └── server/ ├── app.py # FastAPI application (OpenEnv HTTP server) ├── environment.py # Environment implementation (reset / step loop) └── Dockerfile ``` ## Quickstart ```bash # Download the shaders21k corpus (~41 MB) cd envs/shader ./download.sh ``` ### Running via uv / uvicorn ```bash cd envs/shader PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` ### Running via Docker The corpus is downloaded at build time automatically: ```bash cd envs/shader docker build -f server/Dockerfile -t shader . docker run -p 8000:8000 shader ``` ### Validation ```bash # Validate local structure cd envs/shader openenv validate --verbose # Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode) openenv validate http://localhost:8000 ``` ### Interacting via WebSocket The HTTP endpoints (`/reset`, `/step`) are stateless — each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint: ```python import asyncio, json, websockets async def main(): async with websockets.connect("ws://localhost:8000/ws") as ws: # Reset — picks a task, renders reference await ws.send(json.dumps({"type": "reset", "data": {}})) resp = json.loads(await ws.recv()) obs = resp["data"]["observation"] print(obs["task"], obs["remaining"]) # Step — submit GLSL, get back SSIM + render await ws.send(json.dumps({ "type": "step", "data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"} })) resp = json.loads(await ws.recv()) r = resp["data"] print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}") await ws.send(json.dumps({"type": "close"})) asyncio.run(main()) ``` ### Python Client ```python from shader import ShaderEnv, ShaderAction with ShaderEnv(base_url="http://localhost:8000").sync() as client: result = client.reset() print(result.observation.task) # ShaderToy ID print(result.observation.reference_png) # base64 PNG result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }")) print(result.observation.compiled) # True/False print(result.observation.ssim) # similarity vs reference ``` ### Benchmark Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline: ```bash # Requires a running server and OPENAI_API_KEY set python envs/shader/benchmark.py # 3 episodes, default seeds python envs/shader/benchmark.py --turns 5 # cap turns per episode python envs/shader/benchmark.py --url ws://localhost:8001/ws # custom server python envs/shader/benchmark.py --seeds 10 20 30 # custom seeds ``` Seeds control reproducible task selection. Results are saved to `benchmark_output/results.json`. ## Tasks The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0]. | Task | Difficulty | Lines | Description | What the agent needs | |------|-----------|-------|-------------|---------------------| | `Nd33R4` | Easy | 13 | XOR color pattern on pixel coordinates | `int()` casting, bitwise XOR/AND, float conversion | | `stlXWH` | Medium | 44 | SDF distance field (square minus circle) with smooth coloring | Signed distance functions, `abs`, `exp`, `smoothstep`, `cos` for distance coloring | | `ftjSRd` | Hard | 122 | Raymarcher with SDF repetition, polar coordinates, HSV coloring | Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords | All 3 tasks are sourced from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time. To select a specific task, pass its name to `reset()`: ```python result = env.reset(task="Nd33R4") # easy — XOR pattern result = env.reset(task="stlXWH") # medium — SDF visualization result = env.reset(task="ftjSRd") # hard — raymarcher result = env.reset() # random from full corpus ``` ### Grading Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input. - **Score range**: 0.0 (no similarity) to 1.0 (pixel-perfect match) - **Success threshold**: score >= 0.90 - **Compile/render failures**: score = 0.0 ### Baseline Scores Evaluated with GPT 5.4 via `inference.py` (5 steps per task, temperature 0.2): | Task | Difficulty | Best SSIM | Step-by-step rewards | Multi-turn improvement | |------|-----------|-----------|---------------------|----------------------| | `Nd33R4` | Easy | **0.27** | 0.27, 0.13, 0.15, 0.15, 0.16 | No — model generates hash noise instead of XOR pattern | | `stlXWH` | Medium | **0.94** | 0.85, 0.88, 0.91, 0.90, 0.94 | Yes — steady refinement from 0.85 to 0.94 | | `ftjSRd` | Hard | **0.40** | 0.31, 0.15, 0.33, 0.40, 0.28 | Partial — oscillates between 0.15-0.40 | Key observations: - `Nd33R4` (easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math. - `stlXWH` (medium) shows clear multi-turn refinement — the model progressively improves the SDF shape and color mapping across steps. - `ftjSRd` (hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters. *Run `inference.py` to reproduce. Scores vary by model and API endpoint.* ## Task Bank (Corpus) Beyond the 3 curated tasks, the full task bank is loaded from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks. Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it. | Field | Description | |-------|-------------| | `name` | ShaderToy ID (e.g. `MdGcDc`) | | `code` | GLSL fragment shader source | | `source` | ShaderToy URL for provenance | | `resolution` | Render resolution, default 512x288 | | `time` | iTime uniform value, default 0.0 | | `difficulty` | `easy`, `medium`, `hard` (curated tasks only) | ## Motivation Shader work has properties that make it well-suited as an RL environment: - Feedback is fast and dense - Compile success, render success, and performance are easy to gate - The same shader can be evaluated under multiple controlled render conditions - An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (`shaderc`, `glslang`) ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines. ### Positioning There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins. ## Runtime Stack The primary target is GLSL via headless OpenGL. - **GLSL** as the authoring and execution language - The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark) - GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data - WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate) - **ModernGL** with EGL as the headless rendering backend (`render.py`) - Runs on Linux servers without a display via EGL - Used by shaders21k for offline rendering - Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms, `mainImage` forward declaration) - Strips `#version`, `precision`, and `#extension` directives from user code to avoid conflicts - Adjusts error line numbers reported by the driver to map back to user code - **Subprocess isolation** (`harness.py`) - Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption - Configurable per-render timeout (default 10s) - Returns structured `RenderResult` with compile/render status, error messages, and raw RGBA pixel data - **`shaderc` / `glslang`** for offline validation and portability checks (planned) - **WGSL / WebGPU** deferred to a later phase for browser-native demos ## Episode Schema The environment follows a standard multi-turn refinement loop: 1. `reset()` picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG 2. The agent submits GLSL code via `step(ShaderAction(code=...))` 3. The server compiles and renders the shader, computes SSIM vs reference 4. Returns compile/render status, errors, rendered image, and SSIM reward 5. Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99 The server supports up to 4 concurrent environment sessions (`max_concurrent_envs=4`). This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime. ## Environment Variants ### `shader` (primary) Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain. ### Material Graph Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations. ### Shader Repair Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance. ## Task Families - **Reference recreation** — recreate a target still or short effect from a reference render - **Repair** — fix syntax errors, portability failures, or numerical instability - **Optimization** — preserve appearance while reducing frame time or instruction count - **Style transfer** — preserve scene logic while shifting color, texture, motion, or lighting style - **Critique incorporation** — revise the shader based on iterative feedback - **Robustness repair** — stabilize a shader across resolutions, aspect ratios, and time ranges ## Evaluation Evaluation relies on hidden checks rather than visible examples only: - Compile success - Render success (no NaNs or fatal runtime failures) - Perceptual similarity on held-out stills - Temporal consistency on held-out short clips - Stability across resolutions, aspect ratios, and parameter seeds - Frame-time or instruction-budget limits - Optional portability checks across compiler/validator paths ### Reward Structure **Current implementation** (`reward.py`): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy `uniform_filter` for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0]. **Planned multi-component reward:** ```text R = G_compile * G_render * ( 0.35 * appearance_match + 0.20 * temporal_stability + 0.20 * performance + 0.15 * robustness + 0.10 * code_quality ) - step_penalty - regression_penalty ``` Component notes: - **`appearance_match`** — measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants. - **`temporal_stability`** — requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable. - **`performance`** — frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score. - **`robustness`** — captures resolution changes, seed changes, and compiler portability. - **`G_compile * G_render`** — hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening. - **SFT warm-up** is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance. ## Action / Observation Contract ```python class ShaderAction(Action): code: str # Shadertoy-dialect GLSL fragment shader source class ShaderObservation(Observation): task: str # ShaderToy ID remaining: int # turns left in episode reference_png: str # base64 PNG (non-empty on reset only) compiled: bool rendered: bool errors: list[str] agent_png: str # base64 PNG of agent's render ssim: float # SSIM vs reference in [0, 1] done: bool = False reward: float | None = None ``` - `reward` and `done` inside `Observation` follows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-level `StepResponse`. - Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn. ## Training ### Algorithm - **GRPO** as the baseline, with multi-turn extensions: - **MURPHY** (NeurIPS 2025) — for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO. - **TRLOO** (Dr. Kernel) — addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings. - The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop. ### Libraries - **veRL**: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts - **OpenRLHF-M**: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs) Both have TRL and OpenEnv integrations. ### Task Bank The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K). ## Related Work | Work | Relevance | |------|-----------| | **Shadertoy** | Active public corpus and community; seed source for tasks and reference effects | | **OpenEnv** (Meta/PyTorch) | Target framework. Client-server RL environment with `Action`/`Observation` base classes, `reset()`/`step()` contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations. | | **ShaderEval** (LLM4Code @ ICSE 2025) | Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via `shadermatch`. | | **shaders21k** (NeurIPS 2022) | 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation. | | **AI Co-Artist** (arXiv:2512.08951) | Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL. | | **VLMaterial** (ICLR 2025 Spotlight) | Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies. | | **Dr. Kernel / KernelGYM** (arXiv:2602.05885) | Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment. | | **MURPHY** (NeurIPS 2025) | Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here. | | **ProcMatRL** (SIGGRAPH Asia 2024) | RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain. | | **ShadAR** (arXiv:2602.17481) | LLM-driven real-time shader generation for AR. | | **Procedural Shader Evolution** (arXiv:2312.17587) | Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization. | ## Licensing Shader code in the task bank is sourced from [ShaderToy](https://www.shadertoy.com/) via the [shaders21k](https://github.com/mbaradad/shaders21k) dataset. ShaderToy's default license is **CC BY-NC-SA 3.0** — authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API. - The dataset is not redistributed in this repository; it is downloaded at build time - Individual shader provenance is tracked via the `source` field on each task (links back to the ShaderToy page) - For commercial use, per-shader license review is required ## Next Steps - Plan the SFT warm-up dataset and training run (prerequisite for RL) - Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness - Implement the reward plugin API (multi-component, pluggable) - Establish the hidden evaluation protocol across time, resolution, and seed variations - Add `shaderc` / `glslang` offline validation - Add support for multi-pass shaders and texture inputs ## References ### Primary - Shadertoy: - OpenEnv (Meta/PyTorch): - OpenEnv RFC 002: - OpenEnv RFC 004: - TRL OpenEnv integration: - ShaderEval (LLM4Code @ ICSE 2025): - shadertoys-dataset: - Shadereval-inputs (HuggingFace): - shaders21k (NeurIPS 2022): - shaders21k dataset: - AI Co-Artist: - VLMaterial (ICLR 2025): - VLMaterial code: - Dr. Kernel / KernelGYM: - MURPHY (NeurIPS 2025): - RLRF (SVG RL): ### Tools and Infrastructure - `shaderc`: - `glslang`: - `pygfx/shadertoy`: - moderngl: - veRL: - OpenRLHF-M: - Adobe ProcMatRL: ### Specifications - WGSL specification: - WebGPU: - Khronos `glslang` reference: ### Supplementary - Procedural Shader Evolution: - ShadAR: - ProcMatRL paper (SIGGRAPH Asia 2024): - Conditioned Procedural Materials (SIGGRAPH 2023): - FragCoord.xyz: - naga GLSL front-end failures: