title: Shader Environment Server
emoji: π¨
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
Shader Environment
Overview
shader is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints.
Supported interaction modes:
- Reference-conditioned shader recreation via SSIM reward
- Multi-turn shader refinement (not one-shot only)
- GLSL-first execution via headless rendering, with WGSL/browser-native as a later target
- Pluggable reward traces for RL, evaluation, reranking, and distillation
- Hidden evaluation across time, resolution, and seed variations
Project Structure
envs/shader/
βββ __init__.py # Public API: ShaderEnv, ShaderAction, ShaderObservation
βββ models.py # Pydantic Action / Observation types
βββ client.py # OpenEnv client (ShaderEnv)
βββ tasks.py # Task bank loader (shaders21k corpus)
βββ reward.py # SSIM reward computation
βββ render.py # Headless GLSL renderer (ModernGL + EGL)
βββ harness.py # Subprocess-isolated render wrapper
βββ download.sh # Fetches shaders21k dataset
βββ openenv.yaml # OpenEnv space descriptor
βββ pyproject.toml
βββ server/
βββ app.py # FastAPI application (OpenEnv HTTP server)
βββ environment.py # Environment implementation (reset / step loop)
βββ Dockerfile
Quickstart
# Download the shaders21k corpus (~41 MB)
cd envs/shader
./download.sh
Running via uv / uvicorn
cd envs/shader
PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000
Running via Docker
The corpus is downloaded at build time automatically:
cd envs/shader
docker build -f server/Dockerfile -t shader .
docker run -p 8000:8000 shader
Validation
# Validate local structure
cd envs/shader
openenv validate --verbose
# Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode)
openenv validate http://localhost:8000
Interacting via WebSocket
The HTTP endpoints (/reset, /step) are stateless β each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint:
import asyncio, json, websockets
async def main():
async with websockets.connect("ws://localhost:8000/ws") as ws:
# Reset β picks a task, renders reference
await ws.send(json.dumps({"type": "reset", "data": {}}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(obs["task"], obs["remaining"])
# Step β submit GLSL, get back SSIM + render
await ws.send(json.dumps({
"type": "step",
"data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"}
}))
resp = json.loads(await ws.recv())
r = resp["data"]
print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}")
await ws.send(json.dumps({"type": "close"}))
asyncio.run(main())
Python Client
from shader import ShaderEnv, ShaderAction
with ShaderEnv(base_url="http://localhost:8000").sync() as client:
result = client.reset()
print(result.observation.task) # ShaderToy ID
print(result.observation.reference_png) # base64 PNG
result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"))
print(result.observation.compiled) # True/False
print(result.observation.ssim) # similarity vs reference
Benchmark
Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline:
# Requires a running server and OPENAI_API_KEY set
python envs/shader/benchmark.py # 3 episodes, default seeds
python envs/shader/benchmark.py --turns 5 # cap turns per episode
python envs/shader/benchmark.py --url ws://localhost:8001/ws # custom server
python envs/shader/benchmark.py --seeds 10 20 30 # custom seeds
Seeds control reproducible task selection. Results are saved to benchmark_output/results.json.
Tasks
The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0].
| Task | Difficulty | Lines | Description | What the agent needs |
|---|---|---|---|---|
Nd33R4 |
Easy | 13 | XOR color pattern on pixel coordinates | int() casting, bitwise XOR/AND, float conversion |
stlXWH |
Medium | 44 | SDF distance field (square minus circle) with smooth coloring | Signed distance functions, abs, exp, smoothstep, cos for distance coloring |
ftjSRd |
Hard | 122 | Raymarcher with SDF repetition, polar coordinates, HSV coloring | Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords |
All 3 tasks are sourced from the shaders21k corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time.
To select a specific task, pass its name to reset():
result = env.reset(task="Nd33R4") # easy β XOR pattern
result = env.reset(task="stlXWH") # medium β SDF visualization
result = env.reset(task="ftjSRd") # hard β raymarcher
result = env.reset() # random from full corpus
Grading
Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input.
- Score range: 0.0 (no similarity) to 1.0 (pixel-perfect match)
- Success threshold: score >= 0.90
- Compile/render failures: score = 0.0
Baseline Scores
Evaluated with GPT 5.4 via inference.py (5 steps per task, temperature 0.2):
| Task | Difficulty | Best SSIM | Step-by-step rewards | Multi-turn improvement |
|---|---|---|---|---|
Nd33R4 |
Easy | 0.27 | 0.27, 0.13, 0.15, 0.15, 0.16 | No β model generates hash noise instead of XOR pattern |
stlXWH |
Medium | 0.94 | 0.85, 0.88, 0.91, 0.90, 0.94 | Yes β steady refinement from 0.85 to 0.94 |
ftjSRd |
Hard | 0.40 | 0.31, 0.15, 0.33, 0.40, 0.28 | Partial β oscillates between 0.15-0.40 |
Key observations:
Nd33R4(easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math.stlXWH(medium) shows clear multi-turn refinement β the model progressively improves the SDF shape and color mapping across steps.ftjSRd(hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters.
Run inference.py to reproduce. Scores vary by model and API endpoint.
Task Bank (Corpus)
Beyond the 3 curated tasks, the full task bank is loaded from the shaders21k corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks.
Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it.
| Field | Description |
|---|---|
name |
ShaderToy ID (e.g. MdGcDc) |
code |
GLSL fragment shader source |
source |
ShaderToy URL for provenance |
resolution |
Render resolution, default 512x288 |
time |
iTime uniform value, default 0.0 |
difficulty |
easy, medium, hard (curated tasks only) |
Motivation
Shader work has properties that make it well-suited as an RL environment:
- Feedback is fast and dense
- Compile success, render success, and performance are easy to gate
- The same shader can be evaluated under multiple controlled render conditions
- An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (
shaderc,glslang)
ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines.
Positioning
There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins.
Runtime Stack
The primary target is GLSL via headless OpenGL.
- GLSL as the authoring and execution language
- The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark)
- GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data
- WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate)
- ModernGL with EGL as the headless rendering backend (
render.py)- Runs on Linux servers without a display via EGL
- Used by shaders21k for offline rendering
- Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms,
mainImageforward declaration) - Strips
#version,precision, and#extensiondirectives from user code to avoid conflicts - Adjusts error line numbers reported by the driver to map back to user code
- Subprocess isolation (
harness.py)- Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption
- Configurable per-render timeout (default 10s)
- Returns structured
RenderResultwith compile/render status, error messages, and raw RGBA pixel data
shaderc/glslangfor offline validation and portability checks (planned)- WGSL / WebGPU deferred to a later phase for browser-native demos
Episode Schema
The environment follows a standard multi-turn refinement loop:
reset()picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG- The agent submits GLSL code via
step(ShaderAction(code=...)) - The server compiles and renders the shader, computes SSIM vs reference
- Returns compile/render status, errors, rendered image, and SSIM reward
- Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99
The server supports up to 4 concurrent environment sessions (max_concurrent_envs=4).
This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime.
Environment Variants
shader (primary)
Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain.
Material Graph
Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations.
Shader Repair
Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance.
Task Families
- Reference recreation β recreate a target still or short effect from a reference render
- Repair β fix syntax errors, portability failures, or numerical instability
- Optimization β preserve appearance while reducing frame time or instruction count
- Style transfer β preserve scene logic while shifting color, texture, motion, or lighting style
- Critique incorporation β revise the shader based on iterative feedback
- Robustness repair β stabilize a shader across resolutions, aspect ratios, and time ranges
Evaluation
Evaluation relies on hidden checks rather than visible examples only:
- Compile success
- Render success (no NaNs or fatal runtime failures)
- Perceptual similarity on held-out stills
- Temporal consistency on held-out short clips
- Stability across resolutions, aspect ratios, and parameter seeds
- Frame-time or instruction-budget limits
- Optional portability checks across compiler/validator paths
Reward Structure
Current implementation (reward.py): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy uniform_filter for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0].
Planned multi-component reward:
R = G_compile * G_render * (
0.35 * appearance_match +
0.20 * temporal_stability +
0.20 * performance +
0.15 * robustness +
0.10 * code_quality
) - step_penalty - regression_penalty
Component notes:
appearance_matchβ measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants.temporal_stabilityβ requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable.performanceβ frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score.robustnessβ captures resolution changes, seed changes, and compiler portability.G_compile * G_renderβ hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening.- SFT warm-up is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance.
Action / Observation Contract
class ShaderAction(Action):
code: str # Shadertoy-dialect GLSL fragment shader source
class ShaderObservation(Observation):
task: str # ShaderToy ID
remaining: int # turns left in episode
reference_png: str # base64 PNG (non-empty on reset only)
compiled: bool
rendered: bool
errors: list[str]
agent_png: str # base64 PNG of agent's render
ssim: float # SSIM vs reference in [0, 1]
done: bool = False
reward: float | None = None
rewardanddoneinsideObservationfollows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-levelStepResponse.- Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn.
Training
Algorithm
- GRPO as the baseline, with multi-turn extensions:
- MURPHY (NeurIPS 2025) β for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO.
- TRLOO (Dr. Kernel) β addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings.
- The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop.
Libraries
- veRL: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts
- OpenRLHF-M: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs)
Both have TRL and OpenEnv integrations.
Task Bank
The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K).
Related Work
| Work | Relevance |
|---|---|
| Shadertoy | Active public corpus and community; seed source for tasks and reference effects |
| OpenEnv (Meta/PyTorch) | Target framework. Client-server RL environment with Action/Observation base classes, reset()/step() contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations. |
| ShaderEval (LLM4Code @ ICSE 2025) | Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via shadermatch. |
| shaders21k (NeurIPS 2022) | 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation. |
| AI Co-Artist (arXiv:2512.08951) | Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL. |
| VLMaterial (ICLR 2025 Spotlight) | Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies. |
| Dr. Kernel / KernelGYM (arXiv:2602.05885) | Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment. |
| MURPHY (NeurIPS 2025) | Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here. |
| ProcMatRL (SIGGRAPH Asia 2024) | RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain. |
| ShadAR (arXiv:2602.17481) | LLM-driven real-time shader generation for AR. |
| Procedural Shader Evolution (arXiv:2312.17587) | Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization. |
Licensing
Shader code in the task bank is sourced from ShaderToy via the shaders21k dataset. ShaderToy's default license is CC BY-NC-SA 3.0 β authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API.
- The dataset is not redistributed in this repository; it is downloaded at build time
- Individual shader provenance is tracked via the
sourcefield on each task (links back to the ShaderToy page) - For commercial use, per-shader license review is required
Next Steps
- Plan the SFT warm-up dataset and training run (prerequisite for RL)
- Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness
- Implement the reward plugin API (multi-component, pluggable)
- Establish the hidden evaluation protocol across time, resolution, and seed variations
- Add
shaderc/glslangoffline validation - Add support for multi-pass shaders and texture inputs
References
Primary
- Shadertoy: https://www.shadertoy.com/
- OpenEnv (Meta/PyTorch): https://github.com/meta-pytorch/OpenEnv
- OpenEnv RFC 002: https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/002-env-spec.md
- OpenEnv RFC 004: https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/004-rubrics.md
- TRL OpenEnv integration: https://huggingface.co/docs/trl/openenv
- ShaderEval (LLM4Code @ ICSE 2025): https://conf.researchr.org/details/icse-2025/llm4code-2025-papers/13/Evaluating-Language-Models-for-Computer-Graphics-Code-Completion
- shadertoys-dataset: https://github.com/Vipitis/shadertoys-dataset
- Shadereval-inputs (HuggingFace): https://huggingface.co/datasets/Vipitis/Shadereval-inputs
- shaders21k (NeurIPS 2022): https://arxiv.org/abs/2211.16412
- shaders21k dataset: https://github.com/mbaradad/shaders21k
- AI Co-Artist: https://arxiv.org/abs/2512.08951
- VLMaterial (ICLR 2025): https://arxiv.org/abs/2501.18623
- VLMaterial code: https://github.com/mit-gfx/VLMaterial
- Dr. Kernel / KernelGYM: https://arxiv.org/abs/2602.05885
- MURPHY (NeurIPS 2025): https://arxiv.org/abs/2511.07833
- RLRF (SVG RL): https://arxiv.org/abs/2505.20793
Tools and Infrastructure
shaderc: https://github.com/google/shadercglslang: https://github.com/KhronosGroup/glslangpygfx/shadertoy: https://github.com/pygfx/shadertoy- moderngl: https://github.com/moderngl/moderngl
- veRL: https://github.com/volcengine/verl
- OpenRLHF-M: https://github.com/OpenRLHF/OpenRLHF-M
- Adobe ProcMatRL: https://github.com/adobe-research/ProcMatRL
Specifications
- WGSL specification: https://www.w3.org/TR/WGSL/
- WebGPU: https://webgpu.org/
- Khronos
glslangreference: https://www.khronos.org/opengles/sdk/Reference-Compiler/
Supplementary
- Procedural Shader Evolution: https://arxiv.org/abs/2312.17587
- ShadAR: https://arxiv.org/abs/2602.17481
- ProcMatRL paper (SIGGRAPH Asia 2024): https://doi.org/10.1145/3687979
- Conditioned Procedural Materials (SIGGRAPH 2023): https://dl.acm.org/doi/10.1145/3588432.3591520
- FragCoord.xyz: https://fragcoord.xyz/
- naga GLSL front-end failures: https://github.com/Vipitis/shadertoys-dataset/issues/15