shader / README.md
tejadhith's picture
Upload folder using huggingface_hub
67f71c2 verified
---
title: Shader Environment Server
emoji: 🎨
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Shader Environment
## Overview
`shader` is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints.
Supported interaction modes:
- Reference-conditioned shader recreation via SSIM reward
- Multi-turn shader refinement (not one-shot only)
- GLSL-first execution via headless rendering, with WGSL/browser-native as a later target
- Pluggable reward traces for RL, evaluation, reranking, and distillation
- Hidden evaluation across time, resolution, and seed variations
## Project Structure
```
envs/shader/
β”œβ”€β”€ __init__.py # Public API: ShaderEnv, ShaderAction, ShaderObservation
β”œβ”€β”€ models.py # Pydantic Action / Observation types
β”œβ”€β”€ client.py # OpenEnv client (ShaderEnv)
β”œβ”€β”€ tasks.py # Task bank loader (shaders21k corpus)
β”œβ”€β”€ reward.py # SSIM reward computation
β”œβ”€β”€ render.py # Headless GLSL renderer (ModernGL + EGL)
β”œβ”€β”€ harness.py # Subprocess-isolated render wrapper
β”œβ”€β”€ download.sh # Fetches shaders21k dataset
β”œβ”€β”€ openenv.yaml # OpenEnv space descriptor
β”œβ”€β”€ pyproject.toml
└── server/
β”œβ”€β”€ app.py # FastAPI application (OpenEnv HTTP server)
β”œβ”€β”€ environment.py # Environment implementation (reset / step loop)
└── Dockerfile
```
## Quickstart
```bash
# Download the shaders21k corpus (~41 MB)
cd envs/shader
./download.sh
```
### Running via uv / uvicorn
```bash
cd envs/shader
PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000
```
### Running via Docker
The corpus is downloaded at build time automatically:
```bash
cd envs/shader
docker build -f server/Dockerfile -t shader .
docker run -p 8000:8000 shader
```
### Validation
```bash
# Validate local structure
cd envs/shader
openenv validate --verbose
# Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode)
openenv validate http://localhost:8000
```
### Interacting via WebSocket
The HTTP endpoints (`/reset`, `/step`) are stateless β€” each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint:
```python
import asyncio, json, websockets
async def main():
async with websockets.connect("ws://localhost:8000/ws") as ws:
# Reset β€” picks a task, renders reference
await ws.send(json.dumps({"type": "reset", "data": {}}))
resp = json.loads(await ws.recv())
obs = resp["data"]["observation"]
print(obs["task"], obs["remaining"])
# Step β€” submit GLSL, get back SSIM + render
await ws.send(json.dumps({
"type": "step",
"data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"}
}))
resp = json.loads(await ws.recv())
r = resp["data"]
print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}")
await ws.send(json.dumps({"type": "close"}))
asyncio.run(main())
```
### Python Client
```python
from shader import ShaderEnv, ShaderAction
with ShaderEnv(base_url="http://localhost:8000").sync() as client:
result = client.reset()
print(result.observation.task) # ShaderToy ID
print(result.observation.reference_png) # base64 PNG
result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"))
print(result.observation.compiled) # True/False
print(result.observation.ssim) # similarity vs reference
```
### Benchmark
Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline:
```bash
# Requires a running server and OPENAI_API_KEY set
python envs/shader/benchmark.py # 3 episodes, default seeds
python envs/shader/benchmark.py --turns 5 # cap turns per episode
python envs/shader/benchmark.py --url ws://localhost:8001/ws # custom server
python envs/shader/benchmark.py --seeds 10 20 30 # custom seeds
```
Seeds control reproducible task selection. Results are saved to `benchmark_output/results.json`.
## Tasks
The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0].
| Task | Difficulty | Lines | Description | What the agent needs |
|------|-----------|-------|-------------|---------------------|
| `Nd33R4` | Easy | 13 | XOR color pattern on pixel coordinates | `int()` casting, bitwise XOR/AND, float conversion |
| `stlXWH` | Medium | 44 | SDF distance field (square minus circle) with smooth coloring | Signed distance functions, `abs`, `exp`, `smoothstep`, `cos` for distance coloring |
| `ftjSRd` | Hard | 122 | Raymarcher with SDF repetition, polar coordinates, HSV coloring | Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords |
All 3 tasks are sourced from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time.
To select a specific task, pass its name to `reset()`:
```python
result = env.reset(task="Nd33R4") # easy β€” XOR pattern
result = env.reset(task="stlXWH") # medium β€” SDF visualization
result = env.reset(task="ftjSRd") # hard β€” raymarcher
result = env.reset() # random from full corpus
```
### Grading
Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input.
- **Score range**: 0.0 (no similarity) to 1.0 (pixel-perfect match)
- **Success threshold**: score >= 0.90
- **Compile/render failures**: score = 0.0
### Baseline Scores
Evaluated with GPT 5.4 via `inference.py` (5 steps per task, temperature 0.2):
| Task | Difficulty | Best SSIM | Step-by-step rewards | Multi-turn improvement |
|------|-----------|-----------|---------------------|----------------------|
| `Nd33R4` | Easy | **0.27** | 0.27, 0.13, 0.15, 0.15, 0.16 | No β€” model generates hash noise instead of XOR pattern |
| `stlXWH` | Medium | **0.94** | 0.85, 0.88, 0.91, 0.90, 0.94 | Yes β€” steady refinement from 0.85 to 0.94 |
| `ftjSRd` | Hard | **0.40** | 0.31, 0.15, 0.33, 0.40, 0.28 | Partial β€” oscillates between 0.15-0.40 |
Key observations:
- `Nd33R4` (easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math.
- `stlXWH` (medium) shows clear multi-turn refinement β€” the model progressively improves the SDF shape and color mapping across steps.
- `ftjSRd` (hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters.
*Run `inference.py` to reproduce. Scores vary by model and API endpoint.*
## Task Bank (Corpus)
Beyond the 3 curated tasks, the full task bank is loaded from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks.
Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it.
| Field | Description |
|-------|-------------|
| `name` | ShaderToy ID (e.g. `MdGcDc`) |
| `code` | GLSL fragment shader source |
| `source` | ShaderToy URL for provenance |
| `resolution` | Render resolution, default 512x288 |
| `time` | iTime uniform value, default 0.0 |
| `difficulty` | `easy`, `medium`, `hard` (curated tasks only) |
## Motivation
Shader work has properties that make it well-suited as an RL environment:
- Feedback is fast and dense
- Compile success, render success, and performance are easy to gate
- The same shader can be evaluated under multiple controlled render conditions
- An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (`shaderc`, `glslang`)
ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines.
### Positioning
There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins.
## Runtime Stack
The primary target is GLSL via headless OpenGL.
- **GLSL** as the authoring and execution language
- The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark)
- GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data
- WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate)
- **ModernGL** with EGL as the headless rendering backend (`render.py`)
- Runs on Linux servers without a display via EGL
- Used by shaders21k for offline rendering
- Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms, `mainImage` forward declaration)
- Strips `#version`, `precision`, and `#extension` directives from user code to avoid conflicts
- Adjusts error line numbers reported by the driver to map back to user code
- **Subprocess isolation** (`harness.py`)
- Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption
- Configurable per-render timeout (default 10s)
- Returns structured `RenderResult` with compile/render status, error messages, and raw RGBA pixel data
- **`shaderc` / `glslang`** for offline validation and portability checks (planned)
- **WGSL / WebGPU** deferred to a later phase for browser-native demos
## Episode Schema
The environment follows a standard multi-turn refinement loop:
1. `reset()` picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG
2. The agent submits GLSL code via `step(ShaderAction(code=...))`
3. The server compiles and renders the shader, computes SSIM vs reference
4. Returns compile/render status, errors, rendered image, and SSIM reward
5. Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99
The server supports up to 4 concurrent environment sessions (`max_concurrent_envs=4`).
This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime.
## Environment Variants
### `shader` (primary)
Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain.
### Material Graph
Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations.
### Shader Repair
Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance.
## Task Families
- **Reference recreation** β€” recreate a target still or short effect from a reference render
- **Repair** β€” fix syntax errors, portability failures, or numerical instability
- **Optimization** β€” preserve appearance while reducing frame time or instruction count
- **Style transfer** β€” preserve scene logic while shifting color, texture, motion, or lighting style
- **Critique incorporation** β€” revise the shader based on iterative feedback
- **Robustness repair** β€” stabilize a shader across resolutions, aspect ratios, and time ranges
## Evaluation
Evaluation relies on hidden checks rather than visible examples only:
- Compile success
- Render success (no NaNs or fatal runtime failures)
- Perceptual similarity on held-out stills
- Temporal consistency on held-out short clips
- Stability across resolutions, aspect ratios, and parameter seeds
- Frame-time or instruction-budget limits
- Optional portability checks across compiler/validator paths
### Reward Structure
**Current implementation** (`reward.py`): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy `uniform_filter` for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0].
**Planned multi-component reward:**
```text
R = G_compile * G_render * (
0.35 * appearance_match +
0.20 * temporal_stability +
0.20 * performance +
0.15 * robustness +
0.10 * code_quality
) - step_penalty - regression_penalty
```
Component notes:
- **`appearance_match`** β€” measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants.
- **`temporal_stability`** β€” requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable.
- **`performance`** β€” frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score.
- **`robustness`** β€” captures resolution changes, seed changes, and compiler portability.
- **`G_compile * G_render`** β€” hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening.
- **SFT warm-up** is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance.
## Action / Observation Contract
```python
class ShaderAction(Action):
code: str # Shadertoy-dialect GLSL fragment shader source
class ShaderObservation(Observation):
task: str # ShaderToy ID
remaining: int # turns left in episode
reference_png: str # base64 PNG (non-empty on reset only)
compiled: bool
rendered: bool
errors: list[str]
agent_png: str # base64 PNG of agent's render
ssim: float # SSIM vs reference in [0, 1]
done: bool = False
reward: float | None = None
```
- `reward` and `done` inside `Observation` follows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-level `StepResponse`.
- Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn.
## Training
### Algorithm
- **GRPO** as the baseline, with multi-turn extensions:
- **MURPHY** (NeurIPS 2025) β€” for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO.
- **TRLOO** (Dr. Kernel) β€” addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings.
- The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop.
### Libraries
- **veRL**: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts
- **OpenRLHF-M**: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs)
Both have TRL and OpenEnv integrations.
### Task Bank
The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K).
## Related Work
| Work | Relevance |
|------|-----------|
| **Shadertoy** | Active public corpus and community; seed source for tasks and reference effects |
| **OpenEnv** (Meta/PyTorch) | Target framework. Client-server RL environment with `Action`/`Observation` base classes, `reset()`/`step()` contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations. |
| **ShaderEval** (LLM4Code @ ICSE 2025) | Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via `shadermatch`. |
| **shaders21k** (NeurIPS 2022) | 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation. |
| **AI Co-Artist** (arXiv:2512.08951) | Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL. |
| **VLMaterial** (ICLR 2025 Spotlight) | Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies. |
| **Dr. Kernel / KernelGYM** (arXiv:2602.05885) | Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment. |
| **MURPHY** (NeurIPS 2025) | Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here. |
| **ProcMatRL** (SIGGRAPH Asia 2024) | RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain. |
| **ShadAR** (arXiv:2602.17481) | LLM-driven real-time shader generation for AR. |
| **Procedural Shader Evolution** (arXiv:2312.17587) | Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization. |
## Licensing
Shader code in the task bank is sourced from [ShaderToy](https://www.shadertoy.com/) via the [shaders21k](https://github.com/mbaradad/shaders21k) dataset. ShaderToy's default license is **CC BY-NC-SA 3.0** β€” authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API.
- The dataset is not redistributed in this repository; it is downloaded at build time
- Individual shader provenance is tracked via the `source` field on each task (links back to the ShaderToy page)
- For commercial use, per-shader license review is required
## Next Steps
- Plan the SFT warm-up dataset and training run (prerequisite for RL)
- Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness
- Implement the reward plugin API (multi-component, pluggable)
- Establish the hidden evaluation protocol across time, resolution, and seed variations
- Add `shaderc` / `glslang` offline validation
- Add support for multi-pass shaders and texture inputs
## References
### Primary
- Shadertoy: <https://www.shadertoy.com/>
- OpenEnv (Meta/PyTorch): <https://github.com/meta-pytorch/OpenEnv>
- OpenEnv RFC 002: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/002-env-spec.md>
- OpenEnv RFC 004: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/004-rubrics.md>
- TRL OpenEnv integration: <https://huggingface.co/docs/trl/openenv>
- ShaderEval (LLM4Code @ ICSE 2025): <https://conf.researchr.org/details/icse-2025/llm4code-2025-papers/13/Evaluating-Language-Models-for-Computer-Graphics-Code-Completion>
- shadertoys-dataset: <https://github.com/Vipitis/shadertoys-dataset>
- Shadereval-inputs (HuggingFace): <https://huggingface.co/datasets/Vipitis/Shadereval-inputs>
- shaders21k (NeurIPS 2022): <https://arxiv.org/abs/2211.16412>
- shaders21k dataset: <https://github.com/mbaradad/shaders21k>
- AI Co-Artist: <https://arxiv.org/abs/2512.08951>
- VLMaterial (ICLR 2025): <https://arxiv.org/abs/2501.18623>
- VLMaterial code: <https://github.com/mit-gfx/VLMaterial>
- Dr. Kernel / KernelGYM: <https://arxiv.org/abs/2602.05885>
- MURPHY (NeurIPS 2025): <https://arxiv.org/abs/2511.07833>
- RLRF (SVG RL): <https://arxiv.org/abs/2505.20793>
### Tools and Infrastructure
- `shaderc`: <https://github.com/google/shaderc>
- `glslang`: <https://github.com/KhronosGroup/glslang>
- `pygfx/shadertoy`: <https://github.com/pygfx/shadertoy>
- moderngl: <https://github.com/moderngl/moderngl>
- veRL: <https://github.com/volcengine/verl>
- OpenRLHF-M: <https://github.com/OpenRLHF/OpenRLHF-M>
- Adobe ProcMatRL: <https://github.com/adobe-research/ProcMatRL>
### Specifications
- WGSL specification: <https://www.w3.org/TR/WGSL/>
- WebGPU: <https://webgpu.org/>
- Khronos `glslang` reference: <https://www.khronos.org/opengles/sdk/Reference-Compiler/>
### Supplementary
- Procedural Shader Evolution: <https://arxiv.org/abs/2312.17587>
- ShadAR: <https://arxiv.org/abs/2602.17481>
- ProcMatRL paper (SIGGRAPH Asia 2024): <https://doi.org/10.1145/3687979>
- Conditioned Procedural Materials (SIGGRAPH 2023): <https://dl.acm.org/doi/10.1145/3588432.3591520>
- FragCoord.xyz: <https://fragcoord.xyz/>
- naga GLSL front-end failures: <https://github.com/Vipitis/shadertoys-dataset/issues/15>