shader / README.md
tejadhith's picture
Upload folder using huggingface_hub
67f71c2 verified
metadata
title: Shader Environment Server
emoji: 🎨
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Shader Environment

Overview

shader is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints.

Supported interaction modes:

  • Reference-conditioned shader recreation via SSIM reward
  • Multi-turn shader refinement (not one-shot only)
  • GLSL-first execution via headless rendering, with WGSL/browser-native as a later target
  • Pluggable reward traces for RL, evaluation, reranking, and distillation
  • Hidden evaluation across time, resolution, and seed variations

Project Structure

envs/shader/
β”œβ”€β”€ __init__.py          # Public API: ShaderEnv, ShaderAction, ShaderObservation
β”œβ”€β”€ models.py            # Pydantic Action / Observation types
β”œβ”€β”€ client.py            # OpenEnv client (ShaderEnv)
β”œβ”€β”€ tasks.py             # Task bank loader (shaders21k corpus)
β”œβ”€β”€ reward.py            # SSIM reward computation
β”œβ”€β”€ render.py            # Headless GLSL renderer (ModernGL + EGL)
β”œβ”€β”€ harness.py           # Subprocess-isolated render wrapper
β”œβ”€β”€ download.sh          # Fetches shaders21k dataset
β”œβ”€β”€ openenv.yaml         # OpenEnv space descriptor
β”œβ”€β”€ pyproject.toml
└── server/
    β”œβ”€β”€ app.py           # FastAPI application (OpenEnv HTTP server)
    β”œβ”€β”€ environment.py   # Environment implementation (reset / step loop)
    └── Dockerfile

Quickstart

# Download the shaders21k corpus (~41 MB)
cd envs/shader
./download.sh

Running via uv / uvicorn

cd envs/shader
PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000

Running via Docker

The corpus is downloaded at build time automatically:

cd envs/shader
docker build -f server/Dockerfile -t shader .
docker run -p 8000:8000 shader

Validation

# Validate local structure
cd envs/shader
openenv validate --verbose

# Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode)
openenv validate http://localhost:8000

Interacting via WebSocket

The HTTP endpoints (/reset, /step) are stateless β€” each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint:

import asyncio, json, websockets

async def main():
    async with websockets.connect("ws://localhost:8000/ws") as ws:
        # Reset β€” picks a task, renders reference
        await ws.send(json.dumps({"type": "reset", "data": {}}))
        resp = json.loads(await ws.recv())
        obs = resp["data"]["observation"]
        print(obs["task"], obs["remaining"])

        # Step β€” submit GLSL, get back SSIM + render
        await ws.send(json.dumps({
            "type": "step",
            "data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"}
        }))
        resp = json.loads(await ws.recv())
        r = resp["data"]
        print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}")

        await ws.send(json.dumps({"type": "close"}))

asyncio.run(main())

Python Client

from shader import ShaderEnv, ShaderAction

with ShaderEnv(base_url="http://localhost:8000").sync() as client:
    result = client.reset()
    print(result.observation.task)         # ShaderToy ID
    print(result.observation.reference_png) # base64 PNG

    result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"))
    print(result.observation.compiled)     # True/False
    print(result.observation.ssim)         # similarity vs reference

Benchmark

Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline:

# Requires a running server and OPENAI_API_KEY set
python envs/shader/benchmark.py                                    # 3 episodes, default seeds
python envs/shader/benchmark.py --turns 5                          # cap turns per episode
python envs/shader/benchmark.py --url ws://localhost:8001/ws       # custom server
python envs/shader/benchmark.py --seeds 10 20 30                   # custom seeds

Seeds control reproducible task selection. Results are saved to benchmark_output/results.json.

Tasks

The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0].

Task Difficulty Lines Description What the agent needs
Nd33R4 Easy 13 XOR color pattern on pixel coordinates int() casting, bitwise XOR/AND, float conversion
stlXWH Medium 44 SDF distance field (square minus circle) with smooth coloring Signed distance functions, abs, exp, smoothstep, cos for distance coloring
ftjSRd Hard 122 Raymarcher with SDF repetition, polar coordinates, HSV coloring Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords

All 3 tasks are sourced from the shaders21k corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time.

To select a specific task, pass its name to reset():

result = env.reset(task="Nd33R4")   # easy β€” XOR pattern
result = env.reset(task="stlXWH")   # medium β€” SDF visualization
result = env.reset(task="ftjSRd")   # hard β€” raymarcher
result = env.reset()                # random from full corpus

Grading

Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input.

  • Score range: 0.0 (no similarity) to 1.0 (pixel-perfect match)
  • Success threshold: score >= 0.90
  • Compile/render failures: score = 0.0

Baseline Scores

Evaluated with GPT 5.4 via inference.py (5 steps per task, temperature 0.2):

Task Difficulty Best SSIM Step-by-step rewards Multi-turn improvement
Nd33R4 Easy 0.27 0.27, 0.13, 0.15, 0.15, 0.16 No β€” model generates hash noise instead of XOR pattern
stlXWH Medium 0.94 0.85, 0.88, 0.91, 0.90, 0.94 Yes β€” steady refinement from 0.85 to 0.94
ftjSRd Hard 0.40 0.31, 0.15, 0.33, 0.40, 0.28 Partial β€” oscillates between 0.15-0.40

Key observations:

  • Nd33R4 (easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math.
  • stlXWH (medium) shows clear multi-turn refinement β€” the model progressively improves the SDF shape and color mapping across steps.
  • ftjSRd (hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters.

Run inference.py to reproduce. Scores vary by model and API endpoint.

Task Bank (Corpus)

Beyond the 3 curated tasks, the full task bank is loaded from the shaders21k corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks.

Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it.

Field Description
name ShaderToy ID (e.g. MdGcDc)
code GLSL fragment shader source
source ShaderToy URL for provenance
resolution Render resolution, default 512x288
time iTime uniform value, default 0.0
difficulty easy, medium, hard (curated tasks only)

Motivation

Shader work has properties that make it well-suited as an RL environment:

  • Feedback is fast and dense
  • Compile success, render success, and performance are easy to gate
  • The same shader can be evaluated under multiple controlled render conditions
  • An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (shaderc, glslang)

ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines.

Positioning

There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins.

Runtime Stack

The primary target is GLSL via headless OpenGL.

  • GLSL as the authoring and execution language
    • The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark)
    • GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data
    • WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate)
  • ModernGL with EGL as the headless rendering backend (render.py)
    • Runs on Linux servers without a display via EGL
    • Used by shaders21k for offline rendering
    • Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms, mainImage forward declaration)
    • Strips #version, precision, and #extension directives from user code to avoid conflicts
    • Adjusts error line numbers reported by the driver to map back to user code
  • Subprocess isolation (harness.py)
    • Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption
    • Configurable per-render timeout (default 10s)
    • Returns structured RenderResult with compile/render status, error messages, and raw RGBA pixel data
  • shaderc / glslang for offline validation and portability checks (planned)
  • WGSL / WebGPU deferred to a later phase for browser-native demos

Episode Schema

The environment follows a standard multi-turn refinement loop:

  1. reset() picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG
  2. The agent submits GLSL code via step(ShaderAction(code=...))
  3. The server compiles and renders the shader, computes SSIM vs reference
  4. Returns compile/render status, errors, rendered image, and SSIM reward
  5. Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99

The server supports up to 4 concurrent environment sessions (max_concurrent_envs=4).

This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime.

Environment Variants

shader (primary)

Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain.

Material Graph

Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations.

Shader Repair

Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance.

Task Families

  • Reference recreation β€” recreate a target still or short effect from a reference render
  • Repair β€” fix syntax errors, portability failures, or numerical instability
  • Optimization β€” preserve appearance while reducing frame time or instruction count
  • Style transfer β€” preserve scene logic while shifting color, texture, motion, or lighting style
  • Critique incorporation β€” revise the shader based on iterative feedback
  • Robustness repair β€” stabilize a shader across resolutions, aspect ratios, and time ranges

Evaluation

Evaluation relies on hidden checks rather than visible examples only:

  • Compile success
  • Render success (no NaNs or fatal runtime failures)
  • Perceptual similarity on held-out stills
  • Temporal consistency on held-out short clips
  • Stability across resolutions, aspect ratios, and parameter seeds
  • Frame-time or instruction-budget limits
  • Optional portability checks across compiler/validator paths

Reward Structure

Current implementation (reward.py): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy uniform_filter for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0].

Planned multi-component reward:

R = G_compile * G_render * (
  0.35 * appearance_match +
  0.20 * temporal_stability +
  0.20 * performance +
  0.15 * robustness +
  0.10 * code_quality
) - step_penalty - regression_penalty

Component notes:

  • appearance_match β€” measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants.
  • temporal_stability β€” requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable.
  • performance β€” frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score.
  • robustness β€” captures resolution changes, seed changes, and compiler portability.
  • G_compile * G_render β€” hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening.
  • SFT warm-up is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance.

Action / Observation Contract

class ShaderAction(Action):
    code: str  # Shadertoy-dialect GLSL fragment shader source

class ShaderObservation(Observation):
    task: str               # ShaderToy ID
    remaining: int          # turns left in episode
    reference_png: str      # base64 PNG (non-empty on reset only)
    compiled: bool
    rendered: bool
    errors: list[str]
    agent_png: str          # base64 PNG of agent's render
    ssim: float             # SSIM vs reference in [0, 1]
    done: bool = False
    reward: float | None = None
  • reward and done inside Observation follows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-level StepResponse.
  • Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn.

Training

Algorithm

  • GRPO as the baseline, with multi-turn extensions:
    • MURPHY (NeurIPS 2025) β€” for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO.
    • TRLOO (Dr. Kernel) β€” addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings.
  • The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop.

Libraries

  • veRL: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts
  • OpenRLHF-M: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs)

Both have TRL and OpenEnv integrations.

Task Bank

The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K).

Related Work

Work Relevance
Shadertoy Active public corpus and community; seed source for tasks and reference effects
OpenEnv (Meta/PyTorch) Target framework. Client-server RL environment with Action/Observation base classes, reset()/step() contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations.
ShaderEval (LLM4Code @ ICSE 2025) Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via shadermatch.
shaders21k (NeurIPS 2022) 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation.
AI Co-Artist (arXiv:2512.08951) Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL.
VLMaterial (ICLR 2025 Spotlight) Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies.
Dr. Kernel / KernelGYM (arXiv:2602.05885) Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment.
MURPHY (NeurIPS 2025) Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here.
ProcMatRL (SIGGRAPH Asia 2024) RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain.
ShadAR (arXiv:2602.17481) LLM-driven real-time shader generation for AR.
Procedural Shader Evolution (arXiv:2312.17587) Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization.

Licensing

Shader code in the task bank is sourced from ShaderToy via the shaders21k dataset. ShaderToy's default license is CC BY-NC-SA 3.0 β€” authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API.

  • The dataset is not redistributed in this repository; it is downloaded at build time
  • Individual shader provenance is tracked via the source field on each task (links back to the ShaderToy page)
  • For commercial use, per-shader license review is required

Next Steps

  • Plan the SFT warm-up dataset and training run (prerequisite for RL)
  • Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness
  • Implement the reward plugin API (multi-component, pluggable)
  • Establish the hidden evaluation protocol across time, resolution, and seed variations
  • Add shaderc / glslang offline validation
  • Add support for multi-pass shaders and texture inputs

References

Primary

Tools and Infrastructure

Specifications

Supplementary