Spaces:

tejadhith
/

shader

Running

App Files Files Community

shader / README.md

tejadhith

Upload folder using huggingface_hub

67f71c2 verified 3 days ago

preview code

raw

history blame contribute delete

22.2 kB

	---
	title: Shader Environment Server
	emoji: 🎨
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	app_port: 8000
	base_path: /web
	tags:
	- openenv
	---

	# Shader Environment

	## Overview

	`shader` is an OpenEnv-compatible environment for generating, repairing, and iteratively refining executable shaders against visual and systems constraints.

	Supported interaction modes:

	- Reference-conditioned shader recreation via SSIM reward
	- Multi-turn shader refinement (not one-shot only)
	- GLSL-first execution via headless rendering, with WGSL/browser-native as a later target
	- Pluggable reward traces for RL, evaluation, reranking, and distillation
	- Hidden evaluation across time, resolution, and seed variations

	## Project Structure

	```
	envs/shader/
	├── __init__.py # Public API: ShaderEnv, ShaderAction, ShaderObservation
	├── models.py # Pydantic Action / Observation types
	├── client.py # OpenEnv client (ShaderEnv)
	├── tasks.py # Task bank loader (shaders21k corpus)
	├── reward.py # SSIM reward computation
	├── render.py # Headless GLSL renderer (ModernGL + EGL)
	├── harness.py # Subprocess-isolated render wrapper
	├── download.sh # Fetches shaders21k dataset
	├── openenv.yaml # OpenEnv space descriptor
	├── pyproject.toml
	└── server/
	├── app.py # FastAPI application (OpenEnv HTTP server)
	├── environment.py # Environment implementation (reset / step loop)
	└── Dockerfile
	```

	## Quickstart

	```bash
	# Download the shaders21k corpus (~41 MB)
	cd envs/shader
	./download.sh
	```

	### Running via uv / uvicorn

	```bash
	cd envs/shader
	PYTHONPATH=../.. uvicorn server.app:app --host 0.0.0.0 --port 8000
	```

	### Running via Docker

	The corpus is downloaded at build time automatically:

	```bash
	cd envs/shader
	docker build -f server/Dockerfile -t shader .
	docker run -p 8000:8000 shader
	```

	### Validation

	```bash
	# Validate local structure
	cd envs/shader
	openenv validate --verbose

	# Validate a running server (6 criteria: openapi, health, metadata, schema, mcp, mode)
	openenv validate http://localhost:8000
	```

	### Interacting via WebSocket

	The HTTP endpoints (`/reset`, `/step`) are stateless — each creates a fresh environment instance. For multi-turn sessions with persistent state, use the WebSocket endpoint:

	```python
	import asyncio, json, websockets

	async def main():
	async with websockets.connect("ws://localhost:8000/ws") as ws:
	# Reset — picks a task, renders reference
	await ws.send(json.dumps({"type": "reset", "data": {}}))
	resp = json.loads(await ws.recv())
	obs = resp["data"]["observation"]
	print(obs["task"], obs["remaining"])

	# Step — submit GLSL, get back SSIM + render
	await ws.send(json.dumps({
	"type": "step",
	"data": {"code": "void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"}
	}))
	resp = json.loads(await ws.recv())
	r = resp["data"]
	print(f"compiled={r['observation']['compiled']} ssim={r['observation']['ssim']} reward={r['reward']}")

	await ws.send(json.dumps({"type": "close"}))

	asyncio.run(main())
	```

	### Python Client

	```python
	from shader import ShaderEnv, ShaderAction

	with ShaderEnv(base_url="http://localhost:8000").sync() as client:
	result = client.reset()
	print(result.observation.task) # ShaderToy ID
	print(result.observation.reference_png) # base64 PNG

	result = client.step(ShaderAction(code="void mainImage(out vec4 c, in vec2 f) { c = vec4(1,0,0,1); }"))
	print(result.observation.compiled) # True/False
	print(result.observation.ssim) # similarity vs reference
	```

	### Benchmark

	Runs GPT 5.4 against the environment over WebSocket, producing a reproducible baseline:

	```bash
	# Requires a running server and OPENAI_API_KEY set
	python envs/shader/benchmark.py # 3 episodes, default seeds
	python envs/shader/benchmark.py --turns 5 # cap turns per episode
	python envs/shader/benchmark.py --url ws://localhost:8001/ws # custom server
	python envs/shader/benchmark.py --seeds 10 20 30 # custom seeds
	```

	Seeds control reproducible task selection. Results are saved to `benchmark_output/results.json`.

	## Tasks

	The environment ships with 3 curated tasks at increasing difficulty. Each task presents a reference image; the agent must write GLSL code to reproduce it. The grader is SSIM (structural similarity), returning a score in [0.0, 1.0].

	\| Task \| Difficulty \| Lines \| Description \| What the agent needs \|
	\|------\|-----------\|-------\|-------------\|---------------------\|
	\| `Nd33R4` \| Easy \| 13 \| XOR color pattern on pixel coordinates \| `int()` casting, bitwise XOR/AND, float conversion \|
	\| `stlXWH` \| Medium \| 44 \| SDF distance field (square minus circle) with smooth coloring \| Signed distance functions, `abs`, `exp`, `smoothstep`, `cos` for distance coloring \|
	\| `ftjSRd` \| Hard \| 122 \| Raymarcher with SDF repetition, polar coordinates, HSV coloring \| Ray marching loop, rotation matrices, domain repetition, HSV-to-RGB, polar coords \|

	All 3 tasks are sourced from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (Shadertoy). They were selected by code complexity (line count, number of concepts) and verified to produce visually interesting output at render time.

	To select a specific task, pass its name to `reset()`:

	```python
	result = env.reset(task="Nd33R4") # easy — XOR pattern
	result = env.reset(task="stlXWH") # medium — SDF visualization
	result = env.reset(task="ftjSRd") # hard — raymarcher
	result = env.reset() # random from full corpus
	```

	### Grading

	Each task uses the same grader: SSIM between the agent's rendered output and the ground-truth reference image. The score is deterministic and reproducible for the same GLSL input.

	- Score range: 0.0 (no similarity) to 1.0 (pixel-perfect match)
	- Success threshold: score >= 0.90
	- Compile/render failures: score = 0.0

	### Baseline Scores

	Evaluated with GPT 5.4 via `inference.py` (5 steps per task, temperature 0.2):

	\| Task \| Difficulty \| Best SSIM \| Step-by-step rewards \| Multi-turn improvement \|
	\|------\|-----------\|-----------\|---------------------\|----------------------\|
	\| `Nd33R4` \| Easy \| 0.27 \| 0.27, 0.13, 0.15, 0.15, 0.16 \| No — model generates hash noise instead of XOR pattern \|
	\| `stlXWH` \| Medium \| 0.94 \| 0.85, 0.88, 0.91, 0.90, 0.94 \| Yes — steady refinement from 0.85 to 0.94 \|
	\| `ftjSRd` \| Hard \| 0.40 \| 0.31, 0.15, 0.33, 0.40, 0.28 \| Partial — oscillates between 0.15-0.40 \|

	Key observations:
	- `Nd33R4` (easy by code complexity) is hard for VLMs because bitwise XOR patterns are difficult to reverse-engineer from a rendered image alone. The model defaults to procedural noise rather than integer math.
	- `stlXWH` (medium) shows clear multi-turn refinement — the model progressively improves the SDF shape and color mapping across steps.
	- `ftjSRd` (hard) challenges frontier models with 122 lines of tightly coupled raymarching, rotation, and HSV coloring. The model attempts structural elements but cannot match the exact parameters.

	Run `inference.py` to reproduce. Scores vary by model and API endpoint.

	## Task Bank (Corpus)

	Beyond the 3 curated tasks, the full task bank is loaded from the [shaders21k](https://github.com/mbaradad/shaders21k) corpus (NeurIPS 2022). At load time, shaders are filtered to single-pass fragments with no texture/buffer inputs, yielding ~16,800 usable tasks.

	Each task is a Shadertoy-dialect GLSL shader with known ground-truth code. The environment renders the ground truth to produce a reference image, then challenges the agent to reproduce it.

	\| Field \| Description \|
	\|-------\|-------------\|
	\| `name` \| ShaderToy ID (e.g. `MdGcDc`) \|
	\| `code` \| GLSL fragment shader source \|
	\| `source` \| ShaderToy URL for provenance \|
	\| `resolution` \| Render resolution, default 512x288 \|
	\| `time` \| iTime uniform value, default 0.0 \|
	\| `difficulty` \| `easy`, `medium`, `hard` (curated tasks only) \|

	## Motivation

	Shader work has properties that make it well-suited as an RL environment:

	- Feedback is fast and dense
	- Compile success, render success, and performance are easy to gate
	- The same shader can be evaluated under multiple controlled render conditions
	- An active public corpus exists (Shadertoy, ~1M public shaders) alongside established compiler infrastructure (`shaderc`, `glslang`)

	ShaderEval (LLM4Code @ ICSE 2025) benchmarked current LLMs on GLSL function completion and found failure rates above 31% even for top models. GLSL is low-resource in pretraining corpora, leaving room for RL-trained or fine-tuned models to improve on baselines.

	### Positioning

	There is existing adjacent work on conditioned procedural material generation, RL-based material parameter optimization, interactive evolutionary shader generation, and LLM-driven real-time shader generation. The contribution here is not "LLMs can emit shader code" but rather a reusable environment layer with stable runtime, task packaging, and reward/eval plugins.

	## Runtime Stack

	The primary target is GLSL via headless OpenGL.

	- GLSL as the authoring and execution language
	- The entire relevant corpus is GLSL: Shadertoy (~1M shaders), shaders21k (21K shaders), ShaderEval (the only published GLSL benchmark)
	- GLSL has explicit representation in LLM pretraining corpora (The Stack includes a GLSL subset); WGSL has near-zero public training data
	- WGSL prohibits recursion, has no implicit type coercions, and uses incompatible uniform conventions, making Shadertoy-dialect GLSL non-transpilable to WGSL via naga at corpus scale (~15% failure rate)
	- ModernGL with EGL as the headless rendering backend (`render.py`)
	- Runs on Linux servers without a display via EGL
	- Used by shaders21k for offline rendering
	- Wraps user shader code in a Shadertoy-compatible preamble (standard uniforms, `mainImage` forward declaration)
	- Strips `#version`, `precision`, and `#extension` directives from user code to avoid conflicts
	- Adjusts error line numbers reported by the driver to map back to user code
	- Subprocess isolation (`harness.py`)
	- Each render runs in a separate process to contain driver crashes, infinite loops, and GPU state corruption
	- Configurable per-render timeout (default 10s)
	- Returns structured `RenderResult` with compile/render status, error messages, and raw RGBA pixel data
	- `shaderc` / `glslang` for offline validation and portability checks (planned)
	- WGSL / WebGPU deferred to a later phase for browser-native demos

	## Episode Schema

	The environment follows a standard multi-turn refinement loop:

	1. `reset()` picks a task from the bank, renders the ground-truth reference, and returns it as a base64 PNG
	2. The agent submits GLSL code via `step(ShaderAction(code=...))`
	3. The server compiles and renders the shader, computes SSIM vs reference
	4. Returns compile/render status, errors, rendered image, and SSIM reward
	5. Episode ends when the turn budget is exhausted (default 10 turns) or SSIM >= 0.99

	The server supports up to 4 concurrent environment sessions (`max_concurrent_envs=4`).

	This supports RL training, best-of-N search and reranking, trajectory collection for SFT or distillation, and evaluation under a shared runtime.

	## Environment Variants

	### `shader` (primary)

	Match a target still image or short animated effect within a frame-time budget. Direct, visual, and evaluable without a full DCC toolchain.

	### Material Graph

	Edit procedural material graphs and parameters toward a target appearance. The search space is more structured than raw shader code, and recent procedural-material work already uses graph/program representations.

	### Shader Repair

	Start from shaders that are broken, slow, unstable, or portability-problematic and optimize for correctness, robustness, and performance.

	## Task Families

	- Reference recreation — recreate a target still or short effect from a reference render
	- Repair — fix syntax errors, portability failures, or numerical instability
	- Optimization — preserve appearance while reducing frame time or instruction count
	- Style transfer — preserve scene logic while shifting color, texture, motion, or lighting style
	- Critique incorporation — revise the shader based on iterative feedback
	- Robustness repair — stabilize a shader across resolutions, aspect ratios, and time ranges

	## Evaluation

	Evaluation relies on hidden checks rather than visible examples only:

	- Compile success
	- Render success (no NaNs or fatal runtime failures)
	- Perceptual similarity on held-out stills
	- Temporal consistency on held-out short clips
	- Stability across resolutions, aspect ratios, and parameter seeds
	- Frame-time or instruction-budget limits
	- Optional portability checks across compiler/validator paths

	### Reward Structure

	Current implementation (`reward.py`): windowed SSIM (Wang et al. 2004) between agent render and reference, computed per-channel on RGB and averaged. Uses scipy `uniform_filter` for windowed statistics when available, falls back to global-stats SSIM. Compile and render failures receive reward 0.0. Reward range: [0.0, 1.0].

	Planned multi-component reward:

	```text
	R = G_compile * G_render * (
	0.35 * appearance_match +
	0.20 * temporal_stability +
	0.20 * performance +
	0.15 * robustness +
	0.10 * code_quality
	) - step_penalty - regression_penalty
	```

	Component notes:

	- `appearance_match` — measured on hidden render conditions using DINOv2 cosine similarity (better than CLIP for texture/color/style fidelity) combined with pixel-level SSIM for structural accuracy. CLIP is appropriate only for text-conditioned task variants.
	- `temporal_stability` — requires rendering N consecutive frames and computing frame-to-frame SSIM. For v1, this may serve as a held-out evaluation metric rather than a dense training reward to keep per-episode compute manageable.
	- `performance` — frame-time as a reward is tractable headlessly. To avoid reward hacking (trivially simple shaders that are fast but visually wrong), this is gated as a hard budget first (penalize frames over a threshold) before adding a continuous score.
	- `robustness` — captures resolution changes, seed changes, and compiler portability.
	- *`G_compile G_render`** — hard multiplicative gates, standard in code generation RL. These cause a zero-gradient problem early in training; mitigate with curriculum learning (start from partial or working shader skeletons) and/or soft penalties before hardening.
	- SFT warm-up is a prerequisite before RL. Adjacent work (RLRF for SVG) shows that RL directly on an instruction-tuned model without a domain SFT stage fails because the base model cannot generate renderable output reliably enough to produce reward variance.

	## Action / Observation Contract

	```python
	class ShaderAction(Action):
	code: str # Shadertoy-dialect GLSL fragment shader source

	class ShaderObservation(Observation):
	task: str # ShaderToy ID
	remaining: int # turns left in episode
	reference_png: str # base64 PNG (non-empty on reset only)
	compiled: bool
	rendered: bool
	errors: list[str]
	agent_png: str # base64 PNG of agent's render
	ssim: float # SSIM vs reference in [0, 1]
	done: bool = False
	reward: float \| None = None
	```

	- `reward` and `done` inside `Observation` follows OpenEnv spec (RFC 002, Decision 2): rewards are computed inside the environment and returned as part of the observation; the server layer promotes them to the top-level `StepResponse`.
	- Detailed artifacts (frame dumps, profiler traces, held-out evaluation results) live behind tools rather than being inlined on every turn.

	## Training

	### Algorithm

	- GRPO as the baseline, with multi-turn extensions:
	- MURPHY (NeurIPS 2025) — for each rollout that does not reach maximum reward, execution feedback is appended and new rollouts are generated from that state. Up to 8% relative gain over single-turn GRPO.
	- TRLOO (Dr. Kernel) — addresses the biased policy gradient problem that vanilla GRPO has in multi-turn settings.
	- The episode schema (compile, render, reward, next turn) maps directly onto MURPHY's interaction loop.

	### Libraries

	- veRL: supports GRPO, DAPO, GSPO; scales to large models; best throughput for long multi-turn rollouts
	- OpenRLHF-M: multimodal variant of OpenRLHF; targets VLM policies (code + rendered image inputs)

	Both have TRL and OpenEnv integrations.

	### Task Bank

	The task bank is populated from the shaders21k corpus (~16.8K single-pass shaders after filtering). RL training requires 10K-15K rollouts minimum based on adjacent work (CTRL-S: 14.4K, Reason-SVG: 10K).

	## Related Work

	\| Work \| Relevance \|
	\|------\|-----------\|
	\| Shadertoy \| Active public corpus and community; seed source for tasks and reference effects \|
	\| OpenEnv (Meta/PyTorch) \| Target framework. Client-server RL environment with `Action`/`Observation` base classes, `reset()`/`step()` contract, and RFC 004 Rubric system. TRL, Unsloth, SkyRL, and Oumi integrations. \|
	\| ShaderEval (LLM4Code @ ICSE 2025) \| Only formal benchmark for GLSL code generation evaluation. 467 functions, pixel-diff evaluation via `shadermatch`. \|
	\| shaders21k (NeurIPS 2022) \| 21K OpenGL fragment shaders from Shadertoy for visual representation learning. Ready seed corpus for task generation. \|
	\| AI Co-Artist (arXiv:2512.08951) \| Closest work to multi-turn LLM-driven GLSL refinement. GPT-4 + Picbreeder-style evolution, <3% compile error after retries. No RL. \|
	\| VLMaterial (ICLR 2025 Spotlight) \| Fine-tunes a VLM for Blender material node graphs from images. Validates rendered-image similarity as a reward signal for code-generating policies. \|
	\| Dr. Kernel / KernelGYM (arXiv:2602.05885) \| Multi-turn RL for GPU code generation (CUDA/Triton) with compile-correctness-speedup reward chain. Proposes TRLOO for multi-turn credit assignment. \|
	\| MURPHY (NeurIPS 2025) \| Multi-turn GRPO for code generation. Canonical algorithm for the interaction loop used here. \|
	\| ProcMatRL (SIGGRAPH Asia 2024) \| RL for procedural material parameter optimization. Validates RL applicability in the visual generation domain. \|
	\| ShadAR (arXiv:2602.17481) \| LLM-driven real-time shader generation for AR. \|
	\| Procedural Shader Evolution (arXiv:2312.17587) \| Interactive evolutionary algorithms for shader generation. Multi-turn refinement as iterative optimization. \|

	## Licensing

	Shader code in the task bank is sourced from [ShaderToy](https://www.shadertoy.com/) via the [shaders21k](https://github.com/mbaradad/shaders21k) dataset. ShaderToy's default license is CC BY-NC-SA 3.0 — authors may choose a different license, but there is no structured per-shader license metadata in the dataset or the ShaderToy API.

	- The dataset is not redistributed in this repository; it is downloaded at build time
	- Individual shader provenance is tracked via the `source` field on each task (links back to the ShaderToy page)
	- For commercial use, per-shader license review is required

	## Next Steps

	- Plan the SFT warm-up dataset and training run (prerequisite for RL)
	- Extend reward beyond SSIM: DINOv2 appearance match, temporal stability, performance, robustness
	- Implement the reward plugin API (multi-component, pluggable)
	- Establish the hidden evaluation protocol across time, resolution, and seed variations
	- Add `shaderc` / `glslang` offline validation
	- Add support for multi-pass shaders and texture inputs

	## References

	### Primary

	- Shadertoy: <https://www.shadertoy.com/>
	- OpenEnv (Meta/PyTorch): <https://github.com/meta-pytorch/OpenEnv>
	- OpenEnv RFC 002: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/002-env-spec.md>
	- OpenEnv RFC 004: <https://github.com/meta-pytorch/OpenEnv/blob/main/rfcs/004-rubrics.md>
	- TRL OpenEnv integration: <https://huggingface.co/docs/trl/openenv>
	- ShaderEval (LLM4Code @ ICSE 2025): <https://conf.researchr.org/details/icse-2025/llm4code-2025-papers/13/Evaluating-Language-Models-for-Computer-Graphics-Code-Completion>
	- shadertoys-dataset: <https://github.com/Vipitis/shadertoys-dataset>
	- Shadereval-inputs (HuggingFace): <https://huggingface.co/datasets/Vipitis/Shadereval-inputs>
	- shaders21k (NeurIPS 2022): <https://arxiv.org/abs/2211.16412>
	- shaders21k dataset: <https://github.com/mbaradad/shaders21k>
	- AI Co-Artist: <https://arxiv.org/abs/2512.08951>
	- VLMaterial (ICLR 2025): <https://arxiv.org/abs/2501.18623>
	- VLMaterial code: <https://github.com/mit-gfx/VLMaterial>
	- Dr. Kernel / KernelGYM: <https://arxiv.org/abs/2602.05885>
	- MURPHY (NeurIPS 2025): <https://arxiv.org/abs/2511.07833>
	- RLRF (SVG RL): <https://arxiv.org/abs/2505.20793>

	### Tools and Infrastructure

	- `shaderc`: <https://github.com/google/shaderc>
	- `glslang`: <https://github.com/KhronosGroup/glslang>
	- `pygfx/shadertoy`: <https://github.com/pygfx/shadertoy>
	- moderngl: <https://github.com/moderngl/moderngl>
	- veRL: <https://github.com/volcengine/verl>
	- OpenRLHF-M: <https://github.com/OpenRLHF/OpenRLHF-M>
	- Adobe ProcMatRL: <https://github.com/adobe-research/ProcMatRL>

	### Specifications

	- WGSL specification: <https://www.w3.org/TR/WGSL/>
	- WebGPU: <https://webgpu.org/>
	- Khronos `glslang` reference: <https://www.khronos.org/opengles/sdk/Reference-Compiler/>

	### Supplementary

	- Procedural Shader Evolution: <https://arxiv.org/abs/2312.17587>
	- ShadAR: <https://arxiv.org/abs/2602.17481>
	- ProcMatRL paper (SIGGRAPH Asia 2024): <https://doi.org/10.1145/3687979>
	- Conditioned Procedural Materials (SIGGRAPH 2023): <https://dl.acm.org/doi/10.1145/3588432.3591520>
	- FragCoord.xyz: <https://fragcoord.xyz/>
	- naga GLSL front-end failures: <https://github.com/Vipitis/shadertoys-dataset/issues/15>