Audio-to-Audio
TensorRT
English
audio
text-to-audio
music-generation
music
diffusion
streaming-diffusion
real-time
ace-step
lora
genai
Instructions to use daydreamlive/DEMON with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use daydreamlive/DEMON with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| pipeline_tag: audio-to-audio | |
| base_model: ACE-Step/Ace-Step1.5 | |
| tags: | |
| - audio | |
| - audio-to-audio | |
| - text-to-audio | |
| - music-generation | |
| - music | |
| - diffusion | |
| - streaming-diffusion | |
| - real-time | |
| - tensorrt | |
| - ace-step | |
| - lora | |
| - genai | |
| - "arxiv:2605.28657" | |
| # DEMON | |
| **Diffusion Engine for Musical Orchestrated Noise** | |
| DEMON is a real-time streaming diffusion engine for music generation and transformation on top of ACE-Step v1.5. | |
| It is **not a standalone base model**. DEMON is an engine/runtime layer that makes diffusion-based music generation interactive: source audio, prompts, LoRAs, denoise strength, guidance, per-frame curves, and other controls can be changed while generation is already running. | |
| Instead of waiting for one full diffusion generation to finish, DEMON keeps several generations in flight at once using a ring buffer. After warmup, finished latents stream out continuously, making diffusion feel playable. | |
| <p align="center"> | |
| <video controls playsinline preload="none" width="100%" poster="https://raw.githubusercontent.com/daydreamlive/DEMON/refs/heads/main/docs/assets/img/poster-hero.jpg"> | |
| <source src="https://pub-d1b39d9529fd4ef8b125f56716c53163.r2.dev/new_dub_hero.mp4" type="video/mp4"> | |
| Your browser does not support the video tag. Watch the demo at https://music.daydream.live | |
| </video> | |
| </p> | |
| ## Links | |
| - **Try the hosted demo:** https://music.daydream.live | |
| - **Paper:** https://arxiv.org/abs/2605.28657 | |
| - **Hugging Face Paper Page:** https://huggingface.co/papers/2605.28657 | |
| - **Project page:** https://daydreamlive.github.io/DEMON/ | |
| - **Code:** https://github.com/daydreamlive/DEMON | |
| - **Base model:** https://huggingface.co/ACE-Step/Ace-Step1.5 | |
| ## What is DEMON? | |
| DEMON turns ACE-Step v1.5 into a live, controllable streaming system. | |
| A standard diffusion pipeline usually works like this: | |
| ```text | |
| submit request → wait for all denoising steps → decode → hear result | |
| ``` | |
| DEMON instead works like this: | |
| ```text | |
| keep multiple generations in flight → advance them every tick → stream finished latents continuously | |
| ``` | |
| That means controls can become musical performance parameters instead of offline generation settings. | |
| ## Why this matters | |
| Diffusion models are powerful, but they are usually slow, one-shot systems. DEMON makes diffusion-based music generation responsive enough for live control. | |
| You can change parameters while the system is running, including prompt blends, LoRA strength, source preservation, denoise behavior, guidance, velocity scaling, channel guidance, and per-frame modulation curves. | |
| ## Highlights | |
| - **Streaming diffusion for ACE-Step v1.5** | |
| - **Ring-buffer scheduling** with multiple in-flight generations | |
| - **TensorRT acceleration** for low-latency ticks | |
| - **Hot-mutable controls** during generation | |
| - **Hot-resizable ring buffer depth** | |
| - **Per-frame modulation curves** at latent-frame resolution | |
| - **Heterogeneous slots**, where each in-flight generation can carry its own seed, denoise schedule, conditioning, CFG mode, curves, and masks | |
| - **Multi-condition compositing** | |
| - **LoRA hot-swapping** without rebuilding the TensorRT decoder engine | |
| - **Windowed VAE decoding** for low-latency audio updates | |
| - **Streaming output is bit-identical to batch output** | |
| ## What is in this Hugging Face repo? | |
| This Hugging Face page is the release/model card for DEMON. | |
| DEMON itself is an engine built around ACE-Step v1.5. The full source code, examples, demos, TensorRT build scripts, and documentation are available here: | |
| ```text | |
| https://github.com/daydreamlive/DEMON | |
| ``` | |
| If this Hugging Face repository is used as a release mirror, source paths mentioned below refer to the GitHub repository. | |
| If you are looking for the underlying ACE-Step v1.5 base model, see: | |
| ```text | |
| https://huggingface.co/ACE-Step/Ace-Step1.5 | |
| ``` | |
| ## Paper and technical notes | |
| - **DEMON paper:** https://arxiv.org/abs/2605.28657 | |
| - **FastOobleckDecoder / VAE distillation:** coming soon | |
| - **Latent Channel Semantics / 64-channel VAE characterization:** coming soon | |
| Links will be updated as companion artifacts are released. | |
| ## Tested hardware | |
| DEMON has been tested on: | |
| - NVIDIA RTX 3090 | |
| - NVIDIA RTX 4090 | |
| - NVIDIA RTX 5090 | |
| The headline performance numbers below are from an RTX 5090. | |
| ## Performance | |
| RTX 5090, ACE-Step v1.5 turbo, all-TensorRT, `depth=4`, `steps=8`, `vae_window=3s`, 60-second source. | |
| | Metric | Value | | |
| |---|---:| | |
| | Tick latency, decoder forward, depth=4 | ~43 ms | | |
| | Windowed VAE decode, 3 s | 4.5 ms | | |
| | Production throughput, depth=4 | 11.3 generations/s | | |
| | Per-frame control resolution | 25 Hz | | |
| | Streaming vs. batch quality | bit-identical output | | |
| The paper reports up to 12.3 decoder completions per second on a single RTX 5090, and 11.3 generations per second at production ring depth 4. | |
| ## Requirements | |
| - Python 3.11 | |
| - NVIDIA GPU | |
| - ACE-Step v1.5 checkpoints in `checkpoints/` | |
| - Node.js 20+ if running the bundled web demo | |
| - TensorRT 10.16.x for TensorRT acceleration | |
| ACE-Step checkpoints are auto-downloaded on first run where supported. | |
| LoRAs are not auto-downloaded. Drop `.safetensors` LoRA files into: | |
| ```text | |
| $ACESTEP_MODELS_DIR/loras/ | |
| ``` | |
| By default this resolves to: | |
| ```text | |
| ~/.daydream-scope/models/demon/loras/ | |
| ``` | |
| ## Setup | |
| Clone the source repository: | |
| ```bash | |
| git clone https://github.com/daydreamlive/DEMON.git | |
| cd DEMON | |
| ``` | |
| Install Python dependencies: | |
| ```bash | |
| uv sync | |
| ``` | |
| Run the main local demo: | |
| ```bash | |
| uv run python -u -m demos.realtime_motion_graph_web.run | |
| ``` | |
| Then open: | |
| ```text | |
| http://localhost:6660 | |
| ``` | |
| The launcher starts: | |
| ```text | |
| backend: http://localhost:1318 | |
| frontend: http://localhost:6660 | |
| ``` | |
| ## Hosted demo | |
| If you do not have a local GPU, or just want to try DEMON first, use the hosted instance: | |
| ```text | |
| https://music.daydream.live | |
| ``` | |
| ## Running with acceleration backends | |
| The DiT decoder and VAE can choose backends independently. | |
| Supported backend values: | |
| ```text | |
| tensorrt | |
| compile | |
| eager | |
| ``` | |
| Use TensorRT for the fastest path: | |
| ```bash | |
| uv run python -u -m demos.realtime_motion_graph_web.run -- --accel tensorrt | |
| ``` | |
| Use TensorRT for the decoder and eager mode for the VAE: | |
| ```bash | |
| uv run python -u -m demos.realtime_motion_graph_web.run -- \ | |
| --accel tensorrt \ | |
| --vae-accel eager | |
| ``` | |
| Use PyTorch compile mode if you do not want to build TensorRT engines yet: | |
| ```bash | |
| uv run python -u -m demos.realtime_motion_graph_web.run -- --accel compile | |
| ``` | |
| Recommended starting point: | |
| ```text | |
| TRT windowed VAE decoder + compile decoder | |
| ``` | |
| The windowed VAE decoder is the cheapest TensorRT engine to build, is checkpoint- and duration-agnostic, and unlocks low-latency streaming decode. | |
| ## Building TensorRT engines | |
| DEMON targets TensorRT 10.16.x. | |
| TensorRT plans are version- and GPU-architecture-specific by default, so rebuild after changing TensorRT, CUDA, driver, or the GPU used for inference. | |
| Build the full matrix: | |
| ```bash | |
| uv run python -m acestep.engine.trt.build --all | |
| ``` | |
| Build 60-second engines only: | |
| ```bash | |
| uv run python -m acestep.engine.trt.build --all --duration 60 | |
| ``` | |
| Build only the windowed VAE decoder: | |
| ```bash | |
| uv run python -m acestep.engine.trt.build --vae-only --duration 60 | |
| ``` | |
| Preview what would be built: | |
| ```bash | |
| uv run python -m acestep.engine.trt.build --all --dry-run | |
| ``` | |
| Force rebuild: | |
| ```bash | |
| uv run python -m acestep.engine.trt.build --all --force-rebuild | |
| ``` | |
| Force rebuild and ONNX re-export: | |
| ```bash | |
| uv run python -m acestep.engine.trt.build --all --duration 60 --force-rebuild --force-onnx | |
| ``` | |
| TensorRT engine layout: | |
| ```text | |
| trt_engines/ | |
| _onnx/ | |
| vae_encode/vae_encode.onnx | |
| vae_decode/vae_decode.onnx | |
| decoder/decoder.onnx | |
| decoder_refit/decoder_refit.onnx | |
| decoder_mixed_refit_b8_60s/ | |
| decoder_mixed_refit_b8_60s.engine | |
| vae_decode_fp16_3to30s/ | |
| vae_decode_fp16_3to30s.engine | |
| ``` | |
| Pass engine paths to `Session` directly: | |
| ```python | |
| from acestep.engine.session import Session | |
| session = Session( | |
| decoder_backend="tensorrt", | |
| vae_backend="tensorrt", | |
| vae_window=3.0, | |
| trt_engines={ | |
| "decoder": "trt_engines/decoder_mixed_refit_b8_60s/decoder_mixed_refit_b8_60s.engine", | |
| "vae_encode": "trt_engines/vae_encode_fp16_60s/vae_encode_fp16_60s.engine", | |
| "vae_decode": "trt_engines/vae_decode_fp16_3to30s/vae_decode_fp16_3to30s.engine", | |
| }, | |
| ) | |
| ``` | |
| ## Core idea: ring-buffer streaming diffusion | |
| DEMON keeps several generations in flight at different denoising stages. | |
| Each tick advances active slots by one denoising step. After warmup, the system continuously emits completed latents. | |
| Throughput scales with: | |
| ```text | |
| depth / steps | |
| ``` | |
| For example: | |
| ```text | |
| depth = 4 | |
| steps = 8 | |
| ``` | |
| means the engine can emit completed results at a steady streaming rate once the ring buffer is warm. | |
| ## Ring buffer depth | |
| `pipeline_depth` controls how many generations are in flight. | |
| Higher depth: | |
| - smoother parameter sweeps | |
| - more slots in different denoising phases | |
| - higher throughput | |
| - more VRAM and per-tick compute | |
| - slower convergence for some submission-time changes | |
| Lower depth: | |
| - snappier control response | |
| - lower VRAM | |
| - lower per-tick compute | |
| - more discrete changes | |
| Depth can be changed while the system is running: | |
| ```python | |
| pipeline.set_depth(n) | |
| ``` | |
| Active slots drain naturally. | |
| ## Song duration and TensorRT profiles | |
| TensorRT engines are profile-specific. | |
| A 240-second engine reserves more workspace than a 60-second engine, even if the current workload is only 60 seconds. Build only the durations you need. | |
| Per-engine peak workspace measured in isolation on RTX 5090: | |
| | Component | 60s engine | 240s engine | Difference | | |
| |---|---:|---:|---:| | |
| | Decoder, refit | 13,511 MB | 15,911 MB | +2,400 MB | | |
| | VAE decode | 10,547 MB | 10,814 MB | +267 MB | | |
| | VAE encode | 4,178 MB | 10,614 MB | +6,436 MB | | |
| These are per-engine peaks captured in separate subprocesses, not a live-runtime sum. At inference time, the decoder peak dominates and the VAE workspaces do not peak alongside it, which is why the live demo fits on a 24 GB card. | |
| The comparison is what matters: switching the three engines from 240 seconds to 60 seconds frees about 9 GB. | |
| ## VAE windowing | |
| When `vae_window > 0`, decode happens in overlapped time windows instead of full-length decode. | |
| This is what unlocks low-latency streaming updates: only the requested window is decoded per call rather than the full latent. | |
| Set: | |
| ```text | |
| vae_window = 0 | |
| ``` | |
| to use full-length decode. | |
| Set: | |
| ```text | |
| vae_window = 3.0 | |
| ``` | |
| to decode three-second windows. | |
| ## Programmatic use: Session API | |
| The Session API is the main programmatic surface. | |
| It loads the model once and exposes the core primitives: | |
| ```text | |
| prepare_source | |
| encode_text | |
| generate | |
| decode | |
| stream | |
| apply_lora | |
| ``` | |
| Minimal skeleton: | |
| ```python | |
| from acestep.engine.session import Session | |
| from acestep.constants import TASK_INSTRUCTIONS | |
| session = Session( | |
| decoder_backend="compile", # "tensorrt", "compile", or "eager" | |
| vae_backend="compile", # "tensorrt", "compile", or "eager" | |
| vae_window=3.0, # 0 = full decode; >0 = windowed decode | |
| ) | |
| # Replace this with your own audio loading. | |
| audio = load_audio("source.wav") | |
| # Load audio, encode it, and extract semantic context. | |
| source = session.prepare_source(audio) | |
| # Encode text conditioning once and reuse it. | |
| cond = session.encode_text( | |
| tags="deathstep death", | |
| instruction=TASK_INSTRUCTIONS["cover"], | |
| refer_latent=source.latent, | |
| bpm=136, | |
| duration=60.0, | |
| key="G# minor", | |
| ) | |
| # Generate multiple variants cheaply after warmup. | |
| for seed in [1528, 9999, 42]: | |
| latent = session.generate( | |
| conditioning=cond, | |
| context_latent=source.context_latent, | |
| source_latent=source.latent, | |
| seed=seed, | |
| ) | |
| audio_out = session.decode(latent) | |
| # Replace this with your own audio saving. | |
| save_audio(audio_out, f"out_{seed}.wav") | |
| ``` | |
| ## Streaming use | |
| Streaming wraps the same primitives in a `StreamHandle`: | |
| ```python | |
| handle = session.stream( | |
| source=source, | |
| conditioning=cond, | |
| pipeline_depth=4, | |
| ) | |
| for tick in range(128): | |
| latent = handle.tick() | |
| if latent is not None: | |
| audio = handle.decode( | |
| latent, | |
| t_start=0.0, | |
| ) | |
| ``` | |
| Shared curve overrides bypass normal ring-buffer drain and take effect on the next tick: | |
| ```python | |
| handle.pipeline.set_shared_curve("velocity_scale", 1.2) | |
| handle.pipeline.set_shared_curve("sde_denoise_curve", torch.tensor([...])) | |
| ``` | |
| Revert a shared override: | |
| ```python | |
| handle.pipeline.set_shared_curve("velocity_scale", None) | |
| ``` | |
| ## Typed node graph | |
| DEMON exposes a typed node graph for building higher-level applications. | |
| The node graph contains composable operations for: | |
| - latent operations | |
| - audio operations | |
| - conditioning | |
| - curves | |
| - masks | |
| - solver controls | |
| - config | |
| - DCW correction | |
| - channel guidance | |
| The graph is wired through: | |
| ```text | |
| NodeDefinition | |
| NodePort | |
| NodeParam | |
| ``` | |
| Registration validates keyword arguments so applications can safely build on top of the same engine primitives. | |
| This means a CLI, notebook, VST, web demo, MCP tool, or custom protocol can drive the same underlying system. | |
| ## Engine features | |
| ### Streaming diffusion | |
| `StreamPipeline` maintains a ring buffer of in-flight generations. Each tick runs a batched decoder forward pass that advances active slots by one denoising step. | |
| When CFG is active, the engine runs positive and negative branches as needed. | |
| The decoder dispatches to TensorRT or PyTorch through the same code path. | |
| ### Heterogeneous slots | |
| Every in-flight slot carries its own `SlotRequest`. | |
| A slot can have its own: | |
| - seed | |
| - denoise strength | |
| - cached timestep schedule | |
| - source latent | |
| - per-frame curves | |
| - conditioning | |
| - CFG mode | |
| - x0 target | |
| - latent-noise mask | |
| A single ring buffer can mix different request types at the same time. | |
| For example, the same active buffer can contain: | |
| ```text | |
| denoise = 1.0 regeneration | |
| denoise = 0.5 style transfer | |
| RCFG-self request | |
| ``` | |
| and still batch them together in one forward pass. | |
| ### Scalar-or-curve modulation | |
| Many controls accept either a scalar or a `[T]` tensor. | |
| Supported scalar-or-curve controls include: | |
| - velocity scale | |
| - SDE re-noise | |
| - ODE noise injection | |
| - guidance scale | |
| - x0 target strength | |
| - x0 target curve | |
| - initial noise mix | |
| - APG momentum | |
| - CFG rescale | |
| - DCW scalers | |
| - condition temporal weights | |
| Values are canonicalized at the boundary so kernels see one consistent shape. | |
| ### Channel guidance | |
| Channel guidance applies a `[1, T, 64]` per-channel gain to `xt` before each forward pass. | |
| This has its own surface: | |
| ```python | |
| pipeline.set_channel_gain_tensor(...) | |
| ``` | |
| It is separate from normal `[T]` curve controls because it is both per-channel and per-frame. | |
| ### Shared mutable curves | |
| Shared mutable curves override selected curve-shaped fields on every in-flight slot at once. | |
| Supported shared curve names include: | |
| ```text | |
| velocity_scale | |
| sde_denoise_curve | |
| ode_noise_curve | |
| apg_momentum | |
| x0_target_strength | |
| cfg_rescale_curve | |
| ``` | |
| Set a shared curve: | |
| ```python | |
| pipeline.set_shared_curve("velocity_scale", 1.2) | |
| ``` | |
| Revert to per-slot behavior: | |
| ```python | |
| pipeline.set_shared_curve("velocity_scale", None) | |
| ``` | |
| Shared mutable curves take effect on the next tick instead of waiting for new submissions to drain through the ring buffer. | |
| ### Multi-condition compositing | |
| Within a single slot, the decoder can run once per active condition and blend velocities per frame using `temporal_weight`. | |
| Conditions can be gated by step range. | |
| Typed entry points include: | |
| ```text | |
| ConditioningBlend | |
| ConditioningCombine | |
| ``` | |
| ### CFG modes | |
| DEMON supports three CFG modes: | |
| 1. **Standard CFG** | |
| Runs an unconditional forward pass every step. | |
| 2. **RCFG-initialize** | |
| Runs one unconditional forward pass per slot, then caches it for the rest of the schedule. | |
| 3. **RCFG-self** | |
| Runs zero unconditional forwards. The slot's initial noise stands in as the virtual unconditional velocity. | |
| All three modes support APG momentum and optional per-frame CFG rescale curves. | |
| ### Latent-noise-mask inpainting | |
| DEMON supports latent-noise-mask inpainting with two-sided x0 blending: | |
| - pre-blend on `xt` | |
| - post-blend on predicted `x0` | |
| This matches ComfyUI-style semantics and lets the decoder see correctly noised context in preserved regions. | |
| A per-step strength function can be used for progressive masking. | |
| ### DCW post-step correction | |
| DEMON includes wavelet-domain sampler-side correction from Yu et al., ported from upstream ACE-Step v0.1.7. | |
| Supported modes: | |
| ```text | |
| low | |
| high | |
| double | |
| pix | |
| ``` | |
| Advanced controls include: | |
| ```text | |
| mult_blend | |
| mag_phase | |
| soft_thresh | |
| ``` | |
| At zero, the advanced surface is byte-identical to the upstream reference. | |
| Update DCW live: | |
| ```python | |
| pipeline.set_dcw(...) | |
| ``` | |
| ### Hot LoRA | |
| Register a LoRA directory once, then enable, set strength, or remove LoRAs without rebuilding the system. | |
| When the decoder runs in TensorRT mode, LoRA updates are applied through a refitter against the live engine. | |
| Supported LoRA lifecycle operations include: | |
| ```text | |
| register | |
| enable | |
| set_strength | |
| remove | |
| ``` | |
| ### TensorRT acceleration | |
| DEMON can accelerate the DiT decoder, VAE encode, and VAE decode independently. | |
| | Component | Backend | Notes | | |
| |---|---|---| | |
| | Decoder | `tensorrt` | Fastest path. Requires a built decoder engine for the target duration and checkpoint. Refit-enabled engines support LoRA swaps. | | |
| | Decoder | `compile` | Uses `torch.compile`. Long warmup, no TensorRT engine required. | | |
| | Decoder | `eager` | Plain PyTorch. Useful for debugging. | | |
| | VAE encode/decode | `tensorrt` | Fastest VAE path. Windowed decode engine is reused across durations. | | |
| | VAE encode/decode | `compile` | Uses `torch.compile`. | | |
| | VAE encode/decode | `eager` | Plain PyTorch. Useful for debugging. | | |
| ## Demo applications | |
| DEMON ships a flagship reference application plus focused examples. | |
| ### Realtime motion graph web demo | |
| Run: | |
| ```bash | |
| uv run python -u -m demos.realtime_motion_graph_web.run | |
| ``` | |
| Then open: | |
| ```text | |
| http://localhost:6660 | |
| ``` | |
| The web demo lets you: | |
| - feed source audio | |
| - write prompts | |
| - blend two prompts live | |
| - change denoise strength | |
| - hot-swap LoRAs | |
| - blend timbre and structure references | |
| - draw automation curves | |
| - map MIDI controls | |
| - record output | |
| - drive the system through the onboard MCP server | |
| Forward backend flags after `--`: | |
| ```bash | |
| uv run python -u -m demos.realtime_motion_graph_web.run -- --accel tensorrt | |
| uv run python -u -m demos.realtime_motion_graph_web.run -- --checkpoint xl | |
| ``` | |
| ### Other examples | |
| ```text | |
| examples/session_demo.py | |
| examples/realtime_cover.py | |
| examples/covers/ | |
| demos/test_stream_cover_graph.py | |
| ``` | |
| Feature examples include: | |
| | Script | Feature | | |
| |---|---| | |
| | `cover_basic.py` | Standard cover pipeline | | |
| | `prompt_blend.py` | Two prompts blended with a temporal curve | | |
| | `sde_denoise_curve.py` | Per-frame SDE re-noise modulation | | |
| | `velocity_scaling.py` | Per-frame transformation rate control | | |
| | `lora_generation.py` | LoRA-conditioned generation | | |
| | `x0_target_blend.py` | Two-pass morphing toward a target latent | | |
| | `conditioning_average.py` | Fuse two conditionings | | |
| | `guidance_curve.py` | Per-frame CFG scale | | |
| | `latent_noise_mask.py` | Latent-space inpainting | | |
| | `initial_noise_curve.py` | Per-frame noise/source init mix | | |
| | `ode_noise_injection.py` | Stochastic ODE step | | |
| | `cover_semantic_blend.py` | Blend semantic hints from two sources | | |
| | `x0_target_from_reference.py` | Pre-generate a target latent, then morph toward it | | |
| ## Tests | |
| Run: | |
| ```bash | |
| uv run pytest tests/ -v | |
| ``` | |
| ## Limitations | |
| - DEMON requires an NVIDIA GPU for practical real-time use. | |
| - TensorRT engines are GPU-, CUDA-, driver-, and TensorRT-version specific. | |
| - TensorRT engines may need to be rebuilt locally. | |
| - Real-time performance depends on GPU, song duration, ring-buffer depth, denoising steps, VAE window size, and selected backend. | |
| - DEMON is an engine around ACE-Step v1.5, not a replacement for the ACE-Step base model. | |
| - Users should review the ACE-Step model card, license, and usage notes before deploying systems built on DEMON. | |
| - Generated audio quality depends on the source audio, prompts, LoRAs, schedule settings, and backend configuration. | |
| ## Responsible use | |
| DEMON is intended for creative music generation, live performance, research, prototyping, and audio experimentation. | |
| Users are responsible for ensuring they have the rights to any source audio, prompts, LoRAs, datasets, or generated outputs they use, especially for commercial use. | |
| Do not use DEMON to imitate or misuse the identity, voice, likeness, or creative work of others without appropriate rights or consent. | |
| ## Citation | |
| If you use DEMON in your work, please cite the DEMON paper: | |
| ```bibtex | |
| @misc{fosdick2026demon, | |
| title={DEMON: Diffusion Engine for Musical Orchestrated Noise}, | |
| author={Fosdick, Ryan}, | |
| year={2026}, | |
| eprint={2605.28657}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SD}, | |
| url={https://arxiv.org/abs/2605.28657} | |
| } | |
| ``` | |
| Please also cite ACE-Step when appropriate, since DEMON is built on top of ACE-Step v1.5. | |
| ## Acknowledgments | |
| DEMON is built on top of ACE-Step. | |
| The base diffusion model, VAE, text encoder, and 5 Hz language model are ACE-Step's work. Without the ACE-Step team releasing the v1.5 weights and code under MIT, DEMON would not exist. | |
| Thank you to the ACE-Step team for making this work possible. | |
| ## Authors | |
| DEMON was originally created by Ryan Fosdick. | |
| Maintained by Daydream Live and contributors. | |
| - Ryan Fosdick: https://ryanontheinside.com | |
| - Daydream Live: https://daydream.live |