WaveCut's picture
Fix Diffusers class metadata warning
24b949e verified
---
license: other
license_name: circlestone-labs-non-commercial-license
base_model:
- circlestone-labs/Anima
pipeline_tag: text-to-image
library_name: diffusers
tags:
- diffusers
- safetensors
- sdnq
- anima
- cosmos
- text-to-image
- uint4
---
# Anima Preview 3 SDNQ UINT4 Diffusers Checkpoint
4-bit uint4 static SDNQ quantization of the Anima Preview 3 diffusion transformer, packaged as a full Diffusers pipeline. This is the smallest checkpoint and lowest VRAM footprint in this comparison; the companion checkpoints are listed in the benchmark table below.
This repository is a separate full Diffusers checkpoint for `circlestone-labs/Anima` Preview 3. The pipeline code and non-transformer components are based on the public Diffusers conversion `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers`. The `transformer/` component is the WaveCut SDNQ-quantized diffusion transformer converted from `WaveCut/Anima-Preview-3-SDNQ-uint4`.
## Components
- `transformer/`: SDNQ `uint4` quantized `CosmosTransformer3DModel`.
- `llm_adapter/`: Anima LLM adapter required by the native Anima architecture.
- `text_encoder/`: Qwen3 0.6B text encoder from the Diffusers conversion.
- `tokenizer/` and `t5_tokenizer/`: Qwen and T5 tokenizers used by the adapter pathway.
- `vae/`: Qwen Image / Wan-style VAE used by Anima.
- `scheduler/`: `FlowMatchEulerDiscreteScheduler` with shift 3.0.
## Usage
Install current Diffusers/Transformers plus SDNQ support, then load the pipeline:
```python
import torch
import sdnq
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers",
custom_pipeline="pipeline",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).to("cuda")
prompt = "masterpiece, best quality, score_7, safe, 1girl, fern (sousou no frieren), purple hair, purple eyes, black robe, white dress, butterfly on hand, simple background, looking at viewer"
negative_prompt = "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, artist name"
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
width=1024,
height=1024,
num_inference_steps=30,
guidance_scale=4.0,
generator=torch.Generator(device="cuda").manual_seed(424242),
).images[0]
```
Because the Anima pipeline is custom code, pass `custom_pipeline="pipeline"`; `trust_remote_code=True` allows Diffusers to load `pipeline.py` from this repo.
## Prompting
Anima was trained on Danbooru-style tags, natural language captions, and mixtures of both. The upstream Anima Preview 3 card recommends about 1MP generation, for example `1024x1024`, `896x1152`, or `1152x896`, with roughly 30-50 steps and CFG 4-5.
Recommended positive prefix:
```text
masterpiece, best quality, score_7, safe,
```
Recommended negative prompt:
```text
worst quality, low quality, score_1, score_2, score_3, artist name
```
Use lowercase tags with spaces instead of underscores, except score tags such as `score_7`. For artist tags, prefix the artist with `@`.
## 1024x1024 Comparison Grid
Five prompt/seed pairs were generated with the original BF16 Diffusers checkpoint, this UINT4 checkpoint, and the companion INT8 checkpoint. The source JPEG is `3572x5576`; every generated cell is exactly `1024x1024` and pasted 1:1 with no resizing.
![Anima Original BF16 vs SDNQ UINT4 and INT8 1024x1024 grid](images/anima_original_uint4_int8_grid_5x3_1024x1024_1to1.jpg)
Prompt IDs and seeds are printed in the left column of the grid. Raw benchmark data is available in [`benchmarks/benchmark_results_1024.json`](benchmarks/benchmark_results_1024.json).
## Benchmark
Measured on an RTX 5090 32GB with `torch 2.8.0+cu128`, `diffusers 0.38.0`, `transformers 5.8.1`, `sdnq 0.1.8`, `torch.bfloat16`, 24 steps, CFG 4.0, and 1024x1024 output. Network download is excluded. Each model was loaded in a separate process; one 1024x1024 warm-up image was discarded, then five prompt/seed pairs were measured. VRAM was sampled with `nvidia-smi` every 50 ms.
| Model | Repo | Size | Load time | Mean generation | Speed vs original | VRAM after load | Peak VRAM while generating |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Original BF16 | `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers` | 5.3 GiB | 10.04s | 6.37s/img | 1.00x | 6005 MiB | 10759 MiB |
| SDNQ UINT4 | `WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers` | 2.7 GiB (-49.1%) | 11.96s | 6.13s/img | 1.04x (+3.9%) | 3285 MiB (-45.3%) | 8157 MiB (-24.2%) |
| SDNQ INT8 | `WaveCut/Anima-Preview-3-SDNQ-int8-diffusers` | 3.5 GiB (-34.1%) | 22.41s | 4.60s/img | 1.38x (+38.4%) | 4111 MiB (-31.5%) | 8961 MiB (-16.7%) |
Quant-to-quant tradeoff in this run: UINT4 is 22.7% smaller than INT8 and uses 826 MiB less VRAM after load plus 804 MiB less peak generation VRAM. INT8 is 1.33x faster than UINT4 on this RTX 5090 setup.
## Notes
The original Anima split checkpoint is a ComfyUI-native model with a Qwen3 text encoder and a learned LLM adapter. Earlier transformer-only exports that load the checkpoint directly as `CosmosTransformer3DModel` ignore the `llm_adapter.*` weights; this repo keeps the adapter and full pipeline structure so generation follows the Anima architecture.
License follows the upstream Anima/CircleStone non-commercial license and the NVIDIA Cosmos derivative terms referenced by the upstream model card.