File size: 5,400 Bytes

6092fb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a2d96d
6092fb8
6a2d96d
6092fb8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24b949e
6092fb8
 
 
 
6a2d96d
 
6092fb8
 
 
 
 
 
 
 
 
 
 
 
24b949e
 
6092fb8
 
6a2d96d
 
 
6092fb8
 
 
 
 
 
 
 
 
 
 
 
 
6a2d96d
0febcbd
6a2d96d
0febcbd
6a2d96d
0febcbd
6a2d96d
0febcbd
6a2d96d
0febcbd
6a2d96d
0febcbd
6a2d96d
 
 
 
 
0febcbd
6a2d96d
0febcbd
6092fb8

---
license: other
license_name: circlestone-labs-non-commercial-license
base_model:
- circlestone-labs/Anima
pipeline_tag: text-to-image
library_name: diffusers
tags:
- diffusers
- safetensors
- sdnq
- anima
- cosmos
- text-to-image
- uint4
---

# Anima Preview 3 SDNQ UINT4 Diffusers Checkpoint

4-bit uint4 static SDNQ quantization of the Anima Preview 3 diffusion transformer, packaged as a full Diffusers pipeline. This is the smallest checkpoint and lowest VRAM footprint in this comparison; the companion checkpoints are listed in the benchmark table below.

This repository is a separate full Diffusers checkpoint for `circlestone-labs/Anima` Preview 3. The pipeline code and non-transformer components are based on the public Diffusers conversion `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers`. The `transformer/` component is the WaveCut SDNQ-quantized diffusion transformer converted from `WaveCut/Anima-Preview-3-SDNQ-uint4`.

## Components

- `transformer/`: SDNQ `uint4` quantized `CosmosTransformer3DModel`.
- `llm_adapter/`: Anima LLM adapter required by the native Anima architecture.
- `text_encoder/`: Qwen3 0.6B text encoder from the Diffusers conversion.
- `tokenizer/` and `t5_tokenizer/`: Qwen and T5 tokenizers used by the adapter pathway.
- `vae/`: Qwen Image / Wan-style VAE used by Anima.
- `scheduler/`: `FlowMatchEulerDiscreteScheduler` with shift 3.0.

## Usage

Install current Diffusers/Transformers plus SDNQ support, then load the pipeline:

```python
import torch
import sdnq
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers",
    custom_pipeline="pipeline",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).to("cuda")

prompt = "masterpiece, best quality, score_7, safe, 1girl, fern (sousou no frieren), purple hair, purple eyes, black robe, white dress, butterfly on hand, simple background, looking at viewer"
negative_prompt = "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, artist name"

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    width=1024,
    height=1024,
    num_inference_steps=30,
    guidance_scale=4.0,
    generator=torch.Generator(device="cuda").manual_seed(424242),
).images[0]
```

Because the Anima pipeline is custom code, pass `custom_pipeline="pipeline"`; `trust_remote_code=True` allows Diffusers to load `pipeline.py` from this repo.

## Prompting

Anima was trained on Danbooru-style tags, natural language captions, and mixtures of both. The upstream Anima Preview 3 card recommends about 1MP generation, for example `1024x1024`, `896x1152`, or `1152x896`, with roughly 30-50 steps and CFG 4-5.

Recommended positive prefix:

```text
masterpiece, best quality, score_7, safe,
```

Recommended negative prompt:

```text
worst quality, low quality, score_1, score_2, score_3, artist name
```

Use lowercase tags with spaces instead of underscores, except score tags such as `score_7`. For artist tags, prefix the artist with `@`.

## 1024x1024 Comparison Grid

Five prompt/seed pairs were generated with the original BF16 Diffusers checkpoint, this UINT4 checkpoint, and the companion INT8 checkpoint. The source JPEG is `3572x5576`; every generated cell is exactly `1024x1024` and pasted 1:1 with no resizing.

![Anima Original BF16 vs SDNQ UINT4 and INT8 1024x1024 grid](images/anima_original_uint4_int8_grid_5x3_1024x1024_1to1.jpg)

Prompt IDs and seeds are printed in the left column of the grid. Raw benchmark data is available in [`benchmarks/benchmark_results_1024.json`](benchmarks/benchmark_results_1024.json).

## Benchmark

Measured on an RTX 5090 32GB with `torch 2.8.0+cu128`, `diffusers 0.38.0`, `transformers 5.8.1`, `sdnq 0.1.8`, `torch.bfloat16`, 24 steps, CFG 4.0, and 1024x1024 output. Network download is excluded. Each model was loaded in a separate process; one 1024x1024 warm-up image was discarded, then five prompt/seed pairs were measured. VRAM was sampled with `nvidia-smi` every 50 ms.

| Model | Repo | Size | Load time | Mean generation | Speed vs original | VRAM after load | Peak VRAM while generating |
| --- | --- | ---: | ---: | ---: | ---: | ---: | ---: |
| Original BF16 | `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers` | 5.3 GiB | 10.04s | 6.37s/img | 1.00x | 6005 MiB | 10759 MiB |
| SDNQ UINT4 | `WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers` | 2.7 GiB (-49.1%) | 11.96s | 6.13s/img | 1.04x (+3.9%) | 3285 MiB (-45.3%) | 8157 MiB (-24.2%) |
| SDNQ INT8 | `WaveCut/Anima-Preview-3-SDNQ-int8-diffusers` | 3.5 GiB (-34.1%) | 22.41s | 4.60s/img | 1.38x (+38.4%) | 4111 MiB (-31.5%) | 8961 MiB (-16.7%) |

Quant-to-quant tradeoff in this run: UINT4 is 22.7% smaller than INT8 and uses 826 MiB less VRAM after load plus 804 MiB less peak generation VRAM. INT8 is 1.33x faster than UINT4 on this RTX 5090 setup.

## Notes

The original Anima split checkpoint is a ComfyUI-native model with a Qwen3 text encoder and a learned LLM adapter. Earlier transformer-only exports that load the checkpoint directly as `CosmosTransformer3DModel` ignore the `llm_adapter.*` weights; this repo keeps the adapter and full pipeline structure so generation follows the Anima architecture.

License follows the upstream Anima/CircleStone non-commercial license and the NVIDIA Cosmos derivative terms referenced by the upstream model card.