--- license: other license_name: circlestone-labs-non-commercial-license base_model: - circlestone-labs/Anima pipeline_tag: text-to-image library_name: diffusers tags: - diffusers - safetensors - sdnq - anima - cosmos - text-to-image - uint4 --- # Anima Preview 3 SDNQ UINT4 Diffusers Checkpoint 4-bit uint4 static SDNQ quantization of the Anima Preview 3 diffusion transformer, packaged as a full Diffusers pipeline. This is the smallest checkpoint and lowest VRAM footprint in this comparison; the companion checkpoints are listed in the benchmark table below. This repository is a separate full Diffusers checkpoint for `circlestone-labs/Anima` Preview 3. The pipeline code and non-transformer components are based on the public Diffusers conversion `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers`. The `transformer/` component is the WaveCut SDNQ-quantized diffusion transformer converted from `WaveCut/Anima-Preview-3-SDNQ-uint4`. ## Components - `transformer/`: SDNQ `uint4` quantized `CosmosTransformer3DModel`. - `llm_adapter/`: Anima LLM adapter required by the native Anima architecture. - `text_encoder/`: Qwen3 0.6B text encoder from the Diffusers conversion. - `tokenizer/` and `t5_tokenizer/`: Qwen and T5 tokenizers used by the adapter pathway. - `vae/`: Qwen Image / Wan-style VAE used by Anima. - `scheduler/`: `FlowMatchEulerDiscreteScheduler` with shift 3.0. ## Usage Install current Diffusers/Transformers plus SDNQ support, then load the pipeline: ```python import torch import sdnq from diffusers import DiffusionPipeline pipe = DiffusionPipeline.from_pretrained( "WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers", custom_pipeline="pipeline", torch_dtype=torch.bfloat16, trust_remote_code=True, ).to("cuda") prompt = "masterpiece, best quality, score_7, safe, 1girl, fern (sousou no frieren), purple hair, purple eyes, black robe, white dress, butterfly on hand, simple background, looking at viewer" negative_prompt = "worst quality, low quality, score_1, score_2, score_3, blurry, jpeg artifacts, artist name" image = pipe( prompt=prompt, negative_prompt=negative_prompt, width=1024, height=1024, num_inference_steps=30, guidance_scale=4.0, generator=torch.Generator(device="cuda").manual_seed(424242), ).images[0] ``` Because the Anima pipeline is custom code, pass `custom_pipeline="pipeline"`; `trust_remote_code=True` allows Diffusers to load `pipeline.py` from this repo. ## Prompting Anima was trained on Danbooru-style tags, natural language captions, and mixtures of both. The upstream Anima Preview 3 card recommends about 1MP generation, for example `1024x1024`, `896x1152`, or `1152x896`, with roughly 30-50 steps and CFG 4-5. Recommended positive prefix: ```text masterpiece, best quality, score_7, safe, ``` Recommended negative prompt: ```text worst quality, low quality, score_1, score_2, score_3, artist name ``` Use lowercase tags with spaces instead of underscores, except score tags such as `score_7`. For artist tags, prefix the artist with `@`. ## 1024x1024 Comparison Grid Five prompt/seed pairs were generated with the original BF16 Diffusers checkpoint, this UINT4 checkpoint, and the companion INT8 checkpoint. The source JPEG is `3572x5576`; every generated cell is exactly `1024x1024` and pasted 1:1 with no resizing. ![Anima Original BF16 vs SDNQ UINT4 and INT8 1024x1024 grid](images/anima_original_uint4_int8_grid_5x3_1024x1024_1to1.jpg) Prompt IDs and seeds are printed in the left column of the grid. Raw benchmark data is available in [`benchmarks/benchmark_results_1024.json`](benchmarks/benchmark_results_1024.json). ## Benchmark Measured on an RTX 5090 32GB with `torch 2.8.0+cu128`, `diffusers 0.38.0`, `transformers 5.8.1`, `sdnq 0.1.8`, `torch.bfloat16`, 24 steps, CFG 4.0, and 1024x1024 output. Network download is excluded. Each model was loaded in a separate process; one 1024x1024 warm-up image was discarded, then five prompt/seed pairs were measured. VRAM was sampled with `nvidia-smi` every 50 ms. | Model | Repo | Size | Load time | Mean generation | Speed vs original | VRAM after load | Peak VRAM while generating | | --- | --- | ---: | ---: | ---: | ---: | ---: | ---: | | Original BF16 | `CalamitousFelicitousness/Anima-Preview-3-sdnext-diffusers` | 5.3 GiB | 10.04s | 6.37s/img | 1.00x | 6005 MiB | 10759 MiB | | SDNQ UINT4 | `WaveCut/Anima-Preview-3-SDNQ-uint4-diffusers` | 2.7 GiB (-49.1%) | 11.96s | 6.13s/img | 1.04x (+3.9%) | 3285 MiB (-45.3%) | 8157 MiB (-24.2%) | | SDNQ INT8 | `WaveCut/Anima-Preview-3-SDNQ-int8-diffusers` | 3.5 GiB (-34.1%) | 22.41s | 4.60s/img | 1.38x (+38.4%) | 4111 MiB (-31.5%) | 8961 MiB (-16.7%) | Quant-to-quant tradeoff in this run: UINT4 is 22.7% smaller than INT8 and uses 826 MiB less VRAM after load plus 804 MiB less peak generation VRAM. INT8 is 1.33x faster than UINT4 on this RTX 5090 setup. ## Notes The original Anima split checkpoint is a ComfyUI-native model with a Qwen3 text encoder and a learned LLM adapter. Earlier transformer-only exports that load the checkpoint directly as `CosmosTransformer3DModel` ignore the `llm_adapter.*` weights; this repo keeps the adapter and full pipeline structure so generation follows the Anima architecture. License follows the upstream Anima/CircleStone non-commercial license and the NVIDIA Cosmos derivative terms referenced by the upstream model card.