Spaces:

ideogram-ai
/

ideogram4

Running on Zero

multimodalart HF Staff commited on 3 days ago

Commit

332e802

verified ·

1 Parent(s): a1f24dc

Drop dequantize (AOTI off -> run nf4 directly)

Files changed (1) hide show

app.py CHANGED Viewed

@@ -55,14 +55,12 @@ MODES = {
     "Quality · 48 steps": dict(num_inference_steps=48, guidance_schedule=(7.0,) * 45 + (3.0,) * 3, mu=0.0, std=1.5),
 }
-# --- Pipeline: dequantize both transformers nf4 -> bf16 in the parent (CPU) so every ZeroGPU fork inherits
-# bf16 and AOTI can bind its weight-less graph to real weights. ---
 t = time.perf_counter()
 pipe = Ideogram4Pipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
-pipe.transformer.dequantize()
-pipe.unconditional_transformer.dequantize()
 pipe.to("cuda")
-print(f"[timing] pipeline load + dequant: {time.perf_counter() - t:.1f}s", flush=True)
 # The local prompt-enhancer LM head is grafted lazily by `pipe.upsample_prompt` on first use (onto the worker's
 # GPU), so no explicit load is needed here. Local is only the fallback; Ideogram's remote API is the default.

     "Quality · 48 steps": dict(num_inference_steps=48, guidance_schedule=(7.0,) * 45 + (3.0,) * 3, mu=0.0, std=1.5),
 }
+# --- Pipeline (nf4). No dequantize: that was only to give AOTI bf16 weights to bind; with AOTI off we run
+# the nf4 transformers directly (less VRAM, faster startup). Re-add dequantize when re-enabling AOTI. ---
 t = time.perf_counter()
 pipe = Ideogram4Pipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
 pipe.to("cuda")
+print(f"[timing] pipeline load: {time.perf_counter() - t:.1f}s", flush=True)
 # The local prompt-enhancer LM head is grafted lazily by `pipe.upsample_prompt` on first use (onto the worker's
 # GPU), so no explicit load is needed here. Local is only the fallback; Ideogram's remote API is the default.