HiDream-O1-Image-Dev — MLX port for Apple Silicon
Ported by Mrbizarro · MIT licensed · published to mlx-community
🎛️ Run it one-click in Phosphene
Phosphene is a free local generative-video panel for Apple Silicon (Mac, M1+). It ships with HiDream-O1 wired into its Image Studio — pick "HiDream-O1-Image-Dev BF16" from the engine dropdown and you have native edit + multi-reference support out of the box. No conda, no Python tinkering, no separate venv setup. Install Pinokio, then in Pinokio install Phosphene.
A native MLX port of HiDream-ai/HiDream-O1-Image-Dev for fast local image generation on Apple Silicon Macs. No PyTorch, no CUDA, no flash-attn required at inference time.
Capabilities (all native to HiDream-O1, all working in this port):
- Text-to-image at 1024×1024 / 2048×2048 / non-square trained dims
- Instruction-based image edit with 1 reference image (e.g. "change the chef's white jacket to red" — preserves scene, pose, identity)
- Multi-reference subject personalization with 2-3 reference images (compose multiple subjects in a new scene)
HiDream-O1 is an 8B Qwen3-VL-based unified pixel-patch transformer — it predicts raw 32×32 RGB patches directly through the same backbone that handles text, with no separate VAE. The Dev variant is a 28-step distillation of the 50-step Full model, released under the MIT license.
This port:
- Reuses
mlx-vlm's Qwen3-VL backbone (vision tower, decoder layers, mrope-3D) - Adds the three diffusion-side custom heads (
t_embedder1,x_embedder,final_layer2) - Ports the
FlashFlowMatchEulerDiscreteSchedulerand the unified-token-sequence builder - Ships BF16 weights (no quantization — see "Why BF16" below)
Hero samples
All generated by the included generator script on a 64 GB Mac Studio. Click any image to open full-resolution.
More: sample_outputs/hero/.
Variants
| Variant | Repo | Backbone size | RAM (1024) | Quality |
|---|---|---|---|---|
| BF16 (this repo) | mlx-community/HiDream-O1-Image-Dev-mlx-bf16 |
17.5 GB | 16 GB | ✅ Clean across all trained dims |
| Q8 | mlx-community/HiDream-O1-Image-Dev-mlx-q8 |
10 GB | 11.5 GB | ⚠ Clean at square dims, grid at non-square |
| Q6 | mlx-community/HiDream-O1-Image-Dev-mlx-q6 |
8 GB | 8.5 GB | ⚠ Clean at square dims, grid at non-square |
Q4 was tested and rejected — brightness collapses, every image ships dark.
Why BF16 is the safe default
Per-group dequantization rounding (Q6/Q8) compounds across the 36 decoder layers and shows as a visible 32-pixel grid in flat regions (skies, walls, water), specifically at non-square trained dimensions like 1440×2560 or 3104×1312. BF16 matches the upstream's torch_dtype=torch.float32 + autocast(bfloat16) precision and is the only quant clean across all trained dimensions.
If your workflow is square-only (1024×1024, 2048×2048) and you're RAM-constrained, Q6 is half the size and 2× faster — no quality loss at those dims. Use Q6 on a 16 GB Mac, BF16 on 32 GB+.
Install
Requires macOS on Apple Silicon (M1 or newer). Tested on macOS 14+ with a 64 GB Mac Studio.
Quick start (download pre-converted weights — recommended)
# Clone the repo (code, docs, samples)
hf download mlx-community/HiDream-O1-Image-Dev-mlx-bf16 --local-dir hidream-o1-mlx
cd hidream-o1-mlx
# Set up the venv
uv venv --python 3.11
uv pip install -r requirements.txt
# Generate (model files are at the repo root — pass --model-path .)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
--model-path . \
--prompt "your prompt here" \
--output out.png
Or convert from upstream weights yourself
git clone https://huggingface.co/mlx-community/HiDream-O1-Image-Dev-mlx-bf16
cd HiDream-O1-Image-Dev-mlx-bf16
uv venv --python 3.11
uv pip install -r requirements.txt
# Convert the upstream HF weights to MLX BF16 (~5 minutes, requires ~50 GB free disk)
.venv/bin/python scripts/hidream_o1/convert_hidream_o1_to_mlx.py \
--hf-source HiDream-ai/HiDream-O1-Image-Dev \
--out-dir mlx_models/hidream-o1-dev-bf16 \
--bits 16
Usage
# Single image, default 1024×1024 BF16
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
--model-path mlx_models/hidream-o1-dev-bf16 \
--prompt "your prompt here" \
--output sample_outputs/whatever.png \
--seed 42
# Higher resolution (2048×2048 = upstream default)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
--model-path mlx_models/hidream-o1-dev-bf16 \
--prompt "..." \
--width 2048 --height 2048 \
--output sample_outputs/big.png
# Vertical / cinema (auto-snaps to nearest trained ratio)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
--model-path mlx_models/hidream-o1-dev-bf16 \
--prompt "..." \
--width 1440 --height 2560 \
--output sample_outputs/portrait.png
# Instruction-based edit (one ref image)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
--model-path mlx_models/hidream-o1-dev-bf16 \
--prompt "change the chef's white jacket to a bright red chef jacket, same kitchen, same pose, photorealistic" \
--output sample_outputs/edit_red_jacket.png \
--ref-images /path/to/chef.jpg \
--seed 42
# Multi-reference subject personalization (2-3 refs)
.venv/bin/python scripts/hidream_o1/generate_hidream_o1_mlx.py \
--model-path mlx_models/hidream-o1-dev-bf16 \
--prompt "the person from reference 1 standing in the location from reference 2, golden hour, photorealistic" \
--output sample_outputs/multi_ref.png \
--ref-images /path/to/person.jpg /path/to/place.jpg \
--seed 42
Trained resolutions
HiDream-O1 was trained on a fixed list of resolutions. The generator auto-snaps to the closest. Off-spec dims produce visible patch artifacts. The trained list:
2048×2048, 2304×1728, 1728×2304, 2560×1440, 1440×2560,
2496×1664, 1664×2496, 3104×1312, 1312×3104, 2304×1792, 1792×2304
Prompt tips for realism
HiDream is responsive to camera/film terminology. To avoid the AI-glossy look:
- Lead with
masterpiece, best quality(community-found responder phrase) - Subject + Actions → Setting → Style → Details ordering
- Specify equipment:
Leica M6 with Kodak Tri-X 400,Pentax K1000 + Cinestill 800T,Hasselblad H6D medium format - Reference real photographers: Sebastião Salgado, Saul Leiter, Wim Wenders, Annie Leibovitz, Anders Petersen
- Spell out skin imperfection: "natural pores", "faint laugh lines", "weathered hands", "no retouching"
- Avoid "stunning", "perfect", "beautiful" — they push toward AI-glamour aesthetics
The Dev model uses guidance_scale=0.0 so negative prompts have no effect — push positive prompts harder instead.
What's in this repo
hidream-o1-mlx/
├── README.md (this file)
├── LICENSE (MIT)
├── requirements.txt (mlx-vlm 0.5.0, transformers 5.8+, deps)
├── scripts/hidream_o1/
│ ├── convert_hidream_o1_to_mlx.py (HF → MLX, BF16 / Q4 / Q6 / Q8)
│ ├── generate_hidream_o1_mlx.py (T2I generator + experimental edit/multi-ref)
│ ├── hidream_model.py (custom heads + forward_generation)
│ ├── pipeline_helpers.py (T2I sample, mrope, mask, patchify)
│ └── flow_match.py (FlashFlowMatchScheduler in MLX)
├── docs/
│ ├── EVALUATION.md (perf + quality findings, A/B vs mflux)
│ ├── HIDREAM_O1_MLX_PORT_REPORT.md (architecture + weight conversion details)
│ └── PHOSPHENE_INTEGRATION_PLAN.md (how it slots into a host app)
├── sample_outputs/ (gallery)
└── mlx_models/ (where converted weights land)
Performance
| Resolution | Per step | Total (28 steps) | Peak RAM |
|---|---|---|---|
| 1024×1024 | 2.4 s | 67 s | 16 GB |
| 1440×2560 | 4.5 s | 127 s | 16 GB |
| 2048×2048 | 6.7 s | 187 s | 16 GB |
| 3104×1312 | 7.6 s | 213 s | 16 GB |
mx.compile gives 0% speedup — the inference loop is bandwidth-bound on the 36-layer BF16 decoder. To go faster you'd need a smaller distillation (none public) or text-cache reuse across denoising steps.
Status
- ✅ Text-to-image: production-quality, BF16 default, ~67 s / 1024×1024 on a 64 GB Mac
- ✅ Instruction edit (K=1 ref): working at BF16. Verified: same chef, same kitchen, same pose, only the jacket colour changed.
- ✅ Multi-reference subject personalization (K=2-3 refs): supported by the upstream architecture and our port; same
--ref-imagesflag with multiple paths - ✅ Native MLX — no PyTorch, no CUDA, no flash-attn at inference time
- ⚠ Edit requires BF16. Q6/Q8 quantization breaks the attention against ref features (degenerate output). The text-to-image path is fine at all quants.
Acknowledgements
- HiDream-ai for the original HiDream-O1-Image model + MIT license
- Blaizzy/mlx-vlm for the Qwen3-VL MLX backbone (this port reuses their vision tower + decoder layers + mrope-3D wholesale)
- Apple ml-explore/mlx for the MLX framework
- The Civitai community's HiDream prompt-engineering guide
Citation
If you use this in research, cite the upstream model:
@misc{hidream-o1-image,
author = {HiDream-ai},
title = {HiDream-O1-Image: Pixel-Level Unified Transformer},
year = {2026},
url = {https://github.com/HiDream-ai/HiDream-O1-Image}
}
License
MIT — see LICENSE.
- Downloads last month
- 255
Quantized
Model tree for mlx-community/HiDream-O1-Image-Dev-mlx-bf16
Base model
HiDream-ai/HiDream-O1-Image-Dev






