zai-org
/

GLM-Image

@@ -7,44 +7,28 @@ library_name: diffusers
 pipeline_tag: text-to-image
 ---
-# GLM-Image
-<div align="center">
-<img src=https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/logo.svg width="40%"/>
-</div>
 <p align="center">
-    👋 Join our <a href="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/WECHAT.md" target="_blank">WeChat</a> and <a href="https://discord.gg/8KFjEec7" target="_blank">Discord</a> community
-    <br>
-    📖 Check out GLM-Image's <a href="https://z.ai/blog/glm-image" target="_blank">Technical Blog</a>
-    <br>
-    📍 Use GLM-Image's <a href="https://docs.z.ai/guides/image/glm-image" target="_blank">API</a>
 </p>
-## Case
-![show_case](https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/show_case.jpeg)
-### T2I with dense text and knowledge
-![show_case](https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/show_case_t2i.jpeg)
-### I2I
-![show_case](https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/show_case_i2i.jpeg)
 ## Introduction
 GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture. In general image generation quality, GLM‑Image aligns with mainstream latent diffusion approaches, but it shows significant advantages in text-rendering and knowledge‑intensive generation scenarios. It performs especially well in tasks requiring precise semantic understanding and complex information expression, while maintaining strong capabilities in high‑fidelity and fine‑grained detail generation. In addition to text‑to‑image generation, GLM‑Image also supports a rich set of image‑to‑image tasks including image editing, style transfer, identity‑preserving generation, and multi‑subject consistency.
 Model architecture: a hybrid autoregressive + diffusion decoder design.
-![architecture](https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/architecture.jpeg)
 + Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs.
 + Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.
 Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.
 + Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
@@ -55,6 +39,20 @@ GLM-Image supports both text-to-image and image-to-image generation within a sin
 + Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.
 + Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.
 ## Quick Start
 ### transformers + diffusers Pipeline

 pipeline_tag: text-to-image
 ---
+![show_case](resources/show_case.jpeg)
 <p align="center">
+  <img src="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/show_case.jpeg" alt="show_case" width="100%" />
 </p>
 ## Introduction
 GLM-Image is an image generation model adopts a hybrid autoregressive + diffusion decoder architecture. In general image generation quality, GLM‑Image aligns with mainstream latent diffusion approaches, but it shows significant advantages in text-rendering and knowledge‑intensive generation scenarios. It performs especially well in tasks requiring precise semantic understanding and complex information expression, while maintaining strong capabilities in high‑fidelity and fine‑grained detail generation. In addition to text‑to‑image generation, GLM‑Image also supports a rich set of image‑to‑image tasks including image editing, style transfer, identity‑preserving generation, and multi‑subject consistency.
 Model architecture: a hybrid autoregressive + diffusion decoder design.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/architecture_1.jpeg" alt="architecture_1" width="100%" />
+</p>
 + Autoregressive generator: a 9B-parameter model initialized from [GLM-4-9B-0414](https://huggingface.co/zai-org/GLM-4-9B-0414), with an expanded vocabulary to incorporate visual tokens. The model first generates a compact encoding of approximately 256 tokens, then expands to 1K–4K tokens, corresponding to 1K–2K high-resolution image outputs.
 + Diffusion Decoder: a 7B-parameter decoder based on a single-stream DiT architecture for latent-space image decoding. It is equipped with a Glyph Encoder text module, significantly improving accurate text rendering within images.
+<p align="center">
+  <img src="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/architecture_2.jpeg" alt="architecture_2" width="70%" />
+</p>
 Post-training with decoupled reinforcement learning: the model introduces a fine-grained, modular feedback strategy using the GRPO algorithm, substantially enhancing both semantic understanding and visual detail quality.
 + Autoregressive module: provides low-frequency feedback signals focused on aesthetics and semantic alignment, improving instruction following and artistic expressiveness.
 + Text-to-image: generates high-detail images from textual descriptions, with particularly strong performance in information-dense scenarios.
 + Image-to-image: supports a wide range of tasks, including image editing, style transfer, multi-subject consistency, and identity-preserving generation for people and objects.
+## Showcase
+### T2I with dense text and knowledge
+<p align="center">
+  <img src="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/show_case_t2i.jpeg" alt="show_case_t2i" width="100%" />
+</p>
+### I2I
+<p align="center">
+  <img src="https://raw.githubusercontent.com/zai-org/GLM-Image/refs/heads/main/resources/show_case_i2i.jpeg" alt="show_case_i2i" width="100%" />
+</p>
 ## Quick Start
 ### transformers + diffusers Pipeline