--- library_name: diffusers language: - km pipeline_tag: text-to-image --- # Stable Diffusion for Khmer Text Generation (KHM-53) This repository hosts a fine-tuned Stable Diffusion model customized for **Khmer text image generation**. The model aims to generate high-quality synthetic data, particularly for applications such as **Khmer OCR**, **document layout analysis**, and **AI-based Khmer text systems**. --- ## 📌 Problem Despite rapid advances in AI and generative models, **Khmer** remains a low-resource language, lacking high-quality datasets and models for tasks like text-to-image generation, OCR, and scene text analysis. Compared to languages like Thai or Vietnamese, Khmer lacks sufficient publicly available data, especially in image form, making it difficult to develop robust AI systems. --- ## 🎯 Objective The primary objective of this project was to: - Develop a **text-to-image generation pipeline** capable of generating **synthetic Khmer word images** from text prompts. - Support the **development of Khmer OCR** and other Khmer language-related AI tools by providing **training-grade synthetic data**. - Evaluate and compare state-of-the-art generation models including **DCGAN**, **UNet2D**, and **Stable Diffusion** to determine the best fit for Khmer. --- ## 🧭 Goal and Scope **Goal**: To build a complete, scalable, and publicly accessible pipeline that can transform Khmer text into realistic images for downstream use in OCR and machine learning. **Scope**: - Covers Khmer character-level and word-level image generation. - Implements and compares several generative architectures (GAN, Diffusion). - Includes full pipeline: **data preprocessing**, **model training**, **fine-tuning**, **evaluation**, and **deployment**. - Does not yet include a production-ready API, though integration with **Telegram chatbot** and public hosting on Hugging Face Hub are provided. - Future extension planned for longer texts, handwritten data, and web GUI. --- ## 🚀 Project Summary This model was developed as part of a 4-month internship at Factory.io under the Cambodia Academy of Digital Technology (CADT), with the main objective of generating synthetic images of **Khmer script** from text prompts. The final output is an end-to-end **text-to-image generation pipeline**, fine-tuned on Khmer word images using the Stable Diffusion architecture. ## 🛠️ Process Overview ### 1. Literature Review & Experimentation - Compared **DCGAN**, **UNet2D**, and **Stable Diffusion**. - Stable Diffusion with a **RoBERTa text encoder** and **VAE decoder** showed the best qualitative and quantitative results. - Utilized techniques like **UNet2DConditionalModel**, **text embedding**, and **diffusion denoising**. ### 2. Dataset - Source: [Khmer text recognition dataset on Kaggle](https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset) - 136K+ images with 10 fonts, filtered and converted to grayscale. - 96.57% of images retained after filtering for <128×64 px resolution. ### 3. Preprocessing - Filtering based on image size. - Conversion from RGB to grayscale to optimize for limited GPU (12GB VRAM). - Applied normalization, resizing, and rescaling. ### 4. Model Architecture - Text encoder: **RoBERTa** - Latent generator: **UNet2DConditional** - Decoder: **Variational Autoencoder (VAE)** - Pipeline operates in **latent space** for memory efficiency and sharp image generation. ### 5. Training & Fine-Tuning - Trained with **AdamW optimizer** and **MSE Loss**. - Fine-tuned text encoder, UNet2DConditional, and VAE simultaneously. - Evaluated both with unconditional and conditional generation tasks. ### 6. Deployment - Final models are hosted here on Hugging Face for: - Community sharing - Version control - Future fine-tuning - Public reproducibility Hugging Face Collection: 🔗 https://huggingface.co/collections/channudam/textimagegeneration-khm-35-67d916c2505635db1ba8fc3c ## 📈 Results | Model Type | Output Quality | Params | Image Size | |--------------------------|----------------|---------------|------------------| | DCGAN | Low | 239K | 64x64, 1-chan | | UNet2D | Good | 106M | 64x64, 1-chan | | UNet2DConditional | Very Good | 147M | 64x32, 1-chan | | Stable Diffusion | Excellent | 881M (Total) | 128x64, RGB | ✔️ **Stable Diffusion outperformed all other methods**, producing sharper, more accurate Khmer text images. ✔️ Works well under 12GB GPU constraints using compressed latent representation. ## 🧠 Key Challenges - Limited public Khmer datasets. - Khmer script complexity (stacked diacritics, varied fonts). - Hardware constraints (12GB VRAM). - Evaluation had to rely on **manual visual inspection**. ## 🔮 Future Work - Expand dataset to include **longer texts** and **handwritten Khmer**. - Integrate speech and handwriting modules for multimodal Khmer AI. - Develop a web-based GUI for **real-time Khmer text-to-image generation**. - Optimize for edge devices (mobile, low-power GPUs). - Explore larger transformer-based encoders for better text understanding. ## Fine-Tuning This is a base model and is intended to be fine-tuned for specific tasks or datasets. The model was trained on images with a resolution of **128×64** in **RGB** color channel, but this can be adjusted during fine-tuning to match your desired output size. For best results, it is recommended to fine-tune the following three main components rather than just the core UNet model: - **Text Encoder** – [`RobertaModel`] - **Variational Autoencoder** – [`AutoencoderKL`] - **Image Generation Model** – [`UNet2DConditionModel`] ## Usage (with GPU) ```python import matplotlib.pyplot as plt import torch from diffusers import StableDiffusionPipeline pipe = StableDiffusionPipeline.from_pretrained( "channudam/stable-diffusion-khm-53", torch_dtype=torch.float16, ).to("cuda") images = pipe("បាត់ដំបង", guidance_scale=2).images[0] plt.imshow(images) plt.show() ``` ![Generated Khmer Text Image](https://huggingface.co/channudam/stable-diffusion-khm-53/resolve/main/output_128x64_3.png) ## 📚 References - [Khmer Text Recognition Dataset - Kaggle](https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset) - [Stable Diffusion Course - Hugging Face](https://huggingface.co/learn/diffusion-course/en/unit3/1) - [High-Resolution Image Synthesis with Latent Diffusion Models - arXiv](https://arxiv.org/pdf/2112.10752.pdf) - [DCGAN Tutorial - TensorFlow](https://www.tensorflow.org/tutorials/generative/dcgan) --- > Made with ❤️ by [Channudam Ray](https://huggingface.co/channudam) | [Factory.io](https://robotxacademy.site/en/about-us) & CADT, 2025