channudam
/

stable-diffusion-khm-53

@@ -4,14 +4,79 @@ language:
 - km
 pipeline_tag: text-to-image
 ---
-## Model Description
-This project explores Khmer text-to-image generation, inspired by the architecture of Stable Diffusion. It builds upon the base model [`channudam/unet2dcon-khm-35`](https://huggingface.co/channudam/unet2dcon-khm-35) by integrating key components from the Stable Diffusion framework. This setup enhances image quality, provides better control, and offers more flexibility for downstream tasks.
-- **Developed by:** Mr. Channudam Ray
-- **Funded by:** Factory.io
-- **Model Type:** Stable Diffusion-based
-- **Language:** Khmer (Central dialect)
 ## Fine-Tuning
@@ -41,3 +106,14 @@ plt.show()
 ```
 ![Generated Khmer Text Image](https://huggingface.co/channudam/stable-diffusion-khm-53/resolve/main/output_128x64_3.png)

 - km
 pipeline_tag: text-to-image
 ---
+# Stable Diffusion for Khmer Text Generation (KHM-53)
+This repository contains a fine-tuned Stable Diffusion pipeline for **Khmer Text Image Generation**, designed to create high-quality synthetic datasets for applications such as **Khmer OCR**, **document analysis**, and **language modeling**.
+## 🚀 Project Summary
+This model was developed as part of a 4-month internship at Factory.io under the Cambodia Academy of Digital Technology (CADT), with the main objective of generating synthetic images of **Khmer script** from text prompts. The final output is an end-to-end **text-to-image generation pipeline**, fine-tuned on Khmer word images using the Stable Diffusion architecture.
+## 🛠️ Process Overview
+### 1. Literature Review & Experimentation
+- Compared **DCGAN**, **UNet2D**, and **Stable Diffusion**.
+- Stable Diffusion with a **RoBERTa text encoder** and **VAE decoder** showed the best qualitative and quantitative results.
+- Utilized techniques like **UNet2DConditionalModel**, **text embedding**, and **diffusion denoising**.
+### 2. Dataset
+- Source: [Khmer text recognition dataset on Kaggle](https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset)
+- 136K+ images with 10 fonts, filtered and converted to grayscale.
+- 96.57% of images retained after filtering for <128×64 px resolution.
+### 3. Preprocessing
+- Filtering based on image size.
+- Conversion from RGB to grayscale to optimize for limited GPU (12GB VRAM).
+- Applied normalization, resizing, and rescaling.
+### 4. Model Architecture
+- Text encoder: **RoBERTa**
+- Latent generator: **UNet2DConditional**
+- Decoder: **Variational Autoencoder (VAE)**
+- Pipeline operates in **latent space** for memory efficiency and sharp image generation.
+### 5. Training & Fine-Tuning
+- Trained with **AdamW optimizer** and **MSE Loss**.
+- Fine-tuned text encoder, UNet2DConditional, and VAE simultaneously.
+- Evaluated both with unconditional and conditional generation tasks.
+### 6. Deployment
+- Final models are hosted here on Hugging Face for:
+  - Community sharing
+  - Version control
+  - Future fine-tuning
+  - Public reproducibility
+Hugging Face Collection:
+🔗 https://huggingface.co/collections/channudam/textimagegeneration-khm-35-67d916c2505635db1ba8fc3c
+## 📈 Results
+| Model Type               | Output Quality | Params        | Image Size      |
+|--------------------------|----------------|---------------|------------------|
+| DCGAN                    | Low            | 239K          | 64x64, 1-chan    |
+| UNet2D                   | Good           | 106M          | 64x64, 1-chan    |
+| UNet2DConditional        | Very Good      | 147M          | 64x32, 1-chan    |
+| Stable Diffusion         | Excellent      | 881M (Total)  | 128x64, RGB      |
+✔️ **Stable Diffusion outperformed all other methods**, producing sharper, more accurate Khmer text images.
+✔️ Works well under 12GB GPU constraints using compressed latent representation.
+## 🧠 Key Challenges
+- Limited public Khmer datasets.
+- Khmer script complexity (stacked diacritics, varied fonts).
+- Hardware constraints (12GB VRAM).
+- Evaluation had to rely on **manual visual inspection**.
+## 🔮 Future Work
+- Expand dataset to include **longer texts** and **handwritten Khmer**.
+- Integrate speech and handwriting modules for multimodal Khmer AI.
+- Develop a web-based GUI for **real-time Khmer text-to-image generation**.
+- Optimize for edge devices (mobile, low-power GPUs).
+- Explore larger transformer-based encoders for better text understanding.
 ## Fine-Tuning
 ```
 ![Generated Khmer Text Image](https://huggingface.co/channudam/stable-diffusion-khm-53/resolve/main/output_128x64_3.png)
+## 📚 References
+- [Khmer Text Recognition Dataset - Kaggle](https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset)
+- [Stable Diffusion Course - Hugging Face](https://huggingface.co/learn/diffusion-course/en/unit3/1)
+- [High-Resolution Image Synthesis with Latent Diffusion Models - arXiv](https://arxiv.org/pdf/2112.10752.pdf)
+- [DCGAN Tutorial - TensorFlow](https://www.tensorflow.org/tutorials/generative/dcgan)
+---
+> Made with ❤️ by [Channudam Ray](https://huggingface.co/channudam) | [Factory.io](https://robotxacademy.site/en/about-us) & CADT, 2025