channudam commited on
Commit
a660b22
·
verified ·
1 Parent(s): 40befcd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -6
README.md CHANGED
@@ -4,14 +4,79 @@ language:
4
  - km
5
  pipeline_tag: text-to-image
6
  ---
7
- ## Model Description
8
 
9
- This project explores Khmer text-to-image generation, inspired by the architecture of Stable Diffusion. It builds upon the base model [`channudam/unet2dcon-khm-35`](https://huggingface.co/channudam/unet2dcon-khm-35) by integrating key components from the Stable Diffusion framework. This setup enhances image quality, provides better control, and offers more flexibility for downstream tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- - **Developed by:** Mr. Channudam Ray
12
- - **Funded by:** Factory.io
13
- - **Model Type:** Stable Diffusion-based
14
- - **Language:** Khmer (Central dialect)
15
 
16
  ## Fine-Tuning
17
 
@@ -41,3 +106,14 @@ plt.show()
41
  ```
42
  ![Generated Khmer Text Image](https://huggingface.co/channudam/stable-diffusion-khm-53/resolve/main/output_128x64_3.png)
43
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - km
5
  pipeline_tag: text-to-image
6
  ---
7
+ # Stable Diffusion for Khmer Text Generation (KHM-53)
8
 
9
+ This repository contains a fine-tuned Stable Diffusion pipeline for **Khmer Text Image Generation**, designed to create high-quality synthetic datasets for applications such as **Khmer OCR**, **document analysis**, and **language modeling**.
10
+
11
+ ## 🚀 Project Summary
12
+
13
+ This model was developed as part of a 4-month internship at Factory.io under the Cambodia Academy of Digital Technology (CADT), with the main objective of generating synthetic images of **Khmer script** from text prompts. The final output is an end-to-end **text-to-image generation pipeline**, fine-tuned on Khmer word images using the Stable Diffusion architecture.
14
+
15
+ ## 🛠️ Process Overview
16
+
17
+ ### 1. Literature Review & Experimentation
18
+ - Compared **DCGAN**, **UNet2D**, and **Stable Diffusion**.
19
+ - Stable Diffusion with a **RoBERTa text encoder** and **VAE decoder** showed the best qualitative and quantitative results.
20
+ - Utilized techniques like **UNet2DConditionalModel**, **text embedding**, and **diffusion denoising**.
21
+
22
+ ### 2. Dataset
23
+ - Source: [Khmer text recognition dataset on Kaggle](https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset)
24
+ - 136K+ images with 10 fonts, filtered and converted to grayscale.
25
+ - 96.57% of images retained after filtering for <128×64 px resolution.
26
+
27
+ ### 3. Preprocessing
28
+ - Filtering based on image size.
29
+ - Conversion from RGB to grayscale to optimize for limited GPU (12GB VRAM).
30
+ - Applied normalization, resizing, and rescaling.
31
+
32
+ ### 4. Model Architecture
33
+ - Text encoder: **RoBERTa**
34
+ - Latent generator: **UNet2DConditional**
35
+ - Decoder: **Variational Autoencoder (VAE)**
36
+ - Pipeline operates in **latent space** for memory efficiency and sharp image generation.
37
+
38
+ ### 5. Training & Fine-Tuning
39
+ - Trained with **AdamW optimizer** and **MSE Loss**.
40
+ - Fine-tuned text encoder, UNet2DConditional, and VAE simultaneously.
41
+ - Evaluated both with unconditional and conditional generation tasks.
42
+
43
+ ### 6. Deployment
44
+ - Final models are hosted here on Hugging Face for:
45
+ - Community sharing
46
+ - Version control
47
+ - Future fine-tuning
48
+ - Public reproducibility
49
+
50
+ Hugging Face Collection:
51
+ 🔗 https://huggingface.co/collections/channudam/textimagegeneration-khm-35-67d916c2505635db1ba8fc3c
52
+
53
+ ## 📈 Results
54
+
55
+ | Model Type | Output Quality | Params | Image Size |
56
+ |--------------------------|----------------|---------------|------------------|
57
+ | DCGAN | Low | 239K | 64x64, 1-chan |
58
+ | UNet2D | Good | 106M | 64x64, 1-chan |
59
+ | UNet2DConditional | Very Good | 147M | 64x32, 1-chan |
60
+ | Stable Diffusion | Excellent | 881M (Total) | 128x64, RGB |
61
+
62
+ ✔️ **Stable Diffusion outperformed all other methods**, producing sharper, more accurate Khmer text images.
63
+ ✔️ Works well under 12GB GPU constraints using compressed latent representation.
64
+
65
+ ## 🧠 Key Challenges
66
+
67
+ - Limited public Khmer datasets.
68
+ - Khmer script complexity (stacked diacritics, varied fonts).
69
+ - Hardware constraints (12GB VRAM).
70
+ - Evaluation had to rely on **manual visual inspection**.
71
+
72
+ ## 🔮 Future Work
73
+
74
+ - Expand dataset to include **longer texts** and **handwritten Khmer**.
75
+ - Integrate speech and handwriting modules for multimodal Khmer AI.
76
+ - Develop a web-based GUI for **real-time Khmer text-to-image generation**.
77
+ - Optimize for edge devices (mobile, low-power GPUs).
78
+ - Explore larger transformer-based encoders for better text understanding.
79
 
 
 
 
 
80
 
81
  ## Fine-Tuning
82
 
 
106
  ```
107
  ![Generated Khmer Text Image](https://huggingface.co/channudam/stable-diffusion-khm-53/resolve/main/output_128x64_3.png)
108
 
109
+ ## 📚 References
110
+
111
+ - [Khmer Text Recognition Dataset - Kaggle](https://www.kaggle.com/datasets/emhengly/khmer-text-recognition-dataset)
112
+ - [Stable Diffusion Course - Hugging Face](https://huggingface.co/learn/diffusion-course/en/unit3/1)
113
+ - [High-Resolution Image Synthesis with Latent Diffusion Models - arXiv](https://arxiv.org/pdf/2112.10752.pdf)
114
+ - [DCGAN Tutorial - TensorFlow](https://www.tensorflow.org/tutorials/generative/dcgan)
115
+
116
+ ---
117
+
118
+ > Made with ❤️ by [Channudam Ray](https://huggingface.co/channudam) | [Factory.io](https://robotxacademy.site/en/about-us) & CADT, 2025
119
+