|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- dzungpham/font-diffusion-generated-data |
|
|
language: |
|
|
- en |
|
|
library_name: diffusers |
|
|
tags: |
|
|
- font-diffusion |
|
|
- image-to-image |
|
|
- contrastive-learning |
|
|
- diffusers |
|
|
- font-generation |
|
|
- character-synthesis |
|
|
- style-transfer |
|
|
- dpm-solver |
|
|
--- |
|
|
# Model Card for FontDiffuser |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Type |
|
|
- **Architecture**: Diffusion-based Font Generation Model |
|
|
- **Framework**: PyTorch + Hugging Face Diffusers |
|
|
- **Scheduler**: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver) |
|
|
- **Guidance**: Classifier-free guidance |
|
|
- **Base Model**: FontDiffuser with Content and Style Encoders |
|
|
|
|
|
### Model Components |
|
|
1. **UNet**: Main diffusion model for image generation |
|
|
2. **Content Encoder**: Extracts character structure information |
|
|
3. **Style Encoder**: Extracts font style features |
|
|
4. **DDPM/DPM Scheduler**: Noise scheduling for diffusion process |
|
|
|
|
|
### Training Configuration |
|
|
- **Resolution**: 96Γ96 pixels |
|
|
- **Batch Size**: 4-8 (configurable) |
|
|
- **Inference Steps**: 15 (default, configurable) |
|
|
- **Guidance Scale**: 7.5 (default, configurable) |
|
|
- **Precision**: FP32/FP16 (optional) |
|
|
- **Device**: CUDA/GPU recommended |
|
|
|
|
|
## Model Usage |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install diffusers torch torchvision safetensors |
|
|
pip install lpips scikit-image pytorch-fid # Optional: for evaluation |
|
|
``` |
|
|
|
|
|
### Basic Generation |
|
|
```python |
|
|
from sample_batch import ( |
|
|
FontManager, |
|
|
batch_generate_images, |
|
|
load_fontdiffuser_pipeline |
|
|
) |
|
|
from argparse import Namespace |
|
|
|
|
|
# Initialize font manager |
|
|
font_manager = FontManager("path/to/font.ttf") |
|
|
|
|
|
# Load pipeline |
|
|
args = Namespace( |
|
|
ckpt_dir="path/to/checkpoints", |
|
|
device="cuda", |
|
|
num_inference_steps=15, |
|
|
guidance_scale=7.5, |
|
|
batch_size=4, |
|
|
# ... other args |
|
|
) |
|
|
pipe = load_fontdiffuser_pipeline(args) |
|
|
|
|
|
# Generate images |
|
|
characters = ['A', 'B', 'C', 'δΈ', 'ε½'] |
|
|
style_paths = ['style1.png', 'style2.png'] |
|
|
|
|
|
results = batch_generate_images( |
|
|
pipe, characters, style_paths, |
|
|
output_dir="output", |
|
|
args=args, |
|
|
evaluator=evaluator, |
|
|
font_manager=font_manager |
|
|
) |
|
|
``` |
|
|
|
|
|
### Batch Generation with Checkpointing |
|
|
```bash |
|
|
python sample_batch.py \ |
|
|
--characters "characters.txt" \ |
|
|
--start_line 1 \ |
|
|
--end_line 100 \ |
|
|
--style_images "styles/" \ |
|
|
--ttf_path "fonts/myfont.ttf" \ |
|
|
--ckpt_dir "checkpoints/" \ |
|
|
--output_dir "my_dataset/train_original" \ |
|
|
--batch_size 4 \ |
|
|
--num_inference_steps 15 \ |
|
|
--guidance_scale 7.5 \ |
|
|
--save_interval 10 \ |
|
|
--device cuda |
|
|
``` |
|
|
|
|
|
### Resume from Checkpoint |
|
|
```bash |
|
|
python sample_batch.py \ |
|
|
--characters "characters.txt" \ |
|
|
--style_images "styles/" \ |
|
|
--ttf_path "fonts/myfont.ttf" \ |
|
|
--ckpt_dir "checkpoints/" \ |
|
|
--output_dir "my_dataset/train_original" \ |
|
|
--resume_from "my_dataset/train_original/results_checkpoint.json" |
|
|
``` |
|
|
|
|
|
## Model Performance |
|
|
|
|
|
### Supported Tasks |
|
|
- β
Single-character font generation |
|
|
- β
Multi-character batch generation |
|
|
- β
Multi-font support |
|
|
- β
Multi-style transfer |
|
|
- β
Index-based tracking for large-scale generation |
|
|
- β
Checkpoint and resume support |
|
|
|
|
|
### Output Format |
|
|
``` |
|
|
output_dir/ |
|
|
βββ ContentImage/ # Single set of content (character) images |
|
|
β βββ char0.png |
|
|
β βββ char1.png |
|
|
β βββ ... |
|
|
βββ TargetImage/ # Generated font images organized by style |
|
|
β βββ style0/ |
|
|
β β βββ style0+char0.png |
|
|
β β βββ style0+char1.png |
|
|
β β βββ ... |
|
|
β βββ style1/ |
|
|
β β βββ ... |
|
|
β βββ ... |
|
|
βββ results.json # Comprehensive generation metadata |
|
|
βββ results_checkpoint.json # Intermediate checkpoint (if save_interval > 0) |
|
|
βββ results_interrupted.json # Emergency checkpoint (if interrupted) |
|
|
``` |
|
|
|
|
|
### Results Metadata Structure |
|
|
```json |
|
|
{ |
|
|
"generations": [ |
|
|
{ |
|
|
"character": "A", |
|
|
"char_index": 0, |
|
|
"style": "style0", |
|
|
"style_index": 0, |
|
|
"font": "Arial", |
|
|
"style_path": "path/to/style0.png", |
|
|
"output_path": "TargetImage/style0/style0+char0.png" |
|
|
} |
|
|
], |
|
|
"metrics": { |
|
|
"lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5}, |
|
|
"ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95}, |
|
|
"fid": {"mean": 15.3, "std": 2.1}, |
|
|
"inference_times": [ |
|
|
{ |
|
|
"style": "style0", |
|
|
"style_index": 0, |
|
|
"font": "Arial", |
|
|
"total_time": 2.45, |
|
|
"num_images": 100, |
|
|
"time_per_image": 0.0245 |
|
|
} |
|
|
] |
|
|
}, |
|
|
"fonts": ["Arial", "Times New Roman"], |
|
|
"characters": ["A", "B", "C"], |
|
|
"styles": ["style0", "style1"], |
|
|
"total_chars": 3, |
|
|
"total_styles": 2, |
|
|
"total_possible_pairs": 6 |
|
|
} |
|
|
``` |
|
|
|
|
|
## Evaluation Metrics |
|
|
|
|
|
### Supported Metrics |
|
|
- **LPIPS**: Learned perceptual image patch similarity (lower is better) |
|
|
- **SSIM**: Structural similarity index (higher is better) |
|
|
- **FID**: FrΓ©chet Inception Distance (lower is better) |
|
|
- **Inference Time**: Per-image generation time |
|
|
|
|
|
### Generate with Evaluation |
|
|
```bash |
|
|
python sample_batch.py \ |
|
|
--characters "characters.txt" \ |
|
|
--style_images "styles/" \ |
|
|
--ttf_path "fonts/myfont.ttf" \ |
|
|
--ckpt_dir "checkpoints/" \ |
|
|
--output_dir "my_dataset/train_original" \ |
|
|
--evaluate \ |
|
|
--ground_truth_dir "ground_truth/" \ |
|
|
--compute_fid |
|
|
``` |
|
|
|
|
|
## Dataset |
|
|
|
|
|
### Dataset Source |
|
|
- **Name**: font-diffusion-generated-data |
|
|
- **Link**: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data |
|
|
- **Format**: ContentImage + TargetImage per style |
|
|
- **Supports**: Multi-font, multi-character, multi-style generation |
|
|
|
|
|
### Dataset Structure |
|
|
``` |
|
|
FontDiffusion Dataset/ |
|
|
βββ train_original/ |
|
|
β βββ ContentImage/ # Character structure images |
|
|
β βββ TargetImage/ # Style-specific font renderings |
|
|
β βββ results.json |
|
|
βββ val_original/ |
|
|
βββ test_original/ |
|
|
``` |
|
|
|
|
|
## Training & Fine-tuning |
|
|
|
|
|
### Fine-tuning from Checkpoint |
|
|
```bash |
|
|
python my_train.py \ |
|
|
--ckpt_dir "checkpoints/" \ |
|
|
--data_dir "my_dataset/train_original" \ |
|
|
--output_dir "finetuned_ckpt/" \ |
|
|
--num_epochs 5 \ |
|
|
--learning_rate 1e-4 \ |
|
|
--batch_size 4 |
|
|
``` |
|
|
|
|
|
### Convert & Upload Fine-tuned Models |
|
|
```bash |
|
|
python finetune_and_upload.py \ |
|
|
--ckpt_dir "finetuned_ckpt/" \ |
|
|
--hf_token "hf_xxxxx" \ |
|
|
--hf_repo_id "username/font-diffusion-finetuned" \ |
|
|
--num_epochs 5 |
|
|
``` |
|
|
|
|
|
## Technical Features |
|
|
|
|
|
### Optimizations |
|
|
- β
**Batch Processing**: Process multiple characters per style |
|
|
- β
**Memory Efficiency**: Attention slicing (optional) |
|
|
- β
**FP16 Support**: Reduced precision for faster inference |
|
|
- β
**Torch Compile**: Optional model compilation |
|
|
- β
**Channels Last Format**: Memory-optimized tensor layout |
|
|
- β
**XFormers Support**: Fast attention implementation |
|
|
|
|
|
### Robustness |
|
|
- β
**Checkpoint & Resume**: Resume from interruptions |
|
|
- β
**Index-based Tracking**: Handle large character sets (100K+) |
|
|
- β
**Multi-font Support**: Process characters across multiple fonts |
|
|
- β
**Error Recovery**: Graceful handling of missing fonts |
|
|
- β
**Automatic Indexing**: Consistent char_index and style_index |
|
|
|
|
|
### Monitoring |
|
|
- β
**Weights & Biases Integration**: Real-time tracking |
|
|
- β
**Progress Bars**: Detailed generation progress |
|
|
- β
**Checkpoint Saving**: Periodic intermediate saves |
|
|
- β
**Quality Metrics**: LPIPS, SSIM, FID computation |
|
|
|
|
|
## Known Limitations |
|
|
|
|
|
- Requires CUDA-capable GPU for practical generation speeds |
|
|
- Characters must exist in at least one loaded font |
|
|
- Style images should be normalized (96Γ96 or resizable) |
|
|
- Very large character sets (>100K) may require memory optimization |
|
|
- FID computation requires representative ground truth dataset |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@article{fontdiffuser2023, |
|
|
title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning}, |
|
|
author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin}, |
|
|
year={2023} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is licensed under the Apache License 2.0. See LICENSE file for details. |
|
|
|
|
|
## Contact & Support |
|
|
|
|
|
For issues, questions, or contributions: |
|
|
- GitHub: [FontDiffusion Repository] |
|
|
- Hugging Face: [Model Card] |
|
|
- Dataset: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data |
|
|
|
|
|
--- |