--- license: apache-2.0 datasets: - dzungpham/font-diffusion-generated-data language: - en library_name: diffusers tags: - font-diffusion - image-to-image - contrastive-learning - diffusers - font-generation - character-synthesis - style-transfer - dpm-solver --- # Model Card for FontDiffuser ## Model Details ### Model Type - **Architecture**: Diffusion-based Font Generation Model - **Framework**: PyTorch + Hugging Face Diffusers - **Scheduler**: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver) - **Guidance**: Classifier-free guidance - **Base Model**: FontDiffuser with Content and Style Encoders ### Model Components 1. **UNet**: Main diffusion model for image generation 2. **Content Encoder**: Extracts character structure information 3. **Style Encoder**: Extracts font style features 4. **DDPM/DPM Scheduler**: Noise scheduling for diffusion process ### Training Configuration - **Resolution**: 96×96 pixels - **Batch Size**: 4-8 (configurable) - **Inference Steps**: 15 (default, configurable) - **Guidance Scale**: 7.5 (default, configurable) - **Precision**: FP32/FP16 (optional) - **Device**: CUDA/GPU recommended ## Model Usage ### Installation ```bash pip install diffusers torch torchvision safetensors pip install lpips scikit-image pytorch-fid # Optional: for evaluation ``` ### Basic Generation ```python from sample_batch import ( FontManager, batch_generate_images, load_fontdiffuser_pipeline ) from argparse import Namespace # Initialize font manager font_manager = FontManager("path/to/font.ttf") # Load pipeline args = Namespace( ckpt_dir="path/to/checkpoints", device="cuda", num_inference_steps=15, guidance_scale=7.5, batch_size=4, # ... other args ) pipe = load_fontdiffuser_pipeline(args) # Generate images characters = ['A', 'B', 'C', '中', '国'] style_paths = ['style1.png', 'style2.png'] results = batch_generate_images( pipe, characters, style_paths, output_dir="output", args=args, evaluator=evaluator, font_manager=font_manager ) ``` ### Batch Generation with Checkpointing ```bash python sample_batch.py \ --characters "characters.txt" \ --start_line 1 \ --end_line 100 \ --style_images "styles/" \ --ttf_path "fonts/myfont.ttf" \ --ckpt_dir "checkpoints/" \ --output_dir "my_dataset/train_original" \ --batch_size 4 \ --num_inference_steps 15 \ --guidance_scale 7.5 \ --save_interval 10 \ --device cuda ``` ### Resume from Checkpoint ```bash python sample_batch.py \ --characters "characters.txt" \ --style_images "styles/" \ --ttf_path "fonts/myfont.ttf" \ --ckpt_dir "checkpoints/" \ --output_dir "my_dataset/train_original" \ --resume_from "my_dataset/train_original/results_checkpoint.json" ``` ## Model Performance ### Supported Tasks - ✅ Single-character font generation - ✅ Multi-character batch generation - ✅ Multi-font support - ✅ Multi-style transfer - ✅ Index-based tracking for large-scale generation - ✅ Checkpoint and resume support ### Output Format ``` output_dir/ ├── ContentImage/ # Single set of content (character) images │ ├── char0.png │ ├── char1.png │ └── ... ├── TargetImage/ # Generated font images organized by style │ ├── style0/ │ │ ├── style0+char0.png │ │ ├── style0+char1.png │ │ └── ... │ ├── style1/ │ │ └── ... │ └── ... ├── results.json # Comprehensive generation metadata ├── results_checkpoint.json # Intermediate checkpoint (if save_interval > 0) └── results_interrupted.json # Emergency checkpoint (if interrupted) ``` ### Results Metadata Structure ```json { "generations": [ { "character": "A", "char_index": 0, "style": "style0", "style_index": 0, "font": "Arial", "style_path": "path/to/style0.png", "output_path": "TargetImage/style0/style0+char0.png" } ], "metrics": { "lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5}, "ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95}, "fid": {"mean": 15.3, "std": 2.1}, "inference_times": [ { "style": "style0", "style_index": 0, "font": "Arial", "total_time": 2.45, "num_images": 100, "time_per_image": 0.0245 } ] }, "fonts": ["Arial", "Times New Roman"], "characters": ["A", "B", "C"], "styles": ["style0", "style1"], "total_chars": 3, "total_styles": 2, "total_possible_pairs": 6 } ``` ## Evaluation Metrics ### Supported Metrics - **LPIPS**: Learned perceptual image patch similarity (lower is better) - **SSIM**: Structural similarity index (higher is better) - **FID**: Fréchet Inception Distance (lower is better) - **Inference Time**: Per-image generation time ### Generate with Evaluation ```bash python sample_batch.py \ --characters "characters.txt" \ --style_images "styles/" \ --ttf_path "fonts/myfont.ttf" \ --ckpt_dir "checkpoints/" \ --output_dir "my_dataset/train_original" \ --evaluate \ --ground_truth_dir "ground_truth/" \ --compute_fid ``` ## Dataset ### Dataset Source - **Name**: font-diffusion-generated-data - **Link**: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data - **Format**: ContentImage + TargetImage per style - **Supports**: Multi-font, multi-character, multi-style generation ### Dataset Structure ``` FontDiffusion Dataset/ ├── train_original/ │ ├── ContentImage/ # Character structure images │ ├── TargetImage/ # Style-specific font renderings │ └── results.json ├── val_original/ └── test_original/ ``` ## Training & Fine-tuning ### Fine-tuning from Checkpoint ```bash python my_train.py \ --ckpt_dir "checkpoints/" \ --data_dir "my_dataset/train_original" \ --output_dir "finetuned_ckpt/" \ --num_epochs 5 \ --learning_rate 1e-4 \ --batch_size 4 ``` ### Convert & Upload Fine-tuned Models ```bash python finetune_and_upload.py \ --ckpt_dir "finetuned_ckpt/" \ --hf_token "hf_xxxxx" \ --hf_repo_id "username/font-diffusion-finetuned" \ --num_epochs 5 ``` ## Technical Features ### Optimizations - ✅ **Batch Processing**: Process multiple characters per style - ✅ **Memory Efficiency**: Attention slicing (optional) - ✅ **FP16 Support**: Reduced precision for faster inference - ✅ **Torch Compile**: Optional model compilation - ✅ **Channels Last Format**: Memory-optimized tensor layout - ✅ **XFormers Support**: Fast attention implementation ### Robustness - ✅ **Checkpoint & Resume**: Resume from interruptions - ✅ **Index-based Tracking**: Handle large character sets (100K+) - ✅ **Multi-font Support**: Process characters across multiple fonts - ✅ **Error Recovery**: Graceful handling of missing fonts - ✅ **Automatic Indexing**: Consistent char_index and style_index ### Monitoring - ✅ **Weights & Biases Integration**: Real-time tracking - ✅ **Progress Bars**: Detailed generation progress - ✅ **Checkpoint Saving**: Periodic intermediate saves - ✅ **Quality Metrics**: LPIPS, SSIM, FID computation ## Known Limitations - Requires CUDA-capable GPU for practical generation speeds - Characters must exist in at least one loaded font - Style images should be normalized (96×96 or resizable) - Very large character sets (>100K) may require memory optimization - FID computation requires representative ground truth dataset ## Citation ```bibtex @article{fontdiffuser2023, title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning}, author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin}, year={2023} } ``` ## License This model is licensed under the Apache License 2.0. See LICENSE file for details. ## Contact & Support For issues, questions, or contributions: - GitHub: [FontDiffusion Repository] - Hugging Face: [Model Card] - Dataset: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data ---