File size: 8,167 Bytes

---
license: apache-2.0
datasets:
- dzungpham/font-diffusion-generated-data
language:
- en
library_name: diffusers
tags:
- font-diffusion
- image-to-image
- contrastive-learning
- diffusers
- font-generation
- character-synthesis
- style-transfer
- dpm-solver
---
# Model Card for FontDiffuser

## Model Details

### Model Type
- **Architecture**: Diffusion-based Font Generation Model
- **Framework**: PyTorch + Hugging Face Diffusers
- **Scheduler**: DPM-Solver++ (configurable: dpmsolver++ / dpmsolver)
- **Guidance**: Classifier-free guidance
- **Base Model**: FontDiffuser with Content and Style Encoders

### Model Components
1. **UNet**: Main diffusion model for image generation
2. **Content Encoder**: Extracts character structure information
3. **Style Encoder**: Extracts font style features
4. **DDPM/DPM Scheduler**: Noise scheduling for diffusion process

### Training Configuration
- **Resolution**: 96×96 pixels
- **Batch Size**: 4-8 (configurable)
- **Inference Steps**: 15 (default, configurable)
- **Guidance Scale**: 7.5 (default, configurable)
- **Precision**: FP32/FP16 (optional)
- **Device**: CUDA/GPU recommended

## Model Usage

### Installation
```bash
pip install diffusers torch torchvision safetensors
pip install lpips scikit-image pytorch-fid  # Optional: for evaluation
```

### Basic Generation
```python
from sample_batch import (
    FontManager, 
    batch_generate_images,
    load_fontdiffuser_pipeline
)
from argparse import Namespace

# Initialize font manager
font_manager = FontManager("path/to/font.ttf")

# Load pipeline
args = Namespace(
    ckpt_dir="path/to/checkpoints",
    device="cuda",
    num_inference_steps=15,
    guidance_scale=7.5,
    batch_size=4,
    # ... other args
)
pipe = load_fontdiffuser_pipeline(args)

# Generate images
characters = ['A', 'B', 'C', '中', '国']
style_paths = ['style1.png', 'style2.png']

results = batch_generate_images(
    pipe, characters, style_paths,
    output_dir="output",
    args=args,
    evaluator=evaluator,
    font_manager=font_manager
)
```

### Batch Generation with Checkpointing
```bash
python sample_batch.py \
  --characters "characters.txt" \
  --start_line 1 \
  --end_line 100 \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --batch_size 4 \
  --num_inference_steps 15 \
  --guidance_scale 7.5 \
  --save_interval 10 \
  --device cuda
```

### Resume from Checkpoint
```bash
python sample_batch.py \
  --characters "characters.txt" \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --resume_from "my_dataset/train_original/results_checkpoint.json"
```

## Model Performance

### Supported Tasks
- ✅ Single-character font generation
- ✅ Multi-character batch generation
- ✅ Multi-font support
- ✅ Multi-style transfer
- ✅ Index-based tracking for large-scale generation
- ✅ Checkpoint and resume support

### Output Format
```
output_dir/
├── ContentImage/              # Single set of content (character) images
│   ├── char0.png
│   ├── char1.png
│   └── ...
├── TargetImage/               # Generated font images organized by style
│   ├── style0/
│   │   ├── style0+char0.png
│   │   ├── style0+char1.png
│   │   └── ...
│   ├── style1/
│   │   └── ...
│   └── ...
├── results.json               # Comprehensive generation metadata
├── results_checkpoint.json    # Intermediate checkpoint (if save_interval > 0)
└── results_interrupted.json   # Emergency checkpoint (if interrupted)
```

### Results Metadata Structure
```json
{
  "generations": [
    {
      "character": "A",
      "char_index": 0,
      "style": "style0",
      "style_index": 0,
      "font": "Arial",
      "style_path": "path/to/style0.png",
      "output_path": "TargetImage/style0/style0+char0.png"
    }
  ],
  "metrics": {
    "lpips": {"mean": 0.25, "std": 0.08, "min": 0.1, "max": 0.5},
    "ssim": {"mean": 0.82, "std": 0.05, "min": 0.7, "max": 0.95},
    "fid": {"mean": 15.3, "std": 2.1},
    "inference_times": [
      {
        "style": "style0",
        "style_index": 0,
        "font": "Arial",
        "total_time": 2.45,
        "num_images": 100,
        "time_per_image": 0.0245
      }
    ]
  },
  "fonts": ["Arial", "Times New Roman"],
  "characters": ["A", "B", "C"],
  "styles": ["style0", "style1"],
  "total_chars": 3,
  "total_styles": 2,
  "total_possible_pairs": 6
}
```

## Evaluation Metrics

### Supported Metrics
- **LPIPS**: Learned perceptual image patch similarity (lower is better)
- **SSIM**: Structural similarity index (higher is better)
- **FID**: Fréchet Inception Distance (lower is better)
- **Inference Time**: Per-image generation time

### Generate with Evaluation
```bash
python sample_batch.py \
  --characters "characters.txt" \
  --style_images "styles/" \
  --ttf_path "fonts/myfont.ttf" \
  --ckpt_dir "checkpoints/" \
  --output_dir "my_dataset/train_original" \
  --evaluate \
  --ground_truth_dir "ground_truth/" \
  --compute_fid
```

## Dataset

### Dataset Source
- **Name**: font-diffusion-generated-data
- **Link**: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data
- **Format**: ContentImage + TargetImage per style
- **Supports**: Multi-font, multi-character, multi-style generation

### Dataset Structure
```
FontDiffusion Dataset/
├── train_original/
│   ├── ContentImage/          # Character structure images
│   ├── TargetImage/           # Style-specific font renderings
│   └── results.json
├── val_original/
└── test_original/
```

## Training & Fine-tuning

### Fine-tuning from Checkpoint
```bash
python my_train.py \
  --ckpt_dir "checkpoints/" \
  --data_dir "my_dataset/train_original" \
  --output_dir "finetuned_ckpt/" \
  --num_epochs 5 \
  --learning_rate 1e-4 \
  --batch_size 4
```

### Convert & Upload Fine-tuned Models
```bash
python finetune_and_upload.py \
  --ckpt_dir "finetuned_ckpt/" \
  --hf_token "hf_xxxxx" \
  --hf_repo_id "username/font-diffusion-finetuned" \
  --num_epochs 5
```

## Technical Features

### Optimizations
- ✅ **Batch Processing**: Process multiple characters per style
- ✅ **Memory Efficiency**: Attention slicing (optional)
- ✅ **FP16 Support**: Reduced precision for faster inference
- ✅ **Torch Compile**: Optional model compilation
- ✅ **Channels Last Format**: Memory-optimized tensor layout
- ✅ **XFormers Support**: Fast attention implementation

### Robustness
- ✅ **Checkpoint & Resume**: Resume from interruptions
- ✅ **Index-based Tracking**: Handle large character sets (100K+)
- ✅ **Multi-font Support**: Process characters across multiple fonts
- ✅ **Error Recovery**: Graceful handling of missing fonts
- ✅ **Automatic Indexing**: Consistent char_index and style_index

### Monitoring
- ✅ **Weights & Biases Integration**: Real-time tracking
- ✅ **Progress Bars**: Detailed generation progress
- ✅ **Checkpoint Saving**: Periodic intermediate saves
- ✅ **Quality Metrics**: LPIPS, SSIM, FID computation

## Known Limitations

- Requires CUDA-capable GPU for practical generation speeds
- Characters must exist in at least one loaded font
- Style images should be normalized (96×96 or resizable)
- Very large character sets (>100K) may require memory optimization
- FID computation requires representative ground truth dataset

## Citation

```bibtex
@article{fontdiffuser2023,
  title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
  author={Zhenhua Yang, Dezhi Peng, Yuxin Kong, Yuyi Zhang, Cong Yao, Lianwen Jin},
  year={2023}
}
```

## License

This model is licensed under the Apache License 2.0. See LICENSE file for details.

## Contact & Support

For issues, questions, or contributions:
- GitHub: [FontDiffusion Repository]
- Hugging Face: [Model Card]
- Dataset: https://huggingface.co/datasets/dzungpham/font-diffusion-generated-data

---