| # Image to Cartoonify - Selfie to Cartoon Generator |
|
|
| ## Model Description |
|
|
| This is a conditional diffusion model trained to generate cartoon-style images from facial features extracted from real selfies. The model uses a custom U-Net architecture with attribute conditioning to transform realistic facial features into cartoon representations. |
|
|
| ## Architecture |
|
|
| - **Model Type**: Conditional Diffusion Model (Custom U-Net) |
| - **Base Architecture**: Custom OptimizedConditionedUNet |
| - **Input Resolution**: 256x256 RGB images |
| - **Conditioning**: 18-dimensional facial attribute vector |
| - **Parameters**: ~50M parameters |
| - **Training Steps**: 1000 diffusion timesteps |
|
|
| ## Key Features |
|
|
| - **Facial Feature Extraction**: Uses MediaPipe for robust facial landmark detection |
| - **Attribute Conditioning**: 18 facial attributes including: |
| - Eye angle, lashes, lid shape |
| - Eyebrow shape, thickness, width |
| - Face shape, chin length |
| - Hair style and color |
| - Facial hair presence |
| - Glasses detection |
| - Skin tone analysis |
| - **Real-time Generation**: Optimized for fast inference (15-50 steps) |
| - **High Quality**: Trained on 10k+ cartoon images with paired attributes |
|
|
| ## Training Details |
|
|
| ### Dataset |
| - **Source**: CartoonSet10k dataset |
| - **Size**: 10,000 cartoon images with CSV attribute annotations |
| - **Split**: 85% training, 15% validation |
| - **Augmentation**: Random flips, color jittering, rotation |
|
|
| ### Training Configuration |
| - **Epochs**: 110 |
| - **Batch Size**: 16 |
| - **Learning Rate**: 2e-4 with cosine annealing |
| - **Optimization**: AdamW with gradient clipping |
| - **Mixed Precision**: FP16 for efficiency |
| - **Hardware**: NVIDIA T4 GPU |
|
|
| ### Loss Function |
| - **Primary**: MSE loss on predicted noise |
| - **Scheduler**: DDPM with scaled linear beta schedule |
| - **Beta Range**: 0.00085 to 0.012 |
|
|
| ## Usage |
|
|
| ### Installation |
| ```bash |
| pip install torch torchvision |
| pip install diffusers |
| pip install mediapipe |
| pip install opencv-python |
| pip install Pillow numpy |
| ``` |
|
|
| ### Basic Usage |
| ```python |
| import torch |
| from PIL import Image |
| import numpy as np |
| from your_model import OptimizedConditionedUNet, OptimizedMediaPipeExtractor |
| from diffusers import DDPMScheduler |
| |
| # Load model |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
| model = OptimizedConditionedUNet( |
| in_channels=3, |
| out_channels=3, |
| attr_dim=18, |
| base_channels=64 |
| ).to(device) |
| |
| # Load checkpoint |
| checkpoint = torch.load('best_model.pt', map_location=device) |
| model.load_state_dict(checkpoint['model_state_dict']) |
| model.eval() |
| |
| # Initialize components |
| noise_scheduler = DDPMScheduler( |
| num_train_timesteps=1000, |
| beta_start=0.00085, |
| beta_end=0.012, |
| beta_schedule="scaled_linear", |
| prediction_type="epsilon" |
| ) |
| |
| mp_extractor = OptimizedMediaPipeExtractor() |
| |
| # Generate cartoon from selfie |
| def generate_cartoon(selfie_path, output_path): |
| # Extract facial features |
| features = mp_extractor.extract_features(selfie_path) |
| features = features.unsqueeze(0).to(device) |
| |
| # Generate cartoon |
| with torch.no_grad(): |
| # Start with noise |
| image = torch.randn(1, 3, 256, 256).to(device) |
| |
| # Denoising process |
| noise_scheduler.set_timesteps(50) |
| for t in noise_scheduler.timesteps: |
| timesteps = torch.full((1,), t, device=device).long() |
| noise_pred = model(image, timesteps, features) |
| image = noise_scheduler.step(noise_pred, t, image).prev_sample |
| |
| # Save result |
| image = (image / 2 + 0.5).clamp(0, 1) |
| image = image.cpu().squeeze(0).permute(1, 2, 0).numpy() |
| image = (image * 255).astype(np.uint8) |
| |
| result = Image.fromarray(image) |
| result.save(output_path) |
| return result |
| |
| # Usage |
| cartoon = generate_cartoon('selfie.jpg', 'cartoon.png') |
| ``` |
|
|
| ### Advanced Usage |
| ```python |
| # Custom attribute manipulation |
| def generate_with_custom_attributes(base_features, modifications): |
| """ |
| Generate cartoon with modified attributes |
| |
| Args: |
| base_features: Original facial features from selfie |
| modifications: Dict of attribute modifications |
| e.g., {'hair_color': 0.8, 'glasses': 0.9} |
| """ |
| modified_features = base_features.clone() |
| |
| attribute_map = { |
| 'eye_angle': 0, 'eye_lashes': 1, 'eye_lid': 2, |
| 'chin_length': 3, 'eyebrow_weight': 4, 'eyebrow_shape': 5, |
| 'eyebrow_thickness': 6, 'face_shape': 7, 'facial_hair': 8, |
| 'hair': 9, 'eye_color': 10, 'face_color': 11, |
| 'hair_color': 12, 'glasses': 13, 'glasses_color': 14, |
| 'eye_slant': 15, 'eyebrow_width': 16, 'eye_eyebrow_distance': 17 |
| } |
| |
| for attr_name, value in modifications.items(): |
| if attr_name in attribute_map: |
| modified_features[0, attribute_map[attr_name]] = value |
| |
| return generate_from_features(modified_features) |
| ``` |
|
|
| ## Model Performance |
|
|
| ### Metrics |
| - **Training Loss**: 0.0234 (final) |
| - **Validation Loss**: 0.0251 (best) |
| - **Inference Time**: ~2-3 seconds (50 steps, GPU) |
| - **Memory Usage**: ~4GB GPU memory |
|
|
| ### Evaluation |
| - **Facial Feature Preservation**: High fidelity in maintaining key facial characteristics |
| - **Style Consistency**: Consistent cartoon art style across generations |
| - **Attribute Control**: Precise control over 18 facial attributes |
| - **Robustness**: Handles various lighting conditions and face angles |
|
|
| ## Limitations |
|
|
| 1. **Face Detection Dependency**: Requires clear facial landmarks for optimal results |
| 2. **Resolution**: Fixed 256x256 output resolution |
| 3. **Style Scope**: Limited to cartoon style present in training data |
| 4. **Attribute Granularity**: 18 attributes may not capture all facial variations |
| 5. **Background**: Focuses on face region, may not handle complex backgrounds well |
|
|
| ## Ethical Considerations |
|
|
| - **Consent**: Ensure proper consent when processing personal photos |
| - **Bias**: Model may reflect biases present in training data |
| - **Privacy**: Consider privacy implications when processing facial data |
| - **Misuse**: Potential for creating misleading or fake content |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{image_to_cartoonify_2024, |
| title={Image to Cartoonify: Selfie to Cartoon Generator}, |
| author={wizcodes12}, |
| year={2024}, |
| howpublished={\\url{https://huggingface.co/wizcodes12/image_to_cartoonify}}, |
| } |
| ``` |
|
|
| ## License |
|
|
| This model is released under the MIT License. See LICENSE file for details. |
|
|
| ## Acknowledgments |
|
|
| - CartoonSet10k dataset creators |
| - MediaPipe team for facial landmark detection |
| - Diffusers library by Hugging Face |
| - PyTorch team for the deep learning framework |
|
|
| ## Updates |
|
|
| - **v1.0**: Initial release with 110 epochs of training |
| - **v1.1**: Improved feature extraction and normalization |
| - **v1.2**: Enhanced attribute conditioning and inference speed |
|
|
| ## Contact |
|
|
| For questions, issues, or collaborations, please open an issue on the repository or contact wizcodes12@example.com. |
|
|
| --- |
|
|
| *Generated with ❤️ using PyTorch and Diffusers by wizcodes12* |