| # Image to Cartoonify - Selfie to Cartoon Generator | |
| ## Model Description | |
| This is a conditional diffusion model trained to generate cartoon-style images from facial features extracted from real selfies. The model uses a custom U-Net architecture with attribute conditioning to transform realistic facial features into cartoon representations. | |
| ## Architecture | |
| - **Model Type**: Conditional Diffusion Model (Custom U-Net) | |
| - **Base Architecture**: Custom OptimizedConditionedUNet | |
| - **Input Resolution**: 256x256 RGB images | |
| - **Conditioning**: 18-dimensional facial attribute vector | |
| - **Parameters**: ~50M parameters | |
| - **Training Steps**: 1000 diffusion timesteps | |
| ## Key Features | |
| - **Facial Feature Extraction**: Uses MediaPipe for robust facial landmark detection | |
| - **Attribute Conditioning**: 18 facial attributes including: | |
| - Eye angle, lashes, lid shape | |
| - Eyebrow shape, thickness, width | |
| - Face shape, chin length | |
| - Hair style and color | |
| - Facial hair presence | |
| - Glasses detection | |
| - Skin tone analysis | |
| - **Real-time Generation**: Optimized for fast inference (15-50 steps) | |
| - **High Quality**: Trained on 10k+ cartoon images with paired attributes | |
| ## Training Details | |
| ### Dataset | |
| - **Source**: CartoonSet10k dataset | |
| - **Size**: 10,000 cartoon images with CSV attribute annotations | |
| - **Split**: 85% training, 15% validation | |
| - **Augmentation**: Random flips, color jittering, rotation | |
| ### Training Configuration | |
| - **Epochs**: 110 | |
| - **Batch Size**: 16 | |
| - **Learning Rate**: 2e-4 with cosine annealing | |
| - **Optimization**: AdamW with gradient clipping | |
| - **Mixed Precision**: FP16 for efficiency | |
| - **Hardware**: NVIDIA T4 GPU | |
| ### Loss Function | |
| - **Primary**: MSE loss on predicted noise | |
| - **Scheduler**: DDPM with scaled linear beta schedule | |
| - **Beta Range**: 0.00085 to 0.012 | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install torch torchvision | |
| pip install diffusers | |
| pip install mediapipe | |
| pip install opencv-python | |
| pip install Pillow numpy | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| import torch | |
| from PIL import Image | |
| import numpy as np | |
| from your_model import OptimizedConditionedUNet, OptimizedMediaPipeExtractor | |
| from diffusers import DDPMScheduler | |
| # Load model | |
| device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') | |
| model = OptimizedConditionedUNet( | |
| in_channels=3, | |
| out_channels=3, | |
| attr_dim=18, | |
| base_channels=64 | |
| ).to(device) | |
| # Load checkpoint | |
| checkpoint = torch.load('best_model.pt', map_location=device) | |
| model.load_state_dict(checkpoint['model_state_dict']) | |
| model.eval() | |
| # Initialize components | |
| noise_scheduler = DDPMScheduler( | |
| num_train_timesteps=1000, | |
| beta_start=0.00085, | |
| beta_end=0.012, | |
| beta_schedule="scaled_linear", | |
| prediction_type="epsilon" | |
| ) | |
| mp_extractor = OptimizedMediaPipeExtractor() | |
| # Generate cartoon from selfie | |
| def generate_cartoon(selfie_path, output_path): | |
| # Extract facial features | |
| features = mp_extractor.extract_features(selfie_path) | |
| features = features.unsqueeze(0).to(device) | |
| # Generate cartoon | |
| with torch.no_grad(): | |
| # Start with noise | |
| image = torch.randn(1, 3, 256, 256).to(device) | |
| # Denoising process | |
| noise_scheduler.set_timesteps(50) | |
| for t in noise_scheduler.timesteps: | |
| timesteps = torch.full((1,), t, device=device).long() | |
| noise_pred = model(image, timesteps, features) | |
| image = noise_scheduler.step(noise_pred, t, image).prev_sample | |
| # Save result | |
| image = (image / 2 + 0.5).clamp(0, 1) | |
| image = image.cpu().squeeze(0).permute(1, 2, 0).numpy() | |
| image = (image * 255).astype(np.uint8) | |
| result = Image.fromarray(image) | |
| result.save(output_path) | |
| return result | |
| # Usage | |
| cartoon = generate_cartoon('selfie.jpg', 'cartoon.png') | |
| ``` | |
| ### Advanced Usage | |
| ```python | |
| # Custom attribute manipulation | |
| def generate_with_custom_attributes(base_features, modifications): | |
| """ | |
| Generate cartoon with modified attributes | |
| Args: | |
| base_features: Original facial features from selfie | |
| modifications: Dict of attribute modifications | |
| e.g., {'hair_color': 0.8, 'glasses': 0.9} | |
| """ | |
| modified_features = base_features.clone() | |
| attribute_map = { | |
| 'eye_angle': 0, 'eye_lashes': 1, 'eye_lid': 2, | |
| 'chin_length': 3, 'eyebrow_weight': 4, 'eyebrow_shape': 5, | |
| 'eyebrow_thickness': 6, 'face_shape': 7, 'facial_hair': 8, | |
| 'hair': 9, 'eye_color': 10, 'face_color': 11, | |
| 'hair_color': 12, 'glasses': 13, 'glasses_color': 14, | |
| 'eye_slant': 15, 'eyebrow_width': 16, 'eye_eyebrow_distance': 17 | |
| } | |
| for attr_name, value in modifications.items(): | |
| if attr_name in attribute_map: | |
| modified_features[0, attribute_map[attr_name]] = value | |
| return generate_from_features(modified_features) | |
| ``` | |
| ## Model Performance | |
| ### Metrics | |
| - **Training Loss**: 0.0234 (final) | |
| - **Validation Loss**: 0.0251 (best) | |
| - **Inference Time**: ~2-3 seconds (50 steps, GPU) | |
| - **Memory Usage**: ~4GB GPU memory | |
| ### Evaluation | |
| - **Facial Feature Preservation**: High fidelity in maintaining key facial characteristics | |
| - **Style Consistency**: Consistent cartoon art style across generations | |
| - **Attribute Control**: Precise control over 18 facial attributes | |
| - **Robustness**: Handles various lighting conditions and face angles | |
| ## Limitations | |
| 1. **Face Detection Dependency**: Requires clear facial landmarks for optimal results | |
| 2. **Resolution**: Fixed 256x256 output resolution | |
| 3. **Style Scope**: Limited to cartoon style present in training data | |
| 4. **Attribute Granularity**: 18 attributes may not capture all facial variations | |
| 5. **Background**: Focuses on face region, may not handle complex backgrounds well | |
| ## Ethical Considerations | |
| - **Consent**: Ensure proper consent when processing personal photos | |
| - **Bias**: Model may reflect biases present in training data | |
| - **Privacy**: Consider privacy implications when processing facial data | |
| - **Misuse**: Potential for creating misleading or fake content | |
| ## Citation | |
| ```bibtex | |
| @misc{image_to_cartoonify_2024, | |
| title={Image to Cartoonify: Selfie to Cartoon Generator}, | |
| author={wizcodes12}, | |
| year={2024}, | |
| howpublished={\\url{https://huggingface.co/wizcodes12/image_to_cartoonify}}, | |
| } | |
| ``` | |
| ## License | |
| This model is released under the MIT License. See LICENSE file for details. | |
| ## Acknowledgments | |
| - CartoonSet10k dataset creators | |
| - MediaPipe team for facial landmark detection | |
| - Diffusers library by Hugging Face | |
| - PyTorch team for the deep learning framework | |
| ## Updates | |
| - **v1.0**: Initial release with 110 epochs of training | |
| - **v1.1**: Improved feature extraction and normalization | |
| - **v1.2**: Enhanced attribute conditioning and inference speed | |
| ## Contact | |
| For questions, issues, or collaborations, please open an issue on the repository or contact wizcodes12@example.com. | |
| --- | |
| *Generated with ❤️ using PyTorch and Diffusers by wizcodes12* |