Image to Cartoonify - Selfie to Cartoon Generator

Model Description

This is a conditional diffusion model trained to generate cartoon-style images from facial features extracted from real selfies. The model uses a custom U-Net architecture with attribute conditioning to transform realistic facial features into cartoon representations.

Architecture

Model Type: Conditional Diffusion Model (Custom U-Net)
Base Architecture: Custom OptimizedConditionedUNet
Input Resolution: 256x256 RGB images
Conditioning: 18-dimensional facial attribute vector
Parameters: ~50M parameters
Training Steps: 1000 diffusion timesteps

Key Features

Facial Feature Extraction: Uses MediaPipe for robust facial landmark detection
Attribute Conditioning: 18 facial attributes including:
- Eye angle, lashes, lid shape
- Eyebrow shape, thickness, width
- Face shape, chin length
- Hair style and color
- Facial hair presence
- Glasses detection
- Skin tone analysis
Real-time Generation: Optimized for fast inference (15-50 steps)
High Quality: Trained on 10k+ cartoon images with paired attributes

Training Details

Dataset

Source: CartoonSet10k dataset
Size: 10,000 cartoon images with CSV attribute annotations
Split: 85% training, 15% validation
Augmentation: Random flips, color jittering, rotation

Training Configuration

Epochs: 110
Batch Size: 16
Learning Rate: 2e-4 with cosine annealing
Optimization: AdamW with gradient clipping
Mixed Precision: FP16 for efficiency
Hardware: NVIDIA T4 GPU

Loss Function

Primary: MSE loss on predicted noise
Scheduler: DDPM with scaled linear beta schedule
Beta Range: 0.00085 to 0.012

Usage

Installation

pip install torch torchvision
pip install diffusers
pip install mediapipe
pip install opencv-python
pip install Pillow numpy

Basic Usage

import torch
from PIL import Image
import numpy as np
from your_model import OptimizedConditionedUNet, OptimizedMediaPipeExtractor
from diffusers import DDPMScheduler

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = OptimizedConditionedUNet(
    in_channels=3,
    out_channels=3,
    attr_dim=18,
    base_channels=64
).to(device)

# Load checkpoint
checkpoint = torch.load('best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Initialize components
noise_scheduler = DDPMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    prediction_type="epsilon"
)

mp_extractor = OptimizedMediaPipeExtractor()

# Generate cartoon from selfie
def generate_cartoon(selfie_path, output_path):
    # Extract facial features
    features = mp_extractor.extract_features(selfie_path)
    features = features.unsqueeze(0).to(device)
    
    # Generate cartoon
    with torch.no_grad():
        # Start with noise
        image = torch.randn(1, 3, 256, 256).to(device)
        
        # Denoising process
        noise_scheduler.set_timesteps(50)
        for t in noise_scheduler.timesteps:
            timesteps = torch.full((1,), t, device=device).long()
            noise_pred = model(image, timesteps, features)
            image = noise_scheduler.step(noise_pred, t, image).prev_sample
        
        # Save result
        image = (image / 2 + 0.5).clamp(0, 1)
        image = image.cpu().squeeze(0).permute(1, 2, 0).numpy()
        image = (image * 255).astype(np.uint8)
        
        result = Image.fromarray(image)
        result.save(output_path)
        return result

# Usage
cartoon = generate_cartoon('selfie.jpg', 'cartoon.png')

Advanced Usage

# Custom attribute manipulation
def generate_with_custom_attributes(base_features, modifications):
    """
    Generate cartoon with modified attributes
    
    Args:
        base_features: Original facial features from selfie
        modifications: Dict of attribute modifications
                      e.g., {'hair_color': 0.8, 'glasses': 0.9}
    """
    modified_features = base_features.clone()
    
    attribute_map = {
        'eye_angle': 0, 'eye_lashes': 1, 'eye_lid': 2,
        'chin_length': 3, 'eyebrow_weight': 4, 'eyebrow_shape': 5,
        'eyebrow_thickness': 6, 'face_shape': 7, 'facial_hair': 8,
        'hair': 9, 'eye_color': 10, 'face_color': 11,
        'hair_color': 12, 'glasses': 13, 'glasses_color': 14,
        'eye_slant': 15, 'eyebrow_width': 16, 'eye_eyebrow_distance': 17
    }
    
    for attr_name, value in modifications.items():
        if attr_name in attribute_map:
            modified_features[0, attribute_map[attr_name]] = value
    
    return generate_from_features(modified_features)

Model Performance

Metrics

Training Loss: 0.0234 (final)
Validation Loss: 0.0251 (best)
Inference Time: ~2-3 seconds (50 steps, GPU)
Memory Usage: ~4GB GPU memory

Evaluation

Facial Feature Preservation: High fidelity in maintaining key facial characteristics
Style Consistency: Consistent cartoon art style across generations
Attribute Control: Precise control over 18 facial attributes
Robustness: Handles various lighting conditions and face angles

Limitations

Face Detection Dependency: Requires clear facial landmarks for optimal results
Resolution: Fixed 256x256 output resolution
Style Scope: Limited to cartoon style present in training data
Attribute Granularity: 18 attributes may not capture all facial variations
Background: Focuses on face region, may not handle complex backgrounds well

Ethical Considerations

Consent: Ensure proper consent when processing personal photos
Bias: Model may reflect biases present in training data
Privacy: Consider privacy implications when processing facial data
Misuse: Potential for creating misleading or fake content

Citation

@misc{image_to_cartoonify_2024,
  title={Image to Cartoonify: Selfie to Cartoon Generator},
  author={wizcodes12},
  year={2024},
  howpublished={\\url{https://huggingface.co/wizcodes12/image_to_cartoonify}},
}

License

This model is released under the MIT License. See LICENSE file for details.

Acknowledgments

CartoonSet10k dataset creators
MediaPipe team for facial landmark detection
Diffusers library by Hugging Face
PyTorch team for the deep learning framework

Updates

v1.0: Initial release with 110 epochs of training
v1.1: Improved feature extraction and normalization
v1.2: Enhanced attribute conditioning and inference speed

Contact

For questions, issues, or collaborations, please open an issue on the repository or contact wizcodes12@example.com.

Generated with ❤️ using PyTorch and Diffusers by wizcodes12