image_to_cartoonify / MODEL_CARD.md
wizcodes12's picture
Update MODEL_CARD.md
b76fa99 verified

Image to Cartoonify - Selfie to Cartoon Generator

Model Description

This is a conditional diffusion model trained to generate cartoon-style images from facial features extracted from real selfies. The model uses a custom U-Net architecture with attribute conditioning to transform realistic facial features into cartoon representations.

Architecture

  • Model Type: Conditional Diffusion Model (Custom U-Net)
  • Base Architecture: Custom OptimizedConditionedUNet
  • Input Resolution: 256x256 RGB images
  • Conditioning: 18-dimensional facial attribute vector
  • Parameters: ~50M parameters
  • Training Steps: 1000 diffusion timesteps

Key Features

  • Facial Feature Extraction: Uses MediaPipe for robust facial landmark detection
  • Attribute Conditioning: 18 facial attributes including:
    • Eye angle, lashes, lid shape
    • Eyebrow shape, thickness, width
    • Face shape, chin length
    • Hair style and color
    • Facial hair presence
    • Glasses detection
    • Skin tone analysis
  • Real-time Generation: Optimized for fast inference (15-50 steps)
  • High Quality: Trained on 10k+ cartoon images with paired attributes

Training Details

Dataset

  • Source: CartoonSet10k dataset
  • Size: 10,000 cartoon images with CSV attribute annotations
  • Split: 85% training, 15% validation
  • Augmentation: Random flips, color jittering, rotation

Training Configuration

  • Epochs: 110
  • Batch Size: 16
  • Learning Rate: 2e-4 with cosine annealing
  • Optimization: AdamW with gradient clipping
  • Mixed Precision: FP16 for efficiency
  • Hardware: NVIDIA T4 GPU

Loss Function

  • Primary: MSE loss on predicted noise
  • Scheduler: DDPM with scaled linear beta schedule
  • Beta Range: 0.00085 to 0.012

Usage

Installation

pip install torch torchvision
pip install diffusers
pip install mediapipe
pip install opencv-python
pip install Pillow numpy

Basic Usage

import torch
from PIL import Image
import numpy as np
from your_model import OptimizedConditionedUNet, OptimizedMediaPipeExtractor
from diffusers import DDPMScheduler

# Load model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = OptimizedConditionedUNet(
    in_channels=3,
    out_channels=3,
    attr_dim=18,
    base_channels=64
).to(device)

# Load checkpoint
checkpoint = torch.load('best_model.pt', map_location=device)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Initialize components
noise_scheduler = DDPMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    prediction_type="epsilon"
)

mp_extractor = OptimizedMediaPipeExtractor()

# Generate cartoon from selfie
def generate_cartoon(selfie_path, output_path):
    # Extract facial features
    features = mp_extractor.extract_features(selfie_path)
    features = features.unsqueeze(0).to(device)
    
    # Generate cartoon
    with torch.no_grad():
        # Start with noise
        image = torch.randn(1, 3, 256, 256).to(device)
        
        # Denoising process
        noise_scheduler.set_timesteps(50)
        for t in noise_scheduler.timesteps:
            timesteps = torch.full((1,), t, device=device).long()
            noise_pred = model(image, timesteps, features)
            image = noise_scheduler.step(noise_pred, t, image).prev_sample
        
        # Save result
        image = (image / 2 + 0.5).clamp(0, 1)
        image = image.cpu().squeeze(0).permute(1, 2, 0).numpy()
        image = (image * 255).astype(np.uint8)
        
        result = Image.fromarray(image)
        result.save(output_path)
        return result

# Usage
cartoon = generate_cartoon('selfie.jpg', 'cartoon.png')

Advanced Usage

# Custom attribute manipulation
def generate_with_custom_attributes(base_features, modifications):
    """
    Generate cartoon with modified attributes
    
    Args:
        base_features: Original facial features from selfie
        modifications: Dict of attribute modifications
                      e.g., {'hair_color': 0.8, 'glasses': 0.9}
    """
    modified_features = base_features.clone()
    
    attribute_map = {
        'eye_angle': 0, 'eye_lashes': 1, 'eye_lid': 2,
        'chin_length': 3, 'eyebrow_weight': 4, 'eyebrow_shape': 5,
        'eyebrow_thickness': 6, 'face_shape': 7, 'facial_hair': 8,
        'hair': 9, 'eye_color': 10, 'face_color': 11,
        'hair_color': 12, 'glasses': 13, 'glasses_color': 14,
        'eye_slant': 15, 'eyebrow_width': 16, 'eye_eyebrow_distance': 17
    }
    
    for attr_name, value in modifications.items():
        if attr_name in attribute_map:
            modified_features[0, attribute_map[attr_name]] = value
    
    return generate_from_features(modified_features)

Model Performance

Metrics

  • Training Loss: 0.0234 (final)
  • Validation Loss: 0.0251 (best)
  • Inference Time: ~2-3 seconds (50 steps, GPU)
  • Memory Usage: ~4GB GPU memory

Evaluation

  • Facial Feature Preservation: High fidelity in maintaining key facial characteristics
  • Style Consistency: Consistent cartoon art style across generations
  • Attribute Control: Precise control over 18 facial attributes
  • Robustness: Handles various lighting conditions and face angles

Limitations

  1. Face Detection Dependency: Requires clear facial landmarks for optimal results
  2. Resolution: Fixed 256x256 output resolution
  3. Style Scope: Limited to cartoon style present in training data
  4. Attribute Granularity: 18 attributes may not capture all facial variations
  5. Background: Focuses on face region, may not handle complex backgrounds well

Ethical Considerations

  • Consent: Ensure proper consent when processing personal photos
  • Bias: Model may reflect biases present in training data
  • Privacy: Consider privacy implications when processing facial data
  • Misuse: Potential for creating misleading or fake content

Citation

@misc{image_to_cartoonify_2024,
  title={Image to Cartoonify: Selfie to Cartoon Generator},
  author={wizcodes12},
  year={2024},
  howpublished={\\url{https://huggingface.co/wizcodes12/image_to_cartoonify}},
}

License

This model is released under the MIT License. See LICENSE file for details.

Acknowledgments

  • CartoonSet10k dataset creators
  • MediaPipe team for facial landmark detection
  • Diffusers library by Hugging Face
  • PyTorch team for the deep learning framework

Updates

  • v1.0: Initial release with 110 epochs of training
  • v1.1: Improved feature extraction and normalization
  • v1.2: Enhanced attribute conditioning and inference speed

Contact

For questions, issues, or collaborations, please open an issue on the repository or contact wizcodes12@example.com.


Generated with ❤️ using PyTorch and Diffusers by wizcodes12