Duplicate from wizcodes12/image_to_cartoonify

dda964d about 2 months ago

6.95 kB

	# Image to Cartoonify - Selfie to Cartoon Generator

	## Model Description

	This is a conditional diffusion model trained to generate cartoon-style images from facial features extracted from real selfies. The model uses a custom U-Net architecture with attribute conditioning to transform realistic facial features into cartoon representations.

	## Architecture

	- Model Type: Conditional Diffusion Model (Custom U-Net)
	- Base Architecture: Custom OptimizedConditionedUNet
	- Input Resolution: 256x256 RGB images
	- Conditioning: 18-dimensional facial attribute vector
	- Parameters: ~50M parameters
	- Training Steps: 1000 diffusion timesteps

	## Key Features

	- Facial Feature Extraction: Uses MediaPipe for robust facial landmark detection
	- Attribute Conditioning: 18 facial attributes including:
	- Eye angle, lashes, lid shape
	- Eyebrow shape, thickness, width
	- Face shape, chin length
	- Hair style and color
	- Facial hair presence
	- Glasses detection
	- Skin tone analysis
	- Real-time Generation: Optimized for fast inference (15-50 steps)
	- High Quality: Trained on 10k+ cartoon images with paired attributes

	## Training Details

	### Dataset
	- Source: CartoonSet10k dataset
	- Size: 10,000 cartoon images with CSV attribute annotations
	- Split: 85% training, 15% validation
	- Augmentation: Random flips, color jittering, rotation

	### Training Configuration
	- Epochs: 110
	- Batch Size: 16
	- Learning Rate: 2e-4 with cosine annealing
	- Optimization: AdamW with gradient clipping
	- Mixed Precision: FP16 for efficiency
	- Hardware: NVIDIA T4 GPU

	### Loss Function
	- Primary: MSE loss on predicted noise
	- Scheduler: DDPM with scaled linear beta schedule
	- Beta Range: 0.00085 to 0.012

	## Usage

	### Installation
	```bash
	pip install torch torchvision
	pip install diffusers
	pip install mediapipe
	pip install opencv-python
	pip install Pillow numpy
	```

	### Basic Usage
	```python
	import torch
	from PIL import Image
	import numpy as np
	from your_model import OptimizedConditionedUNet, OptimizedMediaPipeExtractor
	from diffusers import DDPMScheduler

	# Load model
	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
	model = OptimizedConditionedUNet(
	in_channels=3,
	out_channels=3,
	attr_dim=18,
	base_channels=64
	).to(device)

	# Load checkpoint
	checkpoint = torch.load('best_model.pt', map_location=device)
	model.load_state_dict(checkpoint['model_state_dict'])
	model.eval()

	# Initialize components
	noise_scheduler = DDPMScheduler(
	num_train_timesteps=1000,
	beta_start=0.00085,
	beta_end=0.012,
	beta_schedule="scaled_linear",
	prediction_type="epsilon"
	)

	mp_extractor = OptimizedMediaPipeExtractor()

	# Generate cartoon from selfie
	def generate_cartoon(selfie_path, output_path):
	# Extract facial features
	features = mp_extractor.extract_features(selfie_path)
	features = features.unsqueeze(0).to(device)

	# Generate cartoon
	with torch.no_grad():
	# Start with noise
	image = torch.randn(1, 3, 256, 256).to(device)

	# Denoising process
	noise_scheduler.set_timesteps(50)
	for t in noise_scheduler.timesteps:
	timesteps = torch.full((1,), t, device=device).long()
	noise_pred = model(image, timesteps, features)
	image = noise_scheduler.step(noise_pred, t, image).prev_sample

	# Save result
	image = (image / 2 + 0.5).clamp(0, 1)
	image = image.cpu().squeeze(0).permute(1, 2, 0).numpy()
	image = (image * 255).astype(np.uint8)

	result = Image.fromarray(image)
	result.save(output_path)
	return result

	# Usage
	cartoon = generate_cartoon('selfie.jpg', 'cartoon.png')
	```

	### Advanced Usage
	```python
	# Custom attribute manipulation
	def generate_with_custom_attributes(base_features, modifications):
	"""
	Generate cartoon with modified attributes

	Args:
	base_features: Original facial features from selfie
	modifications: Dict of attribute modifications
	e.g., {'hair_color': 0.8, 'glasses': 0.9}
	"""
	modified_features = base_features.clone()

	attribute_map = {
	'eye_angle': 0, 'eye_lashes': 1, 'eye_lid': 2,
	'chin_length': 3, 'eyebrow_weight': 4, 'eyebrow_shape': 5,
	'eyebrow_thickness': 6, 'face_shape': 7, 'facial_hair': 8,
	'hair': 9, 'eye_color': 10, 'face_color': 11,
	'hair_color': 12, 'glasses': 13, 'glasses_color': 14,
	'eye_slant': 15, 'eyebrow_width': 16, 'eye_eyebrow_distance': 17
	}

	for attr_name, value in modifications.items():
	if attr_name in attribute_map:
	modified_features[0, attribute_map[attr_name]] = value

	return generate_from_features(modified_features)
	```

	## Model Performance

	### Metrics
	- Training Loss: 0.0234 (final)
	- Validation Loss: 0.0251 (best)
	- Inference Time: ~2-3 seconds (50 steps, GPU)
	- Memory Usage: ~4GB GPU memory

	### Evaluation
	- Facial Feature Preservation: High fidelity in maintaining key facial characteristics
	- Style Consistency: Consistent cartoon art style across generations
	- Attribute Control: Precise control over 18 facial attributes
	- Robustness: Handles various lighting conditions and face angles

	## Limitations

	1. Face Detection Dependency: Requires clear facial landmarks for optimal results
	2. Resolution: Fixed 256x256 output resolution
	3. Style Scope: Limited to cartoon style present in training data
	4. Attribute Granularity: 18 attributes may not capture all facial variations
	5. Background: Focuses on face region, may not handle complex backgrounds well

	## Ethical Considerations

	- Consent: Ensure proper consent when processing personal photos
	- Bias: Model may reflect biases present in training data
	- Privacy: Consider privacy implications when processing facial data
	- Misuse: Potential for creating misleading or fake content

	## Citation

	```bibtex
	@misc{image_to_cartoonify_2024,
	title={Image to Cartoonify: Selfie to Cartoon Generator},
	author={wizcodes12},
	year={2024},
	howpublished={\\url{https://huggingface.co/wizcodes12/image_to_cartoonify}},
	}
	```

	## License

	This model is released under the MIT License. See LICENSE file for details.

	## Acknowledgments

	- CartoonSet10k dataset creators
	- MediaPipe team for facial landmark detection
	- Diffusers library by Hugging Face
	- PyTorch team for the deep learning framework

	## Updates

	- v1.0: Initial release with 110 epochs of training
	- v1.1: Improved feature extraction and normalization
	- v1.2: Enhanced attribute conditioning and inference speed

	## Contact

	For questions, issues, or collaborations, please open an issue on the repository or contact wizcodes12@example.com.

	---

	Generated with ❤️ using PyTorch and Diffusers by wizcodes12