| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - computer-vision |
| - image-segmentation |
| - edge-detection |
| - line-art |
| - anime |
| datasets: |
| - custom |
| metrics: |
| - dice |
| - iou |
| pipeline_tag: image-segmentation |
| --- |
| |
| # Anime Line Art Extraction Segmentation Model |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/6kCBB668giXjJoCLXAzfy.png" width="100%"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/Q5pwvCCAfhl5ctqgsVEPa.png" width="100%"> |
|
|
| ## Model Description |
|
|
| ### Overview |
| This model performs automatic line art extraction from anime images using a deep learning segmentation approach. The goal of the model is to identify edge structures that form the visual outlines of characters and objects in anime frames. |
|
|
| Extracting clean line art typically requires manual tracing by artists or complex rule-based algorithms. This project explores whether a deep learning segmentation model can learn pixel-level edge structures directly from images. |
|
|
| The model takes an RGB anime frame as input and produces a binary edge mask representing the predicted line art structure. |
|
|
| ### Problem & Context |
| Problem: |
| Extracting clean line art from images normally requires manual tracing by hand or specialized algorithms. |
|
|
| Why this matters: |
| - Speed up animation production pipelines |
| - Assist manga and illustration workflows |
| - Help beginners learn drawing by tracing outlines |
| - Improve visual quality by upscaling blurry line art |
| - Generate datasets for generative AI models |
|
|
| How computer vision helps: |
| Deep learning segmentation models can learn pixel-level edge structures directly from images. |
|
|
| ### Training Approach |
| The model was trained as a semantic segmentation model using the PyTorch segmentation framework. |
|
|
| Frameworks used: |
| - PyTorch |
| - segmentation_models_pytorch |
|
|
| Since no pretrained model exists specifically for anime line extraction, the model was trained using a custom dataset and automatically generated edge masks. |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/h06ej-ODkw5tDAx3X6KfL.png" width="100%"> |
|
|
| ### Intended Use Cases |
| Potential applications include: |
|
|
| - Animation pipelines – converting frames into base line structures |
| - Digital art tools – assisting artists by generating sketch outlines |
| - Image upscaling workflows – improving visual quality of blurry lines |
| - Dataset generation – automatically creating line art datasets for training generative models |
|
|
| Example research question explored in this project: |
|
|
| Can a segmentation model trained on edge masks produce usable line art for artistic workflows? |
|
|
|
|
| -------------------------------------------------- |
|
|
| # Training Data |
|
|
| ## Dataset Source |
| Images were collected manually from screenshots of anime episodes. |
|
|
| The dataset was assembled specifically for this project to capture common line art structures present in anime animation. |
|
|
| Dataset characteristics: |
|
|
| Total images: 480 |
| Original resolution: 1920×1080 |
| Training resolution: 256×256 |
| Task type: Binary segmentation |
|
|
| ## Classes |
| Although this is a binary segmentation task, the detected edges represent multiple visual structures: |
|
|
| - Character outlines |
| - Hair edges |
| - Facial outlines |
| - Clothing folds |
| - Background structures |
|
|
| Pixel labels: |
|
|
| 0 = background |
| 1 = line / edge |
|
|
| ## Dataset Split |
|
|
| Train: 384 images |
| Validation: 96 images |
| Test: Not used |
|
|
| A separate test set was not included due to the relatively small dataset size. The validation set was used to monitor training performance and evaluate model results. |
|
|
| ## Data Collection Methodology |
| Images were collected manually from anime episode screenshots. Frames were chosen to capture a variety of characters, poses, lighting conditions, and scene compositions. |
|
|
| All images were resized to 256×256 resolution to standardize input dimensions for training. |
|
|
| ## Annotation Process |
|
|
| Manually labeling line art masks for hundreds of images would be extremely time consuming. Instead, an automated annotation pipeline was used to approximate line structures. |
|
|
| Annotation pipeline: Anime Image → Grayscale conversion → Canny edge detection → Binary edge mask |
|
|
| Tools used: Python, Google Colab (Jupyter Notebook), OpenCV, PyTorch |
|
|
| Work performed for the dataset: |
|
|
| - Collected ~480 anime images |
| - Generated masks automatically using Canny edge detection |
| - Manually inspected mask quality visually |
|
|
| This approach allowed rapid dataset creation while still ensuring that the generated masks captured meaningful line structures. |
|
|
| ## Data Augmentation |
|
|
| No data augmentation techniques were applied. |
|
|
| Images were only resized and normalized during preprocessing. |
|
|
| ## Known Dataset Biases |
|
|
| Several limitations exist in the dataset: |
|
|
| - Images are exclusively anime style, creating stylistic bias |
| - Edge masks generated automatically contain noise |
| - Some thin edges may be missing due to limitations of Canny detection |
| - Dataset size is relatively small for deep learning segmentation |
|
|
|
|
| -------------------------------------------------- |
|
|
| # Training Procedure |
|
|
| ## Training Framework |
|
|
| The model was implemented using: |
|
|
| PyTorch |
| segmentation_models_pytorch |
|
|
| This library provides segmentation architectures suitable for pixel-level prediction tasks. |
|
|
| ## Model Architecture |
|
|
| Architecture: |
|
|
| Encoder: ResNet18 |
| Decoder: U-Net |
| Input: RGB image |
| Output: binary edge mask |
|
|
| U-Net was selected because it performs well for segmentation tasks and works effectively with relatively small datasets. |
|
|
| ## Training Hardware |
|
|
| Training was conducted using Google Colab. |
|
|
| Typical environment: |
|
|
| GPU: NVIDIA T4 |
| VRAM: ~16 GB |
| Training time: approximately 1–2 hours |
|
|
| ## Hyperparameters |
|
|
| Epochs: 30 |
| Batch size: 8 |
| Optimizer: Adam |
| Learning rate: 0.0001 |
| Loss function: Binary Cross Entropy + Dice Loss |
|
|
| ## Preprocessing Steps |
|
|
| Images were preprocessed using: |
|
|
| Resize to 256×256 |
| Normalize using ImageNet statistics |
|
|
| mean = [0.485, 0.456, 0.406] |
| std = [0.229, 0.224, 0.225] |
|
|
|
|
| -------------------------------------------------- |
|
|
| # Evaluation Results |
|
|
| ## Metrics |
|
|
| Because this project uses semantic segmentation rather than object detection, evaluation metrics are calculated at the pixel level. |
|
|
| Metrics used: |
|
|
| Dice Coefficient – measures overlap between predicted masks and ground truth masks |
| Intersection over Union (IoU) – measures intersection divided by union of predicted and ground truth masks |
|
|
| ## Validation Performance |
|
|
| Dice coefficient: ~0.35 |
| IoU: ~0.21 |
|
|
| These metrics indicate that the model is able to detect meaningful edge structures but struggles with extremely thin line details. |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/zvCczs-TB241YW4FuVIOF.png" width="50%"> |
|
|
| ## Key Observations |
|
|
| What worked well: |
|
|
| - Learned major character outlines |
| - Captured hair boundaries |
| - Detected facial structures |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/dlkCHCrPtBJPvy7sGSc8j.png" width="100%"> |
|
|
| Failure cases: |
|
|
| - Small thin lines |
| - Dark scenes |
| - Shading lines interpreted as edges |
| - Excessive background detail |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/Hi0LQhIQZvWlAd_44o88H.png" width="100%"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/KnIMiKkePB9aDNausGNDp.png" width="100%"> |
|
|
| These results show that the model learned meaningful edge structures despite the noisy annotations generated from Canny edge detection. |
|
|
| ## Visual Examples |
|
|
| Typical evaluation visualizations include: |
|
|
| - Input anime frame |
| - Ground truth edge mask |
| - Model predicted mask |
|
|
| These comparisons help visually evaluate whether predicted edges align with important structures in the image. |
|
|
| ## Performance Analysis |
|
|
| The model demonstrates that segmentation networks can learn edge patterns from anime images even when trained with automatically generated masks. |
|
|
| However, the task presents several challenges: |
|
|
| 1. Thin line structures are difficult for segmentation models |
| 2. Automatic annotations introduce noise |
| 3. Low contrast scenes reduce edge detectability |
|
|
| Because the model was only trained for 30 epochs, additional training may improve performance. However, improving annotation quality or training at higher resolution would likely have a larger impact. |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/PL9-L1MHMEhqmNxY4WQkm.png" width="100%"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/ru-gguNfSDzxbCXT6kmeS.png" width="100%"> |
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/wT3J4LSINPNHVVjLUaqcR.png" width="100%"> |
| -------------------------------------------------- |
|
|
| # Limitations and Biases |
|
|
| ## Known Failure Cases |
|
|
| The model struggles with: |
|
|
| - Extremely thin lines |
| - Low contrast scenes |
| - Dark shading regions |
| - Highly detailed backgrounds |
|
|
| These cases often produce incomplete or noisy edge predictions. |
|
|
| ## Annotation Noise |
|
|
| Ground truth labels were generated automatically using Canny edge detection. This introduces issues such as: |
|
|
| - Missing edges |
| - False edges from shading |
| - Broken line segments |
|
|
| Because the model learns from these masks, the maximum achievable accuracy is limited by the quality of the annotations. |
|
|
| ## Dataset Bias |
|
|
| The dataset contains only anime frames, introducing strong stylistic bias. |
|
|
| The model may perform poorly on: |
|
|
| - Photographs |
| - Western illustration styles |
| - Non-anime artwork |
|
|
| ## Resolution Limitations |
|
|
| Images were resized from 1920×1080 to 256×256 for training. |
|
|
| This downscaling removes fine details and makes thin lines harder to detect. |
|
|
| ## Sample Size Limitations |
|
|
| The dataset contains only 480 images, which is relatively small for training deep neural networks. A larger dataset would likely improve generalization. |
|
|
| ## Inappropriate Use Cases |
|
|
| This model should not be used for: |
|
|
| - Photographic edge detection |
| - Medical image segmentation |
| - Object detection tasks |
|
|
| The model is specifically designed for anime-style line structure extraction. |
|
|
|
|
| -------------------------------------------------- |
|
|
| # Future Work |
|
|
| Possible improvements include: |
|
|
| - Expanding the dataset to thousands of images |
| - Training at higher resolution (512×512 or higher) |
| - Improving annotation quality with manual corrections |
| - Exploring diffusion-based line reconstruction models |
|
|
| Additional research directions include: |
|
|
| - Object detection models for automatic removal of occlusions |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/NKCNnMBSAzzhAPjaZiX9y.png" width="100%"> |
|
|
| - Line art upscaling techniques |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/9cisCYIkU_y45UJtJRNcE.png" width="100%"> |
|
|
| - Using detected edges for stitching animation panning shots |
|
|
| <img src="https://cdn-uploads.huggingface.co/production/uploads/6972a2622ef5ed3b50628995/ZDIrGENzx4oy-Vj_jyQMa.gif" width="100%"> |
|
|