| --- |
| license: mit |
| --- |
| --- |
| language: en |
| tags: |
| - vision |
| - text |
| - multimodal |
| - comics |
| - contrastive-learning |
| - feature-extraction |
| license: mit |
| --- |
|
|
| # Comic Panel Encoder v1 (Stage 3) |
|
|
| This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis). |
|
|
| By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks. |
|
|
| ## Model Architecture |
|
|
| The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture: |
|
|
| 1. **Visual Branch (Dual Backbone):** |
| - **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features. |
| - **ResNet50**: Captures fine-grained, low-level texture and structural details. |
| - *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones. |
| 2. **Text Branch:** |
| - **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions. |
| 3. **Compositional Branch:** |
| - A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area). |
| 4. **Adaptive Fusion Gate:** |
| - A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully). |
|
|
| ## Training Data & Methodology |
|
|
| The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS). |
|
|
| ### Objectives |
| The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives: |
| 1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts. |
| 2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features. |
| 3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy. |
|
|
| ## Usage |
|
|
| You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`. |
|
|
| ### Example: Extracting Features |
|
|
| ```python |
| import torch |
| from PIL import Image |
| import torchvision.transforms as T |
| from transformers import AutoTokenizer |
| # Requires cloning the GitHub repo for the framework class |
| from stage3_panel_features_framework import PanelFeatureExtractor |
| |
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| |
| # 1. Initialize Model |
| model = PanelFeatureExtractor( |
| visual_backbone='both', |
| visual_fusion='attention', |
| feature_dim=512 |
| ).to(device) |
| |
| # Load weights from Hugging Face |
| state_dict = torch.hub.load_state_dict_from_url( |
| "https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt", |
| map_location=device |
| ) |
| model.load_state_dict(state_dict) |
| model.eval() |
| |
| # 2. Prepare Inputs |
| # Image |
| image = Image.open('sample_panel.jpg').convert('RGB') |
| transform = T.Compose([ |
| T.Resize((224, 224)), |
| T.ToTensor(), |
| T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) |
| ]) |
| img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W) |
| |
| # Text |
| tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") |
| text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True) |
| input_ids = text_enc['input_ids'].unsqueeze(0).to(device) |
| attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device) |
| |
| # Composition (e.g., Aspect Ratio, Area, Center X, Center Y) |
| comp_feats = torch.zeros(1, 1, 7).to(device) |
| |
| # Modality Mask [Vision, Text, Comp] |
| modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device) |
| |
| batch = { |
| 'images': img_tensor, |
| 'input_ids': input_ids, |
| 'attention_mask': attn_mask, |
| 'comp_feats': comp_feats, |
| 'modality_mask': modality_mask |
| } |
| |
| # 3. Generate Embedding |
| with torch.no_grad(): |
| panel_embedding = model(batch) |
| |
| print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512]) |
| ``` |
|
|
| ## Intended Use & Limitations |
| - **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework). |
| - **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity. |
| - **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers. |
|
|
| ## Citation |
| If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis). |
|
|