RichardScottOZ's picture
Update README.md
d86cb8e verified
---
license: mit
---
---
language: en
tags:
- vision
- text
- multimodal
- comics
- contrastive-learning
- feature-extraction
license: mit
---
# Comic Panel Encoder v1 (Stage 3)
This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis).
By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single **512-dimensional vector** per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.
## Model Architecture
The `comic-panel-encoder-v1` utilizes an **Adaptive Multi-Modal Fusion** architecture:
1. **Visual Branch (Dual Backbone):**
- **SigLIP** (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features.
- **ResNet50**: Captures fine-grained, low-level texture and structural details.
- *Fusion:* An attention mechanism fuses the domain-adapted outputs of both backbones.
2. **Text Branch:**
- **MiniLM** (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
3. **Compositional Branch:**
- A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
4. **Adaptive Fusion Gate:**
- A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).
## Training Data & Methodology
The model was trained on a dataset of approximately **1 million comic pages**, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS).
### Objectives
The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:
1. **InfoNCE Contrastive Loss (Global Context):** Maximizes similarity between panels on the *same page* while minimizing similarity to panels on *different pages*. This forces the model to learn distinct page-level stylistic and narrative contexts.
2. **Masked Panel Reconstruction (Local Detail):** Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
3. **Modality Alignment:** Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.
## Usage
You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`.
### Example: Extracting Features
```python
import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer
# Requires cloning the GitHub repo for the framework class
from stage3_panel_features_framework import PanelFeatureExtractor
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# 1. Initialize Model
model = PanelFeatureExtractor(
visual_backbone='both',
visual_fusion='attention',
feature_dim=512
).to(device)
# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
"https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
map_location=device
)
model.load_state_dict(state_dict)
model.eval()
# 2. Prepare Inputs
# Image
image = Image.open('sample_panel.jpg').convert('RGB')
transform = T.Compose([
T.Resize((224, 224)),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)
# Text
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
# Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
comp_feats = torch.zeros(1, 1, 7).to(device)
# Modality Mask [Vision, Text, Comp]
modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)
batch = {
'images': img_tensor,
'input_ids': input_ids,
'attention_mask': attn_mask,
'comp_feats': comp_feats,
'modality_mask': modality_mask
}
# 3. Generate Embedding
with torch.no_grad():
panel_embedding = model(batch)
print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])
```
## Intended Use & Limitations
- **Sequence Modeling:** These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
- **Retrieval:** Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
- **Limitation:** The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.
## Citation
If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).