Update README.md

d86cb8e verified 29 days ago

5.66 kB

	---
	license: mit
	---
	---
	language: en
	tags:
	- vision
	- text
	- multimodal
	- comics
	- contrastive-learning
	- feature-extraction
	license: mit
	---

	# Comic Panel Encoder v1 (Stage 3)

	This model is a multimodal encoder specifically designed to generate rich, dense feature representations (embeddings) of individual comic book panels. It serves as "Stage 3" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis).

	By combining visual details, extracted text (dialogue/narration), and compositional metadata (bounding box coordinates), it generates a single 512-dimensional vector per panel. These embeddings are highly optimized for downstream sequential narrative modeling (Stage 4) and comic retrieval tasks.

	## Model Architecture

	The `comic-panel-encoder-v1` utilizes an Adaptive Multi-Modal Fusion architecture:

	1. Visual Branch (Dual Backbone):
	- SigLIP (`google/siglip-base-patch16-224`): Captures high-level semantic and stylistic features.
	- ResNet50: Captures fine-grained, low-level texture and structural details.
	- Fusion: An attention mechanism fuses the domain-adapted outputs of both backbones.
	2. Text Branch:
	- MiniLM (`sentence-transformers/all-MiniLM-L6-v2`): Encodes transcribed dialogue, narration, and VLM-generated descriptions.
	3. Compositional Branch:
	- A Multi-Layer Perceptron (MLP) encodes panel geometry (aspect ratio, normalized bounding box coordinates, relative area).
	4. Adaptive Fusion Gate:
	- A learned gating mechanism combines the Vision, Text, and Composition features, dynamically weighting them based on the presence/quality of the modalities (e.g., handles panels with no text gracefully).

	## Training Data & Methodology

	The model was trained on a dataset of approximately 1 million comic pages, filtered specifically for narrative/story content using [CoSMo (Comic Stream Modeling)](https://github.com/mserra0/CoSMo-ComicsPSS).

	### Objectives
	The encoder was trained from scratch (with frozen base backbones) using three simultaneous objectives:
	1. InfoNCE Contrastive Loss (Global Context): Maximizes similarity between panels on the same page while minimizing similarity to panels on different pages. This forces the model to learn distinct page-level stylistic and narrative contexts.
	2. Masked Panel Reconstruction (Local Detail): Predicts the embedding of a masked panel given the context of surrounding panels on the same page. This prevents mode collapse and ensures individual panels retain their unique sequential features.
	3. Modality Alignment: Aligns the visual embedding space with the text embedding space for a given panel using contrastive cross-entropy.

	## Usage

	You can use this model to extract 512-d embeddings from comic panels. The codebase required to run this model is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/version2/stage3_panel_features_framework.py`.

	### Example: Extracting Features

	```python
	import torch
	from PIL import Image
	import torchvision.transforms as T
	from transformers import AutoTokenizer
	# Requires cloning the GitHub repo for the framework class
	from stage3_panel_features_framework import PanelFeatureExtractor

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# 1. Initialize Model
	model = PanelFeatureExtractor(
	visual_backbone='both',
	visual_fusion='attention',
	feature_dim=512
	).to(device)

	# Load weights from Hugging Face
	state_dict = torch.hub.load_state_dict_from_url(
	"https://huggingface.co/RichardScottOZ/comic-panel-encoder-v1/resolve/main/best_model.pt",
	map_location=device
	)
	model.load_state_dict(state_dict)
	model.eval()

	# 2. Prepare Inputs
	# Image
	image = Image.open('sample_panel.jpg').convert('RGB')
	transform = T.Compose([
	T.Resize((224, 224)),
	T.ToTensor(),
	T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
	])
	img_tensor = transform(image).unsqueeze(0).unsqueeze(0).to(device) # (B=1, N=1, C, H, W)

	# Text
	tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
	text_enc = tokenizer(["Batman punches the Joker"], return_tensors='pt', padding=True)
	input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
	attn_mask = text_enc['attention_mask'].unsqueeze(0).to(device)

	# Composition (e.g., Aspect Ratio, Area, Center X, Center Y)
	comp_feats = torch.zeros(1, 1, 7).to(device)

	# Modality Mask [Vision, Text, Comp]
	modality_mask = torch.tensor([[[1.0, 1.0, 1.0]]]).to(device)

	batch = {
	'images': img_tensor,
	'input_ids': input_ids,
	'attention_mask': attn_mask,
	'comp_feats': comp_feats,
	'modality_mask': modality_mask
	}

	# 3. Generate Embedding
	with torch.no_grad():
	panel_embedding = model(batch)

	print(f"Embedding shape: {panel_embedding.shape}") # Output: torch.Size([1, 512])
	```

	## Intended Use & Limitations
	- Sequence Modeling: These embeddings are intended to be fed into a temporal sequence model (like a Transformer encoder) to predict narrative flow, reading order, and character coherence (Stage 4 of the framework).
	- Retrieval: Can be used to find visually or semantically similar panels across a large database using Cosine Similarity.
	- Limitation: The visual backbones were frozen during training, meaning the model relies on the pre-trained priors of SigLIP and ResNet50, combined via the newly trained adapter and fusion layers.

	## Citation
	If you use this model or the associated framework, please link back to the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).