RichardScottOZ
/

comics-analysis-closure-lite-simple

@@ -18,4 +18,118 @@ The model code and documentation repository is at https://github.com/RichardScot
 Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.
-More more detail the repo above.

 Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.
+More more detail the repo above.
+---
+language: en
+tags:
+- vision
+- text
+- multimodal
+- comics
+- contrastive-learning
+- vit
+- roberta
+license: mit
+---
+# ClosureLiteSimple (Version 1 - Comic Panel Encoder)
+ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the [Comic Analysis Framework](https://github.com/RichardScottOZ/Comic-Analysis).
+It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified **384-dimensional** embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism.
+*(Note: This model is considered deprecated in favor of the newer `comic-panel-encoder-v1` which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).*
+## Model Architecture
+The `ClosureLiteSimple` model consists of the `PanelAtomizerLite` and a `SimpleAttention` mechanism:
+1. **Vision Encoder (`google/vit-base-patch16-224`):**
+   - Extracts features from $224 \times 224$ panel image crops.
+   - Outputs projected to $384$-d.
+2. **Text Encoder (`roberta-base`):**
+   - Encodes panel dialogue, narration, or OCR text.
+   - Outputs projected to $384$-d.
+3. **Compositional Encoder (MLP):**
+   - Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates).
+   - Projects through hidden layers to $384$-d.
+4. **Gated Fusion (`GatedFusion`):**
+   - Concatenates the three modality outputs and computes a learned softmax gate.
+   - Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d **Panel Embedding**.
+5. **Page Aggregation (`SimpleAttention`):**
+   - Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d **Page Embedding**.
+## Usage
+The codebase for this model resides in the `src/version1/` directory of the repository.
+### Example: Loading and Inference
+```python
+import torch
+from PIL import Image
+import torchvision.transforms as T
+from transformers import AutoTokenizer
+# Requires cloning the GitHub repo
+from closure_lite_simple_framework import ClosureLiteSimple
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# 1. Initialize Model
+model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device)
+# Load weights from Hugging Face
+state_dict = torch.hub.load_state_dict_from_url(
+    "https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt",
+    map_location=device
+)
+if 'model_state_dict' in state_dict:
+    state_dict = state_dict['model_state_dict']
+model.load_state_dict(state_dict)
+model.eval()
+# 2. Prepare Inputs (Example: A page with 2 panels)
+transform = T.Compose([
+    T.Resize((224, 224)),
+    T.ToTensor(),
+    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
+])
+# Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224)
+images = torch.stack([
+    transform(Image.new('RGB', (224, 224))),
+    transform(Image.new('RGB', (224, 224)))
+]).unsqueeze(0).to(device)
+# Dummy Text
+tokenizer = AutoTokenizer.from_pretrained("roberta-base")
+text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True)
+input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
+attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
+# Dummy Composition (B=1, N=2, F=7)
+comp_feats = torch.zeros((1, 2, 7)).to(device)
+# Valid Panel Mask (B=1, N=2)
+panel_mask = torch.tensor([[True, True]]).to(device)
+# 3. Generate Embeddings
+with torch.no_grad():
+    panel_embeddings, page_embedding = model(
+        images, input_ids, attention_mask, comp_feats, panel_mask
+    )
+print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384)
+print(f"Page Embedding Shape: {page_embedding.shape}")     # (1, 384)
+```
+## Intended Use & Limitations
+- **Intended Use:** Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1).
+- **Limitations:**
+  - **Modality Dominance:** Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the `GatedFusion` mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries.
+  - **Deprecated:** This architecture has been superseded by Stage 3 (`comic-panel-encoder-v1`), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues.
+## Citation
+Please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) when utilizing this architecture.