RichardScottOZ commited on
Commit
f97f985
·
verified ·
1 Parent(s): 1c2768d

Update model card

Browse files

Add some model card detail.

Files changed (1) hide show
  1. README.md +115 -1
README.md CHANGED
@@ -18,4 +18,118 @@ The model code and documentation repository is at https://github.com/RichardScot
18
 
19
  Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.
20
 
21
- More more detail the repo above.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.
20
 
21
+ More more detail the repo above.
22
+
23
+ ---
24
+ language: en
25
+ tags:
26
+ - vision
27
+ - text
28
+ - multimodal
29
+ - comics
30
+ - contrastive-learning
31
+ - vit
32
+ - roberta
33
+ license: mit
34
+ ---
35
+
36
+ # ClosureLiteSimple (Version 1 - Comic Panel Encoder)
37
+
38
+ ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the [Comic Analysis Framework](https://github.com/RichardScottOZ/Comic-Analysis).
39
+
40
+ It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified **384-dimensional** embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism.
41
+
42
+ *(Note: This model is considered deprecated in favor of the newer `comic-panel-encoder-v1` which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).*
43
+
44
+ ## Model Architecture
45
+
46
+ The `ClosureLiteSimple` model consists of the `PanelAtomizerLite` and a `SimpleAttention` mechanism:
47
+
48
+ 1. **Vision Encoder (`google/vit-base-patch16-224`):**
49
+ - Extracts features from $224 \times 224$ panel image crops.
50
+ - Outputs projected to $384$-d.
51
+ 2. **Text Encoder (`roberta-base`):**
52
+ - Encodes panel dialogue, narration, or OCR text.
53
+ - Outputs projected to $384$-d.
54
+ 3. **Compositional Encoder (MLP):**
55
+ - Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates).
56
+ - Projects through hidden layers to $384$-d.
57
+ 4. **Gated Fusion (`GatedFusion`):**
58
+ - Concatenates the three modality outputs and computes a learned softmax gate.
59
+ - Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d **Panel Embedding**.
60
+ 5. **Page Aggregation (`SimpleAttention`):**
61
+ - Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d **Page Embedding**.
62
+
63
+ ## Usage
64
+
65
+ The codebase for this model resides in the `src/version1/` directory of the repository.
66
+
67
+ ### Example: Loading and Inference
68
+
69
+ ```python
70
+ import torch
71
+ from PIL import Image
72
+ import torchvision.transforms as T
73
+ from transformers import AutoTokenizer
74
+
75
+ # Requires cloning the GitHub repo
76
+ from closure_lite_simple_framework import ClosureLiteSimple
77
+
78
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
79
+
80
+ # 1. Initialize Model
81
+ model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device)
82
+
83
+ # Load weights from Hugging Face
84
+ state_dict = torch.hub.load_state_dict_from_url(
85
+ "https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt",
86
+ map_location=device
87
+ )
88
+ if 'model_state_dict' in state_dict:
89
+ state_dict = state_dict['model_state_dict']
90
+ model.load_state_dict(state_dict)
91
+ model.eval()
92
+
93
+ # 2. Prepare Inputs (Example: A page with 2 panels)
94
+ transform = T.Compose([
95
+ T.Resize((224, 224)),
96
+ T.ToTensor(),
97
+ T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
98
+ ])
99
+
100
+ # Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224)
101
+ images = torch.stack([
102
+ transform(Image.new('RGB', (224, 224))),
103
+ transform(Image.new('RGB', (224, 224)))
104
+ ]).unsqueeze(0).to(device)
105
+
106
+ # Dummy Text
107
+ tokenizer = AutoTokenizer.from_pretrained("roberta-base")
108
+ text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True)
109
+ input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
110
+ attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device)
111
+
112
+ # Dummy Composition (B=1, N=2, F=7)
113
+ comp_feats = torch.zeros((1, 2, 7)).to(device)
114
+
115
+ # Valid Panel Mask (B=1, N=2)
116
+ panel_mask = torch.tensor([[True, True]]).to(device)
117
+
118
+ # 3. Generate Embeddings
119
+ with torch.no_grad():
120
+ panel_embeddings, page_embedding = model(
121
+ images, input_ids, attention_mask, comp_feats, panel_mask
122
+ )
123
+
124
+ print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384)
125
+ print(f"Page Embedding Shape: {page_embedding.shape}") # (1, 384)
126
+ ```
127
+
128
+ ## Intended Use & Limitations
129
+ - **Intended Use:** Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1).
130
+ - **Limitations:**
131
+ - **Modality Dominance:** Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the `GatedFusion` mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries.
132
+ - **Deprecated:** This architecture has been superseded by Stage 3 (`comic-panel-encoder-v1`), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues.
133
+
134
+ ## Citation
135
+ Please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) when utilizing this architecture.