RichardScottOZ
/

cosmo-v4

English

Model card Files Files and versions

xet

Community

RichardScottOZ commited on Mar 1

Commit

c047b44

verified ·

1 Parent(s): 0de2bbd

update model card

Browse files

Add basic model card.

Files changed (1) hide show

README.md +147 -0

README.md CHANGED Viewed

@@ -4,6 +4,153 @@ language:
 - en
 ---
 ## Example of Cosmov4 predictions
 ```python

 - en
 ---
+---
+language: en
+tags:
+- vision
+- text
+- multimodal
+- comics
+- page-classification
+- bert
+license: mit
+---
+# CoSMo v4 (Comic Stream Modeling - Page Classifier)
+CoSMo v4 is a highly specialized multimodal classifier designed to categorize pages within a comic book archive into distinct structural classes (e.g., *story, cover, advertisement, credits*).
+It represents "Stage 2" of the [Comic Analysis Framework v2.0](https://github.com/RichardScottOZ/Comic-Analysis), acting as the critical gatekeeper that filters raw comic archives down to pure narrative content for downstream sequence modeling.
+This v4 iteration introduces the **BookBERTMultimodal2** architecture, which replaces standard Convolutional feature extractors with modern Vision-Language models, achieving state-of-the-art accuracy on unstructured comic data.
+## Model Architecture
+CoSMo v4 is based on the `BookBERTMultimodal2` class. It treats a comic book as a "sequence" of pages and uses a Transformer encoder to understand the context of a page based on its position in the book.
+1. **Visual Features (`1152-dim`):** Extracted using **SigLIP** (`google/siglip-so400m-patch14-384`).
+2. **Text Features (`1024-dim`):** Extracted from OCR text using **Qwen-Embedding** (`Qwen/Qwen3-Embedding-0.6B`).
+3. **Projections:** Deep MLP projection layers `(Dim -> 3840 -> 1920 -> 768)` align both visual and text features into a common `768-dim` space.
+4. **Contextual Encoding:** A 4-layer, 4-head **BERT Encoder** (`Transformers.BertModel`) processes the combined features across the entire length of the comic book, allowing the model to understand that an advertisement usually follows a story page, or credits appear at the end.
+5. **Classification Head:** A deep sequential classifier maps the contextualized `768-dim` token back to one of **9 distinct classes**.
+## Output Classes
+The model predicts one of 9 labels for every page:
+1. `advertisement`
+2. `cover`
+3. `story` (The primary narrative content)
+4. `textstory`
+5. `first-page`
+6. `credits`
+7. `art` (Splash pages, pin-ups)
+8. `text` (Editorial text)
+9. `back_cover`
+## Usage
+Because CoSMo v4 requires pre-computed SigLIP and Qwen embeddings, inference is typically a two-step process. The complete codebase for embedding generation and Zarr-based inference is available in the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis) under `src/cosmo/`.
+### Quick Start Inference Snippet
+If you already have your visual (`1152-d`) and text (`1024-d`) embeddings for a sequence of pages, you can run inference like this:
+```python
+import torch
+import torch.nn as nn
+from transformers import BertConfig, BertModel
+# 1. Define Architecture (Must match exactly)
+class BookBERT(nn.Module):
+    def __init__(self, bert_input=768, num_classes=9, hidden_dim=512, dropout_p=0.0):
+        super().__init__()
+        config = BertConfig(
+            hidden_size=bert_input, num_hidden_layers=4, num_attention_heads=4,
+            intermediate_size=bert_input * 4, max_position_embeddings=1024
+        )
+        self.bert_encoder = BertModel(config)
+        self.classifier = nn.Sequential(
+            nn.Linear(bert_input, hidden_dim),
+            nn.Linear(hidden_dim, hidden_dim // 2),
+            nn.LayerNorm(hidden_dim // 2),
+            nn.GELU(),
+            nn.Dropout(dropout_p),
+            nn.Linear(hidden_dim // 2, hidden_dim // 4),
+            nn.LayerNorm(hidden_dim // 4),
+            nn.GELU(),
+            nn.Dropout(dropout_p),
+            nn.Linear(hidden_dim // 4, num_classes)
+        )
+class BookBERTMultimodal2(BookBERT):
+    def __init__(self, textual_dim=1024, visual_dim=1152, bert_dim=768, classes=9):
+        super().__init__(bert_input=bert_dim, num_classes=classes, hidden_dim=512, dropout_p=0.0)
+        sz1_v = (visual_dim + bert_dim) * 2
+        self.visual_projection = nn.Sequential(
+            nn.Linear(visual_dim, sz1_v), nn.LayerNorm(sz1_v), nn.GELU(), nn.Dropout(0.0),
+            nn.Linear(sz1_v, sz1_v//2), nn.LayerNorm(sz1_v//2), nn.GELU(), nn.Dropout(0.0),
+            nn.Linear(sz1_v//2, bert_dim)
+        )
+        sz1_t = (textual_dim + bert_dim) * 2
+        self.textual_projection = nn.Sequential(
+            nn.Linear(textual_dim, sz1_t), nn.LayerNorm(sz1_t), nn.GELU(), nn.Dropout(0.0),
+            nn.Linear(sz1_t, sz1_t//2), nn.LayerNorm(sz1_t//2), nn.GELU(), nn.Dropout(0.0),
+            nn.Linear(sz1_t//2, bert_dim)
+        )
+        self.norm = nn.LayerNorm(bert_dim)
+    def forward(self, textual_features, visual_features):
+        batch_size, seq_len, _ = textual_features.shape
+        mask = torch.ones((batch_size, seq_len), device=textual_features.device)
+        t_norm = self.norm(self.textual_projection(textual_features))
+        v_norm = self.norm(self.visual_projection(visual_features))
+        combined = torch.stack([t_norm, v_norm], dim=2).view(batch_size, seq_len * 2, -1)
+        exp_mask = mask.unsqueeze(2).expand(-1, -1, 2).reshape(batch_size, seq_len * 2)
+        bert_out = self.bert_encoder(inputs_embeds=combined, attention_mask=exp_mask)
+        reshaped = bert_out.last_hidden_state.view(batch_size, seq_len, 2, -1)
+        return self.classifier(reshaped[:, :, -1, :])
+# 2. Load Model
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+model = BookBERTMultimodal2().to(device)
+state_dict = torch.hub.load_state_dict_from_url(
+    "https://huggingface.co/RichardScottOZ/cosmo-v4/resolve/main/best_Multimodal_MultiToken_v4.pt",
+    map_location=device
+)
+if 'model_state_dict' in state_dict:
+    state_dict = state_dict['model_state_dict']
+model.load_state_dict(state_dict, strict=True)
+model.eval()
+# 3. Inference (Example: 1 comic book containing 24 pages)
+# visual_embeddings shape: (1, 24, 1152) -> From SigLIP
+# text_embeddings shape: (1, 24, 1024) -> From Qwen
+visual_embs = torch.randn(1, 24, 1152).to(device)
+text_embs = torch.randn(1, 24, 1024).to(device)
+with torch.inference_mode():
+    logits = model(text_embs, visual_embs)
+    predictions = torch.argmax(logits, dim=-1).squeeze(0)
+class_names = ["advertisement", "cover", "story", "textstory", "first-page", "credits", "art", "text", "back_cover"]
+for page_num, pred_idx in enumerate(predictions):
+    print(f"Page {page_num}: {class_names[pred_idx]}")
+```
+## Intended Use
+This model is designed to process entire comic books/issues as a single sequence. Due to the positional embeddings in the BERT encoder, feeding it pages completely out of order or feeding it a single page at a time without context will degrade performance.
+*Note: The model has a hard limit of `1024` tokens, equating to `512` pages per forward pass. For massive omnibuses, chunking is required.*
+## Citation
+If you use this model or the framework, please reference the [Comic Analysis GitHub Repository](https://github.com/RichardScottOZ/Comic-Analysis).
 ## Example of Cosmov4 predictions
 ```python