d3p4rt commited on
Commit
b3ff445
·
verified ·
1 Parent(s): e64fc69

Add honest model card documenting question-blindness

Browse files
Files changed (1) hide show
  1. README.md +158 -0
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - florence-2
8
+ - document-understanding
9
+ - ocr
10
+ - fine-tuned
11
+ - vision-language
12
+ base_model: microsoft/Florence-2-large
13
+ datasets:
14
+ - HuggingFaceM4/DocumentVQA
15
+ - nvidia/Nemotron-VLM-Dataset-v1
16
+ - HuggingFaceM4/FineVision
17
+ ---
18
+
19
+ # Newtype Cognition
20
+
21
+ ## Florence-2 Document OCR Captioner (4-Phase Fine-tuned)
22
+
23
+ A 4-phase fine-tuned variant of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large)
24
+ trained on document images (DocumentVQA, Nemotron, FineVision). The model performs
25
+ **document text extraction and document-flavored captioning** but does **not** function
26
+ as a true Visual Question Answering (VQA) model. See "Limitations" below.
27
+
28
+ ## What this model actually does
29
+
30
+ Given a document image and a Florence-2 task token (`<OCR_WITH_REGION>`, `<CAPTION>`,
31
+ `<MORE_DETAILED_CAPTION>`, etc.), the model produces:
32
+
33
+ - The **dominant visible text** of the document (e.g., title, biggest number, masthead)
34
+ - A **descriptive caption** of the document layout
35
+ - **Extracted text regions** (OCR-style)
36
+
37
+ It is best thought of as a *document-aware OCR captioner* — useful for indexing,
38
+ thumbnail descriptions, or as a starting checkpoint for further fine-tuning, **not**
39
+ as a question-answering system.
40
+
41
+ ## Limitations (read this before using)
42
+
43
+ **The model is question-blind.** During Phase 1–4 training, the data collator fed only
44
+ the Florence-2 task token (e.g., `<OCR_WITH_REGION>`) to the model and dropped the user's
45
+ question. The model therefore learned a fixed `image → text` mapping, independent of
46
+ what the user asks. Concrete behavior on the same image with different questions:
47
+
48
+ | Question | Predicted answer |
49
+ |---|---|
50
+ | "What is the name of the university?" | `'2:10:48'` |
51
+ | "Where is the university located?" | `'2:10:48'` |
52
+ | "To whom is the document sent?" | `'Willow 155-8056'` |
53
+
54
+ A Phase-4-only retrain with a patched (question-aware) collator did **not** fix the
55
+ behavior, because Phase 1–3 had already saturated the question-blind mapping at higher
56
+ learning rates. A full Phase 1→4 retrain with the corrected collator would be required.
57
+
58
+ The collator fix lives in [`train_phase1.py`](https://github.com/Praxisyn/newtype_cognition/blob/main/train_phase1.py)
59
+ on the `main` branch; this checkpoint was trained before that fix took effect end-to-end.
60
+
61
+ ## Evaluation
62
+
63
+ Evaluated on a 50-sample slice of `HuggingFaceM4/DocumentVQA` validation:
64
+
65
+ | Metric | Value |
66
+ |---|---|
67
+ | Exact match | 10.00% |
68
+ | Token F1 (avg) | 13.67% |
69
+ | Answer-substring hits | 12.00% |
70
+
71
+ Most of the 10% exact match comes from samples where the expected answer is the dominant
72
+ visible text on the document (e.g., the company title is the answer to "What is the
73
+ company name?"). It is **not** evidence of question understanding.
74
+
75
+ ## Recommended use
76
+
77
+ ```python
78
+ from transformers import AutoModelForCausalLM, AutoProcessor
79
+ from PIL import Image
80
+ import torch
81
+
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ "d3p4rt/newtype-cognition",
84
+ trust_remote_code=True,
85
+ torch_dtype=torch.float16,
86
+ attn_implementation="eager",
87
+ ).cuda().eval()
88
+ processor = AutoProcessor.from_pretrained(
89
+ "d3p4rt/newtype-cognition",
90
+ trust_remote_code=True,
91
+ )
92
+
93
+ image = Image.open("document.jpg").convert("RGB")
94
+
95
+ # Use Florence-2 task tokens — do NOT pass arbitrary questions
96
+ inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt")
97
+ inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda")
98
+ for k, v in inputs.items()}
99
+
100
+ with torch.no_grad():
101
+ out = model.generate(
102
+ **inputs,
103
+ max_new_tokens=128,
104
+ num_beams=3,
105
+ do_sample=False,
106
+ early_stopping=True,
107
+ )
108
+
109
+ print(processor.batch_decode(out, skip_special_tokens=True)[0])
110
+ ```
111
+
112
+ ## Training Details
113
+
114
+ - **Base model**: `microsoft/Florence-2-large`
115
+ - **Hardware**: NVIDIA RTX 4090 (24 GB) on Vast.ai
116
+ - **Precision**: bfloat16
117
+ - **Optimizer**: `paged_adamw_8bit`
118
+ - **Memory tricks**: gradient checkpointing, `expandable_segments` allocator
119
+ - **Phase 1 (warm-up)**: 5 epochs, full fine-tune, lr=2e-5
120
+ - **Phase 2 (specialization)**: 3 LoRA adapters (DocVQA, Nemotron, FineVision)
121
+ - **Phase 3 (merge)**: weighted merge biased toward DocVQA (0.90)
122
+ - **Phase 4 (polish)**: 2 epochs full fine-tune, lr=1e-6
123
+
124
+ ## Datasets
125
+
126
+ - `HuggingFaceM4/DocumentVQA` (Phase 1, 2)
127
+ - `nvidia/Nemotron-VLM-Dataset-v1` (Phase 1, 2)
128
+ - `HuggingFaceM4/FineVision` (Phase 1, 2)
129
+ - `ChartGen` (Phase 1)
130
+
131
+ All datasets streamed; no full local copies retained.
132
+
133
+ ## Lessons learned (for future fine-tuners)
134
+
135
+ - **Always include the conditioning input (question) in your data collator from the
136
+ first epoch**, especially when using a custom collator that builds the model input
137
+ text from multiple fields.
138
+ - Florence-2's processor enforces that special task tokens (`<OCR_WITH_REGION>`,
139
+ `<CAPTION>`, etc.) are *the only content* in the input text. To inject extra text,
140
+ manually expand the token to its English prompt (e.g., `<OCR_WITH_REGION>` →
141
+ `"What is the text in the image, with regions?"`) before concatenating user text.
142
+ - A late-stage low-lr "polish" phase **cannot** fix a behavioral bug introduced in
143
+ earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4.
144
+
145
+ ## License
146
+
147
+ MIT (inherited from base model).
148
+
149
+ ## Citation
150
+
151
+ ```bibtex
152
+ @misc{newtype-cognition,
153
+ author = {d3p4rt},
154
+ title = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)},
155
+ year = {2026},
156
+ howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}},
157
+ }
158
+ ```