d3p4rt
/

newtype-cognition

+---
+license: mit
+language:
+- en
+pipeline_tag: image-text-to-text
+tags:
+- florence-2
+- document-understanding
+- ocr
+- fine-tuned
+- vision-language
+base_model: microsoft/Florence-2-large
+datasets:
+- HuggingFaceM4/DocumentVQA
+- nvidia/Nemotron-VLM-Dataset-v1
+- HuggingFaceM4/FineVision
+---
+# Newtype Cognition
+## Florence-2 Document OCR Captioner (4-Phase Fine-tuned)
+A 4-phase fine-tuned variant of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large)
+trained on document images (DocumentVQA, Nemotron, FineVision). The model performs
+**document text extraction and document-flavored captioning** but does **not** function
+as a true Visual Question Answering (VQA) model. See "Limitations" below.
+## What this model actually does
+Given a document image and a Florence-2 task token (`<OCR_WITH_REGION>`, `<CAPTION>`,
+`<MORE_DETAILED_CAPTION>`, etc.), the model produces:
+- The **dominant visible text** of the document (e.g., title, biggest number, masthead)
+- A **descriptive caption** of the document layout
+- **Extracted text regions** (OCR-style)
+It is best thought of as a *document-aware OCR captioner* — useful for indexing,
+thumbnail descriptions, or as a starting checkpoint for further fine-tuning, **not**
+as a question-answering system.
+## Limitations (read this before using)
+**The model is question-blind.** During Phase 1–4 training, the data collator fed only
+the Florence-2 task token (e.g., `<OCR_WITH_REGION>`) to the model and dropped the user's
+question. The model therefore learned a fixed `image → text` mapping, independent of
+what the user asks. Concrete behavior on the same image with different questions:
+| Question | Predicted answer |
+|---|---|
+| "What is the name of the university?" | `'2:10:48'` |
+| "Where is the university located?" | `'2:10:48'` |
+| "To whom is the document sent?" | `'Willow 155-8056'` |
+A Phase-4-only retrain with a patched (question-aware) collator did **not** fix the
+behavior, because Phase 1–3 had already saturated the question-blind mapping at higher
+learning rates. A full Phase 1→4 retrain with the corrected collator would be required.
+The collator fix lives in [`train_phase1.py`](https://github.com/Praxisyn/newtype_cognition/blob/main/train_phase1.py)
+on the `main` branch; this checkpoint was trained before that fix took effect end-to-end.
+## Evaluation
+Evaluated on a 50-sample slice of `HuggingFaceM4/DocumentVQA` validation:
+| Metric | Value |
+|---|---|
+| Exact match | 10.00% |
+| Token F1 (avg) | 13.67% |
+| Answer-substring hits | 12.00% |
+Most of the 10% exact match comes from samples where the expected answer is the dominant
+visible text on the document (e.g., the company title is the answer to "What is the
+company name?"). It is **not** evidence of question understanding.
+## Recommended use
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+from PIL import Image
+import torch
+model = AutoModelForCausalLM.from_pretrained(
+    "d3p4rt/newtype-cognition",
+    trust_remote_code=True,
+    torch_dtype=torch.float16,
+    attn_implementation="eager",
+).cuda().eval()
+processor = AutoProcessor.from_pretrained(
+    "d3p4rt/newtype-cognition",
+    trust_remote_code=True,
+)
+image = Image.open("document.jpg").convert("RGB")
+# Use Florence-2 task tokens — do NOT pass arbitrary questions
+inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt")
+inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda")
+          for k, v in inputs.items()}
+with torch.no_grad():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=128,
+        num_beams=3,
+        do_sample=False,
+        early_stopping=True,
+    )
+print(processor.batch_decode(out, skip_special_tokens=True)[0])
+```
+## Training Details
+- **Base model**: `microsoft/Florence-2-large`
+- **Hardware**: NVIDIA RTX 4090 (24 GB) on Vast.ai
+- **Precision**: bfloat16
+- **Optimizer**: `paged_adamw_8bit`
+- **Memory tricks**: gradient checkpointing, `expandable_segments` allocator
+- **Phase 1 (warm-up)**: 5 epochs, full fine-tune, lr=2e-5
+- **Phase 2 (specialization)**: 3 LoRA adapters (DocVQA, Nemotron, FineVision)
+- **Phase 3 (merge)**: weighted merge biased toward DocVQA (0.90)
+- **Phase 4 (polish)**: 2 epochs full fine-tune, lr=1e-6
+## Datasets
+- `HuggingFaceM4/DocumentVQA` (Phase 1, 2)
+- `nvidia/Nemotron-VLM-Dataset-v1` (Phase 1, 2)
+- `HuggingFaceM4/FineVision` (Phase 1, 2)
+- `ChartGen` (Phase 1)
+All datasets streamed; no full local copies retained.
+## Lessons learned (for future fine-tuners)
+- **Always include the conditioning input (question) in your data collator from the
+  first epoch**, especially when using a custom collator that builds the model input
+  text from multiple fields.
+- Florence-2's processor enforces that special task tokens (`<OCR_WITH_REGION>`,
+  `<CAPTION>`, etc.) are *the only content* in the input text. To inject extra text,
+  manually expand the token to its English prompt (e.g., `<OCR_WITH_REGION>` →
+  `"What is the text in the image, with regions?"`) before concatenating user text.
+- A late-stage low-lr "polish" phase **cannot** fix a behavioral bug introduced in
+  earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4.
+## License
+MIT (inherited from base model).
+## Citation
+```bibtex
+@misc{newtype-cognition,
+  author       = {d3p4rt},
+  title        = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)},
+  year         = {2026},
+  howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}},
+}
+```