--- license: mit language: - en pipeline_tag: image-text-to-text tags: - florence-2 - document-understanding - ocr - fine-tuned - vision-language base_model: microsoft/Florence-2-large datasets: - HuggingFaceM4/DocumentVQA - nvidia/Nemotron-VLM-Dataset-v1 - HuggingFaceM4/FineVision --- # Newtype Cognition

Newtype Cognition Logo

## Florence-2 Document OCR Captioner (4-Phase Fine-tuned) A 4-phase fine-tuned variant of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) trained on document images (DocumentVQA, Nemotron, FineVision). The model performs **document text extraction and document-flavored captioning** but does **not** function as a true Visual Question Answering (VQA) model. See "Limitations" below. ## What this model actually does Given a document image and a Florence-2 task token (``, ``, ``, etc.), the model produces: - The **dominant visible text** of the document (e.g., title, biggest number, masthead) - A **descriptive caption** of the document layout - **Extracted text regions** (OCR-style) It is best thought of as a *document-aware OCR captioner* — useful for indexing, thumbnail descriptions, or as a starting checkpoint for further fine-tuning, **not** as a question-answering system. ## Limitations (read this before using) **The model is question-blind.** During Phase 1–4 training, the data collator fed only the Florence-2 task token (e.g., ``) to the model and dropped the user's question. The model therefore learned a fixed `image → text` mapping, independent of what the user asks. Concrete behavior on the same image with different questions: | Question | Predicted answer | |---|---| | "What is the name of the university?" | `'2:10:48'` | | "Where is the university located?" | `'2:10:48'` | | "To whom is the document sent?" | `'Willow 155-8056'` | A Phase-4-only retrain with a patched (question-aware) collator did **not** fix the behavior, because Phase 1–3 had already saturated the question-blind mapping at higher learning rates. A full Phase 1→4 retrain with the corrected collator would be required. The collator fix lives in [`train_phase1.py`](https://github.com/Praxisyn/newtype_cognition/blob/main/train_phase1.py) on the `main` branch; this checkpoint was trained before that fix took effect end-to-end. **`` does not produce region coordinates.** Despite its name, the token outputs plain text identical to `` — bounding box coordinates are not generated. The fine-tuning process appears to have collapsed the two tokens to the same behavior, likely because the region format (`` tokens) was not well represented in the training data or was discarded by the collator. **OCR extracts dominant text only.** Both `` and `` return the most visually prominent text in the document (e.g., the largest number or title), not a full transcription of all text regions. On template documents with placeholder data, this may produce a single value such as `$30` or `$0.00`. ## Evaluation Evaluated on a 50-sample slice of `HuggingFaceM4/DocumentVQA` validation: | Metric | Value | |---|---| | Exact match | 10.00% | | Token F1 (avg) | 13.67% | | Answer-substring hits | 12.00% | Most of the 10% exact match comes from samples where the expected answer is the dominant visible text on the document (e.g., the company title is the answer to "What is the company name?"). It is **not** evidence of question understanding. ## Token behavior and recommendations Based on empirical testing, the three most useful task tokens behave as follows: | Token | Behavior | Recommended for | |---|---|---| | `` | Extracts the single most visually dominant text | Quick salience extraction, document indexing | | `` | Identical output to `` — region coordinates not generated | Avoid; use `` directly | | `` | Produces a structured natural-language description of the document layout | Thumbnail descriptions, alt-text generation, layout understanding | **`` is the most informative token** for document understanding tasks. On a standard service invoice it correctly identified the document type, described the column structure, and extracted footer text — without being asked any question. Output is capped by `max_new_tokens`; increase to 256 or higher to avoid truncation on dense documents. Example output on a service invoice template (``): ``` The image is a service invoice template with hourly rate. It has a white background and black text. The template is divided into two columns, with the left column containing the company name, address, phone number, and email address. The right column contains the total amount of the service invoice, which includes the total cost, subtotal, discount, and other details. At the bottom of the template, there is a note that reads "Thank you" and "Please make check payable to your company name." ``` ## Use Cases This model is best suited for tasks that do **not** require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios: **Document indexing and search** Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step. **Alt-text and thumbnail description generation** Automatically generating descriptions of document images for accessibility purposes or content management system previews. **Visual salience detection** Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents. **Hybrid OCR pipelines** Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream. **Fine-tuning checkpoint** Starting a domain-specific fine-tune from this checkpoint rather than from `microsoft/Florence-2-large` vanilla, particularly for document-heavy domains. ## When Not to Use This Model - **Document Question Answering (DocQA):** The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks. - **Conversational document assistants:** Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question. - **Multi-document reasoning:** The model processes a single image and has no cross-document or contextual reasoning capability. - **Production-critical extraction:** With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences. ## Recommended use ```python from transformers import AutoModelForCausalLM, AutoProcessor from PIL import Image import torch model = AutoModelForCausalLM.from_pretrained( "d3p4rt/newtype-cognition", trust_remote_code=True, torch_dtype=torch.float16, attn_implementation="eager", ).cuda().eval() processor = AutoProcessor.from_pretrained( "d3p4rt/newtype-cognition", trust_remote_code=True, ) image = Image.open("document.jpg").convert("RGB") # Use Florence-2 task tokens — do NOT pass arbitrary questions inputs = processor(text="", images=image, return_tensors="pt") inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda") for k, v in inputs.items()} with torch.no_grad(): out = model.generate( **inputs, max_new_tokens=128, num_beams=3, do_sample=False, early_stopping=True, ) print(processor.batch_decode(out, skip_special_tokens=True)[0]) ``` ## Training Details - **Base model**: `microsoft/Florence-2-large` - **Hardware**: NVIDIA RTX 4090 (24 GB) on Vast.ai - **Precision**: bfloat16 - **Optimizer**: `paged_adamw_8bit` - **Memory tricks**: gradient checkpointing, `expandable_segments` allocator - **Phase 1 (warm-up)**: 5 epochs, full fine-tune, lr=2e-5 - **Phase 2 (specialization)**: 3 LoRA adapters (DocVQA, Nemotron, FineVision) - **Phase 3 (merge)**: weighted merge biased toward DocVQA (0.90) - **Phase 4 (polish)**: 2 epochs full fine-tune, lr=1e-6 ## Datasets - `HuggingFaceM4/DocumentVQA` (Phase 1, 2) - `nvidia/Nemotron-VLM-Dataset-v1` (Phase 1, 2) - `HuggingFaceM4/FineVision` (Phase 1, 2) - `ChartGen` (Phase 1) All datasets streamed; no full local copies retained. ## Lessons learned (for future fine-tuners) - **Always include the conditioning input (question) in your data collator from the first epoch**, especially when using a custom collator that builds the model input text from multiple fields. - Florence-2's processor enforces that special task tokens (``, ``, etc.) are *the only content* in the input text. To inject extra text, manually expand the token to its English prompt (e.g., `` → `"What is the text in the image, with regions?"`) before concatenating user text. - A late-stage low-lr "polish" phase **cannot** fix a behavioral bug introduced in earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4. ## Demo Try it live: [Space](https://huggingface.co/spaces/d3p4rt/newtype-cognition-demo) ## License MIT (inherited from base model). ## Citation ```bibtex @misc{newtype-cognition, author = {d3p4rt}, title = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)}, year = {2026}, howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}}, } ```