| --- |
| license: mit |
| language: |
| - en |
| pipeline_tag: image-text-to-text |
| tags: |
| - florence-2 |
| - document-understanding |
| - ocr |
| - fine-tuned |
| - vision-language |
| base_model: microsoft/Florence-2-large |
| datasets: |
| - HuggingFaceM4/DocumentVQA |
| - nvidia/Nemotron-VLM-Dataset-v1 |
| - HuggingFaceM4/FineVision |
| --- |
| |
| # Newtype Cognition |
| <p align="center"> |
| <img src="logo.png" alt="Newtype Cognition Logo" width="400"/> |
| </p> |
|
|
|
|
| ## Florence-2 Document OCR Captioner (4-Phase Fine-tuned) |
|
|
| A 4-phase fine-tuned variant of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) |
| trained on document images (DocumentVQA, Nemotron, FineVision). The model performs |
| **document text extraction and document-flavored captioning** but does **not** function |
| as a true Visual Question Answering (VQA) model. See "Limitations" below. |
|
|
| ## What this model actually does |
|
|
| Given a document image and a Florence-2 task token (`<OCR_WITH_REGION>`, `<CAPTION>`, |
| `<MORE_DETAILED_CAPTION>`, etc.), the model produces: |
|
|
| - The **dominant visible text** of the document (e.g., title, biggest number, masthead) |
| - A **descriptive caption** of the document layout |
| - **Extracted text regions** (OCR-style) |
|
|
| It is best thought of as a *document-aware OCR captioner* — useful for indexing, |
| thumbnail descriptions, or as a starting checkpoint for further fine-tuning, **not** |
| as a question-answering system. |
|
|
| ## Limitations (read this before using) |
|
|
| **The model is question-blind.** During Phase 1–4 training, the data collator fed only |
| the Florence-2 task token (e.g., `<OCR_WITH_REGION>`) to the model and dropped the user's |
| question. The model therefore learned a fixed `image → text` mapping, independent of |
| what the user asks. Concrete behavior on the same image with different questions: |
|
|
| | Question | Predicted answer | |
| |---|---| |
| | "What is the name of the university?" | `'2:10:48'` | |
| | "Where is the university located?" | `'2:10:48'` | |
| | "To whom is the document sent?" | `'Willow 155-8056'` | |
|
|
| A Phase-4-only retrain with a patched (question-aware) collator did **not** fix the |
| behavior, because Phase 1–3 had already saturated the question-blind mapping at higher |
| learning rates. A full Phase 1→4 retrain with the corrected collator would be required. |
|
|
| The collator fix lives in [`train_phase1.py`](https://github.com/Praxisyn/newtype_cognition/blob/main/train_phase1.py) |
| on the `main` branch; this checkpoint was trained before that fix took effect end-to-end. |
|
|
| **`<OCR_WITH_REGION>` does not produce region coordinates.** Despite its name, the token |
| outputs plain text identical to `<OCR>` — bounding box coordinates are not generated. |
| The fine-tuning process appears to have collapsed the two tokens to the same behavior, |
| likely because the region format (`<loc_N>` tokens) was not well represented in the |
| training data or was discarded by the collator. |
|
|
| **OCR extracts dominant text only.** Both `<OCR>` and `<OCR_WITH_REGION>` return the |
| most visually prominent text in the document (e.g., the largest number or title), not |
| a full transcription of all text regions. On template documents with placeholder data, |
| this may produce a single value such as `$30` or `$0.00`. |
|
|
| ## Evaluation |
|
|
| Evaluated on a 50-sample slice of `HuggingFaceM4/DocumentVQA` validation: |
|
|
| | Metric | Value | |
| |---|---| |
| | Exact match | 10.00% | |
| | Token F1 (avg) | 13.67% | |
| | Answer-substring hits | 12.00% | |
|
|
| Most of the 10% exact match comes from samples where the expected answer is the dominant |
| visible text on the document (e.g., the company title is the answer to "What is the |
| company name?"). It is **not** evidence of question understanding. |
|
|
| ## Token behavior and recommendations |
|
|
| Based on empirical testing, the three most useful task tokens behave as follows: |
|
|
| | Token | Behavior | Recommended for | |
| |---|---|---| |
| | `<OCR>` | Extracts the single most visually dominant text | Quick salience extraction, document indexing | |
| | `<OCR_WITH_REGION>` | Identical output to `<OCR>` — region coordinates not generated | Avoid; use `<OCR>` directly | |
| | `<MORE_DETAILED_CAPTION>` | Produces a structured natural-language description of the document layout | Thumbnail descriptions, alt-text generation, layout understanding | |
|
|
| **`<MORE_DETAILED_CAPTION>` is the most informative token** for document understanding tasks. |
| On a standard service invoice it correctly identified the document type, described the |
| column structure, and extracted footer text — without being asked any question. Output is |
| capped by `max_new_tokens`; increase to 256 or higher to avoid truncation on dense documents. |
|
|
| Example output on a service invoice template (`<MORE_DETAILED_CAPTION>`): |
| ``` |
| The image is a service invoice template with hourly rate. It has a white background |
| and black text. The template is divided into two columns, with the left column containing |
| the company name, address, phone number, and email address. The right column contains |
| the total amount of the service invoice, which includes the total cost, subtotal, |
| discount, and other details. At the bottom of the template, there is a note that reads |
| "Thank you" and "Please make check payable to your company name." |
| ``` |
|
|
| ## Use Cases |
|
|
| This model is best suited for tasks that do **not** require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios: |
|
|
| **Document indexing and search** |
| Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step. |
|
|
| **Alt-text and thumbnail description generation** |
| Automatically generating descriptions of document images for accessibility purposes or content management system previews. |
|
|
| **Visual salience detection** |
| Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents. |
|
|
| **Hybrid OCR pipelines** |
| Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream. |
|
|
| **Fine-tuning checkpoint** |
| Starting a domain-specific fine-tune from this checkpoint rather than from `microsoft/Florence-2-large` vanilla, particularly for document-heavy domains. |
|
|
| ## When Not to Use This Model |
|
|
| - **Document Question Answering (DocQA):** The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks. |
| - **Conversational document assistants:** Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question. |
| - **Multi-document reasoning:** The model processes a single image and has no cross-document or contextual reasoning capability. |
| - **Production-critical extraction:** With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences. |
|
|
| ## Recommended use |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoProcessor |
| from PIL import Image |
| import torch |
| |
| model = AutoModelForCausalLM.from_pretrained( |
| "d3p4rt/newtype-cognition", |
| trust_remote_code=True, |
| torch_dtype=torch.float16, |
| attn_implementation="eager", |
| ).cuda().eval() |
| processor = AutoProcessor.from_pretrained( |
| "d3p4rt/newtype-cognition", |
| trust_remote_code=True, |
| ) |
| |
| image = Image.open("document.jpg").convert("RGB") |
| |
| # Use Florence-2 task tokens — do NOT pass arbitrary questions |
| inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt") |
| inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda") |
| for k, v in inputs.items()} |
| |
| with torch.no_grad(): |
| out = model.generate( |
| **inputs, |
| max_new_tokens=128, |
| num_beams=3, |
| do_sample=False, |
| early_stopping=True, |
| ) |
| |
| print(processor.batch_decode(out, skip_special_tokens=True)[0]) |
| ``` |
|
|
| ## Training Details |
|
|
| - **Base model**: `microsoft/Florence-2-large` |
| - **Hardware**: NVIDIA RTX 4090 (24 GB) on Vast.ai |
| - **Precision**: bfloat16 |
| - **Optimizer**: `paged_adamw_8bit` |
| - **Memory tricks**: gradient checkpointing, `expandable_segments` allocator |
| - **Phase 1 (warm-up)**: 5 epochs, full fine-tune, lr=2e-5 |
| - **Phase 2 (specialization)**: 3 LoRA adapters (DocVQA, Nemotron, FineVision) |
| - **Phase 3 (merge)**: weighted merge biased toward DocVQA (0.90) |
| - **Phase 4 (polish)**: 2 epochs full fine-tune, lr=1e-6 |
|
|
| ## Datasets |
|
|
| - `HuggingFaceM4/DocumentVQA` (Phase 1, 2) |
| - `nvidia/Nemotron-VLM-Dataset-v1` (Phase 1, 2) |
| - `HuggingFaceM4/FineVision` (Phase 1, 2) |
| - `ChartGen` (Phase 1) |
|
|
| All datasets streamed; no full local copies retained. |
|
|
| ## Lessons learned (for future fine-tuners) |
|
|
| - **Always include the conditioning input (question) in your data collator from the |
| first epoch**, especially when using a custom collator that builds the model input |
| text from multiple fields. |
| - Florence-2's processor enforces that special task tokens (`<OCR_WITH_REGION>`, |
| `<CAPTION>`, etc.) are *the only content* in the input text. To inject extra text, |
| manually expand the token to its English prompt (e.g., `<OCR_WITH_REGION>` → |
| `"What is the text in the image, with regions?"`) before concatenating user text. |
| - A late-stage low-lr "polish" phase **cannot** fix a behavioral bug introduced in |
| earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4. |
|
|
| ## Demo |
|
|
| Try it live: [Space](https://huggingface.co/spaces/d3p4rt/newtype-cognition-demo) |
|
|
| ## License |
|
|
| MIT (inherited from base model). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{newtype-cognition, |
| author = {d3p4rt}, |
| title = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}}, |
| } |
| ``` |