---
license: mit
language:
- en
pipeline_tag: image-text-to-text
tags:
- florence-2
- document-understanding
- ocr
- fine-tuned
- vision-language
base_model: microsoft/Florence-2-large
datasets:
- HuggingFaceM4/DocumentVQA
- nvidia/Nemotron-VLM-Dataset-v1
- HuggingFaceM4/FineVision
---

# Newtype Cognition
<p align="center">
  <img src="logo.png" alt="Newtype Cognition Logo" width="400"/>
</p>


## Florence-2 Document OCR Captioner (4-Phase Fine-tuned)

A 4-phase fine-tuned variant of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large)
trained on document images (DocumentVQA, Nemotron, FineVision). The model performs
**document text extraction and document-flavored captioning** but does **not** function
as a true Visual Question Answering (VQA) model. See "Limitations" below.

## What this model actually does

Given a document image and a Florence-2 task token (`<OCR_WITH_REGION>`, `<CAPTION>`,
`<MORE_DETAILED_CAPTION>`, etc.), the model produces:

- The **dominant visible text** of the document (e.g., title, biggest number, masthead)
- A **descriptive caption** of the document layout
- **Extracted text regions** (OCR-style)

It is best thought of as a *document-aware OCR captioner* — useful for indexing,
thumbnail descriptions, or as a starting checkpoint for further fine-tuning, **not**
as a question-answering system.

## Limitations (read this before using)

**The model is question-blind.** During Phase 1–4 training, the data collator fed only
the Florence-2 task token (e.g., `<OCR_WITH_REGION>`) to the model and dropped the user's
question. The model therefore learned a fixed `image → text` mapping, independent of
what the user asks. Concrete behavior on the same image with different questions:

| Question | Predicted answer |
|---|---|
| "What is the name of the university?" | `'2:10:48'` |
| "Where is the university located?" | `'2:10:48'` |
| "To whom is the document sent?" | `'Willow 155-8056'` |

A Phase-4-only retrain with a patched (question-aware) collator did **not** fix the
behavior, because Phase 1–3 had already saturated the question-blind mapping at higher
learning rates. A full Phase 1→4 retrain with the corrected collator would be required.

The collator fix lives in [`train_phase1.py`](https://github.com/Praxisyn/newtype_cognition/blob/main/train_phase1.py)
on the `main` branch; this checkpoint was trained before that fix took effect end-to-end.

**`<OCR_WITH_REGION>` does not produce region coordinates.** Despite its name, the token
outputs plain text identical to `<OCR>` — bounding box coordinates are not generated.
The fine-tuning process appears to have collapsed the two tokens to the same behavior,
likely because the region format (`<loc_N>` tokens) was not well represented in the
training data or was discarded by the collator.

**OCR extracts dominant text only.** Both `<OCR>` and `<OCR_WITH_REGION>` return the
most visually prominent text in the document (e.g., the largest number or title), not
a full transcription of all text regions. On template documents with placeholder data,
this may produce a single value such as `$30` or `$0.00`.

## Evaluation

Evaluated on a 50-sample slice of `HuggingFaceM4/DocumentVQA` validation:

| Metric | Value |
|---|---|
| Exact match | 10.00% |
| Token F1 (avg) | 13.67% |
| Answer-substring hits | 12.00% |

Most of the 10% exact match comes from samples where the expected answer is the dominant
visible text on the document (e.g., the company title is the answer to "What is the
company name?"). It is **not** evidence of question understanding.

## Token behavior and recommendations

Based on empirical testing, the three most useful task tokens behave as follows:

| Token | Behavior | Recommended for |
|---|---|---|
| `<OCR>` | Extracts the single most visually dominant text | Quick salience extraction, document indexing |
| `<OCR_WITH_REGION>` | Identical output to `<OCR>` — region coordinates not generated | Avoid; use `<OCR>` directly |
| `<MORE_DETAILED_CAPTION>` | Produces a structured natural-language description of the document layout | Thumbnail descriptions, alt-text generation, layout understanding |

**`<MORE_DETAILED_CAPTION>` is the most informative token** for document understanding tasks.
On a standard service invoice it correctly identified the document type, described the
column structure, and extracted footer text — without being asked any question. Output is
capped by `max_new_tokens`; increase to 256 or higher to avoid truncation on dense documents.

Example output on a service invoice template (`<MORE_DETAILED_CAPTION>`):
```
The image is a service invoice template with hourly rate. It has a white background
and black text. The template is divided into two columns, with the left column containing
the company name, address, phone number, and email address. The right column contains
the total amount of the service invoice, which includes the total cost, subtotal,
discount, and other details. At the bottom of the template, there is a note that reads
"Thank you" and "Please make check payable to your company name."
```

## Use Cases

This model is best suited for tasks that do **not** require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios:

**Document indexing and search**
Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step.

**Alt-text and thumbnail description generation**
Automatically generating descriptions of document images for accessibility purposes or content management system previews.

**Visual salience detection**
Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents.

**Hybrid OCR pipelines**
Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream.

**Fine-tuning checkpoint**
Starting a domain-specific fine-tune from this checkpoint rather than from `microsoft/Florence-2-large` vanilla, particularly for document-heavy domains.

## When Not to Use This Model

- **Document Question Answering (DocQA):** The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks.
- **Conversational document assistants:** Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question.
- **Multi-document reasoning:** The model processes a single image and has no cross-document or contextual reasoning capability.
- **Production-critical extraction:** With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences.

## Recommended use

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model = AutoModelForCausalLM.from_pretrained(
    "d3p4rt/newtype-cognition",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    attn_implementation="eager",
).cuda().eval()
processor = AutoProcessor.from_pretrained(
    "d3p4rt/newtype-cognition",
    trust_remote_code=True,
)

image = Image.open("document.jpg").convert("RGB")

# Use Florence-2 task tokens — do NOT pass arbitrary questions
inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt")
inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda")
          for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=128,
        num_beams=3,
        do_sample=False,
        early_stopping=True,
    )

print(processor.batch_decode(out, skip_special_tokens=True)[0])
```

## Training Details

- **Base model**: `microsoft/Florence-2-large`
- **Hardware**: NVIDIA RTX 4090 (24 GB) on Vast.ai
- **Precision**: bfloat16
- **Optimizer**: `paged_adamw_8bit`
- **Memory tricks**: gradient checkpointing, `expandable_segments` allocator
- **Phase 1 (warm-up)**: 5 epochs, full fine-tune, lr=2e-5
- **Phase 2 (specialization)**: 3 LoRA adapters (DocVQA, Nemotron, FineVision)
- **Phase 3 (merge)**: weighted merge biased toward DocVQA (0.90)
- **Phase 4 (polish)**: 2 epochs full fine-tune, lr=1e-6

## Datasets

- `HuggingFaceM4/DocumentVQA` (Phase 1, 2)
- `nvidia/Nemotron-VLM-Dataset-v1` (Phase 1, 2)
- `HuggingFaceM4/FineVision` (Phase 1, 2)
- `ChartGen` (Phase 1)

All datasets streamed; no full local copies retained.

## Lessons learned (for future fine-tuners)

- **Always include the conditioning input (question) in your data collator from the
  first epoch**, especially when using a custom collator that builds the model input
  text from multiple fields.
- Florence-2's processor enforces that special task tokens (`<OCR_WITH_REGION>`,
  `<CAPTION>`, etc.) are *the only content* in the input text. To inject extra text,
  manually expand the token to its English prompt (e.g., `<OCR_WITH_REGION>` →
  `"What is the text in the image, with regions?"`) before concatenating user text.
- A late-stage low-lr "polish" phase **cannot** fix a behavioral bug introduced in
  earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4.

## Demo

Try it live: [Space](https://huggingface.co/spaces/d3p4rt/newtype-cognition-demo)

## License

MIT (inherited from base model).

## Citation

```bibtex
@misc{newtype-cognition,
  author       = {d3p4rt},
  title        = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}},
}
```