license: mit
language:
- en
pipeline_tag: image-text-to-text
tags:
- florence-2
- document-understanding
- ocr
- fine-tuned
- vision-language
base_model: microsoft/Florence-2-large
datasets:
- HuggingFaceM4/DocumentVQA
- nvidia/Nemotron-VLM-Dataset-v1
- HuggingFaceM4/FineVision
Newtype Cognition
Florence-2 Document OCR Captioner (4-Phase Fine-tuned)
A 4-phase fine-tuned variant of microsoft/Florence-2-large trained on document images (DocumentVQA, Nemotron, FineVision). The model performs document text extraction and document-flavored captioning but does not function as a true Visual Question Answering (VQA) model. See "Limitations" below.
What this model actually does
Given a document image and a Florence-2 task token (<OCR_WITH_REGION>, <CAPTION>,
<MORE_DETAILED_CAPTION>, etc.), the model produces:
- The dominant visible text of the document (e.g., title, biggest number, masthead)
- A descriptive caption of the document layout
- Extracted text regions (OCR-style)
It is best thought of as a document-aware OCR captioner β useful for indexing, thumbnail descriptions, or as a starting checkpoint for further fine-tuning, not as a question-answering system.
Limitations (read this before using)
The model is question-blind. During Phase 1β4 training, the data collator fed only
the Florence-2 task token (e.g., <OCR_WITH_REGION>) to the model and dropped the user's
question. The model therefore learned a fixed image β text mapping, independent of
what the user asks. Concrete behavior on the same image with different questions:
| Question | Predicted answer |
|---|---|
| "What is the name of the university?" | '2:10:48' |
| "Where is the university located?" | '2:10:48' |
| "To whom is the document sent?" | 'Willow 155-8056' |
A Phase-4-only retrain with a patched (question-aware) collator did not fix the behavior, because Phase 1β3 had already saturated the question-blind mapping at higher learning rates. A full Phase 1β4 retrain with the corrected collator would be required.
The collator fix lives in train_phase1.py
on the main branch; this checkpoint was trained before that fix took effect end-to-end.
<OCR_WITH_REGION> does not produce region coordinates. Despite its name, the token
outputs plain text identical to <OCR> β bounding box coordinates are not generated.
The fine-tuning process appears to have collapsed the two tokens to the same behavior,
likely because the region format (<loc_N> tokens) was not well represented in the
training data or was discarded by the collator.
OCR extracts dominant text only. Both <OCR> and <OCR_WITH_REGION> return the
most visually prominent text in the document (e.g., the largest number or title), not
a full transcription of all text regions. On template documents with placeholder data,
this may produce a single value such as $30 or $0.00.
Evaluation
Evaluated on a 50-sample slice of HuggingFaceM4/DocumentVQA validation:
| Metric | Value |
|---|---|
| Exact match | 10.00% |
| Token F1 (avg) | 13.67% |
| Answer-substring hits | 12.00% |
Most of the 10% exact match comes from samples where the expected answer is the dominant visible text on the document (e.g., the company title is the answer to "What is the company name?"). It is not evidence of question understanding.
Token behavior and recommendations
Based on empirical testing, the three most useful task tokens behave as follows:
| Token | Behavior | Recommended for |
|---|---|---|
<OCR> |
Extracts the single most visually dominant text | Quick salience extraction, document indexing |
<OCR_WITH_REGION> |
Identical output to <OCR> β region coordinates not generated |
Avoid; use <OCR> directly |
<MORE_DETAILED_CAPTION> |
Produces a structured natural-language description of the document layout | Thumbnail descriptions, alt-text generation, layout understanding |
<MORE_DETAILED_CAPTION> is the most informative token for document understanding tasks.
On a standard service invoice it correctly identified the document type, described the
column structure, and extracted footer text β without being asked any question. Output is
capped by max_new_tokens; increase to 256 or higher to avoid truncation on dense documents.
Example output on a service invoice template (<MORE_DETAILED_CAPTION>):
The image is a service invoice template with hourly rate. It has a white background
and black text. The template is divided into two columns, with the left column containing
the company name, address, phone number, and email address. The right column contains
the total amount of the service invoice, which includes the total cost, subtotal,
discount, and other details. At the bottom of the template, there is a note that reads
"Thank you" and "Please make check payable to your company name."
Use Cases
This model is best suited for tasks that do not require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios:
Document indexing and search Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step.
Alt-text and thumbnail description generation Automatically generating descriptions of document images for accessibility purposes or content management system previews.
Visual salience detection Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents.
Hybrid OCR pipelines Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream.
Fine-tuning checkpoint
Starting a domain-specific fine-tune from this checkpoint rather than from microsoft/Florence-2-large vanilla, particularly for document-heavy domains.
When Not to Use This Model
- Document Question Answering (DocQA): The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks.
- Conversational document assistants: Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question.
- Multi-document reasoning: The model processes a single image and has no cross-document or contextual reasoning capability.
- Production-critical extraction: With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences.
Recommended use
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model = AutoModelForCausalLM.from_pretrained(
"d3p4rt/newtype-cognition",
trust_remote_code=True,
torch_dtype=torch.float16,
attn_implementation="eager",
).cuda().eval()
processor = AutoProcessor.from_pretrained(
"d3p4rt/newtype-cognition",
trust_remote_code=True,
)
image = Image.open("document.jpg").convert("RGB")
# Use Florence-2 task tokens β do NOT pass arbitrary questions
inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt")
inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda")
for k, v in inputs.items()}
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=128,
num_beams=3,
do_sample=False,
early_stopping=True,
)
print(processor.batch_decode(out, skip_special_tokens=True)[0])
Training Details
- Base model:
microsoft/Florence-2-large - Hardware: NVIDIA RTX 4090 (24 GB) on Vast.ai
- Precision: bfloat16
- Optimizer:
paged_adamw_8bit - Memory tricks: gradient checkpointing,
expandable_segmentsallocator - Phase 1 (warm-up): 5 epochs, full fine-tune, lr=2e-5
- Phase 2 (specialization): 3 LoRA adapters (DocVQA, Nemotron, FineVision)
- Phase 3 (merge): weighted merge biased toward DocVQA (0.90)
- Phase 4 (polish): 2 epochs full fine-tune, lr=1e-6
Datasets
HuggingFaceM4/DocumentVQA(Phase 1, 2)nvidia/Nemotron-VLM-Dataset-v1(Phase 1, 2)HuggingFaceM4/FineVision(Phase 1, 2)ChartGen(Phase 1)
All datasets streamed; no full local copies retained.
Lessons learned (for future fine-tuners)
- Always include the conditioning input (question) in your data collator from the first epoch, especially when using a custom collator that builds the model input text from multiple fields.
- Florence-2's processor enforces that special task tokens (
<OCR_WITH_REGION>,<CAPTION>, etc.) are the only content in the input text. To inject extra text, manually expand the token to its English prompt (e.g.,<OCR_WITH_REGION>β"What is the text in the image, with regions?") before concatenating user text. - A late-stage low-lr "polish" phase cannot fix a behavioral bug introduced in earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4.
Demo
Try it live: Space
License
MIT (inherited from base model).
Citation
@misc{newtype-cognition,
author = {d3p4rt},
title = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)},
year = {2026},
howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}},
}