Update README.md

dc52fdf verified 2 days ago

9.99 kB

license: mit
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - florence-2
  - document-understanding
  - ocr
  - fine-tuned
  - vision-language
base_model: microsoft/Florence-2-large
datasets:
  - HuggingFaceM4/DocumentVQA
  - nvidia/Nemotron-VLM-Dataset-v1
  - HuggingFaceM4/FineVision

Newtype Cognition

Florence-2 Document OCR Captioner (4-Phase Fine-tuned)

A 4-phase fine-tuned variant of microsoft/Florence-2-large trained on document images (DocumentVQA, Nemotron, FineVision). The model performs document text extraction and document-flavored captioning but does not function as a true Visual Question Answering (VQA) model. See "Limitations" below.

What this model actually does

Given a document image and a Florence-2 task token (<OCR_WITH_REGION>, <CAPTION>, <MORE_DETAILED_CAPTION>, etc.), the model produces:

The dominant visible text of the document (e.g., title, biggest number, masthead)
A descriptive caption of the document layout
Extracted text regions (OCR-style)

It is best thought of as a document-aware OCR captioner — useful for indexing, thumbnail descriptions, or as a starting checkpoint for further fine-tuning, not as a question-answering system.

Limitations (read this before using)

The model is question-blind. During Phase 1–4 training, the data collator fed only the Florence-2 task token (e.g., <OCR_WITH_REGION>) to the model and dropped the user's question. The model therefore learned a fixed image → text mapping, independent of what the user asks. Concrete behavior on the same image with different questions:

Question	Predicted answer
"What is the name of the university?"	`'2:10:48'`
"Where is the university located?"	`'2:10:48'`
"To whom is the document sent?"	`'Willow 155-8056'`

A Phase-4-only retrain with a patched (question-aware) collator did not fix the behavior, because Phase 1–3 had already saturated the question-blind mapping at higher learning rates. A full Phase 1→4 retrain with the corrected collator would be required.

The collator fix lives in train_phase1.py on the main branch; this checkpoint was trained before that fix took effect end-to-end.

<OCR_WITH_REGION> does not produce region coordinates. Despite its name, the token outputs plain text identical to <OCR> — bounding box coordinates are not generated. The fine-tuning process appears to have collapsed the two tokens to the same behavior, likely because the region format (<loc_N> tokens) was not well represented in the training data or was discarded by the collator.

OCR extracts dominant text only. Both <OCR> and <OCR_WITH_REGION> return the most visually prominent text in the document (e.g., the largest number or title), not a full transcription of all text regions. On template documents with placeholder data, this may produce a single value such as $30 or $0.00.

Evaluation

Evaluated on a 50-sample slice of HuggingFaceM4/DocumentVQA validation:

Metric	Value
Exact match	10.00%
Token F1 (avg)	13.67%
Answer-substring hits	12.00%

Most of the 10% exact match comes from samples where the expected answer is the dominant visible text on the document (e.g., the company title is the answer to "What is the company name?"). It is not evidence of question understanding.

Token behavior and recommendations

Based on empirical testing, the three most useful task tokens behave as follows:

Token	Behavior	Recommended for
`<OCR>`	Extracts the single most visually dominant text	Quick salience extraction, document indexing
`<OCR_WITH_REGION>`	Identical output to `<OCR>` — region coordinates not generated	Avoid; use `<OCR>` directly
`<MORE_DETAILED_CAPTION>`	Produces a structured natural-language description of the document layout	Thumbnail descriptions, alt-text generation, layout understanding

<MORE_DETAILED_CAPTION> is the most informative token for document understanding tasks. On a standard service invoice it correctly identified the document type, described the column structure, and extracted footer text — without being asked any question. Output is capped by max_new_tokens; increase to 256 or higher to avoid truncation on dense documents.

Example output on a service invoice template (<MORE_DETAILED_CAPTION>):

The image is a service invoice template with hourly rate. It has a white background
and black text. The template is divided into two columns, with the left column containing
the company name, address, phone number, and email address. The right column contains
the total amount of the service invoice, which includes the total cost, subtotal,
discount, and other details. At the bottom of the template, there is a note that reads
"Thank you" and "Please make check payable to your company name."

Use Cases

This model is best suited for tasks that do not require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios:

Document indexing and search Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step.

Alt-text and thumbnail description generation Automatically generating descriptions of document images for accessibility purposes or content management system previews.

Visual salience detection Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents.

Hybrid OCR pipelines Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream.

Fine-tuning checkpoint Starting a domain-specific fine-tune from this checkpoint rather than from microsoft/Florence-2-large vanilla, particularly for document-heavy domains.

When Not to Use This Model

Document Question Answering (DocQA): The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks.
Conversational document assistants: Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question.
Multi-document reasoning: The model processes a single image and has no cross-document or contextual reasoning capability.
Production-critical extraction: With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences.

Recommended use

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model = AutoModelForCausalLM.from_pretrained(
    "d3p4rt/newtype-cognition",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    attn_implementation="eager",
).cuda().eval()
processor = AutoProcessor.from_pretrained(
    "d3p4rt/newtype-cognition",
    trust_remote_code=True,
)

image = Image.open("document.jpg").convert("RGB")

# Use Florence-2 task tokens — do NOT pass arbitrary questions
inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt")
inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda")
          for k, v in inputs.items()}

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=128,
        num_beams=3,
        do_sample=False,
        early_stopping=True,
    )

print(processor.batch_decode(out, skip_special_tokens=True)[0])

Training Details

Base model: microsoft/Florence-2-large
Hardware: NVIDIA RTX 4090 (24 GB) on Vast.ai
Precision: bfloat16
Optimizer: paged_adamw_8bit
Memory tricks: gradient checkpointing, expandable_segments allocator
Phase 1 (warm-up): 5 epochs, full fine-tune, lr=2e-5
Phase 2 (specialization): 3 LoRA adapters (DocVQA, Nemotron, FineVision)
Phase 3 (merge): weighted merge biased toward DocVQA (0.90)
Phase 4 (polish): 2 epochs full fine-tune, lr=1e-6

Datasets

HuggingFaceM4/DocumentVQA (Phase 1, 2)
nvidia/Nemotron-VLM-Dataset-v1 (Phase 1, 2)
HuggingFaceM4/FineVision (Phase 1, 2)
ChartGen (Phase 1)

All datasets streamed; no full local copies retained.

Lessons learned (for future fine-tuners)

Always include the conditioning input (question) in your data collator from the first epoch, especially when using a custom collator that builds the model input text from multiple fields.
Florence-2's processor enforces that special task tokens (<OCR_WITH_REGION>, <CAPTION>, etc.) are the only content in the input text. To inject extra text, manually expand the token to its English prompt (e.g., <OCR_WITH_REGION> → "What is the text in the image, with regions?") before concatenating user text.
A late-stage low-lr "polish" phase cannot fix a behavioral bug introduced in earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4.

Demo

Try it live: Space

License

MIT (inherited from base model).

Citation

@misc{newtype-cognition,
  author       = {d3p4rt},
  title        = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}},
}