Update README.md

dc52fdf verified 5 days ago

9.99 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: image-text-to-text
	tags:
	- florence-2
	- document-understanding
	- ocr
	- fine-tuned
	- vision-language
	base_model: microsoft/Florence-2-large
	datasets:
	- HuggingFaceM4/DocumentVQA
	- nvidia/Nemotron-VLM-Dataset-v1
	- HuggingFaceM4/FineVision
	---

	# Newtype Cognition
	<p align="center">
	<img src="logo.png" alt="Newtype Cognition Logo" width="400"/>
	</p>


	## Florence-2 Document OCR Captioner (4-Phase Fine-tuned)

	A 4-phase fine-tuned variant of [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large)
	trained on document images (DocumentVQA, Nemotron, FineVision). The model performs
	document text extraction and document-flavored captioning but does not function
	as a true Visual Question Answering (VQA) model. See "Limitations" below.

	## What this model actually does

	Given a document image and a Florence-2 task token (`<OCR_WITH_REGION>`, `<CAPTION>`,
	`<MORE_DETAILED_CAPTION>`, etc.), the model produces:

	- The dominant visible text of the document (e.g., title, biggest number, masthead)
	- A descriptive caption of the document layout
	- Extracted text regions (OCR-style)

	It is best thought of as a document-aware OCR captioner — useful for indexing,
	thumbnail descriptions, or as a starting checkpoint for further fine-tuning, not
	as a question-answering system.

	## Limitations (read this before using)

	The model is question-blind. During Phase 1–4 training, the data collator fed only
	the Florence-2 task token (e.g., `<OCR_WITH_REGION>`) to the model and dropped the user's
	question. The model therefore learned a fixed `image → text` mapping, independent of
	what the user asks. Concrete behavior on the same image with different questions:

	\| Question \| Predicted answer \|
	\|---\|---\|
	\| "What is the name of the university?" \| `'2:10:48'` \|
	\| "Where is the university located?" \| `'2:10:48'` \|
	\| "To whom is the document sent?" \| `'Willow 155-8056'` \|

	A Phase-4-only retrain with a patched (question-aware) collator did not fix the
	behavior, because Phase 1–3 had already saturated the question-blind mapping at higher
	learning rates. A full Phase 1→4 retrain with the corrected collator would be required.

	The collator fix lives in [`train_phase1.py`](https://github.com/Praxisyn/newtype_cognition/blob/main/train_phase1.py)
	on the `main` branch; this checkpoint was trained before that fix took effect end-to-end.

	`<OCR_WITH_REGION>` does not produce region coordinates. Despite its name, the token
	outputs plain text identical to `<OCR>` — bounding box coordinates are not generated.
	The fine-tuning process appears to have collapsed the two tokens to the same behavior,
	likely because the region format (`<loc_N>` tokens) was not well represented in the
	training data or was discarded by the collator.

	OCR extracts dominant text only. Both `<OCR>` and `<OCR_WITH_REGION>` return the
	most visually prominent text in the document (e.g., the largest number or title), not
	a full transcription of all text regions. On template documents with placeholder data,
	this may produce a single value such as `$30` or `$0.00`.

	## Evaluation

	Evaluated on a 50-sample slice of `HuggingFaceM4/DocumentVQA` validation:

	\| Metric \| Value \|
	\|---\|---\|
	\| Exact match \| 10.00% \|
	\| Token F1 (avg) \| 13.67% \|
	\| Answer-substring hits \| 12.00% \|

	Most of the 10% exact match comes from samples where the expected answer is the dominant
	visible text on the document (e.g., the company title is the answer to "What is the
	company name?"). It is not evidence of question understanding.

	## Token behavior and recommendations

	Based on empirical testing, the three most useful task tokens behave as follows:

	\| Token \| Behavior \| Recommended for \|
	\|---\|---\|---\|
	\| `<OCR>` \| Extracts the single most visually dominant text \| Quick salience extraction, document indexing \|
	\| `<OCR_WITH_REGION>` \| Identical output to `<OCR>` — region coordinates not generated \| Avoid; use `<OCR>` directly \|
	\| `<MORE_DETAILED_CAPTION>` \| Produces a structured natural-language description of the document layout \| Thumbnail descriptions, alt-text generation, layout understanding \|

	`<MORE_DETAILED_CAPTION>` is the most informative token for document understanding tasks.
	On a standard service invoice it correctly identified the document type, described the
	column structure, and extracted footer text — without being asked any question. Output is
	capped by `max_new_tokens`; increase to 256 or higher to avoid truncation on dense documents.

	Example output on a service invoice template (`<MORE_DETAILED_CAPTION>`):
	```
	The image is a service invoice template with hourly rate. It has a white background
	and black text. The template is divided into two columns, with the left column containing
	the company name, address, phone number, and email address. The right column contains
	the total amount of the service invoice, which includes the total cost, subtotal,
	discount, and other details. At the bottom of the template, there is a note that reads
	"Thank you" and "Please make check payable to your company name."
	```

	## Use Cases

	This model is best suited for tasks that do not require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios:

	Document indexing and search
	Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step.

	Alt-text and thumbnail description generation
	Automatically generating descriptions of document images for accessibility purposes or content management system previews.

	Visual salience detection
	Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents.

	Hybrid OCR pipelines
	Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream.

	Fine-tuning checkpoint
	Starting a domain-specific fine-tune from this checkpoint rather than from `microsoft/Florence-2-large` vanilla, particularly for document-heavy domains.

	## When Not to Use This Model

	- Document Question Answering (DocQA): The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks.
	- Conversational document assistants: Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question.
	- Multi-document reasoning: The model processes a single image and has no cross-document or contextual reasoning capability.
	- Production-critical extraction: With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences.

	## Recommended use

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor
	from PIL import Image
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"d3p4rt/newtype-cognition",
	trust_remote_code=True,
	torch_dtype=torch.float16,
	attn_implementation="eager",
	).cuda().eval()
	processor = AutoProcessor.from_pretrained(
	"d3p4rt/newtype-cognition",
	trust_remote_code=True,
	)

	image = Image.open("document.jpg").convert("RGB")

	# Use Florence-2 task tokens — do NOT pass arbitrary questions
	inputs = processor(text="<OCR_WITH_REGION>", images=image, return_tensors="pt")
	inputs = {k: v.to("cuda").to(torch.float16) if v.dtype == torch.float32 else v.to("cuda")
	for k, v in inputs.items()}

	with torch.no_grad():
	out = model.generate(
	**inputs,
	max_new_tokens=128,
	num_beams=3,
	do_sample=False,
	early_stopping=True,
	)

	print(processor.batch_decode(out, skip_special_tokens=True)[0])
	```

	## Training Details

	- Base model: `microsoft/Florence-2-large`
	- Hardware: NVIDIA RTX 4090 (24 GB) on Vast.ai
	- Precision: bfloat16
	- Optimizer: `paged_adamw_8bit`
	- Memory tricks: gradient checkpointing, `expandable_segments` allocator
	- Phase 1 (warm-up): 5 epochs, full fine-tune, lr=2e-5
	- Phase 2 (specialization): 3 LoRA adapters (DocVQA, Nemotron, FineVision)
	- Phase 3 (merge): weighted merge biased toward DocVQA (0.90)
	- Phase 4 (polish): 2 epochs full fine-tune, lr=1e-6

	## Datasets

	- `HuggingFaceM4/DocumentVQA` (Phase 1, 2)
	- `nvidia/Nemotron-VLM-Dataset-v1` (Phase 1, 2)
	- `HuggingFaceM4/FineVision` (Phase 1, 2)
	- `ChartGen` (Phase 1)

	All datasets streamed; no full local copies retained.

	## Lessons learned (for future fine-tuners)

	- **Always include the conditioning input (question) in your data collator from the
	first epoch**, especially when using a custom collator that builds the model input
	text from multiple fields.
	- Florence-2's processor enforces that special task tokens (`<OCR_WITH_REGION>`,
	`<CAPTION>`, etc.) are the only content in the input text. To inject extra text,
	manually expand the token to its English prompt (e.g., `<OCR_WITH_REGION>` →
	`"What is the text in the image, with regions?"`) before concatenating user text.
	- A late-stage low-lr "polish" phase cannot fix a behavioral bug introduced in
	earlier phases. Sanity-check inference behavior at the end of Phase 1, not at Phase 4.

	## Demo

	Try it live: [Space](https://huggingface.co/spaces/d3p4rt/newtype-cognition-demo)

	## License

	MIT (inherited from base model).

	## Citation

	```bibtex
	@misc{newtype-cognition,
	author = {d3p4rt},
	title = {Newtype Cognition: Florence-2 Document OCR Captioner (4-Phase Fine-tuned)},
	year = {2026},
	howpublished = {\url{https://huggingface.co/d3p4rt/newtype-cognition}},
	}
	```