d3p4rt
/

newtype-cognition

Image-Text-to-Text

document-understanding

vision-language

Model card Files Files and versions

d3p4rt commited on 14 days ago

Commit

f5d6d42

·

verified ·

1 Parent(s): 15f2a3d

Update README.md

Files changed (1) hide show

README.md +27 -0

README.md CHANGED Viewed

@@ -72,6 +72,33 @@ Most of the 10% exact match comes from samples where the expected answer is the
 visible text on the document (e.g., the company title is the answer to "What is the
 company name?"). It is **not** evidence of question understanding.
 ## Recommended use
 ```python

 visible text on the document (e.g., the company title is the answer to "What is the
 company name?"). It is **not** evidence of question understanding.
+## Use Cases
+This model is best suited for tasks that do **not** require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios:
+**Document indexing and search**
+Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step.
+**Alt-text and thumbnail description generation**
+Automatically generating descriptions of document images for accessibility purposes or content management system previews.
+**Visual salience detection**
+Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents.
+**Hybrid OCR pipelines**
+Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream.
+**Fine-tuning checkpoint**
+Starting a domain-specific fine-tune from this checkpoint rather than from `microsoft/Florence-2-large` vanilla, particularly for document-heavy domains.
+## When Not to Use This Model
+- **Document Question Answering (DocQA):** The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks.
+- **Conversational document assistants:** Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question.
+- **Multi-document reasoning:** The model processes a single image and has no cross-document or contextual reasoning capability.
+- **Production-critical extraction:** With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences.
 ## Recommended use
 ```python