Update README.md
Browse files
README.md
CHANGED
|
@@ -72,6 +72,33 @@ Most of the 10% exact match comes from samples where the expected answer is the
|
|
| 72 |
visible text on the document (e.g., the company title is the answer to "What is the
|
| 73 |
company name?"). It is **not** evidence of question understanding.
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
## Recommended use
|
| 76 |
|
| 77 |
```python
|
|
|
|
| 72 |
visible text on the document (e.g., the company title is the answer to "What is the
|
| 73 |
company name?"). It is **not** evidence of question understanding.
|
| 74 |
|
| 75 |
+
## Use Cases
|
| 76 |
+
|
| 77 |
+
This model is best suited for tasks that do **not** require understanding a specific question about the document. Given its question-blind behavior, it works well as a document-aware OCR captioner in the following scenarios:
|
| 78 |
+
|
| 79 |
+
**Document indexing and search**
|
| 80 |
+
Extracting the dominant visible text from large archives of scanned documents (invoices, contracts, forms) to make them keyword-searchable without any question-answering step.
|
| 81 |
+
|
| 82 |
+
**Alt-text and thumbnail description generation**
|
| 83 |
+
Automatically generating descriptions of document images for accessibility purposes or content management system previews.
|
| 84 |
+
|
| 85 |
+
**Visual salience detection**
|
| 86 |
+
Identifying the most visually prominent text in a document (title, total amount, masthead). The model appears to have learned a form of salience awareness, which can be useful for extracting the "headline" information from structured documents.
|
| 87 |
+
|
| 88 |
+
**Hybrid OCR pipelines**
|
| 89 |
+
Using the model as a first stage to extract text regions, then passing those regions to a separate reasoning model downstream.
|
| 90 |
+
|
| 91 |
+
**Fine-tuning checkpoint**
|
| 92 |
+
Starting a domain-specific fine-tune from this checkpoint rather than from `microsoft/Florence-2-large` vanilla, particularly for document-heavy domains.
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
## When Not to Use This Model
|
| 96 |
+
|
| 97 |
+
- **Document Question Answering (DocQA):** The model is question-blind and will ignore any natural language question you provide. Do not use it in any pipeline where the output must depend on what the user asks.
|
| 98 |
+
- **Conversational document assistants:** Chatbots, legal assistants, medical record reviewers, or any interactive system where a user expects answers grounded in a specific question.
|
| 99 |
+
- **Multi-document reasoning:** The model processes a single image and has no cross-document or contextual reasoning capability.
|
| 100 |
+
- **Production-critical extraction:** With 10% exact match on DocumentVQA, accuracy is not sufficient for any use case where extraction errors have significant consequences.
|
| 101 |
+
|
| 102 |
## Recommended use
|
| 103 |
|
| 104 |
```python
|