update_colab
Browse files
README.md
CHANGED
|
@@ -6,21 +6,18 @@ language:
|
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 8 |
tags:
|
| 9 |
-
- OCR
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
|
| 13 |
|
| 14 |
-
**Typhoon-OCR-7B**: A bilingual document parsing model built specifically for real-world documents in Thai and English inspired by models like olmOCR.
|
| 15 |
|
| 16 |
|
| 17 |
-
## **Model Description**
|
| 18 |
-
|
| 19 |
-
- **Model type**: A 7B Vision-Language Models (VLMs) model based on Qwen2.5-VL-Instruction.
|
| 20 |
-
- **Requirement**: transformers 4.50.0 or newer.
|
| 21 |
-
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
|
| 22 |
-
- **License**:
|
| 23 |
-
|
| 24 |
|
| 25 |
## **Real-World Document Support**
|
| 26 |
|
|
@@ -51,6 +48,123 @@ However, in the Thai books benchmark, performance slightly declined due to the h
|
|
| 51 |
For this version, our primary focus has been on achieving high-quality OCR for both English and Thai text. Future releases may extend support to more advanced image analysis and figure interpretation.
|
| 52 |
|
| 53 |
## Usage Example
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
|
| 56 |
## **Citation**
|
|
|
|
| 6 |
base_model:
|
| 7 |
- Qwen/Qwen2.5-VL-7B-Instruct
|
| 8 |
tags:
|
| 9 |
+
- OCR
|
| 10 |
+
- vision-language
|
| 11 |
+
- document-understanding
|
| 12 |
+
- multilingual
|
| 13 |
+
- thai
|
| 14 |
---
|
| 15 |
|
| 16 |
|
| 17 |
|
| 18 |
+
**Typhoon-OCR-7B**: A bilingual document parsing model built specifically for real-world documents in Thai and English inspired by models like olmOCR based on Qwen2.5-VL-Instruction.
|
| 19 |
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## **Real-World Document Support**
|
| 23 |
|
|
|
|
| 48 |
For this version, our primary focus has been on achieving high-quality OCR for both English and Thai text. Future releases may extend support to more advanced image analysis and figure interpretation.
|
| 49 |
|
| 50 |
## Usage Example
|
| 51 |
+
**(Recommended!!!): Full inference code available on [Colab](https://colab.research.google.com/drive/1z4Fm2BZnKcFIoWuyxzzIIIn8oI2GKl3r?usp=sharing)**
|
| 52 |
+
|
| 53 |
+
Below is a partial snippet. You can run inference using either the API or a local model.
|
| 54 |
+
|
| 55 |
+
**API**:
|
| 56 |
+
```python
|
| 57 |
+
from typing import Callable
|
| 58 |
+
|
| 59 |
+
PROMPTS_SYS = {
|
| 60 |
+
"default": lambda base_text: (f"Below is an image of a document page along with its dimensions. "
|
| 61 |
+
f"Simply return the markdown representation of this document, presenting tables in markdown format as they naturally appear.\n"
|
| 62 |
+
f"If the document contains images, use a placeholder like dummy.png for each image.\n"
|
| 63 |
+
f"Your final output must be in JSON format with a single key `natural_text` containing the response.\n"
|
| 64 |
+
f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"),
|
| 65 |
+
"structure": lambda base_text: (
|
| 66 |
+
f"Below is an image of a document page, along with its dimensions and possibly some raw textual content previously extracted from it. "
|
| 67 |
+
f"Note that the text extraction may be incomplete or partially missing. Carefully consider both the layout and any available text to reconstruct the document accurately.\n"
|
| 68 |
+
f"Your task is to return the markdown representation of this document, presenting tables in HTML format as they naturally appear.\n"
|
| 69 |
+
f"If the document contains images or figures, analyze them and include the tag <figure>IMAGE_ANALYSIS</figure> in the appropriate location.\n"
|
| 70 |
+
f"Your final output must be in JSON format with a single key `natural_text` containing the response.\n"
|
| 71 |
+
f"RAW_TEXT_START\n{base_text}\nRAW_TEXT_END"
|
| 72 |
+
),
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
def get_prompt(prompt_name: str) -> Callable[[str], str]:
|
| 76 |
+
"""
|
| 77 |
+
Fetches the system prompt based on the provided PROMPT_NAME.
|
| 78 |
+
|
| 79 |
+
:param prompt_name: The identifier for the desired prompt.
|
| 80 |
+
:return: The system prompt as a string.
|
| 81 |
+
"""
|
| 82 |
+
return PROMPTS_SYS.get(prompt_name, lambda x: "Invalid PROMPT_NAME provided.")
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
# Render the first page to base64 PNG and then load it into a PIL image.
|
| 87 |
+
image_base64 = render_pdf_to_base64png(filename, page_num, target_longest_image_dim=1800)
|
| 88 |
+
image_pil = Image.open(BytesIO(base64.b64decode(image_base64)))
|
| 89 |
+
|
| 90 |
+
# Extract anchor text from the PDF (first page)
|
| 91 |
+
anchor_text = get_anchor_text(filename, page_num, pdf_engine="pdfreport", target_length=8000)
|
| 92 |
+
|
| 93 |
+
# Retrieve and fill in the prompt template with the anchor_text
|
| 94 |
+
prompt_template_fn = get_prompt(task_type)
|
| 95 |
+
PROMPT = prompt_template_fn(anchor_text)
|
| 96 |
+
|
| 97 |
+
messages = [{
|
| 98 |
+
"role": "user",
|
| 99 |
+
"content": [
|
| 100 |
+
{"type": "text", "text": PROMPT},
|
| 101 |
+
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image_base64}"}},
|
| 102 |
+
],
|
| 103 |
+
}]
|
| 104 |
+
# send messages to openai compatible api
|
| 105 |
+
openai = OpenAI(base_url="https://api.opentyphoon.ai/v1", api_key="TYPHOON_API_KEY")
|
| 106 |
+
response = openai.chat.completions.create(
|
| 107 |
+
model="typhoon-ocr-preview",
|
| 108 |
+
messages=messages,
|
| 109 |
+
max_tokens=16384,
|
| 110 |
+
extra_body={
|
| 111 |
+
"repetition_penalty": 1.2,
|
| 112 |
+
"temperature": 0.1,
|
| 113 |
+
"top_p": 0.6,
|
| 114 |
+
},
|
| 115 |
+
)
|
| 116 |
+
text_output = response.choices[0].message.content
|
| 117 |
+
print(text_output)
|
| 118 |
+
```
|
| 119 |
+
**Local Model (GPU Required)**:
|
| 120 |
+
```python
|
| 121 |
+
# Initialize the model
|
| 122 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained("scb10x/typhoon-ocr-7b", torch_dtype=torch.bfloat16 ).eval()
|
| 123 |
+
processor = AutoProcessor.from_pretrained("scb10x/typhoon-ocr-7b")
|
| 124 |
+
|
| 125 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 126 |
+
model.to(device)
|
| 127 |
+
# Apply the chat template and processor
|
| 128 |
+
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| 129 |
+
main_image = Image.open(BytesIO(base64.b64decode(image_base64)))
|
| 130 |
+
|
| 131 |
+
inputs = processor(
|
| 132 |
+
text=[text],
|
| 133 |
+
images=[main_image],
|
| 134 |
+
padding=True,
|
| 135 |
+
return_tensors="pt",
|
| 136 |
+
)
|
| 137 |
+
inputs = {key: value.to(device) for (key, value) in inputs.items()}
|
| 138 |
+
|
| 139 |
+
# Generate the output
|
| 140 |
+
output = model.generate(
|
| 141 |
+
**inputs,
|
| 142 |
+
temperature=0.1,
|
| 143 |
+
max_new_tokens=12000,
|
| 144 |
+
num_return_sequences=1,
|
| 145 |
+
repetition_penalty=1.2,
|
| 146 |
+
do_sample=True,
|
| 147 |
+
)
|
| 148 |
+
# Decode the output
|
| 149 |
+
prompt_length = inputs["input_ids"].shape[1]
|
| 150 |
+
new_tokens = output[:, prompt_length:]
|
| 151 |
+
text_output = processor.tokenizer.batch_decode(
|
| 152 |
+
new_tokens, skip_special_tokens=True
|
| 153 |
+
)
|
| 154 |
+
print(text_output[0])
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
## **Intended Uses & Limitations**
|
| 158 |
+
|
| 159 |
+
This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.
|
| 160 |
+
|
| 161 |
+
## **Follow us**
|
| 162 |
+
|
| 163 |
+
**https://twitter.com/opentyphoon**
|
| 164 |
+
|
| 165 |
+
## **Support**
|
| 166 |
+
|
| 167 |
+
**https://discord.gg/us5gAYmrxw**
|
| 168 |
|
| 169 |
|
| 170 |
## **Citation**
|