pitapo commited on
Commit
8eddd2a
·
0 Parent(s):

adding all surya model files

Browse files
Files changed (40) hide show
  1. README.md +575 -0
  2. layout/2025_02_18/.gitattributes +35 -0
  3. layout/2025_02_18/README.md +3 -0
  4. layout/2025_02_18/config.json +267 -0
  5. layout/2025_02_18/manifest.json +1 -0
  6. layout/2025_02_18/model.safetensors +3 -0
  7. layout/2025_02_18/preprocessor_config.json +33 -0
  8. ocr_error_detection/2025_02_18/.gitattributes +35 -0
  9. ocr_error_detection/2025_02_18/README.md +3 -0
  10. ocr_error_detection/2025_02_18/config.json +26 -0
  11. ocr_error_detection/2025_02_18/manifest.json +1 -0
  12. ocr_error_detection/2025_02_18/model.safetensors +3 -0
  13. ocr_error_detection/2025_02_18/special_tokens_map.json +37 -0
  14. ocr_error_detection/2025_02_18/tokenizer.json +0 -0
  15. ocr_error_detection/2025_02_18/tokenizer_config.json +55 -0
  16. ocr_error_detection/2025_02_18/vocab.txt +0 -0
  17. table_recognition/2025_02_18/.gitattributes +35 -0
  18. table_recognition/2025_02_18/README.md +3 -0
  19. table_recognition/2025_02_18/config.json +292 -0
  20. table_recognition/2025_02_18/manifest.json +1 -0
  21. table_recognition/2025_02_18/model.safetensors +3 -0
  22. table_recognition/2025_02_18/preprocessor_config.json +35 -0
  23. text_detection/2025_05_07/.gitattributes +35 -0
  24. text_detection/2025_05_07/README.md +15 -0
  25. text_detection/2025_05_07/config.json +54 -0
  26. text_detection/2025_05_07/manifest.json +1 -0
  27. text_detection/2025_05_07/model.safetensors +3 -0
  28. text_detection/2025_05_07/preprocessor_config.json +23 -0
  29. text_detection/2025_05_07/training_args.bin +3 -0
  30. text_recognition/2025_05_16/.gitattributes +35 -0
  31. text_recognition/2025_05_16/README.md +4 -0
  32. text_recognition/2025_05_16/added_tokens.json +24 -0
  33. text_recognition/2025_05_16/config.json +2290 -0
  34. text_recognition/2025_05_16/manifest.json +1 -0
  35. text_recognition/2025_05_16/merges.txt +0 -0
  36. text_recognition/2025_05_16/model.safetensors +3 -0
  37. text_recognition/2025_05_16/preprocessor_config.json +23 -0
  38. text_recognition/2025_05_16/special_tokens_map.json +31 -0
  39. text_recognition/2025_05_16/tokenizer_config.json +208 -0
  40. text_recognition/2025_05_16/vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,575 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Surya
2
+
3
+ Surya is a document OCR toolkit that does:
4
+
5
+ - OCR in 90+ languages that benchmarks favorably vs cloud services
6
+ - Line-level text detection in any language
7
+ - Layout analysis (table, image, header, etc detection)
8
+ - Reading order detection
9
+ - Table recognition (detecting rows/columns)
10
+ - LaTeX OCR
11
+
12
+ It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks) for more details).
13
+
14
+
15
+ | Detection | OCR |
16
+ |:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
17
+ | <img src="static/images/excerpt.png" width="500px"/> | <img src="static/images/excerpt_text.png" width="500px"/> |
18
+
19
+ | Layout | Reading Order |
20
+ |:------------------------------------------------------------------:|:--------------------------------------------------------------------------:|
21
+ | <img src="static/images/excerpt_layout.png" width="500px"/> | <img src="static/images/excerpt_reading.jpg" width="500px"/> |
22
+
23
+ | Table Recognition | LaTeX OCR |
24
+ |:-------------------------------------------------------------:|:------------------------------------------------------:|
25
+ | <img src="static/images/scanned_tablerec.png" width="500px"/> | <img src="static/images/latex_ocr.png" width="500px"/> |
26
+
27
+
28
+ Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
29
+
30
+ ## Community
31
+
32
+ [Discord](https://discord.gg//KuZwXNGnfH) is where we discuss future development.
33
+
34
+ ## Examples
35
+
36
+ | Name | Detection | OCR | Layout | Order | Table Rec |
37
+ |------------------|:-----------------------------------:|-----------------------------------------:|-------------------------------------------:|--------------------------------------------:|---------------------------------------------:|
38
+ | Japanese | [Image](static/images/japanese.jpg) | [Image](static/images/japanese_text.jpg) | [Image](static/images/japanese_layout.jpg) | [Image](static/images/japanese_reading.jpg) | [Image](static/images/japanese_tablerec.png) |
39
+ | Chinese | [Image](static/images/chinese.jpg) | [Image](static/images/chinese_text.jpg) | [Image](static/images/chinese_layout.jpg) | [Image](static/images/chinese_reading.jpg) | |
40
+ | Hindi | [Image](static/images/hindi.jpg) | [Image](static/images/hindi_text.jpg) | [Image](static/images/hindi_layout.jpg) | [Image](static/images/hindi_reading.jpg) | |
41
+ | Arabic | [Image](static/images/arabic.jpg) | [Image](static/images/arabic_text.jpg) | [Image](static/images/arabic_layout.jpg) | [Image](static/images/arabic_reading.jpg) | |
42
+ | Chinese + Hindi | [Image](static/images/chi_hind.jpg) | [Image](static/images/chi_hind_text.jpg) | [Image](static/images/chi_hind_layout.jpg) | [Image](static/images/chi_hind_reading.jpg) | |
43
+ | Presentation | [Image](static/images/pres.png) | [Image](static/images/pres_text.jpg) | [Image](static/images/pres_layout.jpg) | [Image](static/images/pres_reading.jpg) | [Image](static/images/pres_tablerec.png) |
44
+ | Scientific Paper | [Image](static/images/paper.jpg) | [Image](static/images/paper_text.jpg) | [Image](static/images/paper_layout.jpg) | [Image](static/images/paper_reading.jpg) | [Image](static/images/paper_tablerec.png) |
45
+ | Scanned Document | [Image](static/images/scanned.png) | [Image](static/images/scanned_text.jpg) | [Image](static/images/scanned_layout.jpg) | [Image](static/images/scanned_reading.jpg) | [Image](static/images/scanned_tablerec.png) |
46
+ | New York Times | [Image](static/images/nyt.jpg) | [Image](static/images/nyt_text.jpg) | [Image](static/images/nyt_layout.jpg) | [Image](static/images/nyt_order.jpg) | |
47
+ | Scanned Form | [Image](static/images/funsd.png) | [Image](static/images/funsd_text.jpg) | [Image](static/images/funsd_layout.jpg) | [Image](static/images/funsd_reading.jpg) | [Image](static/images/scanned_tablerec2.png) |
48
+ | Textbook | [Image](static/images/textbook.jpg) | [Image](static/images/textbook_text.jpg) | [Image](static/images/textbook_layout.jpg) | [Image](static/images/textbook_order.jpg) | |
49
+
50
+ # Hosted API
51
+
52
+ There is a hosted API for all surya models available [here](https://www.datalab.to/):
53
+
54
+ - Works with PDF, images, word docs, and powerpoints
55
+ - Consistent speed, with no latency spikes
56
+ - High reliability and uptime
57
+
58
+ # Commercial usage
59
+
60
+ I want surya to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.
61
+
62
+ The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under \$2M USD in gross revenue in the most recent 12-month period AND under \$2M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
63
+
64
+ # Installation
65
+
66
+ You'll need python 3.10+ and PyTorch. You may need to install the CPU version of torch first if you're not using a Mac or a GPU machine. See [here](https://pytorch.org/get-started/locally/) for more details.
67
+
68
+ Install with:
69
+
70
+ ```shell
71
+ pip install surya-ocr
72
+ ```
73
+
74
+ Model weights will automatically download the first time you run surya.
75
+
76
+ # Usage
77
+
78
+ - Inspect the settings in `surya/settings.py`. You can override any settings with environment variables.
79
+ - Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
80
+
81
+ ## Interactive App
82
+
83
+ I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:
84
+
85
+ ```shell
86
+ pip install streamlit pdftext
87
+ surya_gui
88
+ ```
89
+
90
+ ## OCR (text recognition)
91
+
92
+ This command will write out a json file with the detected text and bboxes:
93
+
94
+ ```shell
95
+ surya_ocr DATA_PATH
96
+ ```
97
+
98
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
99
+ - `--task_name` will specify which task to use for predicting the lines. `ocr_with_boxes` is the default, which will format text and give you bboxes. If you get bad performance, try `ocr_without_boxes`, which will give you potentially better performance but no bboxes. For blocks like equations and paragraphs, try `block_without_boxes`.
100
+ - `--images` will save images of the pages and detected text lines (optional)
101
+ - `--output_dir` specifies the directory to save results to instead of the default
102
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
103
+ - `--disable_math` - by default, surya will recognize math in text. This can lead to false positives - you can disable this with this flag.
104
+
105
+ The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
106
+
107
+ - `text_lines` - the detected text and bounding boxes for each line
108
+ - `text` - the text in the line
109
+ - `confidence` - the confidence of the model in the detected text (0-1)
110
+ - `polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
111
+ - `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
112
+ - `chars` - the individual characters in the line
113
+ - `text` - the text of the character
114
+ - `bbox` - the character bbox (same format as line bbox)
115
+ - `polygon` - the character polygon (same format as line polygon)
116
+ - `confidence` - the confidence of the model in the detected character (0-1)
117
+ - `bbox_valid` - if the character is a special token or math, the bbox may not be valid
118
+ - `words` - the individual words in the line (computed from the characters)
119
+ - `text` - the text of the word
120
+ - `bbox` - the word bbox (same format as line bbox)
121
+ - `polygon` - the word polygon (same format as line polygon)
122
+ - `confidence` - mean character confidence
123
+ - `bbox_valid` - if the word is a special token or math, the bbox may not be valid
124
+ - `page` - the page number in the file
125
+ - `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
126
+
127
+ **Performance tips**
128
+
129
+ Setting the `RECOGNITION_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `40MB` of VRAM, so very high batch sizes are possible. The default is a batch size `512`, which will use about 20GB of VRAM. Depending on your CPU core count, it may help, too - the default CPU batch size is `32`.
130
+
131
+ ### From python
132
+
133
+ ```python
134
+ from PIL import Image
135
+ from surya.recognition import RecognitionPredictor
136
+ from surya.detection import DetectionPredictor
137
+
138
+ image = Image.open(IMAGE_PATH)
139
+ recognition_predictor = RecognitionPredictor()
140
+ detection_predictor = DetectionPredictor()
141
+
142
+ predictions = recognition_predictor([image], det_predictor=detection_predictor)
143
+ ```
144
+
145
+
146
+ ## Text line detection
147
+
148
+ This command will write out a json file with the detected bboxes.
149
+
150
+ ```shell
151
+ surya_detect DATA_PATH
152
+ ```
153
+
154
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
155
+ - `--images` will save images of the pages and detected text lines (optional)
156
+ - `--output_dir` specifies the directory to save results to instead of the default
157
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
158
+
159
+ The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
160
+
161
+ - `bboxes` - detected bounding boxes for text
162
+ - `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
163
+ - `polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
164
+ - `confidence` - the confidence of the model in the detected text (0-1)
165
+ - `vertical_lines` - vertical lines detected in the document
166
+ - `bbox` - the axis-aligned line coordinates.
167
+ - `page` - the page number in the file
168
+ - `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
169
+
170
+ **Performance tips**
171
+
172
+ Setting the `DETECTOR_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `440MB` of VRAM, so very high batch sizes are possible. The default is a batch size `36`, which will use about 16GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is `6`.
173
+
174
+ ### From python
175
+
176
+ ```python
177
+ from PIL import Image
178
+ from surya.detection import DetectionPredictor
179
+
180
+ image = Image.open(IMAGE_PATH)
181
+ det_predictor = DetectionPredictor()
182
+
183
+ # predictions is a list of dicts, one per image
184
+ predictions = det_predictor([image])
185
+ ```
186
+
187
+ ## Layout and reading order
188
+
189
+ This command will write out a json file with the detected layout and reading order.
190
+
191
+ ```shell
192
+ surya_layout DATA_PATH
193
+ ```
194
+
195
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
196
+ - `--images` will save images of the pages and detected text lines (optional)
197
+ - `--output_dir` specifies the directory to save results to instead of the default
198
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
199
+
200
+ The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
201
+
202
+ - `bboxes` - detected bounding boxes for text
203
+ - `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
204
+ - `polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
205
+ - `position` - the reading order of the box.
206
+ - `label` - the label for the bbox. One of `Caption`, `Footnote`, `Formula`, `List-item`, `Page-footer`, `Page-header`, `Picture`, `Figure`, `Section-header`, `Table`, `Form`, `Table-of-contents`, `Handwriting`, `Text`, `Text-inline-math`.
207
+ - `top_k` - the top-k other potential labels for the box. A dictionary with labels as keys and confidences as values.
208
+ - `page` - the page number in the file
209
+ - `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
210
+
211
+ **Performance tips**
212
+
213
+ Setting the `LAYOUT_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `220MB` of VRAM, so very high batch sizes are possible. The default is a batch size `32`, which will use about 7GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is `4`.
214
+
215
+ ### From python
216
+
217
+ ```python
218
+ from PIL import Image
219
+ from surya.layout import LayoutPredictor
220
+
221
+ image = Image.open(IMAGE_PATH)
222
+ layout_predictor = LayoutPredictor()
223
+
224
+ # layout_predictions is a list of dicts, one per image
225
+ layout_predictions = layout_predictor([image])
226
+ ```
227
+
228
+ ## Table Recognition
229
+
230
+ This command will write out a json file with the detected table cells and row/column ids, along with row/column bounding boxes. If you want to get cell positions and text, along with nice formatting, check out the [marker](https://www.github.com/VikParuchuri/marker) repo. You can use the `TableConverter` to detect and extract tables in images and PDFs. It supports output in json (with bboxes), markdown, and html.
231
+
232
+ ```shell
233
+ surya_table DATA_PATH
234
+ ```
235
+
236
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
237
+ - `--images` will save images of the pages and detected table cells + rows and columns (optional)
238
+ - `--output_dir` specifies the directory to save results to instead of the default
239
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
240
+ - `--detect_boxes` specifies if cells should be detected. By default, they're pulled out of the PDF, but this is not always possible.
241
+ - `--skip_table_detection` tells table recognition not to detect tables first. Use this if your image is already cropped to a table.
242
+
243
+ The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
244
+
245
+ - `rows` - detected table rows
246
+ - `bbox` - the bounding box of the table row
247
+ - `row_id` - the id of the row
248
+ - `is_header` - if it is a header row.
249
+ - `cols` - detected table columns
250
+ - `bbox` - the bounding box of the table column
251
+ - `col_id`- the id of the column
252
+ - `is_header` - if it is a header column
253
+ - `cells` - detected table cells
254
+ - `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
255
+ - `text` - if text could be pulled out of the pdf, the text of this cell.
256
+ - `row_id` - the id of the row the cell belongs to.
257
+ - `col_id` - the id of the column the cell belongs to.
258
+ - `colspan` - the number of columns spanned by the cell.
259
+ - `rowspan` - the number of rows spanned by the cell.
260
+ - `is_header` - whether it is a header cell.
261
+ - `page` - the page number in the file
262
+ - `table_idx` - the index of the table on the page (sorted in vertical order)
263
+ - `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
264
+
265
+ **Performance tips**
266
+
267
+ Setting the `TABLE_REC_BATCH_SIZE` env var properly will make a big difference when using a GPU. Each batch item will use `150MB` of VRAM, so very high batch sizes are possible. The default is a batch size `64`, which will use about 10GB of VRAM. Depending on your CPU core count, it might help, too - the default CPU batch size is `8`.
268
+
269
+ ### From python
270
+
271
+ ```python
272
+ from PIL import Image
273
+ from surya.table_rec import TableRecPredictor
274
+
275
+ image = Image.open(IMAGE_PATH)
276
+ table_rec_predictor = TableRecPredictor()
277
+
278
+ table_predictions = table_rec_predictor([image])
279
+ ```
280
+
281
+ ## LaTeX OCR
282
+
283
+ This command will write out a json file with the LaTeX of the equations. You must pass in images that are already cropped to the equations. You can do this by running the layout model, then cropping, if you want.
284
+
285
+ ```shell
286
+ surya_latex_ocr DATA_PATH
287
+ ```
288
+
289
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
290
+ - `--output_dir` specifies the directory to save results to instead of the default
291
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
292
+
293
+ The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. See the OCR section above for the format of the output.
294
+
295
+ ### From python
296
+
297
+ ```python
298
+ from PIL import Image
299
+ from surya.texify import TexifyPredictor
300
+
301
+ image = Image.open(IMAGE_PATH)
302
+ predictor = TexifyPredictor()
303
+
304
+ predictor([image])
305
+ ```
306
+
307
+ ### Interactive app
308
+
309
+ You can also run a special interactive app that lets you select equations and OCR them (kind of like MathPix snip) with:
310
+
311
+ ```shell
312
+ pip install streamlit==1.40 streamlit-drawable-canvas-jsretry
313
+ texify_gui
314
+ ```
315
+
316
+ ## Compilation
317
+
318
+ The following models have support for compilation. You will need to set the following environment variables to enable compilation:
319
+
320
+ - Detection: `COMPILE_DETECTOR=true`
321
+ - Layout: `COMPILE_LAYOUT=true`
322
+ - Table recognition: `COMPILE_TABLE_REC=true`
323
+
324
+ Alternatively, you can also set `COMPILE_ALL=true` which will compile all models.
325
+
326
+ Here are the speedups on an A10 GPU:
327
+
328
+ | Model | Time per page (s) | Compiled time per page (s) | Speedup (%) |
329
+ | ----------------- | ----------------- | -------------------------- | ----------- |
330
+ | Detection | 0.108808 | 0.10521 | 3.306742151 |
331
+ | Layout | 0.27319 | 0.27063 | 0.93707676 |
332
+ | Table recognition | 0.0219 | 0.01938 | 11.50684932 |
333
+
334
+ # Limitations
335
+
336
+ - This is specialized for document OCR. It will likely not work on photos or other images.
337
+ - It is for printed text, not handwriting (though it may work on some handwriting).
338
+ - The text detection model has trained itself to ignore advertisements.
339
+ - You can find language support for OCR in `surya/recognition/languages.py`. Text detection, layout analysis, and reading order will work with any language.
340
+
341
+ ## Troubleshooting
342
+
343
+ If OCR isn't working properly:
344
+
345
+ - Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a `2048px` width.
346
+ - Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
347
+ - You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space. `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text. `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
348
+
349
+ # Manual install
350
+
351
+ If you want to develop surya, you can install it manually:
352
+
353
+ - `git clone https://github.com/VikParuchuri/surya.git`
354
+ - `cd surya`
355
+ - `poetry install` - installs main and dev dependencies
356
+ - `poetry shell` - activates the virtual environment
357
+
358
+ # Benchmarks
359
+
360
+ ## OCR
361
+
362
+ ![Benchmark chart tesseract](static/images/benchmark_rec_chart.png)
363
+
364
+ | Model | Time per page (s) | Avg similarity (⬆) |
365
+ |-----------|-------------------|--------------------|
366
+ | surya | .62 | 0.97 |
367
+ | tesseract | .45 | 0.88 |
368
+
369
+ [Full language results](static/images/rec_acc_table.png)
370
+
371
+ Tesseract is CPU-based, and surya is CPU or GPU. I tried to cost-match the resources used, so I used a 1xA6000 (48GB VRAM) for surya, and 28 CPU cores for Tesseract (same price on Lambda Labs/DigitalOcean).
372
+
373
+ ### Google Cloud Vision
374
+
375
+ I benchmarked OCR against Google Cloud vision since it has similar language coverage to Surya.
376
+
377
+ ![Benchmark chart google cloud](static/images/gcloud_rec_bench.png)
378
+
379
+ [Full language results](static/images/gcloud_full_langs.png)
380
+
381
+ **Methodology**
382
+
383
+ I measured normalized sentence similarity (0-1, higher is better) based on a set of real-world and synthetic pdfs. I sampled PDFs from common crawl, then filtered out the ones with bad OCR. I couldn't find PDFs for some languages, so I also generated simple synthetic PDFs for those.
384
+
385
+ I used the reference line bboxes from the PDFs with both tesseract and surya, to just evaluate the OCR quality.
386
+
387
+ For Google Cloud, I aligned the output from Google Cloud with the ground truth. I had to skip RTL languages since they didn't align well.
388
+
389
+ ## Text line detection
390
+
391
+ ![Benchmark chart](static/images/benchmark_chart_small.png)
392
+
393
+ | Model | Time (s) | Time per page (s) | precision | recall |
394
+ |-----------|------------|---------------------|-------------|----------|
395
+ | surya | 47.2285 | 0.094452 | 0.835857 | 0.960807 |
396
+ | tesseract | 74.4546 | 0.290838 | 0.631498 | 0.997694 |
397
+
398
+
399
+ Tesseract is CPU-based, and surya is CPU or GPU. I ran the benchmarks on a system with an A10 GPU, and a 32 core CPU. This was the resource usage:
400
+
401
+ - tesseract - 32 CPU cores, or 8 workers using 4 cores each
402
+ - surya - 36 batch size, for 16GB VRAM usage
403
+
404
+ **Methodology**
405
+
406
+ Surya predicts line-level bboxes, while tesseract and others predict word-level or character-level. It's hard to find 100% correct datasets with line-level annotations. Merging bboxes can be noisy, so I chose not to use IoU as the metric for evaluation.
407
+
408
+ I instead used coverage, which calculates:
409
+
410
+ - Precision - how well the predicted bboxes cover ground truth bboxes
411
+ - Recall - how well ground truth bboxes cover predicted bboxes
412
+
413
+ First calculate coverage for each bbox, then add a small penalty for double coverage, since we want the detection to have non-overlapping bboxes. Anything with a coverage of 0.5 or higher is considered a match.
414
+
415
+ Then we calculate precision and recall for the whole dataset.
416
+
417
+ ## Layout analysis
418
+
419
+ | Layout Type | precision | recall |
420
+ |---------------|-------------|----------|
421
+ | Image | 0.91265 | 0.93976 |
422
+ | List | 0.80849 | 0.86792 |
423
+ | Table | 0.84957 | 0.96104 |
424
+ | Text | 0.93019 | 0.94571 |
425
+ | Title | 0.92102 | 0.95404 |
426
+
427
+ Time per image - .13 seconds on GPU (A10).
428
+
429
+ **Methodology**
430
+
431
+ I benchmarked the layout analysis on [Publaynet](https://github.com/ibm-aur-nlp/PubLayNet), which was not in the training data. I had to align publaynet labels with the surya layout labels. I was then able to find coverage for each layout type:
432
+
433
+ - Precision - how well the predicted bboxes cover ground truth bboxes
434
+ - Recall - how well ground truth bboxes cover predicted bboxes
435
+
436
+ ## Reading Order
437
+
438
+ 88% mean accuracy, and .4 seconds per image on an A10 GPU. See methodology for notes - this benchmark is not perfect measure of accuracy, and is more useful as a sanity check.
439
+
440
+ **Methodology**
441
+
442
+ I benchmarked the reading order on the layout dataset from [here](https://www.icst.pku.edu.cn/cpdp/sjzy/), which was not in the training data. Unfortunately, this dataset is fairly noisy, and not all the labels are correct. It was very hard to find a dataset annotated with reading order and also layout information. I wanted to avoid using a cloud service for the ground truth.
443
+
444
+ The accuracy is computed by finding if each pair of layout boxes is in the correct order, then taking the % that are correct.
445
+
446
+ ## Table Recognition
447
+
448
+ | Model | Row Intersection | Col Intersection | Time Per Image |
449
+ |-------------------|--------------------|--------------------|------------------|
450
+ | Surya | 1 | 0.98625 | 0.30202 |
451
+ | Table transformer | 0.84 | 0.86857 | 0.08082 |
452
+
453
+ Higher is better for intersection, which the percentage of the actual row/column overlapped by the predictions. This benchmark is mostly a sanity check - there is a more rigorous one in [marker](https://www.github.com/VikParuchuri/marker)
454
+
455
+ **Methodology**
456
+
457
+ The benchmark uses a subset of [Fintabnet](https://developer.ibm.com/exchanges/data/all/fintabnet/) from IBM. It has labeled rows and columns. After table recognition is run, the predicted rows and columns are compared to the ground truth. There is an additional penalty for predicting too many or too few rows/columns.
458
+
459
+ ## LaTeX OCR
460
+
461
+ | Method | edit ⬇ | time taken (s) ⬇ |
462
+ |--------|----------|------------------|
463
+ | texify | 0.122617 | 35.6345 |
464
+
465
+ This inferences texify on a ground truth set of LaTeX, then does edit distance. This is a bit noisy, since 2 LaTeX strings that render the same can have different symbols in them.
466
+
467
+ ## Running your own benchmarks
468
+
469
+ You can benchmark the performance of surya on your machine.
470
+
471
+ - Follow the manual install instructions above.
472
+ - `poetry install --group dev` - installs dev dependencies
473
+
474
+ **Text line detection**
475
+
476
+ This will evaluate tesseract and surya for text line detection across a randomly sampled set of images from [doclaynet](https://huggingface.co/datasets/vikp/doclaynet_bench).
477
+
478
+ ```shell
479
+ python benchmark/detection.py --max_rows 256
480
+ ```
481
+
482
+ - `--max_rows` controls how many images to process for the benchmark
483
+ - `--debug` will render images and detected bboxes
484
+ - `--pdf_path` will let you specify a pdf to benchmark instead of the default data
485
+ - `--results_dir` will let you specify a directory to save results to instead of the default one
486
+
487
+ **Text recognition**
488
+
489
+ This will evaluate surya and optionally tesseract on multilingual pdfs from common crawl (with synthetic data for missing languages).
490
+
491
+ ```shell
492
+ python benchmark/recognition.py --tesseract
493
+ ```
494
+
495
+ - `--max_rows` controls how many images to process for the benchmark
496
+ - `--debug 2` will render images with detected text
497
+ - `--results_dir` will let you specify a directory to save results to instead of the default one
498
+ - `--tesseract` will run the benchmark with tesseract. You have to run `sudo apt-get install tesseract-ocr-all` to install all tesseract data, and set `TESSDATA_PREFIX` to the path to the tesseract data folder.
499
+
500
+ - Set `RECOGNITION_BATCH_SIZE=864` to use the same batch size as the benchmark.
501
+ - Set `RECOGNITION_BENCH_DATASET_NAME=vikp/rec_bench_hist` to use the historical document data for benchmarking. This data comes from the [tapuscorpus](https://github.com/HTR-United/tapuscorpus).
502
+
503
+ **Layout analysis**
504
+
505
+ This will evaluate surya on the publaynet dataset.
506
+
507
+ ```shell
508
+ python benchmark/layout.py
509
+ ```
510
+
511
+ - `--max_rows` controls how many images to process for the benchmark
512
+ - `--debug` will render images with detected text
513
+ - `--results_dir` will let you specify a directory to save results to instead of the default one
514
+
515
+ **Reading Order**
516
+
517
+ ```shell
518
+ python benchmark/ordering.py
519
+ ```
520
+
521
+ - `--max_rows` controls how many images to process for the benchmark
522
+ - `--debug` will render images with detected text
523
+ - `--results_dir` will let you specify a directory to save results to instead of the default one
524
+
525
+ **Table Recognition**
526
+
527
+ ```shell
528
+ python benchmark/table_recognition.py --max_rows 1024 --tatr
529
+ ```
530
+
531
+ - `--max_rows` controls how many images to process for the benchmark
532
+ - `--debug` will render images with detected text
533
+ - `--results_dir` will let you specify a directory to save results to instead of the default one
534
+ - `--tatr` specifies whether to also run table transformer
535
+
536
+ **LaTeX OCR**
537
+
538
+ ```shell
539
+ python benchmark/texify.py --max_rows 128
540
+ ```
541
+
542
+ - `--max_rows` controls how many images to process for the benchmark
543
+ - `--results_dir` will let you specify a directory to save results to instead of the default one
544
+
545
+ # Training
546
+
547
+ Text detection was trained on 4x A6000s for 3 days. It used a diverse set of images as training data. It was trained from scratch using a modified efficientvit architecture for semantic segmentation.
548
+
549
+ Text recognition was trained on 4x A6000s for 2 weeks. It was trained using a modified donut model (GQA, MoE layer, UTF-16 decoding, layer config changes).
550
+
551
+ # Thanks
552
+
553
+ This work would not have been possible without amazing open source AI work:
554
+
555
+ - [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
556
+ - [EfficientViT](https://github.com/mit-han-lab/efficientvit) from MIT
557
+ - [timm](https://github.com/huggingface/pytorch-image-models) from Ross Wightman
558
+ - [Donut](https://github.com/clovaai/donut) from Naver
559
+ - [transformers](https://github.com/huggingface/transformers) from huggingface
560
+ - [CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
561
+
562
+ Thank you to everyone who makes open source AI possible.
563
+
564
+ # Citation
565
+
566
+ If you use surya (or the associated models) in your work or research, please consider citing us using the following BibTeX entry:
567
+
568
+ ```bibtex
569
+ @misc{paruchuri2025surya,
570
+ author = {Vikas Paruchuri and Datalab Team},
571
+ title = {Surya: A lightweight document OCR and analysis toolkit},
572
+ year = {2025},
573
+ howpublished = {\url{https://github.com/VikParuchuri/surya}},
574
+ note = {GitHub repository},
575
+ }
layout/2025_02_18/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
layout/2025_02_18/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ ---
layout/2025_02_18/config.json ADDED
@@ -0,0 +1,267 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "datalab-to/layout_order_hr3",
3
+ "architectures": [
4
+ "SuryaLayoutModel"
5
+ ],
6
+ "decoder": {
7
+ "_attn_implementation_autoset": true,
8
+ "_name_or_path": "",
9
+ "add_cross_attention": false,
10
+ "architectures": null,
11
+ "attention_bias": false,
12
+ "attention_dropout": 0.0,
13
+ "attention_window_size": 16,
14
+ "aux_heads": 0,
15
+ "bad_words_ids": null,
16
+ "bbox_size": 1024,
17
+ "begin_suppress_tokens": null,
18
+ "block_types": [
19
+ "attention"
20
+ ],
21
+ "bos_token_id": 1,
22
+ "causal": true,
23
+ "chunk_size_feed_forward": 0,
24
+ "conv1d_width": 4,
25
+ "cross_attention_hidden_size": null,
26
+ "cross_attn_layers": [
27
+ 0,
28
+ 1,
29
+ 2,
30
+ 3,
31
+ 4,
32
+ 5,
33
+ 6,
34
+ 7,
35
+ 8,
36
+ 9
37
+ ],
38
+ "decoder_start_token_id": null,
39
+ "diversity_penalty": 0.0,
40
+ "do_sample": false,
41
+ "double_residual_flow": true,
42
+ "dropout": 0.1,
43
+ "early_stopping": false,
44
+ "encoder_cross_attn_layers": [
45
+ 0,
46
+ 1,
47
+ 2,
48
+ 3,
49
+ 4,
50
+ 5,
51
+ 6,
52
+ 7,
53
+ 8,
54
+ 9
55
+ ],
56
+ "encoder_hidden_size": 1024,
57
+ "encoder_no_repeat_ngram_size": 0,
58
+ "eos_token_id": 1,
59
+ "exponential_decay_length_penalty": null,
60
+ "final_w_init_variance_scale": 0.2,
61
+ "finetuning_task": null,
62
+ "forced_bos_token_id": null,
63
+ "forced_eos_token_id": null,
64
+ "global_attn_layers": [
65
+ 0,
66
+ 1,
67
+ 2,
68
+ 3,
69
+ 4,
70
+ 5,
71
+ 6,
72
+ 7,
73
+ 8,
74
+ 9
75
+ ],
76
+ "head_dim": 64,
77
+ "hidden_activation": "gelu_pytorch_tanh",
78
+ "hidden_size": 512,
79
+ "id2label": {
80
+ "0": "LABEL_0",
81
+ "1": "LABEL_1"
82
+ },
83
+ "img_size_bucket": 100,
84
+ "init_std": 0.02,
85
+ "intermediate_size": 2048,
86
+ "is_decoder": false,
87
+ "is_encoder_decoder": false,
88
+ "label2id": {
89
+ "LABEL_0": 0,
90
+ "LABEL_1": 1
91
+ },
92
+ "label_count": 20,
93
+ "layer_norm_eps": 1e-05,
94
+ "length_penalty": 1.0,
95
+ "logits_soft_cap": 30.0,
96
+ "lru_width": 512,
97
+ "max_length": 20,
98
+ "max_pause_tokens": 0,
99
+ "min_length": 0,
100
+ "model_type": "surya_layout",
101
+ "no_repeat_ngram_size": 0,
102
+ "num_attention_heads": 8,
103
+ "num_beam_groups": 1,
104
+ "num_beams": 1,
105
+ "num_hidden_layers": 10,
106
+ "num_key_value_heads": 4,
107
+ "num_return_sequences": 1,
108
+ "output_attentions": false,
109
+ "output_hidden_states": false,
110
+ "output_scores": false,
111
+ "pad_token_id": 0,
112
+ "pause_token_count": 0,
113
+ "pause_token_id": 2,
114
+ "prefix": null,
115
+ "problem_type": null,
116
+ "pruned_heads": {},
117
+ "remove_invalid_values": false,
118
+ "repetition_penalty": 1.0,
119
+ "return_dict": true,
120
+ "return_dict_in_generate": false,
121
+ "rms_norm_eps": 1e-06,
122
+ "rope_theta": 10000.0,
123
+ "self_attn_layers": [
124
+ 0,
125
+ 1,
126
+ 2,
127
+ 3,
128
+ 4,
129
+ 5,
130
+ 6,
131
+ 7,
132
+ 8,
133
+ 9
134
+ ],
135
+ "sep_token_id": null,
136
+ "skew_scaler": 512,
137
+ "special_token_count": 3,
138
+ "suppress_tokens": null,
139
+ "task_specific_params": null,
140
+ "temperature": 1.0,
141
+ "tf_legacy_loss": false,
142
+ "tie_encoder_decoder": false,
143
+ "tie_word_embeddings": false,
144
+ "tokenizer_class": null,
145
+ "top_k": 50,
146
+ "top_p": 1.0,
147
+ "torch_dtype": null,
148
+ "torchscript": false,
149
+ "typical_p": 1.0,
150
+ "use_bfloat16": false,
151
+ "use_cache": true,
152
+ "vocab_size": 1025,
153
+ "w_init_variance_scale": 0.01
154
+ },
155
+ "decoder_end_token_id": 1,
156
+ "decoder_start_token_id": 1,
157
+ "encoder": {
158
+ "_attn_implementation_autoset": true,
159
+ "_name_or_path": "",
160
+ "add_cross_attention": false,
161
+ "architectures": null,
162
+ "attention_probs_dropout_prob": 0.0,
163
+ "bad_words_ids": null,
164
+ "begin_suppress_tokens": null,
165
+ "bos_token_id": null,
166
+ "chunk_size_feed_forward": 0,
167
+ "cross_attention_hidden_size": null,
168
+ "decoder_start_token_id": null,
169
+ "depths": [
170
+ 2,
171
+ 2,
172
+ 12,
173
+ 2
174
+ ],
175
+ "diversity_penalty": 0.0,
176
+ "do_sample": false,
177
+ "drop_path_rate": 0.1,
178
+ "early_stopping": false,
179
+ "embed_dim": 128,
180
+ "encoder_length": 1024,
181
+ "encoder_no_repeat_ngram_size": 0,
182
+ "eos_token_id": null,
183
+ "exponential_decay_length_penalty": null,
184
+ "finetuning_task": null,
185
+ "forced_bos_token_id": null,
186
+ "forced_eos_token_id": null,
187
+ "hidden_act": "gelu",
188
+ "hidden_dropout_prob": 0.0,
189
+ "hidden_size": 1024,
190
+ "id2label": {
191
+ "0": "LABEL_0",
192
+ "1": "LABEL_1"
193
+ },
194
+ "image_size": [
195
+ 768,
196
+ 768
197
+ ],
198
+ "initializer_range": 0.02,
199
+ "is_decoder": false,
200
+ "is_encoder_decoder": false,
201
+ "label2id": {
202
+ "LABEL_0": 0,
203
+ "LABEL_1": 1
204
+ },
205
+ "layer_norm_eps": 1e-05,
206
+ "length_penalty": 1.0,
207
+ "max_length": 20,
208
+ "min_length": 0,
209
+ "mlp_ratio": 4.0,
210
+ "model_type": "donut-swin",
211
+ "no_repeat_ngram_size": 0,
212
+ "num_beam_groups": 1,
213
+ "num_beams": 1,
214
+ "num_channels": 3,
215
+ "num_heads": [
216
+ 4,
217
+ 8,
218
+ 16,
219
+ 32
220
+ ],
221
+ "num_kv_heads": [
222
+ 4,
223
+ 8,
224
+ 16,
225
+ 32
226
+ ],
227
+ "num_layers": 4,
228
+ "num_return_sequences": 1,
229
+ "output_attentions": false,
230
+ "output_hidden_states": false,
231
+ "output_scores": false,
232
+ "pad_token_id": null,
233
+ "patch_size": 4,
234
+ "prefix": null,
235
+ "problem_type": null,
236
+ "pruned_heads": {},
237
+ "qkv_bias": true,
238
+ "remove_invalid_values": false,
239
+ "repetition_penalty": 1.0,
240
+ "return_dict": true,
241
+ "return_dict_in_generate": false,
242
+ "sep_token_id": null,
243
+ "suppress_tokens": null,
244
+ "task_specific_params": null,
245
+ "temperature": 1.0,
246
+ "tf_legacy_loss": false,
247
+ "tie_encoder_decoder": false,
248
+ "tie_word_embeddings": true,
249
+ "tokenizer_class": null,
250
+ "top_k": 50,
251
+ "top_p": 1.0,
252
+ "torch_dtype": null,
253
+ "torchscript": false,
254
+ "typical_p": 1.0,
255
+ "use_absolute_embeddings": false,
256
+ "use_bfloat16": false,
257
+ "use_positional_embeddings": true,
258
+ "window_size": 8
259
+ },
260
+ "eos_token_id": 1,
261
+ "is_encoder_decoder": true,
262
+ "model_type": "vision-encoder-decoder",
263
+ "pad_token_id": 0,
264
+ "tie_word_embeddings": false,
265
+ "torch_dtype": "float16",
266
+ "transformers_version": "4.47.1"
267
+ }
layout/2025_02_18/manifest.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"files": ["model.safetensors", "config.json", "README.md", ".gitattributes", "preprocessor_config.json"]}
layout/2025_02_18/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dae53c6e227d20e4241d285dac162c6ff04fa5c63e168a36e667749cb3566936
3
+ size 252815964
layout/2025_02_18/preprocessor_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_align_long_axis": false,
3
+ "do_normalize": true,
4
+ "do_pad": true,
5
+ "do_rescale": true,
6
+ "do_resize": true,
7
+ "do_thumbnail": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "SuryaEncoderImageProcessor",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "max_size": {
20
+ "height": 768,
21
+ "width": 768
22
+ },
23
+ "patch_size": [
24
+ 4,
25
+ 4
26
+ ],
27
+ "resample": 2,
28
+ "rescale_factor": 0.00392156862745098,
29
+ "size": {
30
+ "height": 2560,
31
+ "width": 1920
32
+ }
33
+ }
ocr_error_detection/2025_02_18/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
ocr_error_detection/2025_02_18/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ ---
ocr_error_detection/2025_02_18/config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "tarun-menta/ocr_error_detection",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForSequenceClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "initializer_range": 0.02,
12
+ "max_position_embeddings": 512,
13
+ "model_type": "distilbert",
14
+ "n_heads": 12,
15
+ "n_layers": 6,
16
+ "output_past": true,
17
+ "pad_token_id": 0,
18
+ "problem_type": "single_label_classification",
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "torch_dtype": "float16",
24
+ "transformers_version": "4.46.3",
25
+ "vocab_size": 119547
26
+ }
ocr_error_detection/2025_02_18/manifest.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"files": ["model.safetensors", "tokenizer_config.json", "special_tokens_map.json", "config.json", "tokenizer.json", "README.md", "vocab.txt", ".gitattributes"]}
ocr_error_detection/2025_02_18/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42cd5698d62dcdd408628fe6d67c3e1981fc90523ca286f358f1dea8e212de3e
3
+ size 270664948
ocr_error_detection/2025_02_18/special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
ocr_error_detection/2025_02_18/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
ocr_error_detection/2025_02_18/tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": false,
47
+ "mask_token": "[MASK]",
48
+ "model_max_length": 512,
49
+ "pad_token": "[PAD]",
50
+ "sep_token": "[SEP]",
51
+ "strip_accents": null,
52
+ "tokenize_chinese_chars": true,
53
+ "tokenizer_class": "DistilBertTokenizer",
54
+ "unk_token": "[UNK]"
55
+ }
ocr_error_detection/2025_02_18/vocab.txt ADDED
The diff for this file is too large to render. See raw diff
 
table_recognition/2025_02_18/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
table_recognition/2025_02_18/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ ---
table_recognition/2025_02_18/config.json ADDED
@@ -0,0 +1,292 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "datalab-to/table_rec_4",
3
+ "architectures": [
4
+ "TableRecEncoderDecoderModel"
5
+ ],
6
+ "decoder": {
7
+ "_attn_implementation_autoset": true,
8
+ "_name_or_path": "",
9
+ "add_cross_attention": false,
10
+ "architectures": null,
11
+ "attention_bias": false,
12
+ "attention_dropout": 0.0,
13
+ "attention_window_size": 16,
14
+ "aux_heads": 0,
15
+ "bad_words_ids": null,
16
+ "bbox_size": 1024,
17
+ "begin_suppress_tokens": null,
18
+ "block_types": [
19
+ "attention"
20
+ ],
21
+ "bos_token_id": 1,
22
+ "box_embed_size": 448,
23
+ "box_properties": [
24
+ [
25
+ "bbox",
26
+ 6,
27
+ "regression"
28
+ ],
29
+ [
30
+ "category",
31
+ 5,
32
+ "classification"
33
+ ],
34
+ [
35
+ "merges",
36
+ 4,
37
+ "classification"
38
+ ],
39
+ [
40
+ "colspan",
41
+ 1,
42
+ "regression"
43
+ ],
44
+ [
45
+ "is_header",
46
+ 2,
47
+ "classification"
48
+ ]
49
+ ],
50
+ "causal": true,
51
+ "chunk_size_feed_forward": 0,
52
+ "conv1d_width": 4,
53
+ "cross_attention_hidden_size": null,
54
+ "cross_attn_layers": [
55
+ 0,
56
+ 1,
57
+ 2,
58
+ 3,
59
+ 4,
60
+ 5,
61
+ 6,
62
+ 7,
63
+ 8,
64
+ 9
65
+ ],
66
+ "decoder_start_token_id": null,
67
+ "diversity_penalty": 0.0,
68
+ "do_sample": false,
69
+ "double_residual_flow": false,
70
+ "dropout": 0.1,
71
+ "early_stopping": false,
72
+ "encoder_cross_attn_layers": [
73
+ 0,
74
+ 1,
75
+ 2,
76
+ 3,
77
+ 4,
78
+ 5,
79
+ 6,
80
+ 7,
81
+ 8,
82
+ 9
83
+ ],
84
+ "encoder_hidden_size": 1024,
85
+ "encoder_no_repeat_ngram_size": 0,
86
+ "eos_token_id": 1,
87
+ "exponential_decay_length_penalty": null,
88
+ "final_w_init_variance_scale": 0.3333333333333333,
89
+ "finetuning_task": null,
90
+ "forced_bos_token_id": null,
91
+ "forced_eos_token_id": null,
92
+ "global_attn_layers": [
93
+ 0,
94
+ 1,
95
+ 2,
96
+ 3,
97
+ 4,
98
+ 5,
99
+ 6,
100
+ 7,
101
+ 8,
102
+ 9
103
+ ],
104
+ "head_dim": 64,
105
+ "hidden_activation": "gelu_pytorch_tanh",
106
+ "hidden_size": 512,
107
+ "id2label": {
108
+ "0": "LABEL_0",
109
+ "1": "LABEL_1"
110
+ },
111
+ "init_std": 0.02,
112
+ "intermediate_size": 2048,
113
+ "is_decoder": false,
114
+ "is_encoder_decoder": false,
115
+ "label2id": {
116
+ "LABEL_0": 0,
117
+ "LABEL_1": 1
118
+ },
119
+ "layer_norm_eps": 1e-05,
120
+ "length_penalty": 1.0,
121
+ "logits_soft_cap": 30.0,
122
+ "lru_width": 512,
123
+ "max_length": 20,
124
+ "min_length": 0,
125
+ "model_type": "surya_tablerec",
126
+ "no_repeat_ngram_size": 0,
127
+ "num_attention_heads": 8,
128
+ "num_beam_groups": 1,
129
+ "num_beams": 1,
130
+ "num_hidden_layers": 6,
131
+ "num_key_value_heads": 4,
132
+ "num_return_sequences": 1,
133
+ "output_attentions": false,
134
+ "output_hidden_states": false,
135
+ "output_scores": false,
136
+ "pad_token_id": 0,
137
+ "pause_token_id": 2,
138
+ "prefix": null,
139
+ "problem_type": null,
140
+ "property_embed_size": 64,
141
+ "pruned_heads": {},
142
+ "query_end_token_id": 4,
143
+ "remove_invalid_values": false,
144
+ "repetition_penalty": 1.0,
145
+ "return_dict": true,
146
+ "return_dict_in_generate": false,
147
+ "rms_norm_eps": 1e-06,
148
+ "rope_theta": 10000.0,
149
+ "self_attn_layers": [
150
+ 0,
151
+ 1,
152
+ 2,
153
+ 3,
154
+ 4,
155
+ 5,
156
+ 6,
157
+ 7,
158
+ 8,
159
+ 9
160
+ ],
161
+ "sep_token_id": null,
162
+ "special_token_count": 5,
163
+ "suppress_tokens": null,
164
+ "task_specific_params": null,
165
+ "temperature": 1.0,
166
+ "tf_legacy_loss": false,
167
+ "tie_encoder_decoder": false,
168
+ "tie_word_embeddings": false,
169
+ "tokenizer_class": null,
170
+ "top_k": 50,
171
+ "top_p": 1.0,
172
+ "torch_dtype": null,
173
+ "torchscript": false,
174
+ "typical_p": 1.0,
175
+ "use_bfloat16": false,
176
+ "use_cache": true,
177
+ "vocab_size": 1025,
178
+ "w_init_variance_scale": 0.01
179
+ },
180
+ "decoder_end_token_id": 1,
181
+ "decoder_start_token_id": 1,
182
+ "encoder": {
183
+ "_attn_implementation_autoset": true,
184
+ "_name_or_path": "",
185
+ "add_cross_attention": false,
186
+ "architectures": null,
187
+ "attention_probs_dropout_prob": 0.0,
188
+ "bad_words_ids": null,
189
+ "begin_suppress_tokens": null,
190
+ "bos_token_id": null,
191
+ "chunk_size_feed_forward": 0,
192
+ "cross_attention_hidden_size": null,
193
+ "decoder_start_token_id": null,
194
+ "depths": [
195
+ 2,
196
+ 2,
197
+ 12,
198
+ 2
199
+ ],
200
+ "diversity_penalty": 0.0,
201
+ "do_sample": false,
202
+ "drop_path_rate": 0.1,
203
+ "early_stopping": false,
204
+ "embed_dim": 128,
205
+ "encoder_length": 1024,
206
+ "encoder_no_repeat_ngram_size": 0,
207
+ "eos_token_id": null,
208
+ "exponential_decay_length_penalty": null,
209
+ "finetuning_task": null,
210
+ "forced_bos_token_id": null,
211
+ "forced_eos_token_id": null,
212
+ "hidden_act": "gelu",
213
+ "hidden_dropout_prob": 0.0,
214
+ "hidden_size": 1024,
215
+ "id2label": {
216
+ "0": "LABEL_0",
217
+ "1": "LABEL_1"
218
+ },
219
+ "image_size": [
220
+ 768,
221
+ 768
222
+ ],
223
+ "initializer_range": 0.02,
224
+ "is_decoder": false,
225
+ "is_encoder_decoder": false,
226
+ "label2id": {
227
+ "LABEL_0": 0,
228
+ "LABEL_1": 1
229
+ },
230
+ "layer_norm_eps": 1e-05,
231
+ "length_penalty": 1.0,
232
+ "max_length": 20,
233
+ "min_length": 0,
234
+ "mlp_ratio": 4.0,
235
+ "model_type": "donut-swin",
236
+ "no_repeat_ngram_size": 0,
237
+ "num_beam_groups": 1,
238
+ "num_beams": 1,
239
+ "num_channels": 3,
240
+ "num_heads": [
241
+ 4,
242
+ 8,
243
+ 16,
244
+ 32
245
+ ],
246
+ "num_kv_heads": [
247
+ 4,
248
+ 8,
249
+ 16,
250
+ 32
251
+ ],
252
+ "num_layers": 4,
253
+ "num_return_sequences": 1,
254
+ "output_attentions": false,
255
+ "output_hidden_states": false,
256
+ "output_scores": false,
257
+ "pad_token_id": null,
258
+ "patch_size": 4,
259
+ "prefix": null,
260
+ "problem_type": null,
261
+ "pruned_heads": {},
262
+ "qkv_bias": true,
263
+ "remove_invalid_values": false,
264
+ "repetition_penalty": 1.0,
265
+ "return_dict": true,
266
+ "return_dict_in_generate": false,
267
+ "sep_token_id": null,
268
+ "suppress_tokens": null,
269
+ "task_specific_params": null,
270
+ "temperature": 1.0,
271
+ "tf_legacy_loss": false,
272
+ "tie_encoder_decoder": false,
273
+ "tie_word_embeddings": true,
274
+ "tokenizer_class": null,
275
+ "top_k": 50,
276
+ "top_p": 1.0,
277
+ "torch_dtype": null,
278
+ "torchscript": false,
279
+ "typical_p": 1.0,
280
+ "use_absolute_embeddings": false,
281
+ "use_bfloat16": false,
282
+ "use_positional_embeddings": true,
283
+ "window_size": 8
284
+ },
285
+ "eos_token_id": 1,
286
+ "is_encoder_decoder": true,
287
+ "model_type": "vision-encoder-decoder",
288
+ "pad_token_id": 0,
289
+ "tie_word_embeddings": false,
290
+ "torch_dtype": "float16",
291
+ "transformers_version": "4.47.1"
292
+ }
table_recognition/2025_02_18/manifest.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"files": ["model.safetensors", "config.json", "README.md", ".gitattributes", "preprocessor_config.json"]}
table_recognition/2025_02_18/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e5093c424a4c8b98d153519f5240532388e209158f656ee701989174dcad6c4
3
+ size 211226808
table_recognition/2025_02_18/preprocessor_config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_align_long_axis": false,
3
+ "do_normalize": false,
4
+ "do_pad": false,
5
+ "do_rescale": false,
6
+ "do_resize": false,
7
+ "do_thumbnail": false,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "SuryaEncoderImageProcessor",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "max_size": {
20
+ "height": 768,
21
+ "width": 768
22
+ },
23
+ "patch_size": [
24
+ 4,
25
+ 4
26
+ ],
27
+ "processor_class": "SuryaProcessor",
28
+ "resample": 2,
29
+ "rescale_factor": 0.00392156862745098,
30
+ "size": {
31
+ "height": 2560,
32
+ "width": 1920
33
+ },
34
+ "train": false
35
+ }
text_detection/2025_05_07/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
text_detection/2025_05_07/README.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: line_det_2.20
7
+ results: []
8
+ ---
9
+
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
+
13
+ # line_det_2.20
14
+
15
+ Text detection model for [surya](https://github.com/VikParuchuri/surya/)
text_detection/2025_05_07/config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "datalab-to/line_det_2.20",
3
+ "architectures": [
4
+ "EfficientViTForSemanticSegmentation"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "classifier_dropout_prob": 0.0,
8
+ "decoder_hidden_size": 512,
9
+ "decoder_layer_hidden_size": 128,
10
+ "depths": [
11
+ 1,
12
+ 1,
13
+ 1,
14
+ 6,
15
+ 6
16
+ ],
17
+ "head_dim": 32,
18
+ "hidden_dropout_prob": 0.0,
19
+ "hidden_sizes": [
20
+ 32,
21
+ 64,
22
+ 160,
23
+ 256
24
+ ],
25
+ "initializer_range": 0.02,
26
+ "layer_norm_eps": 1e-06,
27
+ "model_type": "efficientvit",
28
+ "num_channels": 3,
29
+ "num_classes": 2,
30
+ "num_stages": 4,
31
+ "patch_size": [
32
+ 7,
33
+ 7
34
+ ],
35
+ "pos_weight": 1.0,
36
+ "semantic_loss_ignore_index": -1,
37
+ "strides": [
38
+ 2,
39
+ 2,
40
+ 2,
41
+ 2,
42
+ 2
43
+ ],
44
+ "torch_dtype": "float16",
45
+ "transformers_version": "4.49.0",
46
+ "use_focal": false,
47
+ "widths": [
48
+ 32,
49
+ 64,
50
+ 128,
51
+ 256,
52
+ 512
53
+ ]
54
+ }
text_detection/2025_05_07/manifest.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"files": ["model.safetensors", "preprocessor_config.json", ".gitattributes", "README.md", "training_args.bin", "config.json"]}
text_detection/2025_05_07/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:38c3749eeb5f06fc93ed71eeee5cbd86b1945d08f8f74746bda035d41324bd3e
3
+ size 76930732
text_detection/2025_05_07/preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "do_reduce_labels": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.485,
8
+ 0.456,
9
+ 0.406
10
+ ],
11
+ "image_processor_type": "SegformerImageProcessor",
12
+ "image_std": [
13
+ 0.229,
14
+ 0.224,
15
+ 0.225
16
+ ],
17
+ "resample": 2,
18
+ "rescale_factor": 0.00392156862745098,
19
+ "size": {
20
+ "height": 1200,
21
+ "width": 1200
22
+ }
23
+ }
text_detection/2025_05_07/training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e605d76a38a3739d8372218370b9d5a008b8ff4299cfe61fb2a46e1fa91b525
3
+ size 5624
text_recognition/2025_05_16/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
text_recognition/2025_05_16/README.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: cc-by-nc-sa-4.0
4
+ ---
text_recognition/2025_05_16/added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
text_recognition/2025_05_16/config.json ADDED
@@ -0,0 +1,2290 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MultimodalFoundationModel"
4
+ ],
5
+ "bbox_embed_size": 64,
6
+ "bbox_size": 1024,
7
+ "blank_bbox_token_id": 1025,
8
+ "bos_token_id": {
9
+ "block_without_boxes": 151673,
10
+ "ocr_with_boxes": 151671,
11
+ "ocr_without_boxes": 151672
12
+ },
13
+ "decoder": {
14
+ "_attn_implementation_autoset": true,
15
+ "_name_or_path": "",
16
+ "add_cross_attention": false,
17
+ "architectures": null,
18
+ "attention_dropout": 0.0,
19
+ "bad_words_ids": null,
20
+ "begin_suppress_tokens": null,
21
+ "bos_token_id": null,
22
+ "chunk_size_feed_forward": 0,
23
+ "cross_attention_hidden_size": null,
24
+ "decoder_start_token_id": null,
25
+ "diversity_penalty": 0.0,
26
+ "do_sample": false,
27
+ "early_stopping": false,
28
+ "encoder_no_repeat_ngram_size": 0,
29
+ "eos_token_id": null,
30
+ "exponential_decay_length_penalty": null,
31
+ "finetuning_task": null,
32
+ "forced_bos_token_id": null,
33
+ "forced_eos_token_id": null,
34
+ "hidden_act": "silu",
35
+ "hidden_size": 1280,
36
+ "id2label": {
37
+ "0": "LABEL_0",
38
+ "1": "LABEL_1"
39
+ },
40
+ "initializer_range": 0.02,
41
+ "intermediate_size": 6400,
42
+ "is_decoder": false,
43
+ "is_encoder_decoder": false,
44
+ "label2id": {
45
+ "LABEL_0": 0,
46
+ "LABEL_1": 1
47
+ },
48
+ "length_penalty": 1.0,
49
+ "max_length": 20,
50
+ "max_position_embeddings": 131072,
51
+ "max_window_layers": 12,
52
+ "min_length": 0,
53
+ "model_type": "qwen2",
54
+ "no_repeat_ngram_size": 0,
55
+ "num_attention_heads": 16,
56
+ "num_beam_groups": 1,
57
+ "num_beams": 1,
58
+ "num_hidden_layers": 12,
59
+ "num_key_value_heads": 4,
60
+ "num_return_sequences": 1,
61
+ "output_attentions": false,
62
+ "output_hidden_states": false,
63
+ "output_scores": false,
64
+ "pad_token_id": null,
65
+ "prefix": null,
66
+ "problem_type": null,
67
+ "pruned_heads": {},
68
+ "remove_invalid_values": false,
69
+ "repetition_penalty": 1.0,
70
+ "return_dict": true,
71
+ "return_dict_in_generate": false,
72
+ "rms_norm_eps": 1e-06,
73
+ "rope_scaling": null,
74
+ "rope_theta": 1000000.0,
75
+ "sep_token_id": null,
76
+ "sliding_window": 131072,
77
+ "suppress_tokens": null,
78
+ "task_specific_params": null,
79
+ "temperature": 1.0,
80
+ "tf_legacy_loss": false,
81
+ "tie_encoder_decoder": false,
82
+ "tie_word_embeddings": true,
83
+ "tokenizer_class": null,
84
+ "top_k": 50,
85
+ "top_p": 1.0,
86
+ "torch_dtype": null,
87
+ "torchscript": false,
88
+ "typical_p": 1.0,
89
+ "use_bfloat16": false,
90
+ "use_cache": true,
91
+ "use_sliding_window": false,
92
+ "vocab_size": 218225
93
+ },
94
+ "embedding_dropout_prob": 0.05,
95
+ "eoi_token_id": 151666,
96
+ "eos_token_id": 151665,
97
+ "hidden_size": 1280,
98
+ "image_embed_encoding_multiplier": 256,
99
+ "image_embed_encoding_size": 1024,
100
+ "image_token_id": 151667,
101
+ "max_sequence_length": 1024,
102
+ "merge_size": 2,
103
+ "model_type": "surya-multimodal-foundation",
104
+ "num_attention_heads": 16,
105
+ "num_hidden_layers": 12,
106
+ "num_key_value_heads": 4,
107
+ "num_register_tokens": 4,
108
+ "pad_token_id": 151668,
109
+ "patch_size": 14,
110
+ "register_token_ids": [
111
+ 151674,
112
+ 151675,
113
+ 151676,
114
+ 151677
115
+ ],
116
+ "special_ocr_tokens": {
117
+ "all": [
118
+ "</S>",
119
+ "<EOI>",
120
+ "<IMAGE>",
121
+ "<PAD>",
122
+ "<NOP>",
123
+ "<ROT>",
124
+ "<OCR-WB>",
125
+ "<OCR-WOB>",
126
+ "<BLOCKS-WOB>",
127
+ "<REG1>",
128
+ "<REG2>",
129
+ "<REG3>",
130
+ "<REG4>",
131
+ "<NO-MATH>",
132
+ "<b>",
133
+ "</b>",
134
+ "<del>",
135
+ "</del>",
136
+ "<mark>",
137
+ "</mark>",
138
+ "<sub>",
139
+ "</sub>",
140
+ "<br>",
141
+ "<sup>",
142
+ "</sup>",
143
+ "<i>",
144
+ "</i>",
145
+ "<small>",
146
+ "</small>",
147
+ "<u>",
148
+ "</u>",
149
+ "<code>",
150
+ "</code>",
151
+ "<math>",
152
+ "<math display='block'>",
153
+ "<math display=\"block\">",
154
+ "<math display='inline'>",
155
+ "<math display=\"inline\">",
156
+ "</math>",
157
+ "<SCRIPT-LATIN>",
158
+ "<SCRIPT-PUNCTUATION>",
159
+ "<SCRIPT-CYRILLIC>",
160
+ "<SCRIPT-ARABIC>",
161
+ "<SCRIPT-CHINESE>",
162
+ "<SCRIPT-JAPANESE>",
163
+ "<SCRIPT-KOREAN>",
164
+ "<SCRIPT-SYMBOLS>",
165
+ "<SCRIPT-GREEK>",
166
+ "<SCRIPT-ARMENIAN>",
167
+ "<SCRIPT-HEBREW>",
168
+ "<SCRIPT-DEVANAGARI>",
169
+ "<SCRIPT-BENGALI>",
170
+ "<SCRIPT-GURMUKHI>",
171
+ "<SCRIPT-GUJARATI>",
172
+ "<SCRIPT-ORIYA>",
173
+ "<SCRIPT-TAMIL>",
174
+ "<SCRIPT-TELUGU>",
175
+ "<SCRIPT-KANNADA>",
176
+ "<SCRIPT-MALAYALAM>",
177
+ "<SCRIPT-SINHALA>",
178
+ "<SCRIPT-THAI>",
179
+ "<SCRIPT-LAO>",
180
+ "<SCRIPT-MYANMAR>",
181
+ "<SCRIPT-GEORGIAN>",
182
+ "<SCRIPT-ETHIOPIC>",
183
+ "<SCRIPT-KHMER>",
184
+ "<SCRIPT-MONGOLIAN>",
185
+ "<SCRIPT-MATH>",
186
+ "<reserved_0>",
187
+ "<reserved_1>",
188
+ "<reserved_2>",
189
+ "<reserved_3>",
190
+ "<reserved_4>",
191
+ "<reserved_5>",
192
+ "<reserved_6>",
193
+ "<reserved_7>",
194
+ "<reserved_8>",
195
+ "<reserved_9>",
196
+ "<reserved_10>",
197
+ "<reserved_11>",
198
+ "<reserved_12>",
199
+ "<reserved_13>",
200
+ "<reserved_14>",
201
+ "<reserved_15>",
202
+ "<reserved_16>",
203
+ "<reserved_17>",
204
+ "<reserved_18>",
205
+ "<reserved_19>",
206
+ "<reserved_20>",
207
+ "<reserved_21>",
208
+ "<reserved_22>",
209
+ "<reserved_23>",
210
+ "<reserved_24>",
211
+ "<reserved_25>",
212
+ "<reserved_26>",
213
+ "<reserved_27>",
214
+ "<reserved_28>",
215
+ "<reserved_29>",
216
+ "<reserved_30>",
217
+ "<reserved_31>",
218
+ "<reserved_32>",
219
+ "<reserved_33>",
220
+ "<reserved_34>",
221
+ "<reserved_35>",
222
+ "<reserved_36>",
223
+ "<reserved_37>",
224
+ "<reserved_38>",
225
+ "<reserved_39>",
226
+ "<reserved_40>",
227
+ "<reserved_41>",
228
+ "<reserved_42>",
229
+ "<reserved_43>",
230
+ "<reserved_44>",
231
+ "<reserved_45>",
232
+ "<reserved_46>",
233
+ "<reserved_47>",
234
+ "<reserved_48>",
235
+ "<reserved_49>",
236
+ "<reserved_50>",
237
+ "<reserved_51>",
238
+ "<reserved_52>",
239
+ "<reserved_53>",
240
+ "<reserved_54>",
241
+ "<reserved_55>",
242
+ "<reserved_56>",
243
+ "<reserved_57>",
244
+ "<reserved_58>",
245
+ "<reserved_59>",
246
+ "<reserved_60>",
247
+ "<reserved_61>",
248
+ "<reserved_62>",
249
+ "<reserved_63>",
250
+ "<reserved_64>",
251
+ "<reserved_65>",
252
+ "<reserved_66>",
253
+ "<reserved_67>",
254
+ "<reserved_68>",
255
+ "<reserved_69>",
256
+ "<reserved_70>",
257
+ "<reserved_71>",
258
+ "<reserved_72>",
259
+ "<reserved_73>",
260
+ "<reserved_74>",
261
+ "<reserved_75>",
262
+ "<reserved_76>",
263
+ "<reserved_77>",
264
+ "<reserved_78>",
265
+ "<reserved_79>",
266
+ "<reserved_80>",
267
+ "<reserved_81>",
268
+ "<reserved_82>",
269
+ "<reserved_83>",
270
+ "<reserved_84>",
271
+ "<reserved_85>",
272
+ "<reserved_86>",
273
+ "<reserved_87>",
274
+ "<reserved_88>",
275
+ "<reserved_89>",
276
+ "<reserved_90>",
277
+ "<reserved_91>",
278
+ "<reserved_92>",
279
+ "<reserved_93>",
280
+ "<reserved_94>",
281
+ "<reserved_95>",
282
+ "<reserved_96>",
283
+ "<reserved_97>",
284
+ "<reserved_98>",
285
+ "<reserved_99>",
286
+ "<reserved_100>",
287
+ "<reserved_101>",
288
+ "<reserved_102>",
289
+ "<reserved_103>",
290
+ "<reserved_104>",
291
+ "<reserved_105>",
292
+ "<reserved_106>",
293
+ "<reserved_107>",
294
+ "<reserved_108>",
295
+ "<reserved_109>",
296
+ "<reserved_110>",
297
+ "<reserved_111>",
298
+ "<reserved_112>",
299
+ "<reserved_113>",
300
+ "<reserved_114>",
301
+ "<reserved_115>",
302
+ "<reserved_116>",
303
+ "<reserved_117>",
304
+ "<reserved_118>",
305
+ "<reserved_119>",
306
+ "<reserved_120>",
307
+ "<reserved_121>",
308
+ "<reserved_122>",
309
+ "<reserved_123>",
310
+ "<reserved_124>",
311
+ "<reserved_125>",
312
+ "<reserved_126>",
313
+ "<reserved_127>",
314
+ "<reserved_128>",
315
+ "<reserved_129>",
316
+ "<reserved_130>",
317
+ "<reserved_131>",
318
+ "<reserved_132>",
319
+ "<reserved_133>",
320
+ "<reserved_134>",
321
+ "<reserved_135>",
322
+ "<reserved_136>",
323
+ "<reserved_137>",
324
+ "<reserved_138>",
325
+ "<reserved_139>",
326
+ "<reserved_140>",
327
+ "<reserved_141>",
328
+ "<reserved_142>",
329
+ "<reserved_143>",
330
+ "<reserved_144>",
331
+ "<reserved_145>",
332
+ "<reserved_146>",
333
+ "<reserved_147>",
334
+ "<reserved_148>",
335
+ "<reserved_149>",
336
+ "<reserved_150>",
337
+ "<reserved_151>",
338
+ "<reserved_152>",
339
+ "<reserved_153>",
340
+ "<reserved_154>",
341
+ "<reserved_155>",
342
+ "<reserved_156>",
343
+ "<reserved_157>",
344
+ "<reserved_158>",
345
+ "<reserved_159>",
346
+ "<reserved_160>",
347
+ "<reserved_161>",
348
+ "<reserved_162>",
349
+ "<reserved_163>",
350
+ "<reserved_164>",
351
+ "<reserved_165>",
352
+ "<reserved_166>",
353
+ "<reserved_167>",
354
+ "<reserved_168>",
355
+ "<reserved_169>",
356
+ "<reserved_170>",
357
+ "<reserved_171>",
358
+ "<reserved_172>",
359
+ "<reserved_173>",
360
+ "<reserved_174>",
361
+ "<reserved_175>",
362
+ "<reserved_176>",
363
+ "<reserved_177>",
364
+ "<reserved_178>",
365
+ "<reserved_179>",
366
+ "<reserved_180>",
367
+ "<reserved_181>",
368
+ "<reserved_182>",
369
+ "<reserved_183>",
370
+ "<reserved_184>",
371
+ "<reserved_185>",
372
+ "<reserved_186>",
373
+ "<reserved_187>",
374
+ "<reserved_188>",
375
+ "<reserved_189>",
376
+ "<reserved_190>",
377
+ "<reserved_191>",
378
+ "<reserved_192>",
379
+ "<reserved_193>",
380
+ "<reserved_194>",
381
+ "<reserved_195>",
382
+ "<reserved_196>",
383
+ "<reserved_197>",
384
+ "<reserved_198>",
385
+ "<reserved_199>",
386
+ "<reserved_200>",
387
+ "<reserved_201>",
388
+ "<reserved_202>",
389
+ "<reserved_203>",
390
+ "<reserved_204>",
391
+ "<reserved_205>",
392
+ "<reserved_206>",
393
+ "<reserved_207>",
394
+ "<reserved_208>",
395
+ "<reserved_209>",
396
+ "<reserved_210>",
397
+ "<reserved_211>",
398
+ "<reserved_212>",
399
+ "<reserved_213>",
400
+ "<reserved_214>",
401
+ "<reserved_215>",
402
+ "<reserved_216>",
403
+ "<reserved_217>",
404
+ "<reserved_218>",
405
+ "<reserved_219>",
406
+ "<reserved_220>",
407
+ "<reserved_221>",
408
+ "<reserved_222>",
409
+ "<reserved_223>",
410
+ "<reserved_224>",
411
+ "<reserved_225>",
412
+ "<reserved_226>",
413
+ "<reserved_227>",
414
+ "<reserved_228>",
415
+ "<reserved_229>",
416
+ "<reserved_230>",
417
+ "<reserved_231>",
418
+ "<reserved_232>",
419
+ "<reserved_233>",
420
+ "<reserved_234>",
421
+ "<reserved_235>",
422
+ "<reserved_236>",
423
+ "<reserved_237>",
424
+ "<reserved_238>",
425
+ "<reserved_239>",
426
+ "<reserved_240>",
427
+ "<reserved_241>",
428
+ "<reserved_242>",
429
+ "<reserved_243>",
430
+ "<reserved_244>",
431
+ "<reserved_245>",
432
+ "<reserved_246>",
433
+ "<reserved_247>",
434
+ "<reserved_248>",
435
+ "<reserved_249>",
436
+ "<reserved_250>",
437
+ "<reserved_251>",
438
+ "<reserved_252>",
439
+ "<reserved_253>",
440
+ "<reserved_254>",
441
+ "<reserved_255>",
442
+ "<reserved_256>",
443
+ "<reserved_257>",
444
+ "<reserved_258>",
445
+ "<reserved_259>",
446
+ "<reserved_260>",
447
+ "<reserved_261>",
448
+ "<reserved_262>",
449
+ "<reserved_263>",
450
+ "<reserved_264>",
451
+ "<reserved_265>",
452
+ "<reserved_266>",
453
+ "<reserved_267>",
454
+ "<reserved_268>",
455
+ "<reserved_269>",
456
+ "<reserved_270>",
457
+ "<reserved_271>",
458
+ "<reserved_272>",
459
+ "<reserved_273>",
460
+ "<reserved_274>",
461
+ "<reserved_275>",
462
+ "<reserved_276>",
463
+ "<reserved_277>",
464
+ "<reserved_278>",
465
+ "<reserved_279>",
466
+ "<reserved_280>",
467
+ "<reserved_281>",
468
+ "<reserved_282>",
469
+ "<reserved_283>",
470
+ "<reserved_284>",
471
+ "<reserved_285>",
472
+ "<reserved_286>",
473
+ "<reserved_287>",
474
+ "<reserved_288>",
475
+ "<reserved_289>",
476
+ "<reserved_290>",
477
+ "<reserved_291>",
478
+ "<reserved_292>",
479
+ "<reserved_293>",
480
+ "<reserved_294>",
481
+ "<reserved_295>",
482
+ "<reserved_296>",
483
+ "<reserved_297>",
484
+ "<reserved_298>",
485
+ "<reserved_299>",
486
+ "<reserved_300>",
487
+ "<reserved_301>",
488
+ "<reserved_302>",
489
+ "<reserved_303>",
490
+ "<reserved_304>",
491
+ "<reserved_305>",
492
+ "<reserved_306>",
493
+ "<reserved_307>",
494
+ "<reserved_308>",
495
+ "<reserved_309>",
496
+ "<reserved_310>",
497
+ "<reserved_311>",
498
+ "<reserved_312>",
499
+ "<reserved_313>",
500
+ "<reserved_314>",
501
+ "<reserved_315>",
502
+ "<reserved_316>",
503
+ "<reserved_317>",
504
+ "<reserved_318>",
505
+ "<reserved_319>",
506
+ "<reserved_320>",
507
+ "<reserved_321>",
508
+ "<reserved_322>",
509
+ "<reserved_323>",
510
+ "<reserved_324>",
511
+ "<reserved_325>",
512
+ "<reserved_326>",
513
+ "<reserved_327>",
514
+ "<reserved_328>",
515
+ "<reserved_329>",
516
+ "<reserved_330>",
517
+ "<reserved_331>",
518
+ "<reserved_332>",
519
+ "<reserved_333>",
520
+ "<reserved_334>",
521
+ "<reserved_335>",
522
+ "<reserved_336>",
523
+ "<reserved_337>",
524
+ "<reserved_338>",
525
+ "<reserved_339>",
526
+ "<reserved_340>",
527
+ "<reserved_341>",
528
+ "<reserved_342>",
529
+ "<reserved_343>",
530
+ "<reserved_344>",
531
+ "<reserved_345>",
532
+ "<reserved_346>",
533
+ "<reserved_347>",
534
+ "<reserved_348>",
535
+ "<reserved_349>",
536
+ "<reserved_350>",
537
+ "<reserved_351>",
538
+ "<reserved_352>",
539
+ "<reserved_353>",
540
+ "<reserved_354>",
541
+ "<reserved_355>",
542
+ "<reserved_356>",
543
+ "<reserved_357>",
544
+ "<reserved_358>",
545
+ "<reserved_359>",
546
+ "<reserved_360>",
547
+ "<reserved_361>",
548
+ "<reserved_362>",
549
+ "<reserved_363>",
550
+ "<reserved_364>",
551
+ "<reserved_365>",
552
+ "<reserved_366>",
553
+ "<reserved_367>",
554
+ "<reserved_368>",
555
+ "<reserved_369>",
556
+ "<reserved_370>",
557
+ "<reserved_371>",
558
+ "<reserved_372>",
559
+ "<reserved_373>",
560
+ "<reserved_374>",
561
+ "<reserved_375>",
562
+ "<reserved_376>",
563
+ "<reserved_377>",
564
+ "<reserved_378>",
565
+ "<reserved_379>",
566
+ "<reserved_380>",
567
+ "<reserved_381>",
568
+ "<reserved_382>",
569
+ "<reserved_383>",
570
+ "<reserved_384>",
571
+ "<reserved_385>",
572
+ "<reserved_386>",
573
+ "<reserved_387>",
574
+ "<reserved_388>",
575
+ "<reserved_389>",
576
+ "<reserved_390>",
577
+ "<reserved_391>",
578
+ "<reserved_392>",
579
+ "<reserved_393>",
580
+ "<reserved_394>",
581
+ "<reserved_395>",
582
+ "<reserved_396>",
583
+ "<reserved_397>",
584
+ "<reserved_398>",
585
+ "<reserved_399>",
586
+ "<reserved_400>",
587
+ "<reserved_401>",
588
+ "<reserved_402>",
589
+ "<reserved_403>",
590
+ "<reserved_404>",
591
+ "<reserved_405>",
592
+ "<reserved_406>",
593
+ "<reserved_407>",
594
+ "<reserved_408>",
595
+ "<reserved_409>",
596
+ "<reserved_410>",
597
+ "<reserved_411>",
598
+ "<reserved_412>",
599
+ "<reserved_413>",
600
+ "<reserved_414>",
601
+ "<reserved_415>",
602
+ "<reserved_416>",
603
+ "<reserved_417>",
604
+ "<reserved_418>",
605
+ "<reserved_419>",
606
+ "<reserved_420>",
607
+ "<reserved_421>",
608
+ "<reserved_422>",
609
+ "<reserved_423>",
610
+ "<reserved_424>",
611
+ "<reserved_425>",
612
+ "<reserved_426>",
613
+ "<reserved_427>",
614
+ "<reserved_428>",
615
+ "<reserved_429>",
616
+ "<reserved_430>",
617
+ "<reserved_431>",
618
+ "<reserved_432>",
619
+ "<reserved_433>",
620
+ "<reserved_434>",
621
+ "<reserved_435>",
622
+ "<reserved_436>",
623
+ "<reserved_437>",
624
+ "<reserved_438>",
625
+ "<reserved_439>",
626
+ "<reserved_440>",
627
+ "<reserved_441>",
628
+ "<reserved_442>",
629
+ "<reserved_443>",
630
+ "<reserved_444>",
631
+ "<reserved_445>",
632
+ "<reserved_446>",
633
+ "<reserved_447>",
634
+ "<reserved_448>",
635
+ "<reserved_449>",
636
+ "<reserved_450>",
637
+ "<reserved_451>",
638
+ "<reserved_452>",
639
+ "<reserved_453>",
640
+ "<reserved_454>",
641
+ "<reserved_455>",
642
+ "<reserved_456>",
643
+ "<reserved_457>",
644
+ "<reserved_458>",
645
+ "<reserved_459>",
646
+ "<reserved_460>",
647
+ "<reserved_461>",
648
+ "<reserved_462>",
649
+ "<reserved_463>",
650
+ "<reserved_464>",
651
+ "<reserved_465>",
652
+ "<reserved_466>",
653
+ "<reserved_467>",
654
+ "<reserved_468>",
655
+ "<reserved_469>",
656
+ "<reserved_470>",
657
+ "<reserved_471>",
658
+ "<reserved_472>",
659
+ "<reserved_473>",
660
+ "<reserved_474>",
661
+ "<reserved_475>",
662
+ "<reserved_476>",
663
+ "<reserved_477>",
664
+ "<reserved_478>",
665
+ "<reserved_479>",
666
+ "<reserved_480>",
667
+ "<reserved_481>",
668
+ "<reserved_482>",
669
+ "<reserved_483>",
670
+ "<reserved_484>",
671
+ "<reserved_485>",
672
+ "<reserved_486>",
673
+ "<reserved_487>",
674
+ "<reserved_488>",
675
+ "<reserved_489>",
676
+ "<reserved_490>",
677
+ "<reserved_491>",
678
+ "<reserved_492>",
679
+ "<reserved_493>",
680
+ "<reserved_494>",
681
+ "<reserved_495>",
682
+ "<reserved_496>",
683
+ "<reserved_497>",
684
+ "<reserved_498>",
685
+ "<reserved_499>",
686
+ "<reserved_500>",
687
+ "<reserved_501>",
688
+ "<reserved_502>",
689
+ "<reserved_503>",
690
+ "<reserved_504>",
691
+ "<reserved_505>",
692
+ "<reserved_506>",
693
+ "<reserved_507>",
694
+ "<reserved_508>",
695
+ "<reserved_509>",
696
+ "<reserved_510>",
697
+ "<reserved_511>",
698
+ "<reserved_512>",
699
+ "<reserved_513>",
700
+ "<reserved_514>",
701
+ "<reserved_515>",
702
+ "<reserved_516>",
703
+ "<reserved_517>",
704
+ "<reserved_518>",
705
+ "<reserved_519>",
706
+ "<reserved_520>",
707
+ "<reserved_521>",
708
+ "<reserved_522>",
709
+ "<reserved_523>",
710
+ "<reserved_524>",
711
+ "<reserved_525>",
712
+ "<reserved_526>",
713
+ "<reserved_527>",
714
+ "<reserved_528>",
715
+ "<reserved_529>",
716
+ "<reserved_530>",
717
+ "<reserved_531>",
718
+ "<reserved_532>",
719
+ "<reserved_533>",
720
+ "<reserved_534>",
721
+ "<reserved_535>",
722
+ "<reserved_536>",
723
+ "<reserved_537>",
724
+ "<reserved_538>",
725
+ "<reserved_539>",
726
+ "<reserved_540>",
727
+ "<reserved_541>",
728
+ "<reserved_542>",
729
+ "<reserved_543>",
730
+ "<reserved_544>",
731
+ "<reserved_545>",
732
+ "<reserved_546>",
733
+ "<reserved_547>",
734
+ "<reserved_548>",
735
+ "<reserved_549>",
736
+ "<reserved_550>",
737
+ "<reserved_551>",
738
+ "<reserved_552>",
739
+ "<reserved_553>",
740
+ "<reserved_554>",
741
+ "<reserved_555>",
742
+ "<reserved_556>",
743
+ "<reserved_557>",
744
+ "<reserved_558>",
745
+ "<reserved_559>",
746
+ "<reserved_560>",
747
+ "<reserved_561>",
748
+ "<reserved_562>",
749
+ "<reserved_563>",
750
+ "<reserved_564>",
751
+ "<reserved_565>",
752
+ "<reserved_566>",
753
+ "<reserved_567>",
754
+ "<reserved_568>",
755
+ "<reserved_569>",
756
+ "<reserved_570>",
757
+ "<reserved_571>",
758
+ "<reserved_572>",
759
+ "<reserved_573>",
760
+ "<reserved_574>",
761
+ "<reserved_575>",
762
+ "<reserved_576>",
763
+ "<reserved_577>",
764
+ "<reserved_578>",
765
+ "<reserved_579>",
766
+ "<reserved_580>",
767
+ "<reserved_581>",
768
+ "<reserved_582>",
769
+ "<reserved_583>",
770
+ "<reserved_584>",
771
+ "<reserved_585>",
772
+ "<reserved_586>",
773
+ "<reserved_587>",
774
+ "<reserved_588>",
775
+ "<reserved_589>",
776
+ "<reserved_590>",
777
+ "<reserved_591>",
778
+ "<reserved_592>",
779
+ "<reserved_593>",
780
+ "<reserved_594>",
781
+ "<reserved_595>",
782
+ "<reserved_596>",
783
+ "<reserved_597>",
784
+ "<reserved_598>",
785
+ "<reserved_599>",
786
+ "<reserved_600>",
787
+ "<reserved_601>",
788
+ "<reserved_602>",
789
+ "<reserved_603>",
790
+ "<reserved_604>",
791
+ "<reserved_605>",
792
+ "<reserved_606>",
793
+ "<reserved_607>",
794
+ "<reserved_608>",
795
+ "<reserved_609>",
796
+ "<reserved_610>",
797
+ "<reserved_611>",
798
+ "<reserved_612>",
799
+ "<reserved_613>",
800
+ "<reserved_614>",
801
+ "<reserved_615>",
802
+ "<reserved_616>",
803
+ "<reserved_617>",
804
+ "<reserved_618>",
805
+ "<reserved_619>",
806
+ "<reserved_620>",
807
+ "<reserved_621>",
808
+ "<reserved_622>",
809
+ "<reserved_623>",
810
+ "<reserved_624>",
811
+ "<reserved_625>",
812
+ "<reserved_626>",
813
+ "<reserved_627>",
814
+ "<reserved_628>",
815
+ "<reserved_629>",
816
+ "<reserved_630>",
817
+ "<reserved_631>",
818
+ "<reserved_632>",
819
+ "<reserved_633>",
820
+ "<reserved_634>",
821
+ "<reserved_635>",
822
+ "<reserved_636>",
823
+ "<reserved_637>",
824
+ "<reserved_638>",
825
+ "<reserved_639>",
826
+ "<reserved_640>",
827
+ "<reserved_641>",
828
+ "<reserved_642>",
829
+ "<reserved_643>",
830
+ "<reserved_644>",
831
+ "<reserved_645>",
832
+ "<reserved_646>",
833
+ "<reserved_647>",
834
+ "<reserved_648>",
835
+ "<reserved_649>",
836
+ "<reserved_650>",
837
+ "<reserved_651>",
838
+ "<reserved_652>",
839
+ "<reserved_653>",
840
+ "<reserved_654>",
841
+ "<reserved_655>",
842
+ "<reserved_656>",
843
+ "<reserved_657>",
844
+ "<reserved_658>",
845
+ "<reserved_659>",
846
+ "<reserved_660>",
847
+ "<reserved_661>",
848
+ "<reserved_662>",
849
+ "<reserved_663>",
850
+ "<reserved_664>",
851
+ "<reserved_665>",
852
+ "<reserved_666>",
853
+ "<reserved_667>",
854
+ "<reserved_668>",
855
+ "<reserved_669>",
856
+ "<reserved_670>",
857
+ "<reserved_671>",
858
+ "<reserved_672>",
859
+ "<reserved_673>",
860
+ "<reserved_674>",
861
+ "<reserved_675>",
862
+ "<reserved_676>",
863
+ "<reserved_677>",
864
+ "<reserved_678>",
865
+ "<reserved_679>",
866
+ "<reserved_680>",
867
+ "<reserved_681>",
868
+ "<reserved_682>",
869
+ "<reserved_683>",
870
+ "<reserved_684>",
871
+ "<reserved_685>",
872
+ "<reserved_686>",
873
+ "<reserved_687>",
874
+ "<reserved_688>",
875
+ "<reserved_689>",
876
+ "<reserved_690>",
877
+ "<reserved_691>",
878
+ "<reserved_692>",
879
+ "<reserved_693>",
880
+ "<reserved_694>",
881
+ "<reserved_695>",
882
+ "<reserved_696>",
883
+ "<reserved_697>",
884
+ "<reserved_698>",
885
+ "<reserved_699>",
886
+ "<reserved_700>",
887
+ "<reserved_701>",
888
+ "<reserved_702>",
889
+ "<reserved_703>",
890
+ "<reserved_704>",
891
+ "<reserved_705>",
892
+ "<reserved_706>",
893
+ "<reserved_707>",
894
+ "<reserved_708>",
895
+ "<reserved_709>",
896
+ "<reserved_710>",
897
+ "<reserved_711>",
898
+ "<reserved_712>",
899
+ "<reserved_713>",
900
+ "<reserved_714>",
901
+ "<reserved_715>",
902
+ "<reserved_716>",
903
+ "<reserved_717>",
904
+ "<reserved_718>",
905
+ "<reserved_719>",
906
+ "<reserved_720>",
907
+ "<reserved_721>",
908
+ "<reserved_722>",
909
+ "<reserved_723>",
910
+ "<reserved_724>",
911
+ "<reserved_725>",
912
+ "<reserved_726>",
913
+ "<reserved_727>",
914
+ "<reserved_728>",
915
+ "<reserved_729>",
916
+ "<reserved_730>",
917
+ "<reserved_731>",
918
+ "<reserved_732>",
919
+ "<reserved_733>",
920
+ "<reserved_734>",
921
+ "<reserved_735>",
922
+ "<reserved_736>",
923
+ "<reserved_737>",
924
+ "<reserved_738>",
925
+ "<reserved_739>",
926
+ "<reserved_740>",
927
+ "<reserved_741>",
928
+ "<reserved_742>",
929
+ "<reserved_743>",
930
+ "<reserved_744>",
931
+ "<reserved_745>",
932
+ "<reserved_746>",
933
+ "<reserved_747>",
934
+ "<reserved_748>",
935
+ "<reserved_749>",
936
+ "<reserved_750>",
937
+ "<reserved_751>",
938
+ "<reserved_752>",
939
+ "<reserved_753>",
940
+ "<reserved_754>",
941
+ "<reserved_755>",
942
+ "<reserved_756>",
943
+ "<reserved_757>",
944
+ "<reserved_758>",
945
+ "<reserved_759>",
946
+ "<reserved_760>",
947
+ "<reserved_761>",
948
+ "<reserved_762>",
949
+ "<reserved_763>",
950
+ "<reserved_764>",
951
+ "<reserved_765>",
952
+ "<reserved_766>",
953
+ "<reserved_767>",
954
+ "<reserved_768>",
955
+ "<reserved_769>",
956
+ "<reserved_770>",
957
+ "<reserved_771>",
958
+ "<reserved_772>",
959
+ "<reserved_773>",
960
+ "<reserved_774>",
961
+ "<reserved_775>",
962
+ "<reserved_776>",
963
+ "<reserved_777>",
964
+ "<reserved_778>",
965
+ "<reserved_779>",
966
+ "<reserved_780>",
967
+ "<reserved_781>",
968
+ "<reserved_782>",
969
+ "<reserved_783>",
970
+ "<reserved_784>",
971
+ "<reserved_785>",
972
+ "<reserved_786>",
973
+ "<reserved_787>",
974
+ "<reserved_788>",
975
+ "<reserved_789>",
976
+ "<reserved_790>",
977
+ "<reserved_791>",
978
+ "<reserved_792>",
979
+ "<reserved_793>",
980
+ "<reserved_794>",
981
+ "<reserved_795>",
982
+ "<reserved_796>",
983
+ "<reserved_797>",
984
+ "<reserved_798>",
985
+ "<reserved_799>",
986
+ "<reserved_800>",
987
+ "<reserved_801>",
988
+ "<reserved_802>",
989
+ "<reserved_803>",
990
+ "<reserved_804>",
991
+ "<reserved_805>",
992
+ "<reserved_806>",
993
+ "<reserved_807>",
994
+ "<reserved_808>",
995
+ "<reserved_809>",
996
+ "<reserved_810>",
997
+ "<reserved_811>",
998
+ "<reserved_812>",
999
+ "<reserved_813>",
1000
+ "<reserved_814>",
1001
+ "<reserved_815>",
1002
+ "<reserved_816>",
1003
+ "<reserved_817>",
1004
+ "<reserved_818>",
1005
+ "<reserved_819>",
1006
+ "<reserved_820>",
1007
+ "<reserved_821>",
1008
+ "<reserved_822>",
1009
+ "<reserved_823>",
1010
+ "<reserved_824>",
1011
+ "<reserved_825>",
1012
+ "<reserved_826>",
1013
+ "<reserved_827>",
1014
+ "<reserved_828>",
1015
+ "<reserved_829>",
1016
+ "<reserved_830>",
1017
+ "<reserved_831>",
1018
+ "<reserved_832>",
1019
+ "<reserved_833>",
1020
+ "<reserved_834>",
1021
+ "<reserved_835>",
1022
+ "<reserved_836>",
1023
+ "<reserved_837>",
1024
+ "<reserved_838>",
1025
+ "<reserved_839>",
1026
+ "<reserved_840>",
1027
+ "<reserved_841>",
1028
+ "<reserved_842>",
1029
+ "<reserved_843>",
1030
+ "<reserved_844>",
1031
+ "<reserved_845>",
1032
+ "<reserved_846>",
1033
+ "<reserved_847>",
1034
+ "<reserved_848>",
1035
+ "<reserved_849>",
1036
+ "<reserved_850>",
1037
+ "<reserved_851>",
1038
+ "<reserved_852>",
1039
+ "<reserved_853>",
1040
+ "<reserved_854>",
1041
+ "<reserved_855>",
1042
+ "<reserved_856>",
1043
+ "<reserved_857>",
1044
+ "<reserved_858>",
1045
+ "<reserved_859>",
1046
+ "<reserved_860>",
1047
+ "<reserved_861>",
1048
+ "<reserved_862>",
1049
+ "<reserved_863>",
1050
+ "<reserved_864>",
1051
+ "<reserved_865>",
1052
+ "<reserved_866>",
1053
+ "<reserved_867>",
1054
+ "<reserved_868>",
1055
+ "<reserved_869>",
1056
+ "<reserved_870>",
1057
+ "<reserved_871>",
1058
+ "<reserved_872>",
1059
+ "<reserved_873>",
1060
+ "<reserved_874>",
1061
+ "<reserved_875>",
1062
+ "<reserved_876>",
1063
+ "<reserved_877>",
1064
+ "<reserved_878>",
1065
+ "<reserved_879>",
1066
+ "<reserved_880>",
1067
+ "<reserved_881>",
1068
+ "<reserved_882>",
1069
+ "<reserved_883>",
1070
+ "<reserved_884>",
1071
+ "<reserved_885>",
1072
+ "<reserved_886>",
1073
+ "<reserved_887>",
1074
+ "<reserved_888>",
1075
+ "<reserved_889>",
1076
+ "<reserved_890>",
1077
+ "<reserved_891>",
1078
+ "<reserved_892>",
1079
+ "<reserved_893>",
1080
+ "<reserved_894>",
1081
+ "<reserved_895>",
1082
+ "<reserved_896>",
1083
+ "<reserved_897>",
1084
+ "<reserved_898>",
1085
+ "<reserved_899>",
1086
+ "<reserved_900>",
1087
+ "<reserved_901>",
1088
+ "<reserved_902>",
1089
+ "<reserved_903>",
1090
+ "<reserved_904>",
1091
+ "<reserved_905>",
1092
+ "<reserved_906>",
1093
+ "<reserved_907>",
1094
+ "<reserved_908>",
1095
+ "<reserved_909>",
1096
+ "<reserved_910>",
1097
+ "<reserved_911>",
1098
+ "<reserved_912>",
1099
+ "<reserved_913>",
1100
+ "<reserved_914>",
1101
+ "<reserved_915>",
1102
+ "<reserved_916>",
1103
+ "<reserved_917>",
1104
+ "<reserved_918>",
1105
+ "<reserved_919>",
1106
+ "<reserved_920>",
1107
+ "<reserved_921>",
1108
+ "<reserved_922>",
1109
+ "<reserved_923>",
1110
+ "<reserved_924>",
1111
+ "<reserved_925>",
1112
+ "<reserved_926>",
1113
+ "<reserved_927>",
1114
+ "<reserved_928>",
1115
+ "<reserved_929>",
1116
+ "<reserved_930>",
1117
+ "<reserved_931>",
1118
+ "<reserved_932>",
1119
+ "<reserved_933>",
1120
+ "<reserved_934>",
1121
+ "<reserved_935>",
1122
+ "<reserved_936>",
1123
+ "<reserved_937>",
1124
+ "<reserved_938>",
1125
+ "<reserved_939>",
1126
+ "<reserved_940>",
1127
+ "<reserved_941>",
1128
+ "<reserved_942>",
1129
+ "<reserved_943>",
1130
+ "<reserved_944>",
1131
+ "<reserved_945>",
1132
+ "<reserved_946>",
1133
+ "<reserved_947>",
1134
+ "<reserved_948>",
1135
+ "<reserved_949>",
1136
+ "<reserved_950>",
1137
+ "<reserved_951>",
1138
+ "<reserved_952>",
1139
+ "<reserved_953>",
1140
+ "<reserved_954>",
1141
+ "<reserved_955>"
1142
+ ],
1143
+ "formatting": [
1144
+ "<b>",
1145
+ "</b>",
1146
+ "<del>",
1147
+ "</del>",
1148
+ "<mark>",
1149
+ "</mark>",
1150
+ "<sub>",
1151
+ "</sub>",
1152
+ "<br>",
1153
+ "<sup>",
1154
+ "</sup>",
1155
+ "<i>",
1156
+ "</i>",
1157
+ "<small>",
1158
+ "</small>",
1159
+ "<u>",
1160
+ "</u>",
1161
+ "<code>",
1162
+ "</code>"
1163
+ ],
1164
+ "math_external": [
1165
+ "<math>",
1166
+ "<math display='block'>",
1167
+ "<math display=\"block\">",
1168
+ "<math display='inline'>",
1169
+ "<math display=\"inline\">",
1170
+ "</math>"
1171
+ ],
1172
+ "reserved": [
1173
+ "<reserved_0>",
1174
+ "<reserved_1>",
1175
+ "<reserved_2>",
1176
+ "<reserved_3>",
1177
+ "<reserved_4>",
1178
+ "<reserved_5>",
1179
+ "<reserved_6>",
1180
+ "<reserved_7>",
1181
+ "<reserved_8>",
1182
+ "<reserved_9>",
1183
+ "<reserved_10>",
1184
+ "<reserved_11>",
1185
+ "<reserved_12>",
1186
+ "<reserved_13>",
1187
+ "<reserved_14>",
1188
+ "<reserved_15>",
1189
+ "<reserved_16>",
1190
+ "<reserved_17>",
1191
+ "<reserved_18>",
1192
+ "<reserved_19>",
1193
+ "<reserved_20>",
1194
+ "<reserved_21>",
1195
+ "<reserved_22>",
1196
+ "<reserved_23>",
1197
+ "<reserved_24>",
1198
+ "<reserved_25>",
1199
+ "<reserved_26>",
1200
+ "<reserved_27>",
1201
+ "<reserved_28>",
1202
+ "<reserved_29>",
1203
+ "<reserved_30>",
1204
+ "<reserved_31>",
1205
+ "<reserved_32>",
1206
+ "<reserved_33>",
1207
+ "<reserved_34>",
1208
+ "<reserved_35>",
1209
+ "<reserved_36>",
1210
+ "<reserved_37>",
1211
+ "<reserved_38>",
1212
+ "<reserved_39>",
1213
+ "<reserved_40>",
1214
+ "<reserved_41>",
1215
+ "<reserved_42>",
1216
+ "<reserved_43>",
1217
+ "<reserved_44>",
1218
+ "<reserved_45>",
1219
+ "<reserved_46>",
1220
+ "<reserved_47>",
1221
+ "<reserved_48>",
1222
+ "<reserved_49>",
1223
+ "<reserved_50>",
1224
+ "<reserved_51>",
1225
+ "<reserved_52>",
1226
+ "<reserved_53>",
1227
+ "<reserved_54>",
1228
+ "<reserved_55>",
1229
+ "<reserved_56>",
1230
+ "<reserved_57>",
1231
+ "<reserved_58>",
1232
+ "<reserved_59>",
1233
+ "<reserved_60>",
1234
+ "<reserved_61>",
1235
+ "<reserved_62>",
1236
+ "<reserved_63>",
1237
+ "<reserved_64>",
1238
+ "<reserved_65>",
1239
+ "<reserved_66>",
1240
+ "<reserved_67>",
1241
+ "<reserved_68>",
1242
+ "<reserved_69>",
1243
+ "<reserved_70>",
1244
+ "<reserved_71>",
1245
+ "<reserved_72>",
1246
+ "<reserved_73>",
1247
+ "<reserved_74>",
1248
+ "<reserved_75>",
1249
+ "<reserved_76>",
1250
+ "<reserved_77>",
1251
+ "<reserved_78>",
1252
+ "<reserved_79>",
1253
+ "<reserved_80>",
1254
+ "<reserved_81>",
1255
+ "<reserved_82>",
1256
+ "<reserved_83>",
1257
+ "<reserved_84>",
1258
+ "<reserved_85>",
1259
+ "<reserved_86>",
1260
+ "<reserved_87>",
1261
+ "<reserved_88>",
1262
+ "<reserved_89>",
1263
+ "<reserved_90>",
1264
+ "<reserved_91>",
1265
+ "<reserved_92>",
1266
+ "<reserved_93>",
1267
+ "<reserved_94>",
1268
+ "<reserved_95>",
1269
+ "<reserved_96>",
1270
+ "<reserved_97>",
1271
+ "<reserved_98>",
1272
+ "<reserved_99>",
1273
+ "<reserved_100>",
1274
+ "<reserved_101>",
1275
+ "<reserved_102>",
1276
+ "<reserved_103>",
1277
+ "<reserved_104>",
1278
+ "<reserved_105>",
1279
+ "<reserved_106>",
1280
+ "<reserved_107>",
1281
+ "<reserved_108>",
1282
+ "<reserved_109>",
1283
+ "<reserved_110>",
1284
+ "<reserved_111>",
1285
+ "<reserved_112>",
1286
+ "<reserved_113>",
1287
+ "<reserved_114>",
1288
+ "<reserved_115>",
1289
+ "<reserved_116>",
1290
+ "<reserved_117>",
1291
+ "<reserved_118>",
1292
+ "<reserved_119>",
1293
+ "<reserved_120>",
1294
+ "<reserved_121>",
1295
+ "<reserved_122>",
1296
+ "<reserved_123>",
1297
+ "<reserved_124>",
1298
+ "<reserved_125>",
1299
+ "<reserved_126>",
1300
+ "<reserved_127>",
1301
+ "<reserved_128>",
1302
+ "<reserved_129>",
1303
+ "<reserved_130>",
1304
+ "<reserved_131>",
1305
+ "<reserved_132>",
1306
+ "<reserved_133>",
1307
+ "<reserved_134>",
1308
+ "<reserved_135>",
1309
+ "<reserved_136>",
1310
+ "<reserved_137>",
1311
+ "<reserved_138>",
1312
+ "<reserved_139>",
1313
+ "<reserved_140>",
1314
+ "<reserved_141>",
1315
+ "<reserved_142>",
1316
+ "<reserved_143>",
1317
+ "<reserved_144>",
1318
+ "<reserved_145>",
1319
+ "<reserved_146>",
1320
+ "<reserved_147>",
1321
+ "<reserved_148>",
1322
+ "<reserved_149>",
1323
+ "<reserved_150>",
1324
+ "<reserved_151>",
1325
+ "<reserved_152>",
1326
+ "<reserved_153>",
1327
+ "<reserved_154>",
1328
+ "<reserved_155>",
1329
+ "<reserved_156>",
1330
+ "<reserved_157>",
1331
+ "<reserved_158>",
1332
+ "<reserved_159>",
1333
+ "<reserved_160>",
1334
+ "<reserved_161>",
1335
+ "<reserved_162>",
1336
+ "<reserved_163>",
1337
+ "<reserved_164>",
1338
+ "<reserved_165>",
1339
+ "<reserved_166>",
1340
+ "<reserved_167>",
1341
+ "<reserved_168>",
1342
+ "<reserved_169>",
1343
+ "<reserved_170>",
1344
+ "<reserved_171>",
1345
+ "<reserved_172>",
1346
+ "<reserved_173>",
1347
+ "<reserved_174>",
1348
+ "<reserved_175>",
1349
+ "<reserved_176>",
1350
+ "<reserved_177>",
1351
+ "<reserved_178>",
1352
+ "<reserved_179>",
1353
+ "<reserved_180>",
1354
+ "<reserved_181>",
1355
+ "<reserved_182>",
1356
+ "<reserved_183>",
1357
+ "<reserved_184>",
1358
+ "<reserved_185>",
1359
+ "<reserved_186>",
1360
+ "<reserved_187>",
1361
+ "<reserved_188>",
1362
+ "<reserved_189>",
1363
+ "<reserved_190>",
1364
+ "<reserved_191>",
1365
+ "<reserved_192>",
1366
+ "<reserved_193>",
1367
+ "<reserved_194>",
1368
+ "<reserved_195>",
1369
+ "<reserved_196>",
1370
+ "<reserved_197>",
1371
+ "<reserved_198>",
1372
+ "<reserved_199>",
1373
+ "<reserved_200>",
1374
+ "<reserved_201>",
1375
+ "<reserved_202>",
1376
+ "<reserved_203>",
1377
+ "<reserved_204>",
1378
+ "<reserved_205>",
1379
+ "<reserved_206>",
1380
+ "<reserved_207>",
1381
+ "<reserved_208>",
1382
+ "<reserved_209>",
1383
+ "<reserved_210>",
1384
+ "<reserved_211>",
1385
+ "<reserved_212>",
1386
+ "<reserved_213>",
1387
+ "<reserved_214>",
1388
+ "<reserved_215>",
1389
+ "<reserved_216>",
1390
+ "<reserved_217>",
1391
+ "<reserved_218>",
1392
+ "<reserved_219>",
1393
+ "<reserved_220>",
1394
+ "<reserved_221>",
1395
+ "<reserved_222>",
1396
+ "<reserved_223>",
1397
+ "<reserved_224>",
1398
+ "<reserved_225>",
1399
+ "<reserved_226>",
1400
+ "<reserved_227>",
1401
+ "<reserved_228>",
1402
+ "<reserved_229>",
1403
+ "<reserved_230>",
1404
+ "<reserved_231>",
1405
+ "<reserved_232>",
1406
+ "<reserved_233>",
1407
+ "<reserved_234>",
1408
+ "<reserved_235>",
1409
+ "<reserved_236>",
1410
+ "<reserved_237>",
1411
+ "<reserved_238>",
1412
+ "<reserved_239>",
1413
+ "<reserved_240>",
1414
+ "<reserved_241>",
1415
+ "<reserved_242>",
1416
+ "<reserved_243>",
1417
+ "<reserved_244>",
1418
+ "<reserved_245>",
1419
+ "<reserved_246>",
1420
+ "<reserved_247>",
1421
+ "<reserved_248>",
1422
+ "<reserved_249>",
1423
+ "<reserved_250>",
1424
+ "<reserved_251>",
1425
+ "<reserved_252>",
1426
+ "<reserved_253>",
1427
+ "<reserved_254>",
1428
+ "<reserved_255>",
1429
+ "<reserved_256>",
1430
+ "<reserved_257>",
1431
+ "<reserved_258>",
1432
+ "<reserved_259>",
1433
+ "<reserved_260>",
1434
+ "<reserved_261>",
1435
+ "<reserved_262>",
1436
+ "<reserved_263>",
1437
+ "<reserved_264>",
1438
+ "<reserved_265>",
1439
+ "<reserved_266>",
1440
+ "<reserved_267>",
1441
+ "<reserved_268>",
1442
+ "<reserved_269>",
1443
+ "<reserved_270>",
1444
+ "<reserved_271>",
1445
+ "<reserved_272>",
1446
+ "<reserved_273>",
1447
+ "<reserved_274>",
1448
+ "<reserved_275>",
1449
+ "<reserved_276>",
1450
+ "<reserved_277>",
1451
+ "<reserved_278>",
1452
+ "<reserved_279>",
1453
+ "<reserved_280>",
1454
+ "<reserved_281>",
1455
+ "<reserved_282>",
1456
+ "<reserved_283>",
1457
+ "<reserved_284>",
1458
+ "<reserved_285>",
1459
+ "<reserved_286>",
1460
+ "<reserved_287>",
1461
+ "<reserved_288>",
1462
+ "<reserved_289>",
1463
+ "<reserved_290>",
1464
+ "<reserved_291>",
1465
+ "<reserved_292>",
1466
+ "<reserved_293>",
1467
+ "<reserved_294>",
1468
+ "<reserved_295>",
1469
+ "<reserved_296>",
1470
+ "<reserved_297>",
1471
+ "<reserved_298>",
1472
+ "<reserved_299>",
1473
+ "<reserved_300>",
1474
+ "<reserved_301>",
1475
+ "<reserved_302>",
1476
+ "<reserved_303>",
1477
+ "<reserved_304>",
1478
+ "<reserved_305>",
1479
+ "<reserved_306>",
1480
+ "<reserved_307>",
1481
+ "<reserved_308>",
1482
+ "<reserved_309>",
1483
+ "<reserved_310>",
1484
+ "<reserved_311>",
1485
+ "<reserved_312>",
1486
+ "<reserved_313>",
1487
+ "<reserved_314>",
1488
+ "<reserved_315>",
1489
+ "<reserved_316>",
1490
+ "<reserved_317>",
1491
+ "<reserved_318>",
1492
+ "<reserved_319>",
1493
+ "<reserved_320>",
1494
+ "<reserved_321>",
1495
+ "<reserved_322>",
1496
+ "<reserved_323>",
1497
+ "<reserved_324>",
1498
+ "<reserved_325>",
1499
+ "<reserved_326>",
1500
+ "<reserved_327>",
1501
+ "<reserved_328>",
1502
+ "<reserved_329>",
1503
+ "<reserved_330>",
1504
+ "<reserved_331>",
1505
+ "<reserved_332>",
1506
+ "<reserved_333>",
1507
+ "<reserved_334>",
1508
+ "<reserved_335>",
1509
+ "<reserved_336>",
1510
+ "<reserved_337>",
1511
+ "<reserved_338>",
1512
+ "<reserved_339>",
1513
+ "<reserved_340>",
1514
+ "<reserved_341>",
1515
+ "<reserved_342>",
1516
+ "<reserved_343>",
1517
+ "<reserved_344>",
1518
+ "<reserved_345>",
1519
+ "<reserved_346>",
1520
+ "<reserved_347>",
1521
+ "<reserved_348>",
1522
+ "<reserved_349>",
1523
+ "<reserved_350>",
1524
+ "<reserved_351>",
1525
+ "<reserved_352>",
1526
+ "<reserved_353>",
1527
+ "<reserved_354>",
1528
+ "<reserved_355>",
1529
+ "<reserved_356>",
1530
+ "<reserved_357>",
1531
+ "<reserved_358>",
1532
+ "<reserved_359>",
1533
+ "<reserved_360>",
1534
+ "<reserved_361>",
1535
+ "<reserved_362>",
1536
+ "<reserved_363>",
1537
+ "<reserved_364>",
1538
+ "<reserved_365>",
1539
+ "<reserved_366>",
1540
+ "<reserved_367>",
1541
+ "<reserved_368>",
1542
+ "<reserved_369>",
1543
+ "<reserved_370>",
1544
+ "<reserved_371>",
1545
+ "<reserved_372>",
1546
+ "<reserved_373>",
1547
+ "<reserved_374>",
1548
+ "<reserved_375>",
1549
+ "<reserved_376>",
1550
+ "<reserved_377>",
1551
+ "<reserved_378>",
1552
+ "<reserved_379>",
1553
+ "<reserved_380>",
1554
+ "<reserved_381>",
1555
+ "<reserved_382>",
1556
+ "<reserved_383>",
1557
+ "<reserved_384>",
1558
+ "<reserved_385>",
1559
+ "<reserved_386>",
1560
+ "<reserved_387>",
1561
+ "<reserved_388>",
1562
+ "<reserved_389>",
1563
+ "<reserved_390>",
1564
+ "<reserved_391>",
1565
+ "<reserved_392>",
1566
+ "<reserved_393>",
1567
+ "<reserved_394>",
1568
+ "<reserved_395>",
1569
+ "<reserved_396>",
1570
+ "<reserved_397>",
1571
+ "<reserved_398>",
1572
+ "<reserved_399>",
1573
+ "<reserved_400>",
1574
+ "<reserved_401>",
1575
+ "<reserved_402>",
1576
+ "<reserved_403>",
1577
+ "<reserved_404>",
1578
+ "<reserved_405>",
1579
+ "<reserved_406>",
1580
+ "<reserved_407>",
1581
+ "<reserved_408>",
1582
+ "<reserved_409>",
1583
+ "<reserved_410>",
1584
+ "<reserved_411>",
1585
+ "<reserved_412>",
1586
+ "<reserved_413>",
1587
+ "<reserved_414>",
1588
+ "<reserved_415>",
1589
+ "<reserved_416>",
1590
+ "<reserved_417>",
1591
+ "<reserved_418>",
1592
+ "<reserved_419>",
1593
+ "<reserved_420>",
1594
+ "<reserved_421>",
1595
+ "<reserved_422>",
1596
+ "<reserved_423>",
1597
+ "<reserved_424>",
1598
+ "<reserved_425>",
1599
+ "<reserved_426>",
1600
+ "<reserved_427>",
1601
+ "<reserved_428>",
1602
+ "<reserved_429>",
1603
+ "<reserved_430>",
1604
+ "<reserved_431>",
1605
+ "<reserved_432>",
1606
+ "<reserved_433>",
1607
+ "<reserved_434>",
1608
+ "<reserved_435>",
1609
+ "<reserved_436>",
1610
+ "<reserved_437>",
1611
+ "<reserved_438>",
1612
+ "<reserved_439>",
1613
+ "<reserved_440>",
1614
+ "<reserved_441>",
1615
+ "<reserved_442>",
1616
+ "<reserved_443>",
1617
+ "<reserved_444>",
1618
+ "<reserved_445>",
1619
+ "<reserved_446>",
1620
+ "<reserved_447>",
1621
+ "<reserved_448>",
1622
+ "<reserved_449>",
1623
+ "<reserved_450>",
1624
+ "<reserved_451>",
1625
+ "<reserved_452>",
1626
+ "<reserved_453>",
1627
+ "<reserved_454>",
1628
+ "<reserved_455>",
1629
+ "<reserved_456>",
1630
+ "<reserved_457>",
1631
+ "<reserved_458>",
1632
+ "<reserved_459>",
1633
+ "<reserved_460>",
1634
+ "<reserved_461>",
1635
+ "<reserved_462>",
1636
+ "<reserved_463>",
1637
+ "<reserved_464>",
1638
+ "<reserved_465>",
1639
+ "<reserved_466>",
1640
+ "<reserved_467>",
1641
+ "<reserved_468>",
1642
+ "<reserved_469>",
1643
+ "<reserved_470>",
1644
+ "<reserved_471>",
1645
+ "<reserved_472>",
1646
+ "<reserved_473>",
1647
+ "<reserved_474>",
1648
+ "<reserved_475>",
1649
+ "<reserved_476>",
1650
+ "<reserved_477>",
1651
+ "<reserved_478>",
1652
+ "<reserved_479>",
1653
+ "<reserved_480>",
1654
+ "<reserved_481>",
1655
+ "<reserved_482>",
1656
+ "<reserved_483>",
1657
+ "<reserved_484>",
1658
+ "<reserved_485>",
1659
+ "<reserved_486>",
1660
+ "<reserved_487>",
1661
+ "<reserved_488>",
1662
+ "<reserved_489>",
1663
+ "<reserved_490>",
1664
+ "<reserved_491>",
1665
+ "<reserved_492>",
1666
+ "<reserved_493>",
1667
+ "<reserved_494>",
1668
+ "<reserved_495>",
1669
+ "<reserved_496>",
1670
+ "<reserved_497>",
1671
+ "<reserved_498>",
1672
+ "<reserved_499>",
1673
+ "<reserved_500>",
1674
+ "<reserved_501>",
1675
+ "<reserved_502>",
1676
+ "<reserved_503>",
1677
+ "<reserved_504>",
1678
+ "<reserved_505>",
1679
+ "<reserved_506>",
1680
+ "<reserved_507>",
1681
+ "<reserved_508>",
1682
+ "<reserved_509>",
1683
+ "<reserved_510>",
1684
+ "<reserved_511>",
1685
+ "<reserved_512>",
1686
+ "<reserved_513>",
1687
+ "<reserved_514>",
1688
+ "<reserved_515>",
1689
+ "<reserved_516>",
1690
+ "<reserved_517>",
1691
+ "<reserved_518>",
1692
+ "<reserved_519>",
1693
+ "<reserved_520>",
1694
+ "<reserved_521>",
1695
+ "<reserved_522>",
1696
+ "<reserved_523>",
1697
+ "<reserved_524>",
1698
+ "<reserved_525>",
1699
+ "<reserved_526>",
1700
+ "<reserved_527>",
1701
+ "<reserved_528>",
1702
+ "<reserved_529>",
1703
+ "<reserved_530>",
1704
+ "<reserved_531>",
1705
+ "<reserved_532>",
1706
+ "<reserved_533>",
1707
+ "<reserved_534>",
1708
+ "<reserved_535>",
1709
+ "<reserved_536>",
1710
+ "<reserved_537>",
1711
+ "<reserved_538>",
1712
+ "<reserved_539>",
1713
+ "<reserved_540>",
1714
+ "<reserved_541>",
1715
+ "<reserved_542>",
1716
+ "<reserved_543>",
1717
+ "<reserved_544>",
1718
+ "<reserved_545>",
1719
+ "<reserved_546>",
1720
+ "<reserved_547>",
1721
+ "<reserved_548>",
1722
+ "<reserved_549>",
1723
+ "<reserved_550>",
1724
+ "<reserved_551>",
1725
+ "<reserved_552>",
1726
+ "<reserved_553>",
1727
+ "<reserved_554>",
1728
+ "<reserved_555>",
1729
+ "<reserved_556>",
1730
+ "<reserved_557>",
1731
+ "<reserved_558>",
1732
+ "<reserved_559>",
1733
+ "<reserved_560>",
1734
+ "<reserved_561>",
1735
+ "<reserved_562>",
1736
+ "<reserved_563>",
1737
+ "<reserved_564>",
1738
+ "<reserved_565>",
1739
+ "<reserved_566>",
1740
+ "<reserved_567>",
1741
+ "<reserved_568>",
1742
+ "<reserved_569>",
1743
+ "<reserved_570>",
1744
+ "<reserved_571>",
1745
+ "<reserved_572>",
1746
+ "<reserved_573>",
1747
+ "<reserved_574>",
1748
+ "<reserved_575>",
1749
+ "<reserved_576>",
1750
+ "<reserved_577>",
1751
+ "<reserved_578>",
1752
+ "<reserved_579>",
1753
+ "<reserved_580>",
1754
+ "<reserved_581>",
1755
+ "<reserved_582>",
1756
+ "<reserved_583>",
1757
+ "<reserved_584>",
1758
+ "<reserved_585>",
1759
+ "<reserved_586>",
1760
+ "<reserved_587>",
1761
+ "<reserved_588>",
1762
+ "<reserved_589>",
1763
+ "<reserved_590>",
1764
+ "<reserved_591>",
1765
+ "<reserved_592>",
1766
+ "<reserved_593>",
1767
+ "<reserved_594>",
1768
+ "<reserved_595>",
1769
+ "<reserved_596>",
1770
+ "<reserved_597>",
1771
+ "<reserved_598>",
1772
+ "<reserved_599>",
1773
+ "<reserved_600>",
1774
+ "<reserved_601>",
1775
+ "<reserved_602>",
1776
+ "<reserved_603>",
1777
+ "<reserved_604>",
1778
+ "<reserved_605>",
1779
+ "<reserved_606>",
1780
+ "<reserved_607>",
1781
+ "<reserved_608>",
1782
+ "<reserved_609>",
1783
+ "<reserved_610>",
1784
+ "<reserved_611>",
1785
+ "<reserved_612>",
1786
+ "<reserved_613>",
1787
+ "<reserved_614>",
1788
+ "<reserved_615>",
1789
+ "<reserved_616>",
1790
+ "<reserved_617>",
1791
+ "<reserved_618>",
1792
+ "<reserved_619>",
1793
+ "<reserved_620>",
1794
+ "<reserved_621>",
1795
+ "<reserved_622>",
1796
+ "<reserved_623>",
1797
+ "<reserved_624>",
1798
+ "<reserved_625>",
1799
+ "<reserved_626>",
1800
+ "<reserved_627>",
1801
+ "<reserved_628>",
1802
+ "<reserved_629>",
1803
+ "<reserved_630>",
1804
+ "<reserved_631>",
1805
+ "<reserved_632>",
1806
+ "<reserved_633>",
1807
+ "<reserved_634>",
1808
+ "<reserved_635>",
1809
+ "<reserved_636>",
1810
+ "<reserved_637>",
1811
+ "<reserved_638>",
1812
+ "<reserved_639>",
1813
+ "<reserved_640>",
1814
+ "<reserved_641>",
1815
+ "<reserved_642>",
1816
+ "<reserved_643>",
1817
+ "<reserved_644>",
1818
+ "<reserved_645>",
1819
+ "<reserved_646>",
1820
+ "<reserved_647>",
1821
+ "<reserved_648>",
1822
+ "<reserved_649>",
1823
+ "<reserved_650>",
1824
+ "<reserved_651>",
1825
+ "<reserved_652>",
1826
+ "<reserved_653>",
1827
+ "<reserved_654>",
1828
+ "<reserved_655>",
1829
+ "<reserved_656>",
1830
+ "<reserved_657>",
1831
+ "<reserved_658>",
1832
+ "<reserved_659>",
1833
+ "<reserved_660>",
1834
+ "<reserved_661>",
1835
+ "<reserved_662>",
1836
+ "<reserved_663>",
1837
+ "<reserved_664>",
1838
+ "<reserved_665>",
1839
+ "<reserved_666>",
1840
+ "<reserved_667>",
1841
+ "<reserved_668>",
1842
+ "<reserved_669>",
1843
+ "<reserved_670>",
1844
+ "<reserved_671>",
1845
+ "<reserved_672>",
1846
+ "<reserved_673>",
1847
+ "<reserved_674>",
1848
+ "<reserved_675>",
1849
+ "<reserved_676>",
1850
+ "<reserved_677>",
1851
+ "<reserved_678>",
1852
+ "<reserved_679>",
1853
+ "<reserved_680>",
1854
+ "<reserved_681>",
1855
+ "<reserved_682>",
1856
+ "<reserved_683>",
1857
+ "<reserved_684>",
1858
+ "<reserved_685>",
1859
+ "<reserved_686>",
1860
+ "<reserved_687>",
1861
+ "<reserved_688>",
1862
+ "<reserved_689>",
1863
+ "<reserved_690>",
1864
+ "<reserved_691>",
1865
+ "<reserved_692>",
1866
+ "<reserved_693>",
1867
+ "<reserved_694>",
1868
+ "<reserved_695>",
1869
+ "<reserved_696>",
1870
+ "<reserved_697>",
1871
+ "<reserved_698>",
1872
+ "<reserved_699>",
1873
+ "<reserved_700>",
1874
+ "<reserved_701>",
1875
+ "<reserved_702>",
1876
+ "<reserved_703>",
1877
+ "<reserved_704>",
1878
+ "<reserved_705>",
1879
+ "<reserved_706>",
1880
+ "<reserved_707>",
1881
+ "<reserved_708>",
1882
+ "<reserved_709>",
1883
+ "<reserved_710>",
1884
+ "<reserved_711>",
1885
+ "<reserved_712>",
1886
+ "<reserved_713>",
1887
+ "<reserved_714>",
1888
+ "<reserved_715>",
1889
+ "<reserved_716>",
1890
+ "<reserved_717>",
1891
+ "<reserved_718>",
1892
+ "<reserved_719>",
1893
+ "<reserved_720>",
1894
+ "<reserved_721>",
1895
+ "<reserved_722>",
1896
+ "<reserved_723>",
1897
+ "<reserved_724>",
1898
+ "<reserved_725>",
1899
+ "<reserved_726>",
1900
+ "<reserved_727>",
1901
+ "<reserved_728>",
1902
+ "<reserved_729>",
1903
+ "<reserved_730>",
1904
+ "<reserved_731>",
1905
+ "<reserved_732>",
1906
+ "<reserved_733>",
1907
+ "<reserved_734>",
1908
+ "<reserved_735>",
1909
+ "<reserved_736>",
1910
+ "<reserved_737>",
1911
+ "<reserved_738>",
1912
+ "<reserved_739>",
1913
+ "<reserved_740>",
1914
+ "<reserved_741>",
1915
+ "<reserved_742>",
1916
+ "<reserved_743>",
1917
+ "<reserved_744>",
1918
+ "<reserved_745>",
1919
+ "<reserved_746>",
1920
+ "<reserved_747>",
1921
+ "<reserved_748>",
1922
+ "<reserved_749>",
1923
+ "<reserved_750>",
1924
+ "<reserved_751>",
1925
+ "<reserved_752>",
1926
+ "<reserved_753>",
1927
+ "<reserved_754>",
1928
+ "<reserved_755>",
1929
+ "<reserved_756>",
1930
+ "<reserved_757>",
1931
+ "<reserved_758>",
1932
+ "<reserved_759>",
1933
+ "<reserved_760>",
1934
+ "<reserved_761>",
1935
+ "<reserved_762>",
1936
+ "<reserved_763>",
1937
+ "<reserved_764>",
1938
+ "<reserved_765>",
1939
+ "<reserved_766>",
1940
+ "<reserved_767>",
1941
+ "<reserved_768>",
1942
+ "<reserved_769>",
1943
+ "<reserved_770>",
1944
+ "<reserved_771>",
1945
+ "<reserved_772>",
1946
+ "<reserved_773>",
1947
+ "<reserved_774>",
1948
+ "<reserved_775>",
1949
+ "<reserved_776>",
1950
+ "<reserved_777>",
1951
+ "<reserved_778>",
1952
+ "<reserved_779>",
1953
+ "<reserved_780>",
1954
+ "<reserved_781>",
1955
+ "<reserved_782>",
1956
+ "<reserved_783>",
1957
+ "<reserved_784>",
1958
+ "<reserved_785>",
1959
+ "<reserved_786>",
1960
+ "<reserved_787>",
1961
+ "<reserved_788>",
1962
+ "<reserved_789>",
1963
+ "<reserved_790>",
1964
+ "<reserved_791>",
1965
+ "<reserved_792>",
1966
+ "<reserved_793>",
1967
+ "<reserved_794>",
1968
+ "<reserved_795>",
1969
+ "<reserved_796>",
1970
+ "<reserved_797>",
1971
+ "<reserved_798>",
1972
+ "<reserved_799>",
1973
+ "<reserved_800>",
1974
+ "<reserved_801>",
1975
+ "<reserved_802>",
1976
+ "<reserved_803>",
1977
+ "<reserved_804>",
1978
+ "<reserved_805>",
1979
+ "<reserved_806>",
1980
+ "<reserved_807>",
1981
+ "<reserved_808>",
1982
+ "<reserved_809>",
1983
+ "<reserved_810>",
1984
+ "<reserved_811>",
1985
+ "<reserved_812>",
1986
+ "<reserved_813>",
1987
+ "<reserved_814>",
1988
+ "<reserved_815>",
1989
+ "<reserved_816>",
1990
+ "<reserved_817>",
1991
+ "<reserved_818>",
1992
+ "<reserved_819>",
1993
+ "<reserved_820>",
1994
+ "<reserved_821>",
1995
+ "<reserved_822>",
1996
+ "<reserved_823>",
1997
+ "<reserved_824>",
1998
+ "<reserved_825>",
1999
+ "<reserved_826>",
2000
+ "<reserved_827>",
2001
+ "<reserved_828>",
2002
+ "<reserved_829>",
2003
+ "<reserved_830>",
2004
+ "<reserved_831>",
2005
+ "<reserved_832>",
2006
+ "<reserved_833>",
2007
+ "<reserved_834>",
2008
+ "<reserved_835>",
2009
+ "<reserved_836>",
2010
+ "<reserved_837>",
2011
+ "<reserved_838>",
2012
+ "<reserved_839>",
2013
+ "<reserved_840>",
2014
+ "<reserved_841>",
2015
+ "<reserved_842>",
2016
+ "<reserved_843>",
2017
+ "<reserved_844>",
2018
+ "<reserved_845>",
2019
+ "<reserved_846>",
2020
+ "<reserved_847>",
2021
+ "<reserved_848>",
2022
+ "<reserved_849>",
2023
+ "<reserved_850>",
2024
+ "<reserved_851>",
2025
+ "<reserved_852>",
2026
+ "<reserved_853>",
2027
+ "<reserved_854>",
2028
+ "<reserved_855>",
2029
+ "<reserved_856>",
2030
+ "<reserved_857>",
2031
+ "<reserved_858>",
2032
+ "<reserved_859>",
2033
+ "<reserved_860>",
2034
+ "<reserved_861>",
2035
+ "<reserved_862>",
2036
+ "<reserved_863>",
2037
+ "<reserved_864>",
2038
+ "<reserved_865>",
2039
+ "<reserved_866>",
2040
+ "<reserved_867>",
2041
+ "<reserved_868>",
2042
+ "<reserved_869>",
2043
+ "<reserved_870>",
2044
+ "<reserved_871>",
2045
+ "<reserved_872>",
2046
+ "<reserved_873>",
2047
+ "<reserved_874>",
2048
+ "<reserved_875>",
2049
+ "<reserved_876>",
2050
+ "<reserved_877>",
2051
+ "<reserved_878>",
2052
+ "<reserved_879>",
2053
+ "<reserved_880>",
2054
+ "<reserved_881>",
2055
+ "<reserved_882>",
2056
+ "<reserved_883>",
2057
+ "<reserved_884>",
2058
+ "<reserved_885>",
2059
+ "<reserved_886>",
2060
+ "<reserved_887>",
2061
+ "<reserved_888>",
2062
+ "<reserved_889>",
2063
+ "<reserved_890>",
2064
+ "<reserved_891>",
2065
+ "<reserved_892>",
2066
+ "<reserved_893>",
2067
+ "<reserved_894>",
2068
+ "<reserved_895>",
2069
+ "<reserved_896>",
2070
+ "<reserved_897>",
2071
+ "<reserved_898>",
2072
+ "<reserved_899>",
2073
+ "<reserved_900>",
2074
+ "<reserved_901>",
2075
+ "<reserved_902>",
2076
+ "<reserved_903>",
2077
+ "<reserved_904>",
2078
+ "<reserved_905>",
2079
+ "<reserved_906>",
2080
+ "<reserved_907>",
2081
+ "<reserved_908>",
2082
+ "<reserved_909>",
2083
+ "<reserved_910>",
2084
+ "<reserved_911>",
2085
+ "<reserved_912>",
2086
+ "<reserved_913>",
2087
+ "<reserved_914>",
2088
+ "<reserved_915>",
2089
+ "<reserved_916>",
2090
+ "<reserved_917>",
2091
+ "<reserved_918>",
2092
+ "<reserved_919>",
2093
+ "<reserved_920>",
2094
+ "<reserved_921>",
2095
+ "<reserved_922>",
2096
+ "<reserved_923>",
2097
+ "<reserved_924>",
2098
+ "<reserved_925>",
2099
+ "<reserved_926>",
2100
+ "<reserved_927>",
2101
+ "<reserved_928>",
2102
+ "<reserved_929>",
2103
+ "<reserved_930>",
2104
+ "<reserved_931>",
2105
+ "<reserved_932>",
2106
+ "<reserved_933>",
2107
+ "<reserved_934>",
2108
+ "<reserved_935>",
2109
+ "<reserved_936>",
2110
+ "<reserved_937>",
2111
+ "<reserved_938>",
2112
+ "<reserved_939>",
2113
+ "<reserved_940>",
2114
+ "<reserved_941>",
2115
+ "<reserved_942>",
2116
+ "<reserved_943>",
2117
+ "<reserved_944>",
2118
+ "<reserved_945>",
2119
+ "<reserved_946>",
2120
+ "<reserved_947>",
2121
+ "<reserved_948>",
2122
+ "<reserved_949>",
2123
+ "<reserved_950>",
2124
+ "<reserved_951>",
2125
+ "<reserved_952>",
2126
+ "<reserved_953>",
2127
+ "<reserved_954>",
2128
+ "<reserved_955>"
2129
+ ],
2130
+ "script": [
2131
+ "<SCRIPT-LATIN>",
2132
+ "<SCRIPT-PUNCTUATION>",
2133
+ "<SCRIPT-CYRILLIC>",
2134
+ "<SCRIPT-ARABIC>",
2135
+ "<SCRIPT-CHINESE>",
2136
+ "<SCRIPT-JAPANESE>",
2137
+ "<SCRIPT-KOREAN>",
2138
+ "<SCRIPT-SYMBOLS>",
2139
+ "<SCRIPT-GREEK>",
2140
+ "<SCRIPT-ARMENIAN>",
2141
+ "<SCRIPT-HEBREW>",
2142
+ "<SCRIPT-DEVANAGARI>",
2143
+ "<SCRIPT-BENGALI>",
2144
+ "<SCRIPT-GURMUKHI>",
2145
+ "<SCRIPT-GUJARATI>",
2146
+ "<SCRIPT-ORIYA>",
2147
+ "<SCRIPT-TAMIL>",
2148
+ "<SCRIPT-TELUGU>",
2149
+ "<SCRIPT-KANNADA>",
2150
+ "<SCRIPT-MALAYALAM>",
2151
+ "<SCRIPT-SINHALA>",
2152
+ "<SCRIPT-THAI>",
2153
+ "<SCRIPT-LAO>",
2154
+ "<SCRIPT-MYANMAR>",
2155
+ "<SCRIPT-GEORGIAN>",
2156
+ "<SCRIPT-ETHIOPIC>",
2157
+ "<SCRIPT-KHMER>",
2158
+ "<SCRIPT-MONGOLIAN>",
2159
+ "<SCRIPT-MATH>"
2160
+ ],
2161
+ "system": [
2162
+ "</S>",
2163
+ "<EOI>",
2164
+ "<IMAGE>",
2165
+ "<PAD>",
2166
+ "<NOP>",
2167
+ "<ROT>",
2168
+ "<OCR-WB>",
2169
+ "<OCR-WOB>",
2170
+ "<BLOCKS-WOB>",
2171
+ "<REG1>",
2172
+ "<REG2>",
2173
+ "<REG3>",
2174
+ "<REG4>",
2175
+ "<NO-MATH>"
2176
+ ]
2177
+ },
2178
+ "special_token_count": 4,
2179
+ "tasks": {
2180
+ "block_without_boxes": {
2181
+ "img_size": [
2182
+ 1024,
2183
+ 512
2184
+ ]
2185
+ },
2186
+ "ocr_with_boxes": {
2187
+ "img_size": [
2188
+ 1024,
2189
+ 256
2190
+ ]
2191
+ },
2192
+ "ocr_without_boxes": {
2193
+ "img_size": [
2194
+ 1024,
2195
+ 256
2196
+ ]
2197
+ }
2198
+ },
2199
+ "torch_dtype": "bfloat16",
2200
+ "transformers_version": "4.50.3",
2201
+ "unmask_image": false,
2202
+ "use_ce_loss": false,
2203
+ "vision_encoder": {
2204
+ "_attn_implementation_autoset": true,
2205
+ "_name_or_path": "",
2206
+ "add_cross_attention": false,
2207
+ "architectures": null,
2208
+ "bad_words_ids": null,
2209
+ "begin_suppress_tokens": null,
2210
+ "bos_token_id": null,
2211
+ "chunk_size_feed_forward": 0,
2212
+ "cross_attention_hidden_size": null,
2213
+ "decoder_start_token_id": null,
2214
+ "depth": 8,
2215
+ "diversity_penalty": 0.0,
2216
+ "do_sample": false,
2217
+ "early_stopping": false,
2218
+ "encoder_no_repeat_ngram_size": 0,
2219
+ "eos_token_id": null,
2220
+ "exponential_decay_length_penalty": null,
2221
+ "finetuning_task": null,
2222
+ "forced_bos_token_id": null,
2223
+ "forced_eos_token_id": null,
2224
+ "fullatt_block_indexes": [
2225
+ 3,
2226
+ 7
2227
+ ],
2228
+ "hidden_act": "silu",
2229
+ "hidden_size": 1280,
2230
+ "id2label": {
2231
+ "0": "LABEL_0",
2232
+ "1": "LABEL_1"
2233
+ },
2234
+ "image_size": [
2235
+ 1024,
2236
+ 256
2237
+ ],
2238
+ "in_channels": 3,
2239
+ "initializer_range": 0.02,
2240
+ "intermediate_size": 3420,
2241
+ "is_decoder": false,
2242
+ "is_encoder_decoder": false,
2243
+ "label2id": {
2244
+ "LABEL_0": 0,
2245
+ "LABEL_1": 1
2246
+ },
2247
+ "length_penalty": 1.0,
2248
+ "max_length": 20,
2249
+ "min_length": 0,
2250
+ "model_type": "qwen2_5_vl",
2251
+ "no_repeat_ngram_size": 0,
2252
+ "num_beam_groups": 1,
2253
+ "num_beams": 1,
2254
+ "num_heads": 16,
2255
+ "num_return_sequences": 1,
2256
+ "out_hidden_size": 1280,
2257
+ "output_attentions": false,
2258
+ "output_hidden_states": false,
2259
+ "output_scores": false,
2260
+ "pad_token_id": null,
2261
+ "patch_size": 14,
2262
+ "prefix": null,
2263
+ "problem_type": null,
2264
+ "pruned_heads": {},
2265
+ "remove_invalid_values": false,
2266
+ "repetition_penalty": 1.0,
2267
+ "return_dict": true,
2268
+ "return_dict_in_generate": false,
2269
+ "sep_token_id": null,
2270
+ "spatial_merge_size": 2,
2271
+ "spatial_patch_size": 14,
2272
+ "suppress_tokens": null,
2273
+ "task_specific_params": null,
2274
+ "temperature": 1.0,
2275
+ "temporal_patch_size": 1,
2276
+ "tf_legacy_loss": false,
2277
+ "tie_encoder_decoder": false,
2278
+ "tie_word_embeddings": true,
2279
+ "tokenizer_class": null,
2280
+ "tokens_per_second": 4,
2281
+ "top_k": 50,
2282
+ "top_p": 1.0,
2283
+ "torch_dtype": null,
2284
+ "torchscript": false,
2285
+ "typical_p": 1.0,
2286
+ "use_bfloat16": false,
2287
+ "window_size": 112
2288
+ },
2289
+ "vocab_size": 218225
2290
+ }
text_recognition/2025_05_16/manifest.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"files": ["model.safetensors", "added_tokens.json", "tokenizer_config.json", "special_tokens_map.json", "config.json", "README.md", "merges.txt", ".gitattributes", "vocab.json", "preprocessor_config.json"]}
text_recognition/2025_05_16/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
text_recognition/2025_05_16/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30818e0e67898880036bb6b0738104366e8fe4197dc1f21b96c3c21d6ee2d671
3
+ size 1634909470
text_recognition/2025_05_16/preprocessor_config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": null,
3
+ "do_normalize": true,
4
+ "do_rescale": true,
5
+ "do_resize": false,
6
+ "image_mean": [
7
+ 0.485,
8
+ 0.456,
9
+ 0.406
10
+ ],
11
+ "image_processor_type": "ViTImageProcessor",
12
+ "image_std": [
13
+ 0.229,
14
+ 0.224,
15
+ 0.225
16
+ ],
17
+ "resample": 3,
18
+ "rescale_factor": 0.00392156862745098,
19
+ "size": {
20
+ "height": 1024,
21
+ "width": 256
22
+ }
23
+ }
text_recognition/2025_05_16/special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|endoftext|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
text_recognition/2025_05_16/tokenizer_config.json ADDED
@@ -0,0 +1,208 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are a helpful assistant.' }}\n {%- endif %}\n {{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0]['content'] + '<|im_end|>\\n' }}\n {%- else %}\n {{- '<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\\n' + message.content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\\n<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- message.content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}\n",
199
+ "clean_up_tokenization_spaces": false,
200
+ "eos_token": "<|endoftext|>",
201
+ "errors": "replace",
202
+ "extra_special_tokens": {},
203
+ "model_max_length": 131072,
204
+ "pad_token": "<|endoftext|>",
205
+ "split_special_tokens": false,
206
+ "tokenizer_class": "Qwen2Tokenizer",
207
+ "unk_token": null
208
+ }
text_recognition/2025_05_16/vocab.json ADDED
The diff for this file is too large to render. See raw diff