colibrisson1 commited on
Commit
662a0c6
·
1 Parent(s): ccbc2b4

Initial commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,174 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ license_name: Creative Commons Attribution-NonCommercial 4.0 International
4
+ license_link: https://creativecommons.org/licenses/by-nc/4.0/
5
+ tags:
6
+ - ocr
7
+ - htr
8
+ - vision-language-model
9
+ - historical-documents
10
+ - chinese
11
+ - classical-chinese
12
+ - image-to-text
13
+ library_name: transformers
14
+ pipeline_tag: image-to-text
15
+ ---
16
+
17
+ # AnandaSky
18
+
19
+ **AnandaSky** is a vision-language model for line-level transcription of historical sinographic documents.
20
+
21
+ The name combines *Ananda*—the disciple of the Buddha traditionally associated with the "encoding" of early Buddhist texts—and *Sky*, the opening character of the *Thousand Character Classic*, a text long used in premodern China to enumerate items.
22
+
23
+ ## Paper
24
+
25
+ This model is described in the following paper:
26
+
27
+ [**AnandaSky: A Vision--Language Model for Line-Level Transcription of Historical Sinographic Documents**](https://hal.science/view/index/docid/5548531)
28
+
29
+ ## Model Overview
30
+
31
+ AnandaSky is a vision-language model for efficient transcription of historical sinographic line images. It contains approximately **626M parameters** and combines:
32
+
33
+ - a Vision Transformer (ViT) encoder
34
+ - an autoregressive Qwen3-based decoder
35
+
36
+ The model was trained on **4 million line images** extracted from historical documents produced in China and Korea between the **8th and 20th centuries**, including both printed editions and handwritten manuscripts.
37
+
38
+ A full description of the datasets, preprocessing pipeline, and training procedure is provided in the accompanying paper.
39
+
40
+ ## Evaluation Results
41
+
42
+ The model achieves the following character error rates (CER) on in-domain test sets:
43
+
44
+ | Dataset | CER |
45
+ |---|---:|
46
+ | MTHv2 | 0.92% |
47
+ | Sibu Congkan | 0.43% |
48
+ | Korean Anthologies | 0.33% |
49
+ | Dunhuang Manuscripts | 1.38% |
50
+ | Qing Legal Documents | 4.89% |
51
+
52
+
53
+ The model achieves the following character error rates (CER) on held-out benchmarks:
54
+
55
+ | Dataset | CER |
56
+ |---|---:|
57
+ | ICDAR2019-HDRC | 0.96% |
58
+ | CUHK Challenge 2021 | 0.82% |
59
+ | CUHK Challenge 2022 | 1.61% |
60
+
61
+ ## Intended Use
62
+
63
+ AnandaSky is intended for line-level transcription of historical sinographic documents. It can process both single-column and double-column vertical text layouts.
64
+
65
+ ## Transcription Normalization
66
+
67
+ If you notice that the model systematically produces an incorrect transcription for a specific character or glyph form, please consider opening an issue in the repository. Such reports are valuable for improving the normalization pipeline and future model releases.
68
+
69
+ ## Hardware and Dependencies
70
+
71
+ This model has a hard dependency on **FlashAttention**.
72
+
73
+ ### Required Environment
74
+
75
+ - Python >= 3.10
76
+ - PyTorch >= 2.1
77
+ - NVIDIA Ampere-or-newer GPU
78
+ - `transformers`
79
+ - `flash-attn`
80
+
81
+ ### Install FlashAttention
82
+
83
+ ```bash
84
+ pip install flash-attn --no-build-isolation
85
+ ```
86
+
87
+ > ⚠️ **FlashAttention is required.** The model will not run without it.
88
+
89
+ ## Loading the Model
90
+
91
+ Because this repository uses custom Transformers modeling code, the model must be loaded with `trust_remote_code=True`.
92
+
93
+ ### Example
94
+
95
+ ```python
96
+ from transformers import AutoModelForCausalLM
97
+
98
+ model = AutoModelForCausalLM.from_pretrained(
99
+ "badianeai/AnandaSky",
100
+ trust_remote_code=True,
101
+ torch_dtype="auto",
102
+ device_map="auto",
103
+ )
104
+ ```
105
+
106
+ ## Minimal Inference Example
107
+
108
+ ```python
109
+ from PIL import Image
110
+ import torch
111
+ from transformers import AutoModelForCausalLM, AutoProcessor
112
+
113
+ DEVICE = torch.device("cuda")
114
+ DTYPE = torch.bfloat16
115
+
116
+ model = AutoModelForCausalLM.from_pretrained("badianeai/AnandaSky",
117
+ trust_remote_code=True,
118
+ torch_dtype=DTYPE)
119
+
120
+ model = model.to(DEVICE)
121
+
122
+ image = Image.open("line_image.png")
123
+
124
+ processor = AutoProcessor.from_pretrained("badianeai/AnandaSky",
125
+ trust_remote_code=True,
126
+ local_files_only=True)
127
+
128
+ inputs = processor(images=image, return_tensors="pt")
129
+
130
+
131
+ inputs["input_ids"] = inputs["input_ids"].to(device=DEVICE, non_blocking=True)
132
+ inputs["attention_mask"] = inputs["attention_mask"].to(device=DEVICE, non_blocking=True)
133
+ inputs["pixel_values"] = inputs["pixel_values"].to(device=DEVICE, dtype=DTYPE, non_blocking=True)
134
+ inputs["patch_attention_mask"] = inputs["patch_attention_mask"].to(device=DEVICE, non_blocking=True)
135
+
136
+ with torch.no_grad():
137
+ with torch.autocast(device_type="cuda", dtype=DTYPE, enabled=True):
138
+ output = model.generate(
139
+ **inputs,
140
+ use_cache=True,
141
+ )
142
+
143
+ text = processor.decode(output[0, 1:], skip_special_tokens=True).strip()
144
+ print(text)
145
+ ```
146
+
147
+ ## License
148
+
149
+ This model is released under the **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license.
150
+
151
+ It may be used for **research, academic, and other non-commercial purposes only**. Commercial use is **not permitted** without prior permission from the authors.
152
+
153
+ ## Citation
154
+
155
+ ```bibtex
156
+ @inproceedings{brisson:hal-05548531,
157
+ TITLE = {{AnandaSky: A Vision-Language Model for Line-Level Transcription of Historical Sinographic Documents}},
158
+ AUTHOR = {Brisson, Colin and Kahfy, Ayoub and Constant, Fr{\'e}d{\'e}ric and Bui, Marc},
159
+ URL = {https://hal.science/hal-05548531},
160
+ NOTE = {BnF DataLab Projet READ\_Chinese},
161
+ BOOKTITLE = {{The Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026)}},
162
+ ADDRESS = {Majorca/Spain, Spain},
163
+ YEAR = {2026},
164
+ MONTH = May,
165
+ KEYWORDS = {Dunhuang manuscripts ; long-tailed distribution ; vision-language models ; HTR ; OCR ; Classical Chinese ; Historical documents ; Historical documents Classical Chinese OCR HTR vision-language models long-tailed distribution Dunhuang manuscripts},
166
+ PDF = {https://hal.science/hal-05548531v1/file/AnandaSky_Technical_Report.pdf},
167
+ HAL_ID = {hal-05548531},
168
+ HAL_VERSION = {v1},
169
+ }
170
+ ```
171
+
172
+ ## Contact
173
+
174
+ For questions, bug reports, or feedback, please open an issue in the repository.
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c0284b582e14987fbd3d5a2cb2bd139084371ed9acbae488829a1c900833c680
3
+ size 707
chat_template.jinja ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- if tools %}
2
+ {{- '<|im_start|>system\n' }}
3
+ {%- if messages[0].role == 'system' %}
4
+ {{- messages[0].content + '\n\n' }}
5
+ {%- endif %}
6
+ {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
7
+ {%- for tool in tools %}
8
+ {{- "\n" }}
9
+ {{- tool | tojson }}
10
+ {%- endfor %}
11
+ {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
12
+ {%- else %}
13
+ {%- if messages[0].role == 'system' %}
14
+ {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
15
+ {%- endif %}
16
+ {%- endif %}
17
+ {%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
18
+ {%- for message in messages[::-1] %}
19
+ {%- set index = (messages|length - 1) - loop.index0 %}
20
+ {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
21
+ {%- set ns.multi_step_tool = false %}
22
+ {%- set ns.last_query_index = index %}
23
+ {%- endif %}
24
+ {%- endfor %}
25
+ {%- for message in messages %}
26
+ {%- if message.content is string %}
27
+ {%- set content = message.content %}
28
+ {%- else %}
29
+ {%- set content = '' %}
30
+ {%- endif %}
31
+ {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
32
+ {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
33
+ {%- elif message.role == "assistant" %}
34
+ {%- set reasoning_content = '' %}
35
+ {%- if message.reasoning_content is string %}
36
+ {%- set reasoning_content = message.reasoning_content %}
37
+ {%- else %}
38
+ {%- if '</think>' in content %}
39
+ {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
40
+ {%- set content = content.split('</think>')[-1].lstrip('\n') %}
41
+ {%- endif %}
42
+ {%- endif %}
43
+ {%- if loop.index0 > ns.last_query_index %}
44
+ {%- if loop.last or (not loop.last and reasoning_content) %}
45
+ {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
46
+ {%- else %}
47
+ {{- '<|im_start|>' + message.role + '\n' + content }}
48
+ {%- endif %}
49
+ {%- else %}
50
+ {{- '<|im_start|>' + message.role + '\n' + content }}
51
+ {%- endif %}
52
+ {%- if message.tool_calls %}
53
+ {%- for tool_call in message.tool_calls %}
54
+ {%- if (loop.first and content) or (not loop.first) %}
55
+ {{- '\n' }}
56
+ {%- endif %}
57
+ {%- if tool_call.function %}
58
+ {%- set tool_call = tool_call.function %}
59
+ {%- endif %}
60
+ {{- '<tool_call>\n{"name": "' }}
61
+ {{- tool_call.name }}
62
+ {{- '", "arguments": ' }}
63
+ {%- if tool_call.arguments is string %}
64
+ {{- tool_call.arguments }}
65
+ {%- else %}
66
+ {{- tool_call.arguments | tojson }}
67
+ {%- endif %}
68
+ {{- '}\n</tool_call>' }}
69
+ {%- endfor %}
70
+ {%- endif %}
71
+ {{- '<|im_end|>\n' }}
72
+ {%- elif message.role == "tool" %}
73
+ {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
74
+ {{- '<|im_start|>user' }}
75
+ {%- endif %}
76
+ {{- '\n<tool_response>\n' }}
77
+ {{- content }}
78
+ {{- '\n</tool_response>' }}
79
+ {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
80
+ {{- '<|im_end|>\n' }}
81
+ {%- endif %}
82
+ {%- endif %}
83
+ {%- endfor %}
84
+ {%- if add_generation_prompt %}
85
+ {{- '<|im_start|>assistant\n' }}
86
+ {%- if enable_thinking is defined and enable_thinking is false %}
87
+ {{- '<think>\n\n</think>\n\n' }}
88
+ {%- endif %}
89
+ {%- endif %}
config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:632e450a6764e7e8f4f343c6c42d1073093bfd56813443211ed957507b1fbb92
3
+ size 2073
generation_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7908ed1e22dee278b73b06e307117a0479c2e2d80e6931156a5793790f72e60b
3
+ size 147
inference_processor.py ADDED
@@ -0,0 +1,454 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import math
5
+ import os
6
+ from pathlib import Path
7
+ from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union
8
+
9
+ import numpy as np
10
+ from PIL import Image
11
+
12
+ import torch
13
+ import torch.nn.functional as F
14
+
15
+ from transformers import AutoTokenizer, BatchFeature, ProcessorMixin
16
+ from transformers.image_processing_utils import BaseImageProcessor
17
+ from transformers.utils import TensorType, cached_file
18
+
19
+
20
+ CONFIG_NAME = "config.json"
21
+ PREPROCESSOR_CONFIG_NAME = "preprocessor_config.json"
22
+ PROCESSOR_CONFIG_NAME = "processor_config.json"
23
+
24
+
25
+ ImageLike = Union[Image.Image, np.ndarray, torch.Tensor]
26
+
27
+
28
+ def _select_cached_file_kwargs(kwargs: Dict[str, Any]) -> Dict[str, Any]:
29
+ allowed = {
30
+ "cache_dir",
31
+ "force_download",
32
+ "proxies",
33
+ "token",
34
+ "local_files_only",
35
+ "revision",
36
+ "subfolder",
37
+ }
38
+ out = {k: v for k, v in kwargs.items() if k in allowed}
39
+ out.setdefault("_raise_exceptions_for_missing_entries", False)
40
+ out.setdefault("_raise_exceptions_for_gated_repo", False)
41
+ out.setdefault("_raise_exceptions_for_connection_errors", False)
42
+ return out
43
+
44
+
45
+ def _resolve_repo_file(pretrained_model_name_or_path: Union[str, os.PathLike], filename: str, **kwargs) -> Optional[str]:
46
+ path = str(pretrained_model_name_or_path)
47
+
48
+ if os.path.isdir(path):
49
+ candidate = os.path.join(path, filename)
50
+ return candidate if os.path.exists(candidate) else None
51
+
52
+ if os.path.isfile(path):
53
+ return path if os.path.basename(path) == filename else None
54
+
55
+ try:
56
+ return cached_file(path, filename, **_select_cached_file_kwargs(kwargs))
57
+ except Exception:
58
+ return None
59
+
60
+
61
+ def _load_json_file(path: str) -> Dict[str, Any]:
62
+ with open(path, "r", encoding="utf-8") as f:
63
+ return json.load(f)
64
+
65
+
66
+ def _load_image_processor_dict(pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> Dict[str, Any]:
67
+ processor_path = _resolve_repo_file(pretrained_model_name_or_path, PROCESSOR_CONFIG_NAME, **kwargs)
68
+ if processor_path is not None:
69
+ processor_dict = _load_json_file(processor_path)
70
+ nested = processor_dict.get("image_processor")
71
+ if isinstance(nested, dict):
72
+ return nested
73
+
74
+ preprocessor_path = _resolve_repo_file(pretrained_model_name_or_path, PREPROCESSOR_CONFIG_NAME, **kwargs)
75
+ if preprocessor_path is not None:
76
+ return _load_json_file(preprocessor_path)
77
+
78
+ config_path = _resolve_repo_file(pretrained_model_name_or_path, CONFIG_NAME, **kwargs)
79
+ if config_path is not None:
80
+ return _load_json_file(config_path)
81
+
82
+ raise FileNotFoundError(
83
+ f"Could not find {PREPROCESSOR_CONFIG_NAME}, {PROCESSOR_CONFIG_NAME}, or {CONFIG_NAME} in {pretrained_model_name_or_path!r}."
84
+ )
85
+
86
+
87
+ class AnandaImageProcessor(BaseImageProcessor):
88
+ """Image processor for Ananda OCR-style visual prefix inputs.
89
+
90
+ Behavior mirrored from the development inference path:
91
+ 1. Convert to RGB / 3 channels.
92
+ 2. Convert to CHW float32 in [0, 1].
93
+ 3. Normalize with config mean/std.
94
+ 4. Pad H/W up to a multiple of patch_size.
95
+ 5. Pad again up to a multiple of patch_size * merge_factor.
96
+ 6. Emit `pixel_values` and `patch_attention_mask`.
97
+ """
98
+
99
+ model_input_names = ["pixel_values", "patch_attention_mask"]
100
+
101
+ def __init__(
102
+ self,
103
+ patch_size: int = 16,
104
+ merge_factor: int = 1,
105
+ do_convert_rgb: bool = True,
106
+ do_rescale: bool = True,
107
+ rescale_factor: float = 1.0 / 255.0,
108
+ do_normalize: bool = True,
109
+ image_mean: Optional[Sequence[float]] = None,
110
+ image_std: Optional[Sequence[float]] = None,
111
+ pad_value: float = 0.0,
112
+ processor_class: Optional[str] = "AnandaProcessor",
113
+ **kwargs: Any,
114
+ ) -> None:
115
+ super().__init__(**kwargs)
116
+ self.patch_size = int(patch_size)
117
+ self.merge_factor = max(int(merge_factor), 1)
118
+ self.do_convert_rgb = bool(do_convert_rgb)
119
+ self.do_rescale = bool(do_rescale)
120
+ self.rescale_factor = float(rescale_factor)
121
+ self.do_normalize = bool(do_normalize)
122
+ self.image_mean = list(image_mean) if image_mean is not None else [0.5, 0.5, 0.5]
123
+ self.image_std = list(image_std) if image_std is not None else [0.5, 0.5, 0.5]
124
+ self.pad_value = float(pad_value)
125
+ self.processor_class = processor_class
126
+
127
+ @classmethod
128
+ def from_model_config(cls, model_config: Union[Dict[str, Any], Any]) -> "AnandaImageProcessor":
129
+ if isinstance(model_config, dict):
130
+ cfg = model_config
131
+ else:
132
+ cfg = vars(model_config)
133
+
134
+ return cls(
135
+ patch_size=int(cfg.get("patch_size", 16)),
136
+ merge_factor=int(cfg.get("encoder_2d_merge_factor", 1)),
137
+ image_mean=cfg.get("image_normalization_mean", [0.5, 0.5, 0.5]),
138
+ image_std=cfg.get("image_normalization_std", [0.5, 0.5, 0.5]),
139
+ )
140
+
141
+ @classmethod
142
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs: Any) -> "AnandaImageProcessor":
143
+ config_dict = _load_image_processor_dict(pretrained_model_name_or_path, **kwargs)
144
+ nested = config_dict.get("image_processor")
145
+ if isinstance(nested, dict):
146
+ config_dict = nested
147
+
148
+ return cls(
149
+ patch_size=int(config_dict.get("patch_size", 16)),
150
+ merge_factor=int(config_dict.get("merge_factor", config_dict.get("encoder_2d_merge_factor", 1))),
151
+ do_convert_rgb=bool(config_dict.get("do_convert_rgb", True)),
152
+ do_rescale=bool(config_dict.get("do_rescale", True)),
153
+ rescale_factor=float(config_dict.get("rescale_factor", 1.0 / 255.0)),
154
+ do_normalize=bool(config_dict.get("do_normalize", True)),
155
+ image_mean=config_dict.get("image_mean", config_dict.get("image_normalization_mean", [0.5, 0.5, 0.5])),
156
+ image_std=config_dict.get("image_std", config_dict.get("image_normalization_std", [0.5, 0.5, 0.5])),
157
+ pad_value=float(config_dict.get("pad_value", 0.0)),
158
+ processor_class=config_dict.get("processor_class", "AnandaProcessor"),
159
+ )
160
+
161
+ def to_dict(self) -> Dict[str, Any]:
162
+ return {
163
+ "image_processor_type": self.__class__.__name__,
164
+ "processor_class": self.processor_class,
165
+ "auto_map": {
166
+ "AutoImageProcessor": "inference_processor.AnandaImageProcessor",
167
+ "AutoProcessor": "inference_processor.AnandaProcessor",
168
+ },
169
+ "patch_size": self.patch_size,
170
+ "merge_factor": self.merge_factor,
171
+ "do_convert_rgb": self.do_convert_rgb,
172
+ "do_rescale": self.do_rescale,
173
+ "rescale_factor": self.rescale_factor,
174
+ "do_normalize": self.do_normalize,
175
+ "image_mean": list(self.image_mean),
176
+ "image_std": list(self.image_std),
177
+ "pad_value": self.pad_value,
178
+ }
179
+
180
+ def save_pretrained(self, save_directory: Union[str, os.PathLike], **_: Any) -> List[str]:
181
+ os.makedirs(save_directory, exist_ok=True)
182
+ output_path = os.path.join(save_directory, PREPROCESSOR_CONFIG_NAME)
183
+ with open(output_path, "w", encoding="utf-8") as f:
184
+ json.dump(self.to_dict(), f, ensure_ascii=False, indent=2)
185
+ return [output_path]
186
+
187
+ @staticmethod
188
+ def _ensure_list(images: Union[ImageLike, Sequence[ImageLike]]) -> List[ImageLike]:
189
+ if isinstance(images, (list, tuple)):
190
+ return list(images)
191
+ return [images]
192
+
193
+ def _to_chw_uint8(self, image: ImageLike) -> torch.Tensor:
194
+ if isinstance(image, Image.Image):
195
+ img = image.convert("RGB") if self.do_convert_rgb else image
196
+ arr = np.array(img, dtype=np.uint8)
197
+ tensor = torch.from_numpy(arr)
198
+ if tensor.ndim == 2:
199
+ tensor = tensor.unsqueeze(-1)
200
+ tensor = tensor.permute(2, 0, 1).contiguous()
201
+ elif isinstance(image, np.ndarray):
202
+ arr = image
203
+ if arr.ndim == 2:
204
+ arr = arr[..., None]
205
+ if arr.ndim != 3:
206
+ raise ValueError(f"Expected 2D or 3D ndarray image, got shape={arr.shape}")
207
+ tensor = torch.from_numpy(arr)
208
+ if tensor.shape[0] in (1, 3, 4):
209
+ pass
210
+ elif tensor.shape[-1] in (1, 3, 4):
211
+ tensor = tensor.permute(2, 0, 1)
212
+ else:
213
+ raise ValueError(f"Could not infer channel dimension from ndarray shape={arr.shape}")
214
+ tensor = tensor.contiguous()
215
+ elif torch.is_tensor(image):
216
+ tensor = image.detach().cpu()
217
+ if tensor.ndim == 2:
218
+ tensor = tensor.unsqueeze(0)
219
+ if tensor.ndim != 3:
220
+ raise ValueError(f"Expected 2D or 3D tensor image, got shape={tuple(tensor.shape)}")
221
+ if tensor.shape[0] in (1, 3, 4):
222
+ pass
223
+ elif tensor.shape[-1] in (1, 3, 4):
224
+ tensor = tensor.permute(2, 0, 1)
225
+ else:
226
+ raise ValueError(f"Could not infer channel dimension from tensor shape={tuple(tensor.shape)}")
227
+ tensor = tensor.contiguous()
228
+ else:
229
+ raise TypeError(f"Unsupported image type: {type(image)!r}")
230
+
231
+ if tensor.shape[0] == 1:
232
+ tensor = tensor.expand(3, -1, -1)
233
+ elif tensor.shape[0] == 4:
234
+ tensor = tensor[:3]
235
+ elif tensor.shape[0] != 3:
236
+ raise ValueError(f"Expected 1, 3, or 4 channels, got {tensor.shape[0]}")
237
+
238
+ if tensor.dtype.is_floating_point:
239
+ max_val = float(tensor.max().item()) if tensor.numel() else 0.0
240
+ if max_val <= 1.0 + 1e-6:
241
+ tensor = tensor * 255.0
242
+ tensor = tensor.round().clamp_(0.0, 255.0).to(torch.uint8)
243
+ else:
244
+ tensor = tensor.clamp_(0, 255).to(torch.uint8)
245
+
246
+ return tensor.contiguous()
247
+
248
+ def _normalize(self, chw_u8: torch.Tensor) -> torch.Tensor:
249
+ x = chw_u8.to(torch.float32)
250
+ if self.do_rescale:
251
+ x = x * self.rescale_factor
252
+ mean = torch.tensor(self.image_mean, dtype=torch.float32).view(3, 1, 1)
253
+ std = torch.tensor(self.image_std, dtype=torch.float32).view(3, 1, 1)
254
+ if self.do_normalize:
255
+ x = (x - mean) / std
256
+ return x
257
+
258
+ def _pad_to_patch_multiple(self, img: torch.Tensor) -> torch.Tensor:
259
+ _, h, w = img.shape
260
+ p = self.patch_size
261
+ target_h = int(math.ceil(h / p) * p)
262
+ target_w = int(math.ceil(w / p) * p)
263
+ if target_h != h or target_w != w:
264
+ img = F.pad(img, (0, target_w - w, 0, target_h - h), value=self.pad_value)
265
+ return img
266
+
267
+ def _pad_for_merge_factor(self, img_norm: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
268
+ if img_norm.ndim != 3:
269
+ raise ValueError(f"Expected image tensor with shape (3,H,W), got {tuple(img_norm.shape)}")
270
+
271
+ p = self.patch_size
272
+ m = self.merge_factor
273
+ base = p * m
274
+ _, h, w = img_norm.shape
275
+
276
+ if h % p != 0 or w % p != 0:
277
+ raise ValueError(f"Image must be patch-multiple before merge padding, got H={h}, W={w}, patch_size={p}")
278
+
279
+ target_h = int(math.ceil(h / base) * base)
280
+ target_w = int(math.ceil(w / base) * base)
281
+
282
+ ph, pw = h // p, w // p
283
+ target_ph, target_pw = target_h // p, target_w // p
284
+
285
+ mask_2d = torch.ones((ph, pw), dtype=torch.bool)
286
+ if target_ph != ph or target_pw != pw:
287
+ mask_2d = F.pad(mask_2d, (0, target_pw - pw, 0, target_ph - ph), value=False)
288
+
289
+ if target_h != h or target_w != w:
290
+ img_norm = F.pad(img_norm, (0, target_w - w, 0, target_h - h), value=self.pad_value)
291
+
292
+ return img_norm, mask_2d.reshape(-1).to(torch.long)
293
+
294
+ def _preprocess_single(self, image: ImageLike) -> Tuple[torch.Tensor, torch.Tensor]:
295
+ chw_u8 = self._to_chw_uint8(image)
296
+ img = self._normalize(chw_u8)
297
+ img = self._pad_to_patch_multiple(img)
298
+ return self._pad_for_merge_factor(img)
299
+
300
+ def preprocess(
301
+ self,
302
+ images: Union[ImageLike, Sequence[ImageLike]],
303
+ return_tensors: Optional[Union[str, TensorType]] = None,
304
+ **_: Any,
305
+ ) -> BatchFeature:
306
+ image_list = self._ensure_list(images)
307
+ if len(image_list) == 0:
308
+ raise ValueError("`images` must contain at least one image")
309
+
310
+ processed: List[torch.Tensor] = []
311
+ patch_masks: List[torch.Tensor] = []
312
+ for image in image_list:
313
+ px, pm = self._preprocess_single(image)
314
+ processed.append(px)
315
+ patch_masks.append(pm)
316
+
317
+ max_h = max(t.shape[1] for t in processed)
318
+ max_w = max(t.shape[2] for t in processed)
319
+ p = self.patch_size
320
+ batch_patch_h = max_h // p
321
+ batch_patch_w = max_w // p
322
+
323
+ batch_pixels: List[torch.Tensor] = []
324
+ batch_masks: List[torch.Tensor] = []
325
+ for px, pm in zip(processed, patch_masks):
326
+ _, h, w = px.shape
327
+ ph, pw = h // p, w // p
328
+
329
+ if h != max_h or w != max_w:
330
+ px = F.pad(px, (0, max_w - w, 0, max_h - h), value=self.pad_value)
331
+
332
+ pm_2d = pm.view(ph, pw).to(torch.bool)
333
+ if ph != batch_patch_h or pw != batch_patch_w:
334
+ pm_2d = F.pad(pm_2d, (0, batch_patch_w - pw, 0, batch_patch_h - ph), value=False)
335
+
336
+ batch_pixels.append(px)
337
+ batch_masks.append(pm_2d.reshape(-1).to(torch.long))
338
+
339
+ data = {
340
+ "pixel_values": torch.stack(batch_pixels, dim=0),
341
+ "patch_attention_mask": torch.stack(batch_masks, dim=0),
342
+ }
343
+ return BatchFeature(data=data, tensor_type=return_tensors)
344
+
345
+ __call__ = preprocess
346
+
347
+
348
+ class AnandaProcessor(ProcessorMixin):
349
+ attributes = ["image_processor", "tokenizer"]
350
+ image_processor_class = "AutoImageProcessor"
351
+ tokenizer_class = "AutoTokenizer"
352
+ model_input_names = ["input_ids", "attention_mask", "pixel_values", "patch_attention_mask"]
353
+
354
+ def __init__(self, image_processor: AnandaImageProcessor, tokenizer, **kwargs: Any) -> None:
355
+ self.image_processor = image_processor
356
+ self.tokenizer = tokenizer
357
+ self.current_processor = self.image_processor
358
+ self._in_target_context_manager = False
359
+ super().__init__(image_processor, tokenizer, **kwargs)
360
+
361
+ @classmethod
362
+ def from_model_config(cls, tokenizer, model_config: Union[Dict[str, Any], Any]) -> "AnandaProcessor":
363
+ image_processor = AnandaImageProcessor.from_model_config(model_config)
364
+ return cls(image_processor=image_processor, tokenizer=tokenizer)
365
+
366
+ @classmethod
367
+ def from_pretrained(
368
+ cls,
369
+ pretrained_model_name_or_path: Union[str, os.PathLike],
370
+ trust_remote_code: bool = True,
371
+ **kwargs: Any,
372
+ ) -> "AnandaProcessor":
373
+ tokenizer = AutoTokenizer.from_pretrained(
374
+ pretrained_model_name_or_path,
375
+ trust_remote_code=trust_remote_code,
376
+ **kwargs,
377
+ )
378
+ image_processor = AnandaImageProcessor.from_pretrained(pretrained_model_name_or_path, **kwargs)
379
+ return cls(image_processor=image_processor, tokenizer=tokenizer)
380
+
381
+ def save_pretrained(self, save_directory: Union[str, os.PathLike], **kwargs: Any) -> List[str]:
382
+ os.makedirs(save_directory, exist_ok=True)
383
+
384
+ saved_files: List[str] = []
385
+ saved_files.extend(self.image_processor.save_pretrained(save_directory))
386
+ saved_files.extend(self.tokenizer.save_pretrained(save_directory))
387
+
388
+ processor_dict = {
389
+ "processor_class": self.__class__.__name__,
390
+ "auto_map": {"AutoProcessor": "inference_processor.AnandaProcessor"},
391
+ "image_processor": self.image_processor.to_dict(),
392
+ }
393
+ output_path = os.path.join(save_directory, PROCESSOR_CONFIG_NAME)
394
+ with open(output_path, "w", encoding="utf-8") as f:
395
+ json.dump(processor_dict, f, ensure_ascii=False, indent=2)
396
+ saved_files.append(output_path)
397
+ return saved_files
398
+
399
+ def __call__(
400
+ self,
401
+ text: Optional[Union[str, Sequence[str]]] = None,
402
+ images: Optional[Union[ImageLike, Sequence[ImageLike]]] = None,
403
+ return_tensors: Optional[Union[str, TensorType]] = None,
404
+ add_special_tokens: bool = True,
405
+ **kwargs: Any,
406
+ ) -> BatchFeature:
407
+ if text is None and images is None:
408
+ raise ValueError("At least one of `text` or `images` must be provided")
409
+
410
+ encoding: Dict[str, Any] = {}
411
+
412
+ if images is not None:
413
+ image_features = self.image_processor(images=images, return_tensors=return_tensors)
414
+ encoding.update(image_features)
415
+ batch_size = int(image_features["pixel_values"].shape[0])
416
+ else:
417
+ batch_size = None
418
+
419
+ if text is None:
420
+ bos_id = self.tokenizer.bos_token_id
421
+ eos_id = self.tokenizer.eos_token_id
422
+ prompt_id = bos_id if bos_id is not None else eos_id
423
+ if prompt_id is None:
424
+ raise ValueError("Tokenizer must define bos_token_id or eos_token_id.")
425
+ if batch_size is None:
426
+ batch_size = 1
427
+
428
+ input_ids = [[int(prompt_id)] for _ in range(batch_size)]
429
+ attention_mask = [[1] for _ in range(batch_size)]
430
+ if return_tensors == "pt" or return_tensors == TensorType.PYTORCH:
431
+ encoding["input_ids"] = torch.tensor(input_ids, dtype=torch.long)
432
+ encoding["attention_mask"] = torch.tensor(attention_mask, dtype=torch.long)
433
+ else:
434
+ encoding["input_ids"] = input_ids
435
+ encoding["attention_mask"] = attention_mask
436
+ else:
437
+ text_encoding = self.tokenizer(
438
+ text,
439
+ add_special_tokens=add_special_tokens,
440
+ return_tensors=return_tensors,
441
+ **kwargs,
442
+ )
443
+ encoding.update(text_encoding)
444
+
445
+ return BatchFeature(data=encoding, tensor_type=return_tensors)
446
+
447
+ def batch_decode(self, *args: Any, **kwargs: Any):
448
+ return self.tokenizer.batch_decode(*args, **kwargs)
449
+
450
+ def decode(self, *args: Any, **kwargs: Any):
451
+ return self.tokenizer.decode(*args, **kwargs)
452
+
453
+ def apply_chat_template(self, *args: Any, **kwargs: Any):
454
+ return self.tokenizer.apply_chat_template(*args, **kwargs)
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7fa72c09c7ba378981b6fc6c90cb6d49d5cd69267cd759f037ec269a3377c73d
3
+ size 3126311968
modeling_anandasky.py ADDED
The diff for this file is too large to render. See raw diff
 
processor_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f04692ec710107538335a4ba0c5ce6f7946bf042e122281da6117da5d10350d2
3
+ size 748
special_tokens_map.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51d28ee6e40cc506625950bf3983cdf9212d8f939dd52622ee4efd7ee3342a8b
3
+ size 756
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
3
+ size 11422654
tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:87d541a08b9cd877495739a38ee30bac3164b9b680cb0cd67131e1c15a0b081e
3
+ size 5412
vocab.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
3
+ size 2776833