Initial commit

Browse files

Files changed (15) hide show

.gitattributes +1 -0
README.md +174 -0
added_tokens.json +3 -0
chat_template.jinja +89 -0
config.json +3 -0
generation_config.json +3 -0
inference_processor.py +454 -0
merges.txt +0 -0
model.safetensors +3 -0
modeling_anandasky.py +0 -0
processor_config.json +3 -0
special_tokens_map.json +3 -0
tokenizer.json +3 -0
tokenizer_config.json +3 -0
vocab.json +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,174 @@

+---
+license: cc-by-nc-4.0
+license_name: Creative Commons Attribution-NonCommercial 4.0 International
+license_link: https://creativecommons.org/licenses/by-nc/4.0/
+tags:
+  - ocr
+  - htr
+  - vision-language-model
+  - historical-documents
+  - chinese
+  - classical-chinese
+  - image-to-text
+library_name: transformers
+pipeline_tag: image-to-text
+---
+# AnandaSky
+**AnandaSky** is a vision-language model for line-level transcription of historical sinographic documents.
+The name combines *Ananda*—the disciple of the Buddha traditionally associated with the "encoding" of early Buddhist texts—and *Sky*, the opening character of the *Thousand Character Classic*, a text long used in premodern China to enumerate items.
+## Paper
+This model is described in the following paper:
+[**AnandaSky: A Vision--Language Model for Line-Level Transcription of Historical Sinographic Documents**](https://hal.science/view/index/docid/5548531)
+## Model Overview
+AnandaSky is a vision-language model for efficient transcription of historical sinographic line images. It contains approximately **626M parameters** and combines:
+- a Vision Transformer (ViT) encoder
+- an autoregressive Qwen3-based decoder
+The model was trained on **4 million line images** extracted from historical documents produced in China and Korea between the **8th and 20th centuries**, including both printed editions and handwritten manuscripts.
+A full description of the datasets, preprocessing pipeline, and training procedure is provided in the accompanying paper.
+## Evaluation Results
+The model achieves the following character error rates (CER) on in-domain test sets:
+| Dataset | CER |
+|---|---:|
+| MTHv2 | 0.92% |
+| Sibu Congkan | 0.43% |
+| Korean Anthologies | 0.33% |
+| Dunhuang Manuscripts | 1.38% |
+| Qing Legal Documents | 4.89% |
+The model achieves the following character error rates (CER) on held-out benchmarks:
+| Dataset | CER |
+|---|---:|
+| ICDAR2019-HDRC | 0.96% |
+| CUHK Challenge 2021 | 0.82% |
+| CUHK Challenge 2022 | 1.61% |
+## Intended Use
+AnandaSky is intended for line-level transcription of historical sinographic documents. It can process both single-column and double-column vertical text layouts.
+## Transcription Normalization
+If you notice that the model systematically produces an incorrect transcription for a specific character or glyph form, please consider opening an issue in the repository. Such reports are valuable for improving the normalization pipeline and future model releases.
+## Hardware and Dependencies
+This model has a hard dependency on **FlashAttention**.
+### Required Environment
+- Python >= 3.10
+- PyTorch >= 2.1
+- NVIDIA Ampere-or-newer GPU
+- `transformers`
+- `flash-attn`
+### Install FlashAttention
+```bash
+pip install flash-attn --no-build-isolation
+```
+> ⚠️ **FlashAttention is required.** The model will not run without it.
+## Loading the Model
+Because this repository uses custom Transformers modeling code, the model must be loaded with `trust_remote_code=True`.
+### Example
+```python
+from transformers import AutoModelForCausalLM
+model = AutoModelForCausalLM.from_pretrained(
+    "badianeai/AnandaSky",
+    trust_remote_code=True,
+    torch_dtype="auto",
+    device_map="auto",
+)
+```
+## Minimal Inference Example
+```python
+from PIL import Image
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor
+DEVICE = torch.device("cuda")
+DTYPE = torch.bfloat16
+model = AutoModelForCausalLM.from_pretrained("badianeai/AnandaSky",
+                                            trust_remote_code=True,
+                                            torch_dtype=DTYPE)
+model = model.to(DEVICE)
+image = Image.open("line_image.png")
+processor = AutoProcessor.from_pretrained("badianeai/AnandaSky",
+                                          trust_remote_code=True,
+                                          local_files_only=True)
+inputs = processor(images=image, return_tensors="pt")
+inputs["input_ids"] = inputs["input_ids"].to(device=DEVICE, non_blocking=True)
+inputs["attention_mask"] = inputs["attention_mask"].to(device=DEVICE, non_blocking=True)
+inputs["pixel_values"] = inputs["pixel_values"].to(device=DEVICE, dtype=DTYPE, non_blocking=True)
+inputs["patch_attention_mask"] = inputs["patch_attention_mask"].to(device=DEVICE, non_blocking=True)
+with torch.no_grad():
+    with torch.autocast(device_type="cuda", dtype=DTYPE, enabled=True):
+        output = model.generate(
+            **inputs,
+            use_cache=True,
+        )
+text = processor.decode(output[0, 1:], skip_special_tokens=True).strip()
+print(text)
+```
+## License
+This model is released under the **Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)** license.
+It may be used for **research, academic, and other non-commercial purposes only**. Commercial use is **not permitted** without prior permission from the authors.
+## Citation
+```bibtex
+@inproceedings{brisson:hal-05548531,
+  TITLE = {{AnandaSky: A Vision-Language Model for Line-Level Transcription of Historical Sinographic Documents}},
+  AUTHOR = {Brisson, Colin and Kahfy, Ayoub and Constant, Fr{\'e}d{\'e}ric and Bui, Marc},
+  URL = {https://hal.science/hal-05548531},
+  NOTE = {BnF DataLab Projet READ\_Chinese},
+  BOOKTITLE = {{The Fourth Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA 2026)}},
+  ADDRESS = {Majorca/Spain, Spain},
+  YEAR = {2026},
+  MONTH = May,
+  KEYWORDS = {Dunhuang manuscripts ; long-tailed distribution ; vision-language models ; HTR ; OCR ; Classical Chinese ; Historical documents ; Historical documents Classical Chinese OCR HTR vision-language models long-tailed distribution Dunhuang manuscripts},
+  PDF = {https://hal.science/hal-05548531v1/file/AnandaSky_Technical_Report.pdf},
+  HAL_ID = {hal-05548531},
+  HAL_VERSION = {v1},
+}
+```
+## Contact
+For questions, bug reports, or feedback, please open an issue in the repository.

added_tokens.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c0284b582e14987fbd3d5a2cb2bd139084371ed9acbae488829a1c900833c680
+size 707

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,89 @@

+{%- if tools %}
+    {{- '<|im_start|>system\n' }}
+    {%- if messages[0].role == 'system' %}
+        {{- messages[0].content + '\n\n' }}
+    {%- endif %}
+    {{- "# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
+    {%- for tool in tools %}
+        {{- "\n" }}
+        {{- tool | tojson }}
+    {%- endfor %}
+    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
+{%- else %}
+    {%- if messages[0].role == 'system' %}
+        {{- '<|im_start|>system\n' + messages[0].content + '<|im_end|>\n' }}
+    {%- endif %}
+{%- endif %}
+{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
+{%- for message in messages[::-1] %}
+    {%- set index = (messages|length - 1) - loop.index0 %}
+    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith('<tool_response>') and message.content.endswith('</tool_response>')) %}
+        {%- set ns.multi_step_tool = false %}
+        {%- set ns.last_query_index = index %}
+    {%- endif %}
+{%- endfor %}
+{%- for message in messages %}
+    {%- if message.content is string %}
+        {%- set content = message.content %}
+    {%- else %}
+        {%- set content = '' %}
+    {%- endif %}
+    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
+        {{- '<|im_start|>' + message.role + '\n' + content + '<|im_end|>' + '\n' }}
+    {%- elif message.role == "assistant" %}
+        {%- set reasoning_content = '' %}
+        {%- if message.reasoning_content is string %}
+            {%- set reasoning_content = message.reasoning_content %}
+        {%- else %}
+            {%- if '</think>' in content %}
+                {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
+                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
+            {%- endif %}
+        {%- endif %}
+        {%- if loop.index0 > ns.last_query_index %}
+            {%- if loop.last or (not loop.last and reasoning_content) %}
+                {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.strip('\n') + '\n</think>\n\n' + content.lstrip('\n') }}
+            {%- else %}
+                {{- '<|im_start|>' + message.role + '\n' + content }}
+            {%- endif %}
+        {%- else %}
+            {{- '<|im_start|>' + message.role + '\n' + content }}
+        {%- endif %}
+        {%- if message.tool_calls %}
+            {%- for tool_call in message.tool_calls %}
+                {%- if (loop.first and content) or (not loop.first) %}
+                    {{- '\n' }}
+                {%- endif %}
+                {%- if tool_call.function %}
+                    {%- set tool_call = tool_call.function %}
+                {%- endif %}
+                {{- '<tool_call>\n{"name": "' }}
+                {{- tool_call.name }}
+                {{- '", "arguments": ' }}
+                {%- if tool_call.arguments is string %}
+                    {{- tool_call.arguments }}
+                {%- else %}
+                    {{- tool_call.arguments | tojson }}
+                {%- endif %}
+                {{- '}\n</tool_call>' }}
+            {%- endfor %}
+        {%- endif %}
+        {{- '<|im_end|>\n' }}
+    {%- elif message.role == "tool" %}
+        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
+            {{- '<|im_start|>user' }}
+        {%- endif %}
+        {{- '\n<tool_response>\n' }}
+        {{- content }}
+        {{- '\n</tool_response>' }}
+        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
+            {{- '<|im_end|>\n' }}
+        {%- endif %}
+    {%- endif %}
+{%- endfor %}
+{%- if add_generation_prompt %}
+    {{- '<|im_start|>assistant\n' }}
+    {%- if enable_thinking is defined and enable_thinking is false %}
+        {{- '<think>\n\n</think>\n\n' }}
+    {%- endif %}
+{%- endif %}

config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:632e450a6764e7e8f4f343c6c42d1073093bfd56813443211ed957507b1fbb92
+size 2073

generation_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7908ed1e22dee278b73b06e307117a0479c2e2d80e6931156a5793790f72e60b
+size 147

inference_processor.py ADDED Viewed

	@@ -0,0 +1,454 @@

+from __future__ import annotations
+import json
+import math
+import os
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple, Union
+import numpy as np
+from PIL import Image
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, BatchFeature, ProcessorMixin
+from transformers.image_processing_utils import BaseImageProcessor
+from transformers.utils import TensorType, cached_file
+CONFIG_NAME = "config.json"
+PREPROCESSOR_CONFIG_NAME = "preprocessor_config.json"
+PROCESSOR_CONFIG_NAME = "processor_config.json"
+ImageLike = Union[Image.Image, np.ndarray, torch.Tensor]
+def _select_cached_file_kwargs(kwargs: Dict[str, Any]) -> Dict[str, Any]:
+    allowed = {
+        "cache_dir",
+        "force_download",
+        "proxies",
+        "token",
+        "local_files_only",
+        "revision",
+        "subfolder",
+    }
+    out = {k: v for k, v in kwargs.items() if k in allowed}
+    out.setdefault("_raise_exceptions_for_missing_entries", False)
+    out.setdefault("_raise_exceptions_for_gated_repo", False)
+    out.setdefault("_raise_exceptions_for_connection_errors", False)
+    return out
+def _resolve_repo_file(pretrained_model_name_or_path: Union[str, os.PathLike], filename: str, **kwargs) -> Optional[str]:
+    path = str(pretrained_model_name_or_path)
+    if os.path.isdir(path):
+        candidate = os.path.join(path, filename)
+        return candidate if os.path.exists(candidate) else None
+    if os.path.isfile(path):
+        return path if os.path.basename(path) == filename else None
+    try:
+        return cached_file(path, filename, **_select_cached_file_kwargs(kwargs))
+    except Exception:
+        return None
+def _load_json_file(path: str) -> Dict[str, Any]:
+    with open(path, "r", encoding="utf-8") as f:
+        return json.load(f)
+def _load_image_processor_dict(pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> Dict[str, Any]:
+    processor_path = _resolve_repo_file(pretrained_model_name_or_path, PROCESSOR_CONFIG_NAME, **kwargs)
+    if processor_path is not None:
+        processor_dict = _load_json_file(processor_path)
+        nested = processor_dict.get("image_processor")
+        if isinstance(nested, dict):
+            return nested
+    preprocessor_path = _resolve_repo_file(pretrained_model_name_or_path, PREPROCESSOR_CONFIG_NAME, **kwargs)
+    if preprocessor_path is not None:
+        return _load_json_file(preprocessor_path)
+    config_path = _resolve_repo_file(pretrained_model_name_or_path, CONFIG_NAME, **kwargs)
+    if config_path is not None:
+        return _load_json_file(config_path)
+    raise FileNotFoundError(
+        f"Could not find {PREPROCESSOR_CONFIG_NAME}, {PROCESSOR_CONFIG_NAME}, or {CONFIG_NAME} in {pretrained_model_name_or_path!r}."
+    )
+class AnandaImageProcessor(BaseImageProcessor):
+    """Image processor for Ananda OCR-style visual prefix inputs.
+    Behavior mirrored from the development inference path:
+      1. Convert to RGB / 3 channels.
+      2. Convert to CHW float32 in [0, 1].
+      3. Normalize with config mean/std.
+      4. Pad H/W up to a multiple of patch_size.
+      5. Pad again up to a multiple of patch_size * merge_factor.
+      6. Emit `pixel_values` and `patch_attention_mask`.
+    """
+    model_input_names = ["pixel_values", "patch_attention_mask"]
+    def __init__(
+        self,
+        patch_size: int = 16,
+        merge_factor: int = 1,
+        do_convert_rgb: bool = True,
+        do_rescale: bool = True,
+        rescale_factor: float = 1.0 / 255.0,
+        do_normalize: bool = True,
+        image_mean: Optional[Sequence[float]] = None,
+        image_std: Optional[Sequence[float]] = None,
+        pad_value: float = 0.0,
+        processor_class: Optional[str] = "AnandaProcessor",
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.patch_size = int(patch_size)
+        self.merge_factor = max(int(merge_factor), 1)
+        self.do_convert_rgb = bool(do_convert_rgb)
+        self.do_rescale = bool(do_rescale)
+        self.rescale_factor = float(rescale_factor)
+        self.do_normalize = bool(do_normalize)
+        self.image_mean = list(image_mean) if image_mean is not None else [0.5, 0.5, 0.5]
+        self.image_std = list(image_std) if image_std is not None else [0.5, 0.5, 0.5]
+        self.pad_value = float(pad_value)
+        self.processor_class = processor_class
+    @classmethod
+    def from_model_config(cls, model_config: Union[Dict[str, Any], Any]) -> "AnandaImageProcessor":
+        if isinstance(model_config, dict):
+            cfg = model_config
+        else:
+            cfg = vars(model_config)
+        return cls(
+            patch_size=int(cfg.get("patch_size", 16)),
+            merge_factor=int(cfg.get("encoder_2d_merge_factor", 1)),
+            image_mean=cfg.get("image_normalization_mean", [0.5, 0.5, 0.5]),
+            image_std=cfg.get("image_normalization_std", [0.5, 0.5, 0.5]),
+        )
+    @classmethod
+    def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs: Any) -> "AnandaImageProcessor":
+        config_dict = _load_image_processor_dict(pretrained_model_name_or_path, **kwargs)
+        nested = config_dict.get("image_processor")
+        if isinstance(nested, dict):
+            config_dict = nested
+        return cls(
+            patch_size=int(config_dict.get("patch_size", 16)),
+            merge_factor=int(config_dict.get("merge_factor", config_dict.get("encoder_2d_merge_factor", 1))),
+            do_convert_rgb=bool(config_dict.get("do_convert_rgb", True)),
+            do_rescale=bool(config_dict.get("do_rescale", True)),
+            rescale_factor=float(config_dict.get("rescale_factor", 1.0 / 255.0)),
+            do_normalize=bool(config_dict.get("do_normalize", True)),
+            image_mean=config_dict.get("image_mean", config_dict.get("image_normalization_mean", [0.5, 0.5, 0.5])),
+            image_std=config_dict.get("image_std", config_dict.get("image_normalization_std", [0.5, 0.5, 0.5])),
+            pad_value=float(config_dict.get("pad_value", 0.0)),
+            processor_class=config_dict.get("processor_class", "AnandaProcessor"),
+        )
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "image_processor_type": self.__class__.__name__,
+            "processor_class": self.processor_class,
+            "auto_map": {
+                "AutoImageProcessor": "inference_processor.AnandaImageProcessor",
+                "AutoProcessor": "inference_processor.AnandaProcessor",
+            },
+            "patch_size": self.patch_size,
+            "merge_factor": self.merge_factor,
+            "do_convert_rgb": self.do_convert_rgb,
+            "do_rescale": self.do_rescale,
+            "rescale_factor": self.rescale_factor,
+            "do_normalize": self.do_normalize,
+            "image_mean": list(self.image_mean),
+            "image_std": list(self.image_std),
+            "pad_value": self.pad_value,
+        }
+    def save_pretrained(self, save_directory: Union[str, os.PathLike], **_: Any) -> List[str]:
+        os.makedirs(save_directory, exist_ok=True)
+        output_path = os.path.join(save_directory, PREPROCESSOR_CONFIG_NAME)
+        with open(output_path, "w", encoding="utf-8") as f:
+            json.dump(self.to_dict(), f, ensure_ascii=False, indent=2)
+        return [output_path]
+    @staticmethod
+    def _ensure_list(images: Union[ImageLike, Sequence[ImageLike]]) -> List[ImageLike]:
+        if isinstance(images, (list, tuple)):
+            return list(images)
+        return [images]
+    def _to_chw_uint8(self, image: ImageLike) -> torch.Tensor:
+        if isinstance(image, Image.Image):
+            img = image.convert("RGB") if self.do_convert_rgb else image
+            arr = np.array(img, dtype=np.uint8)
+            tensor = torch.from_numpy(arr)
+            if tensor.ndim == 2:
+                tensor = tensor.unsqueeze(-1)
+            tensor = tensor.permute(2, 0, 1).contiguous()
+        elif isinstance(image, np.ndarray):
+            arr = image
+            if arr.ndim == 2:
+                arr = arr[..., None]
+            if arr.ndim != 3:
+                raise ValueError(f"Expected 2D or 3D ndarray image, got shape={arr.shape}")
+            tensor = torch.from_numpy(arr)
+            if tensor.shape[0] in (1, 3, 4):
+                pass
+            elif tensor.shape[-1] in (1, 3, 4):
+                tensor = tensor.permute(2, 0, 1)
+            else:
+                raise ValueError(f"Could not infer channel dimension from ndarray shape={arr.shape}")
+            tensor = tensor.contiguous()
+        elif torch.is_tensor(image):
+            tensor = image.detach().cpu()
+            if tensor.ndim == 2:
+                tensor = tensor.unsqueeze(0)
+            if tensor.ndim != 3:
+                raise ValueError(f"Expected 2D or 3D tensor image, got shape={tuple(tensor.shape)}")
+            if tensor.shape[0] in (1, 3, 4):
+                pass
+            elif tensor.shape[-1] in (1, 3, 4):
+                tensor = tensor.permute(2, 0, 1)
+            else:
+                raise ValueError(f"Could not infer channel dimension from tensor shape={tuple(tensor.shape)}")
+            tensor = tensor.contiguous()
+        else:
+            raise TypeError(f"Unsupported image type: {type(image)!r}")
+        if tensor.shape[0] == 1:
+            tensor = tensor.expand(3, -1, -1)
+        elif tensor.shape[0] == 4:
+            tensor = tensor[:3]
+        elif tensor.shape[0] != 3:
+            raise ValueError(f"Expected 1, 3, or 4 channels, got {tensor.shape[0]}")
+        if tensor.dtype.is_floating_point:
+            max_val = float(tensor.max().item()) if tensor.numel() else 0.0
+            if max_val <= 1.0 + 1e-6:
+                tensor = tensor * 255.0
+            tensor = tensor.round().clamp_(0.0, 255.0).to(torch.uint8)
+        else:
+            tensor = tensor.clamp_(0, 255).to(torch.uint8)
+        return tensor.contiguous()
+    def _normalize(self, chw_u8: torch.Tensor) -> torch.Tensor:
+        x = chw_u8.to(torch.float32)
+        if self.do_rescale:
+            x = x * self.rescale_factor
+        mean = torch.tensor(self.image_mean, dtype=torch.float32).view(3, 1, 1)
+        std = torch.tensor(self.image_std, dtype=torch.float32).view(3, 1, 1)
+        if self.do_normalize:
+            x = (x - mean) / std
+        return x
+    def _pad_to_patch_multiple(self, img: torch.Tensor) -> torch.Tensor:
+        _, h, w = img.shape
+        p = self.patch_size
+        target_h = int(math.ceil(h / p) * p)
+        target_w = int(math.ceil(w / p) * p)
+        if target_h != h or target_w != w:
+            img = F.pad(img, (0, target_w - w, 0, target_h - h), value=self.pad_value)
+        return img
+    def _pad_for_merge_factor(self, img_norm: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
+        if img_norm.ndim != 3:
+            raise ValueError(f"Expected image tensor with shape (3,H,W), got {tuple(img_norm.shape)}")
+        p = self.patch_size
+        m = self.merge_factor
+        base = p * m
+        _, h, w = img_norm.shape
+        if h % p != 0 or w % p != 0:
+            raise ValueError(f"Image must be patch-multiple before merge padding, got H={h}, W={w}, patch_size={p}")
+        target_h = int(math.ceil(h / base) * base)
+        target_w = int(math.ceil(w / base) * base)
+        ph, pw = h // p, w // p
+        target_ph, target_pw = target_h // p, target_w // p
+        mask_2d = torch.ones((ph, pw), dtype=torch.bool)
+        if target_ph != ph or target_pw != pw:
+            mask_2d = F.pad(mask_2d, (0, target_pw - pw, 0, target_ph - ph), value=False)
+        if target_h != h or target_w != w:
+            img_norm = F.pad(img_norm, (0, target_w - w, 0, target_h - h), value=self.pad_value)
+        return img_norm, mask_2d.reshape(-1).to(torch.long)
+    def _preprocess_single(self, image: ImageLike) -> Tuple[torch.Tensor, torch.Tensor]:
+        chw_u8 = self._to_chw_uint8(image)
+        img = self._normalize(chw_u8)
+        img = self._pad_to_patch_multiple(img)
+        return self._pad_for_merge_factor(img)
+    def preprocess(
+        self,
+        images: Union[ImageLike, Sequence[ImageLike]],
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        **_: Any,
+    ) -> BatchFeature:
+        image_list = self._ensure_list(images)
+        if len(image_list) == 0:
+            raise ValueError("`images` must contain at least one image")
+        processed: List[torch.Tensor] = []
+        patch_masks: List[torch.Tensor] = []
+        for image in image_list:
+            px, pm = self._preprocess_single(image)
+            processed.append(px)
+            patch_masks.append(pm)
+        max_h = max(t.shape[1] for t in processed)
+        max_w = max(t.shape[2] for t in processed)
+        p = self.patch_size
+        batch_patch_h = max_h // p
+        batch_patch_w = max_w // p
+        batch_pixels: List[torch.Tensor] = []
+        batch_masks: List[torch.Tensor] = []
+        for px, pm in zip(processed, patch_masks):
+            _, h, w = px.shape
+            ph, pw = h // p, w // p
+            if h != max_h or w != max_w:
+                px = F.pad(px, (0, max_w - w, 0, max_h - h), value=self.pad_value)
+            pm_2d = pm.view(ph, pw).to(torch.bool)
+            if ph != batch_patch_h or pw != batch_patch_w:
+                pm_2d = F.pad(pm_2d, (0, batch_patch_w - pw, 0, batch_patch_h - ph), value=False)
+            batch_pixels.append(px)
+            batch_masks.append(pm_2d.reshape(-1).to(torch.long))
+        data = {
+            "pixel_values": torch.stack(batch_pixels, dim=0),
+            "patch_attention_mask": torch.stack(batch_masks, dim=0),
+        }
+        return BatchFeature(data=data, tensor_type=return_tensors)
+    __call__ = preprocess
+class AnandaProcessor(ProcessorMixin):
+    attributes = ["image_processor", "tokenizer"]
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    model_input_names = ["input_ids", "attention_mask", "pixel_values", "patch_attention_mask"]
+    def __init__(self, image_processor: AnandaImageProcessor, tokenizer, **kwargs: Any) -> None:
+        self.image_processor = image_processor
+        self.tokenizer = tokenizer
+        self.current_processor = self.image_processor
+        self._in_target_context_manager = False
+        super().__init__(image_processor, tokenizer, **kwargs)
+    @classmethod
+    def from_model_config(cls, tokenizer, model_config: Union[Dict[str, Any], Any]) -> "AnandaProcessor":
+        image_processor = AnandaImageProcessor.from_model_config(model_config)
+        return cls(image_processor=image_processor, tokenizer=tokenizer)
+    @classmethod
+    def from_pretrained(
+        cls,
+        pretrained_model_name_or_path: Union[str, os.PathLike],
+        trust_remote_code: bool = True,
+        **kwargs: Any,
+    ) -> "AnandaProcessor":
+        tokenizer = AutoTokenizer.from_pretrained(
+            pretrained_model_name_or_path,
+            trust_remote_code=trust_remote_code,
+            **kwargs,
+        )
+        image_processor = AnandaImageProcessor.from_pretrained(pretrained_model_name_or_path, **kwargs)
+        return cls(image_processor=image_processor, tokenizer=tokenizer)
+    def save_pretrained(self, save_directory: Union[str, os.PathLike], **kwargs: Any) -> List[str]:
+        os.makedirs(save_directory, exist_ok=True)
+        saved_files: List[str] = []
+        saved_files.extend(self.image_processor.save_pretrained(save_directory))
+        saved_files.extend(self.tokenizer.save_pretrained(save_directory))
+        processor_dict = {
+            "processor_class": self.__class__.__name__,
+            "auto_map": {"AutoProcessor": "inference_processor.AnandaProcessor"},
+            "image_processor": self.image_processor.to_dict(),
+        }
+        output_path = os.path.join(save_directory, PROCESSOR_CONFIG_NAME)
+        with open(output_path, "w", encoding="utf-8") as f:
+            json.dump(processor_dict, f, ensure_ascii=False, indent=2)
+        saved_files.append(output_path)
+        return saved_files
+    def __call__(
+        self,
+        text: Optional[Union[str, Sequence[str]]] = None,
+        images: Optional[Union[ImageLike, Sequence[ImageLike]]] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        add_special_tokens: bool = True,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        if text is None and images is None:
+            raise ValueError("At least one of `text` or `images` must be provided")
+        encoding: Dict[str, Any] = {}
+        if images is not None:
+            image_features = self.image_processor(images=images, return_tensors=return_tensors)
+            encoding.update(image_features)
+            batch_size = int(image_features["pixel_values"].shape[0])
+        else:
+            batch_size = None
+        if text is None:
+            bos_id = self.tokenizer.bos_token_id
+            eos_id = self.tokenizer.eos_token_id
+            prompt_id = bos_id if bos_id is not None else eos_id
+            if prompt_id is None:
+                raise ValueError("Tokenizer must define bos_token_id or eos_token_id.")
+            if batch_size is None:
+                batch_size = 1
+            input_ids = [[int(prompt_id)] for _ in range(batch_size)]
+            attention_mask = [[1] for _ in range(batch_size)]
+            if return_tensors == "pt" or return_tensors == TensorType.PYTORCH:
+                encoding["input_ids"] = torch.tensor(input_ids, dtype=torch.long)
+                encoding["attention_mask"] = torch.tensor(attention_mask, dtype=torch.long)
+            else:
+                encoding["input_ids"] = input_ids
+                encoding["attention_mask"] = attention_mask
+        else:
+            text_encoding = self.tokenizer(
+                text,
+                add_special_tokens=add_special_tokens,
+                return_tensors=return_tensors,
+                **kwargs,
+            )
+            encoding.update(text_encoding)
+        return BatchFeature(data=encoding, tensor_type=return_tensors)
+    def batch_decode(self, *args: Any, **kwargs: Any):
+        return self.tokenizer.batch_decode(*args, **kwargs)
+    def decode(self, *args: Any, **kwargs: Any):
+        return self.tokenizer.decode(*args, **kwargs)
+    def apply_chat_template(self, *args: Any, **kwargs: Any):
+        return self.tokenizer.apply_chat_template(*args, **kwargs)

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7fa72c09c7ba378981b6fc6c90cb6d49d5cd69267cd759f037ec269a3377c73d
+size 3126311968

modeling_anandasky.py ADDED Viewed

The diff for this file is too large to render. See raw diff

processor_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f04692ec710107538335a4ba0c5ce6f7946bf042e122281da6117da5d10350d2
+size 748

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51d28ee6e40cc506625950bf3983cdf9212d8f939dd52622ee4efd7ee3342a8b
+size 756

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aeb13307a71acd8fe81861d94ad54ab689df773318809eed3cbe794b4492dae4
+size 11422654

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:87d541a08b9cd877495739a38ee30bac3164b9b680cb0cd67131e1c15a0b081e
+size 5412

vocab.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca10d7e9fb3ed18575dd1e277a2579c16d108e32f27439684afa0e10b1440910
+size 2776833