Instructions to use MSALab/PerceptionDLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MSALab/PerceptionDLM with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="MSALab/PerceptionDLM", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("MSALab/PerceptionDLM", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use MSALab/PerceptionDLM with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MSALab/PerceptionDLM"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MSALab/PerceptionDLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/MSALab/PerceptionDLM

SGLang

How to use MSALab/PerceptionDLM with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MSALab/PerceptionDLM" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MSALab/PerceptionDLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MSALab/PerceptionDLM" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MSALab/PerceptionDLM",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use MSALab/PerceptionDLM with Docker Model Runner:
```
docker model run hf.co/MSALab/PerceptionDLM
```

MSALab commited on 15 days ago

Commit

cadf670

verified ·

1 Parent(s): dd8925b

Add files using upload-large-folder tool

Browse files

Files changed (23) hide show

.gitattributes +1 -0
README.md +86 -0
cache.py +94 -0
chat_template.json +3 -0
chat_template_utils.py +533 -0
config.json +333 -0
configuration_llada.py +175 -0
configuration_pdmllm.py +95 -0
model-00001-of-00005.safetensors +3 -0
model-00002-of-00005.safetensors +3 -0
model-00003-of-00005.safetensors +3 -0
model-00004-of-00005.safetensors +3 -0
model-00005-of-00005.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_abstractor.py +30 -0
modeling_llada.py +0 -0
modeling_pdmllm.py +1194 -0
preprocessor_config.json +27 -0
processing_pdmllm.py +382 -0
processor_config.json +15 -0
special_tokens_map.json +172 -0
tokenizer.json +0 -0
tokenizer_config.json +2359 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+__pycache__/modeling_llada.cpython-311.pyc filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,86 @@

+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+pipeline_tag: image-text-to-text
+base_model:
+- MSALab/PerceptionDLM-Base
+tags:
+- multimodal
+- diffusion-language-model
+- dllm
+- region-captioning
+- dense-captioning
+- parallel-decoding
+---
+# PerceptionDLM
+**PerceptionDLM** is a multimodal **diffusion** language model optimized for **efficient parallel region perception**. Built upon [**PerceptionDLM-Base**](https://huggingface.co/MSALab/PerceptionDLM-Base), it fully leverages the parallel decoding nature of diffusion language models (DLMs): given an image and multiple region masks, it generates descriptions for **all regions simultaneously** within a single denoising process — avoiding the linear latency growth of autoregressive (AR) region captioners.
+To the best of our knowledge, this is the first model to achieve **parallel region captioning and perception** by leveraging the advantages of diffusion language models.
+<p align="center">
+  📄 <a href="https://arxiv.org/abs/2606.19534">Paper</a> &nbsp;|&nbsp;
+  💻 <a href="https://github.com/MSALab-PKU/PerceptionDLM">Code</a> &nbsp;|&nbsp;
+  📊 <a href="https://huggingface.co/datasets/MSALab/ParaDLC-Bench">ParaDLC-Bench</a>
+</p>
+## Highlights
+- 🧩 **Parallel region captioning.** Region prompting + structured attention masking describe many masked regions in a single denoising pass.
+- ⚡ **Up to 3.44× throughput speedup** in dense multi-region scenarios, with stable per-image latency (~2.9s).
+- 🎯 **Competitive quality** with strong AR region captioners while being substantially faster.
+## Model Details
+| | |
+| :--- | :--- |
+| Base model | [MSALab/PerceptionDLM-Base](https://huggingface.co/MSALab/PerceptionDLM-Base) |
+| Key modules | Region prompting, RoI-aligned feature replay, structured attention masking |
+| Region prompts | up to 6 per image |
+| Default inference | 32 diffusion steps, generation length 32 per mask |
+| Training | full ParaCaption corpus, ~2 days on 32× H100 |
+| Precision | bfloat16 |
+## Results (ParaDLC-Bench)
+| Method | Type | Avg (%) | TPF ↑ | Time (s) ↓ |
+| :--- | :--- | :---: | :---: | :---: |
+| GAR-8B | AR (sequential) | 69.5 | 1.0 | 479 |
+| LLaDA-V-8B | Diffusion | 35.2 | 1.0 | 3241 |
+| **PerceptionDLM** | **Diffusion (parallel)** | **62.4** | **2.9** | **276** |
+`TPF` = Tokens Per Forward (higher = more parallel). PerceptionDLM nearly doubles the accuracy of prior diffusion VLMs while drastically reducing inference time.
+## Usage
+Full inference scripts are provided in the [GitHub repository](https://github.com/MSALab-PKU/PerceptionDLM).
+```bash
+python demo/infer_pdmllm.py \
+  --model-path MSALab/PerceptionDLM \
+  --image assets/demo.jpg \
+  --masks assets/demo_mask_0.jpg \
+          assets/demo_mask_1.jpg \
+          assets/demo_mask_2.jpg \
+  --gen-length 32 --steps 32 --temperature 0.0 --top-p 1.0
+```
+The model takes an RGB image plus one or more binary masks, and returns one caption per region — all generated in parallel.
+## Citation
+```bibtex
+@article{sun2026perceptiondlm,
+  title   = {PerceptionDLM: Parallel Region Perception with Multimodal Diffusion Language Models},
+  author  = {Sun, Yueyi and Wang, Yuhao and Li, Jason and Tian, Ye and Zhang, Tao and Mai, Jacky and Wang, Yihan and Wang, Haochen and Bai, Jinbin and Yang, Ling and Tong, Yunhai},
+  journal = {arXiv preprint arXiv:2606.19534},
+  year    = {2026}
+}
+```
+## License
+Released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).

cache.py ADDED Viewed

	@@ -0,0 +1,94 @@

+from dataclasses import dataclass
+@dataclass
+class dLLMCacheConfig:
+    prompt_interval_steps: int = 1
+    gen_interval_steps: int = 1
+    transfer_ratio: float = 0.0
+    cfg_interval_steps: int = 1
+import torch
+from collections import defaultdict
+class Singleton(type):
+    _instances = {}
+    def __call__(cls, *args, **kwargs):
+        if cls not in cls._instances:
+            cls._instances[cls] = super(Singleton, cls).__call__(*args, **kwargs)
+        return cls._instances[cls]
+class dLLMCache(metaclass=Singleton):
+    gen_interval_steps: int
+    prompt_interval_steps: int
+    cfg_interval_steps: int
+    prompt_length: int
+    transfer_ratio: float
+    __cache: defaultdict
+    __step_counter: defaultdict
+    @classmethod
+    def new_instance(
+        cls,
+        prompt_interval_steps: int = 1,
+        gen_interval_steps: int = 1,
+        cfg_interval_steps: int = 1,
+        transfer_ratio: float = 0.0,
+    ) -> "dLLMCache":
+        ins = cls()
+        setattr(ins, "prompt_interval_steps", prompt_interval_steps)
+        setattr(ins, "gen_interval_steps", gen_interval_steps)
+        setattr(ins, "cfg_interval_steps", cfg_interval_steps)
+        setattr(ins, "transfer_ratio", transfer_ratio)
+        ins.init()
+        return ins
+    def init(self) -> None:
+        self.__cache = defaultdict(
+            lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(dict)))
+        )
+        self.__step_counter = defaultdict(lambda: defaultdict(lambda: 0))
+    def reset_cache(self, prompt_length: int = 0) -> None:
+        self.init()
+        torch.cuda.empty_cache()
+        self.prompt_length = prompt_length
+        self.cache_type = "no_cfg"
+    def set_cache(
+        self, layer_id: int, feature_name: str, features: torch.Tensor, cache_type: str
+    ) -> None:
+        self.__cache[self.cache_type][cache_type][layer_id][feature_name] = {
+            0: features
+        }
+    def get_cache(
+        self, layer_id: int, feature_name: str, cache_type: str
+    ) -> torch.Tensor:
+        output = self.__cache[self.cache_type][cache_type][layer_id][feature_name][0]
+        return output
+    def update_step(self, layer_id: int) -> None:
+        self.__step_counter[self.cache_type][layer_id] += 1
+    def refresh_gen(self, layer_id: int = 0) -> bool:
+        return (self.current_step - 1) % self.gen_interval_steps == 0
+    def refresh_prompt(self, layer_id: int = 0) -> bool:
+        return (self.current_step - 1) % self.prompt_interval_steps == 0
+    def refresh_cfg(self, layer_id: int = 0) -> bool:
+        return (
+            self.current_step - 1
+        ) % self.cfg_interval_steps == 0 or self.current_step <= 5
+    @property
+    def current_step(self) -> int:
+        return max(list(self.__step_counter[self.cache_type].values()), default=1)
+    def __repr__(self):
+        return f"USE dLLMCache"

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "chat_template": "{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant.<|eot_id|>\n{% endif %}<|start_header_id|>{{ message['role'] }}<|end_header_id|>\n{% if message['role'] == 'assistant' %}{% generation %}{{ message['content'][0]['text'] }}<|eot_id|>{% endgeneration %}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}<img><IMG_CONTEXT></img>{% elif content['type'] == 'video' or 'video' in content %}<video><VIDEO_CONTEXT></video>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|eot_id|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|start_header_id|>assistant<|end_header_id|>\n{% endif %}"
+}

chat_template_utils.py ADDED Viewed

	@@ -0,0 +1,533 @@

+# Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import json
+import re
+import types
+from contextlib import contextmanager
+from datetime import datetime
+from functools import lru_cache
+from inspect import isfunction
+from typing import Any, Callable, Optional, Union, get_args, get_origin, get_type_hints
+from packaging import version
+from transformers.utils import logging
+from transformers.utils.import_utils import is_jinja_available, is_torch_available, is_vision_available
+logger = logging.get_logger(__name__)
+if is_jinja_available():
+    import jinja2
+    from jinja2.ext import Extension
+    from jinja2.sandbox import ImmutableSandboxedEnvironment
+else:
+    jinja2 = None
+if is_vision_available():
+    from PIL.Image import Image
+if is_torch_available():
+    from torch import Tensor
+BASIC_TYPES = (int, float, str, bool, Any, type(None), ...)
+# Extracts the initial segment of the docstring, containing the function description
+description_re = re.compile(r"^(.*?)[\n\s]*(Args:|Returns:|Raises:|\Z)", re.DOTALL)
+# Extracts the Args: block from the docstring
+args_re = re.compile(r"\n\s*Args:\n\s*(.*?)[\n\s]*(Returns:|Raises:|\Z)", re.DOTALL)
+# Splits the Args: block into individual arguments
+args_split_re = re.compile(
+    r"""
+(?:^|\n)  # Match the start of the args block, or a newline
+\s*(\w+):\s*  # Capture the argument name and strip spacing
+(.*?)\s*  # Capture the argument description, which can span multiple lines, and strip trailing spacing
+(?=\n\s*\w+:|\Z)  # Stop when you hit the next argument or the end of the block
+""",
+    re.DOTALL | re.VERBOSE,
+)
+# Extracts the Returns: block from the docstring, if present. Note that most chat templates ignore the return type/doc!
+returns_re = re.compile(r"\n\s*Returns:\n\s*(.*?)[\n\s]*(Raises:|\Z)", re.DOTALL)
+class TypeHintParsingException(Exception):
+    """Exception raised for errors in parsing type hints to generate JSON schemas"""
+    pass
+class DocstringParsingException(Exception):
+    """Exception raised for errors in parsing docstrings to generate JSON schemas"""
+    pass
+def _get_json_schema_type(param_type: str) -> dict[str, str]:
+    type_mapping = {
+        int: {"type": "integer"},
+        float: {"type": "number"},
+        str: {"type": "string"},
+        bool: {"type": "boolean"},
+        type(None): {"type": "null"},
+        Any: {},
+    }
+    if is_vision_available():
+        type_mapping[Image] = {"type": "image"}
+    if is_torch_available():
+        type_mapping[Tensor] = {"type": "audio"}
+    return type_mapping.get(param_type, {"type": "object"})
+def _parse_type_hint(hint: str) -> dict:
+    origin = get_origin(hint)
+    args = get_args(hint)
+    if origin is None:
+        try:
+            return _get_json_schema_type(hint)
+        except KeyError:
+            raise TypeHintParsingException(
+                "Couldn't parse this type hint, likely due to a custom class or object: ", hint
+            )
+    elif origin is Union or (hasattr(types, "UnionType") and origin is types.UnionType):
+        # Recurse into each of the subtypes in the Union, except None, which is handled separately at the end
+        subtypes = [_parse_type_hint(t) for t in args if t is not type(None)]
+        if len(subtypes) == 1:
+            # A single non-null type can be expressed directly
+            return_dict = subtypes[0]
+        elif all(isinstance(subtype["type"], str) for subtype in subtypes):
+            # A union of basic types can be expressed as a list in the schema
+            return_dict = {"type": sorted([subtype["type"] for subtype in subtypes])}
+        else:
+            # A union of more complex types requires "anyOf"
+            return_dict = {"anyOf": subtypes}
+        if type(None) in args:
+            return_dict["nullable"] = True
+        return return_dict
+    elif origin is list:
+        if not args:
+            return {"type": "array"}
+        else:
+            # Lists can only have a single type argument, so recurse into it
+            return {"type": "array", "items": _parse_type_hint(args[0])}
+    elif origin is tuple:
+        if not args:
+            return {"type": "array"}
+        if len(args) == 1:
+            raise TypeHintParsingException(
+                f"The type hint {str(hint).replace('typing.', '')} is a Tuple with a single element, which "
+                "we do not automatically convert to JSON schema as it is rarely necessary. If this input can contain "
+                "more than one element, we recommend "
+                "using a List[] type instead, or if it really is a single element, remove the Tuple[] wrapper and just "
+                "pass the element directly."
+            )
+        if ... in args:
+            raise TypeHintParsingException(
+                "Conversion of '...' is not supported in Tuple type hints. "
+                "Use List[] types for variable-length"
+                " inputs instead."
+            )
+        return {"type": "array", "prefixItems": [_parse_type_hint(t) for t in args]}
+    elif origin is dict:
+        # The JSON equivalent to a dict is 'object', which mandates that all keys are strings
+        # However, we can specify the type of the dict values with "additionalProperties"
+        out = {"type": "object"}
+        if len(args) == 2:
+            out["additionalProperties"] = _parse_type_hint(args[1])
+        return out
+    raise TypeHintParsingException("Couldn't parse this type hint, likely due to a custom class or object: ", hint)
+def _convert_type_hints_to_json_schema(func: Callable) -> dict:
+    type_hints = get_type_hints(func)
+    signature = inspect.signature(func)
+    required = []
+    for param_name, param in signature.parameters.items():
+        if param.annotation == inspect.Parameter.empty:
+            raise TypeHintParsingException(f"Argument {param.name} is missing a type hint in function {func.__name__}")
+        if param.default == inspect.Parameter.empty:
+            required.append(param_name)
+    properties = {}
+    for param_name, param_type in type_hints.items():
+        properties[param_name] = _parse_type_hint(param_type)
+    schema = {"type": "object", "properties": properties}
+    if required:
+        schema["required"] = required
+    return schema
+def parse_google_format_docstring(docstring: str) -> tuple[Optional[str], Optional[dict], Optional[str]]:
+    """
+    Parses a Google-style docstring to extract the function description,
+    argument descriptions, and return description.
+    Args:
+        docstring (str): The docstring to parse.
+    Returns:
+        The function description, arguments, and return description.
+    """
+    # Extract the sections
+    description_match = description_re.search(docstring)
+    args_match = args_re.search(docstring)
+    returns_match = returns_re.search(docstring)
+    # Clean and store the sections
+    description = description_match.group(1).strip() if description_match else None
+    docstring_args = args_match.group(1).strip() if args_match else None
+    returns = returns_match.group(1).strip() if returns_match else None
+    # Parsing the arguments into a dictionary
+    if docstring_args is not None:
+        docstring_args = "\n".join([line for line in docstring_args.split("\n") if line.strip()])  # Remove blank lines
+        matches = args_split_re.findall(docstring_args)
+        args_dict = {match[0]: re.sub(r"\s*\n+\s*", " ", match[1].strip()) for match in matches}
+    else:
+        args_dict = {}
+    return description, args_dict, returns
+def get_json_schema(func: Callable) -> dict:
+    """
+    This function generates a JSON schema for a given function, based on its docstring and type hints. This is
+    mostly used for passing lists of tools to a chat template. The JSON schema contains the name and description of
+    the function, as well as the names, types and descriptions for each of its arguments. `get_json_schema()` requires
+    that the function has a docstring, and that each argument has a description in the docstring, in the standard
+    Google docstring format shown below. It also requires that all the function arguments have a valid Python type hint.
+    Although it is not required, a `Returns` block can also be added, which will be included in the schema. This is
+    optional because most chat templates ignore the return value of the function.
+    Args:
+        func: The function to generate a JSON schema for.
+    Returns:
+        A dictionary containing the JSON schema for the function.
+    Examples:
+    ```python
+    >>> def multiply(x: float, y: float):
+    >>>    '''
+    >>>    A function that multiplies two numbers
+    >>>
+    >>>    Args:
+    >>>        x: The first number to multiply
+    >>>        y: The second number to multiply
+    >>>    '''
+    >>>    return x * y
+    >>>
+    >>> print(get_json_schema(multiply))
+    {
+        "name": "multiply",
+        "description": "A function that multiplies two numbers",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "x": {"type": "number", "description": "The first number to multiply"},
+                "y": {"type": "number", "description": "The second number to multiply"}
+            },
+            "required": ["x", "y"]
+        }
+    }
+    ```
+    The general use for these schemas is that they are used to generate tool descriptions for chat templates that
+    support them, like so:
+    ```python
+    >>> from transformers import AutoTokenizer
+    >>> from transformers.utils import get_json_schema
+    >>>
+    >>> def multiply(x: float, y: float):
+    >>>    '''
+    >>>    A function that multiplies two numbers
+    >>>
+    >>>    Args:
+    >>>        x: The first number to multiply
+    >>>        y: The second number to multiply
+    >>>    return x * y
+    >>>    '''
+    >>>
+    >>> multiply_schema = get_json_schema(multiply)
+    >>> tokenizer = AutoTokenizer.from_pretrained("CohereForAI/c4ai-command-r-v01")
+    >>> messages = [{"role": "user", "content": "What is 179 x 4571?"}]
+    >>> formatted_chat = tokenizer.apply_chat_template(
+    >>>     messages,
+    >>>     tools=[multiply_schema],
+    >>>     chat_template="tool_use",
+    >>>     return_dict=True,
+    >>>     return_tensors="pt",
+    >>>     add_generation_prompt=True
+    >>> )
+    >>> # The formatted chat can now be passed to model.generate()
+    ```
+    Each argument description can also have an optional `(choices: ...)` block at the end, such as
+    `(choices: ["tea", "coffee"])`, which will be parsed into an `enum` field in the schema. Note that this will
+    only be parsed correctly if it is at the end of the line:
+    ```python
+    >>> def drink_beverage(beverage: str):
+    >>>    '''
+    >>>    A function that drinks a beverage
+    >>>
+    >>>    Args:
+    >>>        beverage: The beverage to drink (choices: ["tea", "coffee"])
+    >>>    '''
+    >>>    pass
+    >>>
+    >>> print(get_json_schema(drink_beverage))
+    ```
+    {
+        'name': 'drink_beverage',
+        'description': 'A function that drinks a beverage',
+        'parameters': {
+            'type': 'object',
+            'properties': {
+                'beverage': {
+                    'type': 'string',
+                    'enum': ['tea', 'coffee'],
+                    'description': 'The beverage to drink'
+                    }
+                },
+            'required': ['beverage']
+        }
+    }
+    """
+    doc = inspect.getdoc(func)
+    if not doc:
+        raise DocstringParsingException(
+            f"Cannot generate JSON schema for {func.__name__} because it has no docstring!"
+        )
+    doc = doc.strip()
+    main_doc, param_descriptions, return_doc = parse_google_format_docstring(doc)
+    json_schema = _convert_type_hints_to_json_schema(func)
+    if (return_dict := json_schema["properties"].pop("return", None)) is not None:
+        if return_doc is not None:  # We allow a missing return docstring since most templates ignore it
+            return_dict["description"] = return_doc
+    for arg, schema in json_schema["properties"].items():
+        if arg not in param_descriptions:
+            raise DocstringParsingException(
+                f"Cannot generate JSON schema for {func.__name__} because the docstring has no description for the argument '{arg}'"
+            )
+        desc = param_descriptions[arg]
+        enum_choices = re.search(r"\(choices:\s*(.*?)\)\s*$", desc, flags=re.IGNORECASE)
+        if enum_choices:
+            schema["enum"] = [c.strip() for c in json.loads(enum_choices.group(1))]
+            desc = enum_choices.string[: enum_choices.start()].strip()
+        schema["description"] = desc
+    output = {"name": func.__name__, "description": main_doc, "parameters": json_schema}
+    if return_dict is not None:
+        output["return"] = return_dict
+    return {"type": "function", "function": output}
+def _render_with_assistant_indices(
+    compiled_template, messages, tools, documents, add_generation_prompt, **template_kwargs
+):
+    rendered_blocks = []
+    generation_indices = []
+    with compiled_template.environment.activate_tracker(rendered_blocks, generation_indices):
+        for block in compiled_template.generate(
+            messages=messages,
+            tools=tools,
+            documents=documents,
+            add_generation_prompt=add_generation_prompt,
+            **template_kwargs,
+        ):
+            rendered_blocks.append(block)
+        rendered_chat = "".join(rendered_blocks)
+    return rendered_chat, generation_indices
+@lru_cache
+def _compile_jinja_template(chat_template):
+    if not is_jinja_available():
+        raise ImportError(
+            "apply_chat_template requires jinja2 to be installed. Please install it using `pip install jinja2`."
+        )
+    class AssistantTracker(Extension):
+        # This extension is used to track the indices of assistant-generated tokens in the rendered chat
+        tags = {"generation"}
+        def __init__(self, environment: ImmutableSandboxedEnvironment):
+            # The class is only initiated by jinja.
+            super().__init__(environment)
+            environment.extend(activate_tracker=self.activate_tracker)
+            self._rendered_blocks = None
+            self._generation_indices = None
+        def parse(self, parser: jinja2.parser.Parser) -> jinja2.nodes.CallBlock:
+            lineno = next(parser.stream).lineno
+            body = parser.parse_statements(["name:endgeneration"], drop_needle=True)
+            return jinja2.nodes.CallBlock(self.call_method("_generation_support"), [], [], body).set_lineno(lineno)
+        @jinja2.pass_eval_context
+        def _generation_support(self, context: jinja2.nodes.EvalContext, caller: jinja2.runtime.Macro) -> str:
+            rv = caller()
+            if self.is_active():
+                # Only track generation indices if the tracker is active
+                start_index = len("".join(self._rendered_blocks))
+                end_index = start_index + len(rv)
+                self._generation_indices.append((start_index, end_index))
+            return rv
+        def is_active(self) -> bool:
+            return self._rendered_blocks or self._generation_indices
+        @contextmanager
+        def activate_tracker(self, rendered_blocks: list[int], generation_indices: list[int]):
+            try:
+                if self.is_active():
+                    raise ValueError("AssistantTracker should not be reused before closed")
+                self._rendered_blocks = rendered_blocks
+                self._generation_indices = generation_indices
+                yield
+            finally:
+                self._rendered_blocks = None
+                self._generation_indices = None
+    if version.parse(jinja2.__version__) < version.parse("3.1.0"):
+        raise ImportError(
+            f"apply_chat_template requires jinja2>=3.1.0 to be installed. Your version is {jinja2.__version__}."
+        )
+    def raise_exception(message):
+        raise jinja2.exceptions.TemplateError(message)
+    def tojson(x, ensure_ascii=False, indent=None, separators=None, sort_keys=False):
+        # We override the built-in tojson filter because Jinja's default filter escapes HTML characters
+        # We also expose some options like custom indents and separators
+        return json.dumps(x, ensure_ascii=ensure_ascii, indent=indent, separators=separators, sort_keys=sort_keys)
+    def strftime_now(format):
+        return datetime.now().strftime(format)
+    jinja_env = ImmutableSandboxedEnvironment(
+        trim_blocks=True, lstrip_blocks=True, extensions=[AssistantTracker, jinja2.ext.loopcontrols]
+    )
+    jinja_env.filters["tojson"] = tojson
+    jinja_env.globals["raise_exception"] = raise_exception
+    jinja_env.globals["strftime_now"] = strftime_now
+    return jinja_env.from_string(chat_template)
+def render_jinja_template(
+    conversations: list[list[dict[str, str]]],
+    tools: Optional[list[Union[dict, Callable]]] = None,
+    documents: Optional[list[dict[str, str]]] = None,
+    chat_template: Optional[str] = None,
+    return_assistant_tokens_mask: Optional[bool] = False,
+    continue_final_message: Optional[bool] = False,
+    add_generation_prompt: Optional[bool] = False,
+    **kwargs,
+) -> str:
+    if return_assistant_tokens_mask and not re.search(r"\{\%-?\s*generation\s*-?\%\}", chat_template):
+        logger.warning_once(
+            "return_assistant_tokens_mask==True but chat template does not contain `{% generation %}` keyword."
+        )
+    # Compilation function uses a cache to avoid recompiling the same template
+    compiled_template = _compile_jinja_template(chat_template)
+    # We accept either JSON schemas or functions for tools. If we get functions, we convert them to schemas
+    if tools is not None:
+        tool_schemas = []
+        for tool in tools:
+            if isinstance(tool, dict):
+                tool_schemas.append(tool)
+            elif isfunction(tool):
+                tool_schemas.append(get_json_schema(tool))
+            else:
+                raise ValueError(
+                    "Tools should either be a JSON schema, or a callable function with type hints "
+                    "and a docstring suitable for auto-conversion to a schema."
+                )
+    else:
+        tool_schemas = None
+    if documents is not None:
+        for document in documents:
+            if not isinstance(document, dict):
+                raise TypeError("Documents should be a list of dicts with 'title' and 'text' keys!")
+    rendered = []
+    all_generation_indices = []
+    for chat in conversations:
+        if hasattr(chat, "messages"):
+            # Indicates it's a Conversation object
+            chat = chat.messages
+        if return_assistant_tokens_mask:
+            rendered_chat, generation_indices = _render_with_assistant_indices(
+                compiled_template=compiled_template,
+                messages=chat,
+                tools=tool_schemas,
+                documents=documents,
+                add_generation_prompt=add_generation_prompt,
+                **kwargs,
+            )
+            all_generation_indices.append(generation_indices)
+        else:
+            rendered_chat = compiled_template.render(
+                messages=chat,
+                tools=tool_schemas,
+                documents=documents,
+                add_generation_prompt=add_generation_prompt,
+                **kwargs,
+            )
+        if continue_final_message:
+            final_message = chat[-1]["content"]
+            if isinstance(final_message, (list, tuple)):
+                for content_block in reversed(final_message):
+                    if "text" in content_block:
+                        # Pick the last text block in the message (the first one we hit while iterating in reverse)
+                        final_message = content_block["text"]
+                        break
+                else:
+                    raise ValueError(
+                        "continue_final_message is set but we could not find any text to continuein the final message!"
+                    )
+            if final_message.strip() not in rendered_chat:
+                raise ValueError(
+                    "continue_final_message is set but the final message does not appear in the chat after "
+                    "applying the chat template! This can happen if the chat template deletes portions of "
+                    "the final message. Please verify the chat template and final message in your chat to "
+                    "ensure they are compatible."
+                )
+            final_msg_loc = rendered_chat.rindex(final_message.strip())
+            if rendered_chat[final_msg_loc : final_msg_loc + len(final_message.lstrip())] == final_message:
+                # The template preserves spacing or the message doesn't have trailing spacing, so things are simple
+                rendered_chat = rendered_chat[: final_msg_loc + len(final_message.lstrip())]
+            else:
+                # The message has trailing spacing that was trimmed, so we must be more cautious
+                rendered_chat = rendered_chat[: final_msg_loc + len(final_message.strip())]
+        rendered.append(rendered_chat)
+    return rendered, all_generation_indices

config.json ADDED Viewed

	@@ -0,0 +1,333 @@

+{
+  "architectures": [
+    "PDMLLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_pdmllm.PDMLLMConfig",
+    "AutoModel": "modeling_pdmllm.PDMLLM",
+    "AutoModelForCausalLM": "modeling_pdmllm.PDMLLM"
+  },
+  "downsample_ratio": 0.5,
+  "image_size": 512,
+  "image_token_id": 126349,
+  "kernel_size": [
+    16,
+    16
+  ],
+  "language_model_config": {
+    "_attn_implementation_autoset": true,
+    "_name_or_path": "bitersun/LLaDA-8B-Instruct-HF",
+    "add_cross_attention": false,
+    "architectures": [
+      "LLaDAModelLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "auto_map": {
+      "AutoConfig": "configuration_llada.LLaDAConfig",
+      "AutoModel": "modeling_llada.LLaDAModelLM",
+      "AutoModelForCausalLM": "modeling_llada.LLaDAModelLM"
+    },
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 128000,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 126081,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "silu",
+    "hidden_size": 4096,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 12288,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 16384,
+    "min_length": 0,
+    "model_type": "llada",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 32,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 32,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "prefix": null,
+    "pretraining_tp": 1,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "rms_norm_eps": 1e-05,
+    "rope_scaling": null,
+    "rope_theta": 500000.0,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": false,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": "bfloat16",
+    "torchscript": false,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": false,
+    "vocab_size": 126464
+  },
+  "mask_patch_embedding_in_channels": 3,
+  "mask_patch_embedding_out_channels": 1152,
+  "model_type": "pdmllm",
+  "num_image_token": 256,
+  "patch_size": 16,
+  "prompt_numbers": 6,
+  "replacement_noise_mode": false,
+  "roi_output_size": 4,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.3",
+  "vision_abstractor_config": {
+    "projection_type": "mlp2x_gelu"
+  },
+  "vision_model_config": {
+    "_attn_implementation_autoset": true,
+    "_name_or_path": "google/siglip2-so400m-patch16-512",
+    "add_cross_attention": false,
+    "architectures": null,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_size": 1152,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_factor": 1.0,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "siglip",
+    "no_repeat_ngram_size": 0,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "text_config": {
+      "_attn_implementation_autoset": false,
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_dropout": 0.0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": 49406,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": 49407,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "gelu_pytorch_tanh",
+      "hidden_size": 1152,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "intermediate_size": 4304,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "max_position_embeddings": 64,
+      "min_length": 0,
+      "model_type": "siglip_text_model",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 16,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_hidden_layers": 27,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": 1,
+      "prefix": null,
+      "problem_type": null,
+      "projection_size": 1152,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "torch_dtype": "bfloat16",
+      "torchscript": false,
+      "typical_p": 1.0,
+      "use_bfloat16": false,
+      "vocab_size": 256000
+    },
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": "bfloat16",
+    "torchscript": false,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "vision_config": {
+      "_attn_implementation_autoset": false,
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_dropout": 0.0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "hidden_act": "gelu_pytorch_tanh",
+      "hidden_size": 1152,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "image_size": 512,
+      "intermediate_size": 4304,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_norm_eps": 1e-06,
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "min_length": 0,
+      "model_type": "siglip_vision_model",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 16,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_channels": 3,
+      "num_hidden_layers": 27,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "patch_size": 16,
+      "prefix": null,
+      "problem_type": null,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "sep_token_id": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": true,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "torch_dtype": "bfloat16",
+      "torchscript": false,
+      "typical_p": 1.0,
+      "use_bfloat16": false
+    }
+  },
+  "vision_output_key": null,
+  "vision_select_layer": -2
+}

configuration_llada.py ADDED Viewed

	@@ -0,0 +1,175 @@

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" LLaDA model configuration"""
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+LLaDA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
+class LLaDAConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`LLaDAModel`]. It is used to instantiate an LLaDA
+    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
+    defaults will yield a similar configuration to that of the LLaDA-8B.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 32000):
+            Vocabulary size of the LLaDA model. Defines the number of different tokens that can be represented by the
+            `inputs_ids` passed when calling [`LLaDAModel`]
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the hidden representations.
+        intermediate_size (`int`, *optional*, defaults to 11008):
+            Dimension of the MLP representations.
+        num_hidden_layers (`int`, *optional*, defaults to 32):
+            Number of hidden layers in the Transformer decoder.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads for each attention layer in the Transformer decoder.
+        num_key_value_heads (`int`, *optional*):
+            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
+            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
+            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
+            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
+            by meanpooling all the original heads within that group. For more details checkout [this
+            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
+            `num_attention_heads`.
+        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
+            The non-linear activation function (function or string) in the decoder.
+        max_position_embeddings (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with.
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        rms_norm_eps (`float`, *optional*, defaults to 1e-06):
+            The epsilon used by the rms normalization layers.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        pad_token_id (`int`, *optional*):
+            Padding token id.
+        bos_token_id (`int`, *optional*, defaults to 1):
+            Beginning of stream token id.
+        eos_token_id (`int`, *optional*, defaults to 2):
+            End of stream token id.
+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).
+        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
+            Whether to tie weight embeddings
+        rope_theta (`float`, *optional*, defaults to 10000.0):
+            The base period of the RoPE embeddings.
+        rope_scaling (`Dict`, *optional*):
+            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
+            strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
+            `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
+            `max_position_embeddings` to the expected new maximum.
+        attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
+            Whether to use a bias in the query, key, value and output projection layers during self-attention.
+        attention_dropout (`float`, *optional*, defaults to 0.0):
+            The dropout ratio for the attention probabilities.
+    """
+    model_type = "llada"
+    keys_to_ignore_at_inference = ["past_key_values"]
+    def __init__(
+        self,
+        vocab_size=32000,
+        hidden_size=4096,
+        intermediate_size=11008,
+        num_hidden_layers=32,
+        num_attention_heads=32,
+        num_key_value_heads=None,
+        hidden_act="silu",
+        max_position_embeddings=2048,
+        initializer_range=0.02,
+        rms_norm_eps=1e-6,
+        use_cache=True,
+        pad_token_id=None,
+        bos_token_id=1,
+        eos_token_id=2,
+        pretraining_tp=1,
+        tie_word_embeddings=False,
+        rope_theta=10000.0,
+        rope_scaling=None,
+        attention_bias=False,
+        attention_dropout=0.0,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.max_position_embeddings = max_position_embeddings
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        # for backward compatibility
+        if num_key_value_heads is None:
+            num_key_value_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.hidden_act = hidden_act
+        self.initializer_range = initializer_range
+        self.rms_norm_eps = rms_norm_eps
+        self.pretraining_tp = pretraining_tp
+        self.use_cache = use_cache
+        self.rope_theta = rope_theta
+        self.rope_scaling = rope_scaling
+        self._rope_scaling_validation()
+        self.attention_bias = attention_bias
+        self.attention_dropout = attention_dropout
+        super().__init__(
+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+    def _rope_scaling_validation(self):
+        """
+        Validate the `rope_scaling` configuration.
+        """
+        if self.rope_scaling is None:
+            return
+        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
+            raise ValueError(
+                "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
+                f"got {self.rope_scaling}"
+            )
+        rope_scaling_type = self.rope_scaling.get("type", None)
+        rope_scaling_factor = self.rope_scaling.get("factor", None)
+        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
+            raise ValueError(
+                f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
+            )
+        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
+            raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")

configuration_pdmllm.py ADDED Viewed

	@@ -0,0 +1,95 @@

+from transformers import PretrainedConfig, AutoConfig, CONFIG_MAPPING
+from transformers.dynamic_module_utils import get_class_from_dynamic_module
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class PDMLLMConfig(PretrainedConfig):
+    model_type = "pdmllm"
+    is_composition = True
+    def __init__(self,
+                 language_model_config=None,
+                 vision_model_config=None,
+                 vision_abstractor_config=None,
+                 image_token_id=None,
+                 image_size=512,
+                 patch_size=16,
+                 downsample_ratio=0.5,
+                 vision_select_layer=-2,
+                 replacement_noise_mode=False,
+                 prompt_numbers=5,
+                 mask_patch_embedding_in_channels=3,
+                 mask_patch_embedding_out_channels=1152,
+                 kernel_size=[16, 16],
+                 roi_output_size=None,
+                 **kwargs):
+        super().__init__(**kwargs)
+        self.replacement_noise_mode = replacement_noise_mode
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.downsample_ratio = downsample_ratio
+        self.num_image_token = int((image_size // patch_size) ** 2 * (downsample_ratio ** 2))
+        self.vision_select_layer = vision_select_layer
+        self.prompt_numbers = prompt_numbers
+        self.mask_patch_embedding_in_channels = mask_patch_embedding_in_channels
+        # self.mask_patch_embedding_out_channels = mask_patch_embedding_out_channels
+        # roi_output_size controls how many RoI-aligned tokens replace each crop token.
+        # None => keep original (feat_h, feat_w); int => square grid; tuple => (h, w).
+        self.roi_output_size = roi_output_size
+        if isinstance(language_model_config, dict):
+            if '_name_or_path' not in language_model_config:
+                language_model_config['_name_or_path'] = self._name_or_path
+            language_model_type = language_model_config.get('model_type', '')
+            is_remote_code = '.' in language_model_config.get('auto_map', {}).get('AutoConfig', '')
+            if language_model_type in CONFIG_MAPPING and not is_remote_code:
+                language_model_config = AutoConfig.for_model(**language_model_config)
+            elif language_model_type:
+                Config = get_class_from_dynamic_module(language_model_config["auto_map"]["AutoConfig"],
+                                                       language_model_config['_name_or_path'])
+                language_model_config = Config(**language_model_config)
+        self.language_model_config = language_model_config
+        if isinstance(vision_model_config, dict):
+            if '_name_or_path' not in vision_model_config:
+                vision_model_config['_name_or_path'] = self._name_or_path
+            vision_model_type = vision_model_config.get('model_type', '')
+            is_remote_code = '.' in vision_model_config.get('auto_map', {}).get('AutoConfig', '')
+            if vision_model_type in CONFIG_MAPPING and not is_remote_code:
+                vision_model_config = AutoConfig.for_model(**vision_model_config)
+            elif vision_model_type:
+                Config = get_class_from_dynamic_module(vision_model_config["auto_map"]["AutoConfig"],
+                                                       vision_model_config['_name_or_path'])
+                vision_model_config = Config(**vision_model_config)
+        self.vision_model_config = vision_model_config
+        self.vision_abstractor_config = vision_abstractor_config
+        self.image_token_id = image_token_id
+        try:
+            self.mask_patch_embedding_out_channels = self.vision_model_config.vision_config.hidden_size
+        except:
+            self.mask_patch_embedding_out_channels = mask_patch_embedding_out_channels
+        self.kernel_size = kernel_size
+    @property
+    def hidden_size(self):
+        return self.language_model_config.hidden_size
+    def to_dict(self):
+        ret_dict = super().to_dict()
+        ret_dict["auto_map"] = {
+            "AutoConfig": "configuration_pdmllm.PDMLLMConfig",
+            "AutoModel": "modeling_pdmllm.PDMLLM",
+            "AutoModelForCausalLM": "modeling_pdmllm.PDMLLM"
+        }
+        return ret_dict
+    @classmethod
+    def from_dict(cls, config_dict, **kwargs):
+        if 'name_or_path' in kwargs:
+            config_dict['_name_or_path'] = kwargs.pop('name_or_path')
+        return super().from_dict(config_dict, **kwargs)

model-00001-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4ce9307665e98867ed074d88e163135df3dbf752102330c7f8831be63836450d
+size 3950989604

model-00002-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:64331db82bdf3706da1c58ac678d9c76644619753acb1ab575ea548f9656a3a4
+size 3926026584

model-00003-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9fd6da345b0f5ad381e446101fe12c166537565a73bc20358e460674bbb6b25e
+size 3926026664

model-00004-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc9fdfa279c5bf910f9bd5358c79385247b4d1b56411e77bb61cfbedc7f223e3
+size 3926026664

model-00005-of-00005.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f573bf881d7b601eda01ab540a7aa57eb0beec140324ee02f0e0445b1d7b4280
+size 2646697936

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_abstractor.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import re
+import torch
+from torch import nn
+from torch.nn import functional as F
+def build_projection(projection_type: str, in_dim: int, out_dim: int) -> nn.Module:
+    mlp_gelu_match = re.match(r'^mlp(\d+)x_gelu$', projection_type)
+    if mlp_gelu_match:
+        mlp_depth = int(mlp_gelu_match.group(1))
+        modules = [nn.Linear(in_dim, out_dim)]
+        for _ in range(1, mlp_depth):
+            modules.append(nn.GELU())
+            modules.append(nn.Linear(out_dim, out_dim))
+        projection = nn.Sequential(*modules)
+        return projection
+    raise ValueError(f'Unknown projector type: {projection_type}')
+class PerceiverProjection(nn.Module):
+    def __init__(self, projection_type: str, in_dim: int, out_dim: int):
+        super().__init__()
+        self.projection = build_projection(projection_type, in_dim, out_dim)
+    def forward(self, input_embeds: torch.Tensor):
+        input_embeds.requires_grad_(True)
+        embeds = self.projection(input_embeds)
+        embeds.requires_grad_(True)
+        return embeds

modeling_llada.py ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_pdmllm.py ADDED Viewed

	@@ -0,0 +1,1194 @@

+from typing import Optional, List
+import re
+import torch
+import torchvision
+import transformers
+from einops import rearrange
+from torch import nn
+from torch.nn import functional as F
+from transformers import PreTrainedModel, AutoModel, AutoModelForCausalLM, GenerationConfig
+from transformers import AutoConfig
+from transformers.modeling_outputs import BaseModelOutputWithPooling
+from transformers.feature_extraction_utils import BatchFeature
+from .configuration_pdmllm import PDMLLMConfig
+from .modeling_abstractor import PerceiverProjection
+from .modeling_llada import LLaDAModelLM
+from .cache import *
+from .configuration_llada import LLaDAConfig
+def build_vision_model(config, model=None):
+    assert hasattr(config, "name_or_path")
+    if model is None:
+        model = AutoModel.from_pretrained(
+            config.name_or_path, config=config, trust_remote_code=True)
+    return model
+def vit_forward_with_mask(
+    self,
+    pixel_values,
+    interpolate_pos_encoding: bool = False,
+    mask_embeddings=None,
+    output_hidden_states: bool = False,
+    **kwargs,
+):
+    attention_mask = kwargs.pop("attention_mask", None)
+    kwargs.pop("output_hidden_states", None)
+    kwargs.pop("output_attentions", None)
+    _, _, height, width = pixel_values.shape
+    target_dtype = self.embeddings.patch_embedding.weight.dtype
+    patch_embeds = self.embeddings.patch_embedding(pixel_values.to(dtype=target_dtype))  # shape = [*, width, grid, grid]
+    embeddings = patch_embeds.flatten(2).transpose(1, 2)
+    #hidden_states = self.embeddings(pixel_values, interpolate_pos_encoding=interpolate_pos_encoding)
+    if mask_embeddings is not None:
+        embeddings = embeddings + mask_embeddings.to(embeddings.device, dtype=embeddings.dtype)
+    if interpolate_pos_encoding:
+        embeddings = embeddings + self.embeddings.interpolate_pos_encoding(embeddings, height, width)
+    else:
+        embeddings = embeddings + self.embeddings.position_embedding(self.embeddings.position_ids)
+    collected_hs = [] if output_hidden_states else None
+    for layer in self.encoder.layers:
+        hs = layer(embeddings, attention_mask=attention_mask)
+        if isinstance(hs, tuple):
+            hs = hs[0]
+        embeddings = hs
+        if collected_hs is not None:
+            collected_hs.append(embeddings)
+    last_hidden_state = self.post_layernorm(embeddings)
+    pooler_output = self.head(last_hidden_state) if self.use_head else None
+    return BaseModelOutputWithPooling(
+        last_hidden_state=last_hidden_state,
+        pooler_output=pooler_output,
+        hidden_states=tuple(collected_hs) if collected_hs is not None else None,
+    )
+class PDMLLM(PreTrainedModel):
+    config_class = PDMLLMConfig
+    supports_gradient_checkpointing = True
+    _skip_keys_device_placement = "past_key_values"
+    _supports_cache_class = False
+    _supports_flash_attn_2 = True
+    _supports_sdpa = True
+    accepts_loss_kwargs=False
+    def __init__(self,
+                 config: PDMLLMConfig,
+                 language_model=None,
+                 vision_model=None,
+                 processor=None,
+                ):
+        super().__init__(config)
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+        self.downsample_ratio = config.downsample_ratio
+        self.num_image_token = config.num_image_token
+        self.vision_select_layer = config.vision_select_layer
+        self.replacement_noise_mode = config.replacement_noise_mode
+        try:
+            vision_hidden_states = self.config.vision_model_config.hidden_size
+        except:
+            vision_hidden_states = self.config.vision_model_config.vision_config.hidden_size
+            self.config.vision_model_config.hidden_size = vision_hidden_states
+        vision_model = build_vision_model(config.vision_model_config, vision_model)
+        vision_abstractor = PerceiverProjection(**config.vision_abstractor_config,
+                                                in_dim=self.config.vision_model_config.hidden_size * (int(1 / self.downsample_ratio) ** 2),
+                                                out_dim=self.config.language_model_config.hidden_size)
+        if language_model is None:
+            kwargs_ = {}
+            if config._attn_implementation_internal is not None:
+                kwargs_['attn_implementation'] = config._attn_implementation_internal
+            if 'llada' in config.language_model_config.name_or_path.lower():
+                with transformers.modeling_utils.no_init_weights():
+                    language_model = LLaDAModelLM(config.language_model_config)
+            else:
+                raise ValueError(f"Unsupported language model: {config.language_model_config.name_or_path}")
+        self.vision_model = vision_model
+        self.vision_abstractor = vision_abstractor
+        self.language_model = language_model
+        # self.mask_patch_embedding = nn.Conv2d(
+        #     in_channels=1,
+        #     out_channels=config.mask_patch_embedding_out_channels,
+        #     kernel_size=config.kernel_size,
+        #     stride=config.kernel_size,
+        #     bias=False,
+        # )
+        self.mask_id_embedding = nn.Embedding(config.prompt_numbers, config.vision_model_config.vision_config.hidden_size)
+        #self.vit = self.vision_model.vision_model
+        #self.vit.forward = vit_forward_with_mask.__get__(self.vit, self.vit.__class__)
+        self.vision_model.vision_model.forward = vit_forward_with_mask.__get__(self.vision_model.vision_model, self.vision_model.vision_model.__class__)
+        # zero-init
+        # for param in self.mask_patch_embedding.parameters():
+        #     nn.init.zeros_(param)
+        if processor is not None:
+            self.processor = processor
+        self.prompt_numbers = config.prompt_numbers
+        # Optional override for how many RoI-aligned tokens replace a crop token.
+        self.roi_output_size = getattr(config, "roi_output_size", None)
+        # Only add special tokens when a processor is available (i.e. during training).
+        # During inference via from_pretrained, the tokens are already in the saved tokenizer.
+        if hasattr(self, "processor"):
+            self._add_special_tokens()
+        self.gradient_checkpointing_enable()
+    def _add_special_tokens(self):
+        assert hasattr(self, "processor")
+        visual_prompt_nums = self.prompt_numbers
+        visual_prompt_tokens = [f"<Prompt{i}>" for i in range(visual_prompt_nums)]
+        visual_prompt_tokens.append("<NO_Prompt>")
+        special_tokens = visual_prompt_tokens
+        num_new_tokens = self.processor.tokenizer.add_tokens(
+            special_tokens, special_tokens=True
+        )
+        self.language_model.resize_token_embeddings(len(self.processor.tokenizer))
+        print(f"Added {num_new_tokens} special tokens.")
+    def forward_vision(self, pixel_values, global_mask_values_list=None, prompt_tokens=None):
+        # pixel_values (n, c, h, w)
+        # Unwrap BatchFeature if needed
+        if isinstance(pixel_values, BatchFeature):
+            pixel_values = pixel_values["pixel_values"]
+        # Precompute mask embeddings so they can be injected before the vision encoder.
+        mask_embeds = None
+        if global_mask_values_list is not None:
+            if isinstance(global_mask_values_list, BatchFeature):
+                mask_values_list = global_mask_values_list.get("pixel_values_list", None)
+            else:
+                mask_values_list = global_mask_values_list
+            if mask_values_list is not None:
+                K = self.config.kernel_size[0]
+                h_patches = pixel_values.shape[2] // K
+                w_patches = pixel_values.shape[3] // K
+                mask_embeds = torch.zeros(
+                    pixel_values.shape[0],
+                    self.config.vision_model_config.vision_config.hidden_size,
+                    h_patches, w_patches,
+                    dtype=pixel_values.dtype,
+                    device=pixel_values.device,
+                )
+                for prompt_token, mask_values in zip(prompt_tokens, mask_values_list):
+                    prompt_id = int(re.search(r"<Prompt(\d+)>", prompt_token).group(1))
+                    vp_id = torch.tensor(prompt_id, device=pixel_values.device)
+                    vp_embed = self.mask_id_embedding(vp_id).to(pixel_values.device)  # (C,)
+                    if mask_values.shape[1] > 1:
+                        mask_values = mask_values.mean(dim=1, keepdim=True)
+                    mask_values = mask_values.to(pixel_values.device)
+                    mask_values = torch.round((mask_values + 1.0) / 2.0 * 255.0).long()
+                    mask_values = torch.clamp(mask_values, min=0, max=255)
+                    binary_mask = (mask_values != 255).to(pixel_values.dtype)  # (B, 1, H, W)
+                    ## mask_patch_embeds = self.mask_patch_embedding(binary_mask)  # (B, C, h_patches, w_patches)
+                    active_patches = torch.nn.functional.interpolate(
+                        binary_mask,
+                        size=(h_patches, w_patches),
+                        mode='nearest'
+                    )  # (B, 1, h_patches, w_patches)
+                    # Add mask id embedding (at active patches) + mask conv embedding
+                    mask_embeds = mask_embeds + vp_embed.view(1, -1, 1, 1) * active_patches ## + mask_patch_embeds
+                mask_embeds = mask_embeds.flatten(2).transpose(1, 2)  # (B, num_patches, C)
+        vision_outputs = None
+        if mask_embeds is not None:
+            vision_outputs = self.vision_model.vision_model(
+                pixel_values=pixel_values,
+                mask_embeddings=mask_embeds,
+                output_hidden_states=True,
+            )
+        assert vision_outputs is not None
+        if self.vision_select_layer == -1:
+            image_embeddings = vision_outputs.last_hidden_state
+        else:
+            image_embeddings = vision_outputs.hidden_states[self.vision_select_layer] # (B, N, C)
+        # Keep all tile embeddings — do NOT filter by image_flags.
+        # All tiles are real crops from a single image (produced by dynamic_preprocess).
+        # Filtering by pixel-sum==0 can incorrectly drop tiles whose normalized
+        # pixel values happen to sum to zero, causing shape mismatches with
+        # input_ids image tokens and aspect_ratios in downstream _merge / RoI-align.
+        vit_embeds = image_embeddings
+        if self.downsample_ratio != 1:
+            patch_num = self.image_size // self.patch_size
+            vit_embeds = vit_embeds.reshape(vit_embeds.shape[0], patch_num, patch_num, vit_embeds.shape[-1])
+            vit_embeds = self.pixel_shuffle(vit_embeds, scale_factor=self.downsample_ratio)
+            vit_embeds = vit_embeds.flatten(1, 2)
+        vit_embeds = self.vision_abstractor(vit_embeds)
+        return vit_embeds
+    def prepare_for_lm(self, input_ids, vision_embeds):
+        inputs_embeds = self.get_input_embeddings()(input_ids)
+        vision_embeds_ = vision_embeds
+        if vision_embeds is not None:
+            try:
+                vision_mask = input_ids == self.config.image_token_id
+                if torch.count_nonzero(vision_mask).item() != vision_embeds.shape[:-1].numel():
+                    info = "vision embeddings mismatch input embeddings: " \
+                           f"vision_mask shape={vision_mask.shape}; " \
+                           f"vision_mask count={torch.count_nonzero(vision_mask)}; " \
+                           f"vision_embeds shape={vision_embeds.shape}"
+                    # print(info)
+                    num_vision_1 = torch.count_nonzero(vision_mask).item()
+                    num_vision_2 = vision_embeds.shape[:-1].numel()
+                    vision_embeds = vision_embeds.contiguous()
+                    if num_vision_1 <= num_vision_2:
+                        vision_embeds = vision_embeds.view(-1, vision_embeds.size(-1))[:num_vision_1]
+                    else:
+                        vision_embeds = vision_embeds.view(-1, vision_embeds.size(-1))
+                        less_nums = num_vision_1 - num_vision_2
+                        vision_embeds = torch.cat([vision_embeds, vision_embeds[-less_nums:]], dim=0)
+                    vision_embeds = vision_embeds.contiguous()
+                # assert torch.count_nonzero(vision_mask).item() == vision_embeds.shape[:-1].numel(), \
+                #     "vision embeddings mismatch input embeddings: " \
+                #     f"vision_mask shape={vision_mask.shape}; " \
+                #     f"vision_mask count={torch.count_nonzero(vision_mask)}; " \
+                #     f"vision_embeds shape={vision_embeds.shape}"
+                inputs_embeds = torch.masked_scatter(inputs_embeds, vision_mask.unsqueeze(-1),
+                                                     vision_embeds.to(inputs_embeds.dtype).view(-1,
+                                                                                                vision_embeds.size(-1)))
+            except:
+                inputs_embeds = inputs_embeds + torch.sum(vision_embeds_[0, 0, :]) * 0.0
+        return inputs_embeds
+    def _prepare_inputs_for_generation(
+        self,
+        input_ids,
+        pixel_values=None,
+        global_mask_values_list=None,
+        aspect_ratios=None,
+        bboxes=None,
+        prompt_tokens=None,
+        attention_mask=None,
+        position_ids=None,
+        tokenizer=None,
+    ):
+        vision_embeds = None
+        if pixel_values is not None:
+            vision_embeds = self.forward_vision(pixel_values, global_mask_values_list=global_mask_values_list, prompt_tokens=prompt_tokens)
+        inputs_embeds = self.prepare_for_lm(input_ids, vision_embeds)
+        reserved_token_spans: List[List[tuple]] = [[] for _ in range(input_ids.shape[0])]
+        length_changed = False
+        if vision_embeds is not None and aspect_ratios is not None and bboxes is not None:
+            crop_tokens = [
+                tokenizer.convert_tokens_to_ids(f"<|reserved_token_{pid}|>")
+                for pid in range(self.prompt_numbers)
+            ]
+            patch_num = self.image_size // self.patch_size
+            if self.downsample_ratio != 1:
+                feat_h = int(patch_num * self.downsample_ratio)
+                feat_w = int(patch_num * self.downsample_ratio)
+            else:
+                feat_h = patch_num
+                feat_w = patch_num
+            if vision_embeds.shape[0] != 1:
+                image_features_tiles = rearrange(
+                    vision_embeds[1:].unsqueeze(0), "b n (h w) c -> b n c h w", h=feat_h, w=feat_w
+                )
+            else:
+                image_features_tiles = rearrange(
+                    vision_embeds.unsqueeze(0), "b n (h w) c -> b n c h w", h=feat_h, w=feat_w
+                )
+            new_inputs_embeds = []
+            new_input_ids_list = []
+            assert inputs_embeds.shape[0] == 1, "Currently only support batch_size=1"
+            for batch_idx in range(inputs_embeds.shape[0]):
+                curr_inputs_embeds = inputs_embeds[batch_idx]
+                curr_input_ids = input_ids[batch_idx]
+                replacements = []
+                orig_input_ids = input_ids[batch_idx]
+                for cap_idx, crop_token in enumerate(crop_tokens):
+                    target_mask = orig_input_ids.eq(crop_token)
+                    if not target_mask.any():
+                        continue
+                    target_indices = target_mask.nonzero().squeeze()
+                    if target_indices.ndim == 0:
+                        head_idx = tail_idx = target_indices.item()
+                    else:
+                        head_idx = target_indices.min().item()
+                        tail_idx = target_indices.max().item()
+                    replacements.append((head_idx, tail_idx, cap_idx, crop_token))
+                # Apply replacements in ascending order with running shift to keep spans aligned
+                replacements.sort(key=lambda x: x[0])
+                running_shift = 0
+                for head_idx, tail_idx, cap_idx, crop_token in replacements:
+                    adj_head = head_idx + running_shift
+                    adj_tail = tail_idx + running_shift
+                    image_features_recover = self._merge(
+                        image_features_tiles,
+                        aspect_ratios[batch_idx][0],
+                        aspect_ratios[batch_idx][1],
+                    )
+                    feat_h, feat_w = image_features_recover.shape[2:]
+                    x1, y1, x2, y2 = bboxes[batch_idx][str(crop_token)]
+                    orig_h, orig_w = feat_h * 16 * 2, feat_w * 16 * 2
+                    roi_orig_x1 = x1 * orig_w
+                    roi_orig_y1 = y1 * orig_h
+                    roi_orig_x2 = x2 * orig_w
+                    roi_orig_y2 = y2 * orig_h
+                    spatial_scale = feat_w / orig_w
+                    roi_feat_x1 = roi_orig_x1 * spatial_scale
+                    roi_feat_y1 = roi_orig_y1 * spatial_scale
+                    roi_feat_x2 = roi_orig_x2 * spatial_scale
+                    roi_feat_y2 = roi_orig_y2 * spatial_scale
+                    roi = torch.tensor(
+                        [0, roi_feat_x1, roi_feat_y1, roi_feat_x2, roi_feat_y2],
+                        dtype=torch.float32,
+                        device=image_features_recover.device,
+                    )
+                    if self.roi_output_size is None:
+                        output_h, output_w = feat_h, feat_w
+                    elif isinstance(self.roi_output_size, int):
+                        output_h = output_w = self.roi_output_size
+                    else:
+                        output_h, output_w = self.roi_output_size
+                    roi_features = torchvision.ops.roi_align(
+                        input=image_features_recover.float(),
+                        boxes=roi.unsqueeze(0),
+                        output_size=(output_h, output_w),
+                        spatial_scale=spatial_scale,
+                        sampling_ratio=2,
+                        aligned=True,
+                    )
+                    image_features_replay = (
+                        roi_features.permute(0, 2, 3, 1)
+                        .flatten(1, 2)
+                        .to(image_features_recover.dtype)
+                        .squeeze()
+                    )
+                    curr_inputs_embeds = torch.cat(
+                        [
+                            curr_inputs_embeds[:adj_head],
+                            image_features_replay,
+                            curr_inputs_embeds[adj_tail + 1 :],
+                        ]
+                    )
+                    curr_input_ids = torch.cat(
+                        [
+                            curr_input_ids[:adj_head],
+                            torch.full(
+                                (image_features_replay.shape[0],),
+                                crop_token,
+                                dtype=torch.long,
+                                device=curr_input_ids.device,
+                            ),
+                            curr_input_ids[adj_tail + 1 :],
+                        ]
+                    )
+                    reserved_token_spans[batch_idx].append(
+                        (cap_idx, adj_head, adj_head + image_features_replay.shape[0])
+                    )
+                    length_changed = True
+                    delta = image_features_replay.shape[0] - (tail_idx - head_idx + 1)
+                    running_shift += delta
+                if reserved_token_spans[batch_idx]:
+                    reserved_token_spans[batch_idx].sort(key=lambda x: x[1])
+                new_inputs_embeds.append(curr_inputs_embeds.unsqueeze(0))
+                new_input_ids_list.append(curr_input_ids.unsqueeze(0))
+            inputs_embeds = torch.cat(new_inputs_embeds, dim=0)
+            input_ids = torch.cat(new_input_ids_list, dim=0)
+        if (
+            length_changed
+            or attention_mask is None
+            or attention_mask.shape[1] != inputs_embeds.shape[1]
+            or position_ids is None
+            or position_ids.shape[1] != inputs_embeds.shape[1]
+        ):
+            attention_mask = torch.ones(
+                inputs_embeds.shape[0],
+                inputs_embeds.shape[1],
+                dtype=torch.long,
+                device=inputs_embeds.device,
+            )
+            position_ids = (
+                torch.arange(
+                    0,
+                    inputs_embeds.shape[1],
+                    dtype=torch.long,
+                    device=inputs_embeds.device,
+                )
+                .unsqueeze(0)
+                .repeat(inputs_embeds.shape[0], 1)
+            )
+        return inputs_embeds, attention_mask, position_ids, input_ids, reserved_token_spans
+    def pixel_shuffle(self, x, scale_factor=0.5):
+        x = x.contiguous()
+        n, w, h, c = x.size()
+        # N, W, H, C --> N, W, H * scale, C // scale
+        x = x.view(n, w, int(h * scale_factor), int(c / scale_factor))
+        # N, W, H * scale, C // scale --> N, H * scale, W, C // scale
+        x = x.permute(0, 2, 1, 3).contiguous()
+        # N, H * scale, W, C // scale --> N, H * scale, W * scale, C // (scale ** 2)
+        x = x.view(n, int(h * scale_factor), int(w * scale_factor),
+                   int(c / (scale_factor * scale_factor)))
+        x = x.permute(0, 2, 1, 3).contiguous()
+        return x
+    def _merge(self, tiles: torch.Tensor, ncw: int, nch: int) -> torch.Tensor:
+        """Merge image tiles back to original spatial layout."""
+        batch_size, num_tiles, num_channels, tile_height, tile_width = tiles.size()
+        assert num_tiles == ncw * nch, f"{ncw * nch} != {num_tiles}"
+        tiles = tiles.view(batch_size, nch, ncw, num_channels, tile_height, tile_width)
+        tiles = tiles.permute(0, 3, 1, 4, 2, 5).contiguous()
+        original_height = nch * tile_height
+        original_width = ncw * tile_width
+        image = tiles.view(batch_size, num_channels, original_height, original_width)
+        return image
+    def _build_custom_4d_mask(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask_2d: torch.Tensor,
+        tokenizer,
+        dtype: torch.dtype,
+        reserved_token_spans: Optional[List[List[tuple]]] = None,
+    ) -> Optional[torch.Tensor]:
+        """Construct a 4D attention mask so each Mask_Cap_i block only attends to itself,
+        image tokens, and its corresponding reserved token embeddings.
+        Args:
+            input_ids: (B, L)
+            attention_mask_2d: (B, L) padding mask
+            tokenizer: tokenizer with convert_tokens_to_ids
+            dtype: target dtype for the mask (match hidden states)
+            reserved_token_spans: optional per-batch list of (idx, start, end) spans that
+                replaced <|reserved_token_i|>. End is exclusive.
+        Returns:
+            mask_4d: (B, 1, L, L) or None if tokenizer is missing
+        """
+        if tokenizer is None:
+            return None
+        device = input_ids.device
+        batch_size, seq_len = input_ids.shape
+        neg_value = torch.finfo(dtype).min
+        image_token_id = getattr(self.config, "image_token_id", None)
+        image_positions = input_ids.eq(image_token_id) if image_token_id is not None else None
+        eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
+        # Precompute Mask_Cap and reserved token ids
+        mask_cap_ids = []
+        reserved_token_ids = []
+        for i in range(self.prompt_numbers):
+            mask_cap_ids.append((i, tokenizer.convert_tokens_to_ids(f"<|Mask_Cap_{i}|>")))
+            reserved_token_ids.append(tokenizer.convert_tokens_to_ids(f"<|reserved_token_{i}|>"))
+        mask_4d = torch.zeros((batch_size, 1, seq_len, seq_len), device=device, dtype=dtype)
+        for b in range(batch_size):
+            seq = input_ids[b]
+            valid_positions = attention_mask_2d[b].bool()
+            valid_indices = torch.nonzero(valid_positions, as_tuple=False).flatten().tolist()
+            img_idx = (
+                torch.nonzero(image_positions[b], as_tuple=False).flatten().tolist()
+                if image_positions is not None
+                else []
+            )
+            for cap_idx, cap_token_id in mask_cap_ids:
+                if cap_token_id is None or cap_token_id < 0:
+                    continue
+                cap_locs = torch.nonzero(seq == cap_token_id, as_tuple=False).flatten()
+                if cap_locs.numel() == 0:
+                    continue
+                start = cap_locs[0].item()
+                # Determine the end boundary: next mask_cap or last token in the sentence.
+                # NOTE: <|eot_id|> is NOT used as boundary because it now serves as
+                # padding within each caption block after the caption-padding change.
+                end_candidates = []
+                for later_idx, later_token_id in mask_cap_ids:
+                    if later_idx <= cap_idx:
+                        continue
+                    later_pos = torch.nonzero(seq == later_token_id, as_tuple=False).flatten()
+                    if later_pos.numel() > 0:
+                        end_candidates.append(later_pos[0].item())
+                end = min(end_candidates) if len(end_candidates) > 0 else seq_len
+                group_tokens = [i for i in range(start, end) if valid_positions[i]]
+                if len(group_tokens) == 0:
+                    continue
+                # Collect reserved token spans for this caption block
+                allowed_reserved_positions: List[int] = []
+                if reserved_token_spans is not None and len(reserved_token_spans) > b:
+                    for idx, span_start, span_end in reserved_token_spans[b]:
+                        if idx == cap_idx:
+                            allowed_reserved_positions.extend(range(span_start, min(span_end, seq_len)))
+                # Fallback to original reserved token id if no recorded span
+                if len(allowed_reserved_positions) == 0:
+                    reserved_id = reserved_token_ids[cap_idx]
+                    if reserved_id is not None and reserved_id >= 0:
+                        allowed_reserved_positions.extend(
+                            torch.nonzero(seq == reserved_id, as_tuple=False).flatten().tolist()
+                        )
+                fix_prompt_positions = torch.nonzero(
+                    seq == tokenizer.convert_tokens_to_ids('<|reserved_token_0|>'),
+                    as_tuple=False,
+                ).flatten()
+                fix_prompt_len = fix_prompt_positions[0].item() if fix_prompt_positions.numel() > 0 else 0
+                # Use the latest recorded reserved span (after sorting) when available
+                last_span_end = (
+                    reserved_token_spans[b][-1][2]
+                    if reserved_token_spans is not None
+                    and len(reserved_token_spans) > b
+                    and len(reserved_token_spans[b]) > 0
+                    else fix_prompt_len
+                )
+                mask_cap_0_position = torch.nonzero(
+                    seq == tokenizer.convert_tokens_to_ids('<|Mask_Cap_0|>'),
+                    as_tuple=False,
+                ).flatten().tolist()
+                fix_prompt_idx = torch.arange(fix_prompt_len, device=device).tolist() + list(range(last_span_end, mask_cap_0_position[0]))
+                allowed_targets = set(group_tokens) | set(fix_prompt_idx) | set(allowed_reserved_positions)
+                disallowed = set(valid_indices) - allowed_targets
+                if len(disallowed) == 0:
+                    continue
+                disallowed_tensor = torch.tensor(list(disallowed), device=device)
+                for q in group_tokens:
+                    mask_4d[b, 0, q, disallowed_tensor] = neg_value
+            # Optionally mask out padding for all queries (consistency)
+            if len(valid_indices) < seq_len:
+                invalid = torch.nonzero(~valid_positions, as_tuple=False).flatten()
+                if invalid.numel() > 0:
+                    mask_4d[b, 0, :, invalid] = neg_value
+        return mask_4d
+    def forward(self,
+                input_ids: torch.LongTensor = None,
+                attention_mask: Optional[torch.BoolTensor] = None,
+                position_ids: Optional[torch.LongTensor] = None,
+                pixel_values: Optional[torch.Tensor] = None,
+                global_mask_values_list: Optional[List[torch.Tensor]] = None,
+                aspect_ratios: Optional[List] = None,
+                bboxes: Optional[List] = None,
+                prompt_tokens: Optional[List] = None,
+                past_key_values: Optional[List[torch.FloatTensor]] = None,
+                labels: Optional[torch.LongTensor] = None,
+                return_dict: bool = True,
+                **kwargs,
+                ):
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        # ========Get visual embedding========
+        if pixel_values is not None:
+            vision_embeds = self.forward_vision(pixel_values, global_mask_values_list=global_mask_values_list, prompt_tokens=prompt_tokens)
+        else:
+            vision_embeds = None
+        # ========Prepare inputs for LM========
+        # print(f"input_ids.shape: {input_ids.shape}", {vision_embeds.shape})
+        inputs_embeds = self.prepare_for_lm(input_ids, vision_embeds)
+        # print(f"inputs_embeds.shape: {inputs_embeds.shape}")
+        p_mask = None
+        answer_length = None
+        reserved_token_spans = [[] for _ in range(input_ids.shape[0])]
+        # ========Feature Replay (from grasp_any_region)========
+        if vision_embeds is not None and aspect_ratios is not None and bboxes is not None:
+            # Get crop tokens from reserved special tokens
+            crop_tokens = [
+                self.processor.tokenizer.convert_tokens_to_ids(
+                    f"<|reserved_token_{pid}|>"
+                )
+                for pid in range(self.prompt_numbers)
+            ]
+            # Reshape vision_embeds to tiles format for feature replay
+            # Assuming vision_embeds shape: (num_tiles, num_tokens, hidden_dim)
+            # Need to convert to (batch, num_tiles, channels, h, w) format
+            patch_num = self.image_size // self.patch_size
+            if self.downsample_ratio != 1:
+                feat_h = int(patch_num * self.downsample_ratio)
+                feat_w = int(patch_num * self.downsample_ratio)
+            else:
+                feat_h = patch_num
+                feat_w = patch_num
+            # Reshape vision_embeds: (num_tiles, num_tokens, hidden_dim) -> (1, num_tiles, hidden_dim, h, w)
+            if vision_embeds.shape[0] != 1:
+                image_features_tiles = rearrange(
+                    vision_embeds[1:].unsqueeze(0), "b n (h w) c -> b n c h w", h=feat_h, w=feat_w
+                )
+            else:
+                image_features_tiles = rearrange(
+                    vision_embeds.unsqueeze(0), "b n (h w) c -> b n c h w", h=feat_h, w=feat_w
+                )
+            new_inputs_embeds = []
+            new_input_ids_list = []
+            new_labels = [] if labels is not None else None
+            length_changed = False
+            assert inputs_embeds.shape[0] == 1, "Currently only support batch_size=1"
+            for batch_idx in range(inputs_embeds.shape[0]):
+                curr_inputs_embeds = inputs_embeds[batch_idx]
+                curr_input_ids = input_ids[batch_idx]
+                curr_labels = labels[batch_idx] if labels is not None else None
+                # Collect all replacements first to avoid index shifting during insertion
+                orig_input_ids = input_ids[batch_idx]
+                replacements = []
+                for cap_idx, crop_token in enumerate(crop_tokens):
+                    target_mask = orig_input_ids.eq(crop_token)
+                    if not target_mask.any():
+                        continue
+                    target_indices = target_mask.nonzero().squeeze()
+                    if target_indices.ndim == 0:
+                        head_idx = tail_idx = target_indices.item()
+                    else:
+                        head_idx = target_indices.min().item()
+                        tail_idx = target_indices.max().item()
+                    replacements.append((head_idx, tail_idx, cap_idx, crop_token))
+                # Apply replacements in ascending order with running shift to keep spans aligned
+                replacements.sort(key=lambda x: x[0])
+                running_shift = 0
+                for head_idx, tail_idx, cap_idx, crop_token in replacements:
+                    adj_head = head_idx + running_shift
+                    adj_tail = tail_idx + running_shift
+                    # Merge tiles back to original spatial layout
+                    image_features_recover = self._merge(
+                        image_features_tiles,
+                        aspect_ratios[batch_idx][0],
+                        aspect_ratios[batch_idx][1],
+                    )
+                    feat_h, feat_w = image_features_recover.shape[2:]
+                    # Get bbox coordinates
+                    x1, y1, x2, y2 = bboxes[batch_idx][str(crop_token)]
+                    # RoI-Align
+                    orig_h, orig_w = feat_h * 28, feat_w * 28  # Original image size
+                    # Origin box
+                    roi_orig_x1 = x1 * orig_w
+                    roi_orig_y1 = y1 * orig_h
+                    roi_orig_x2 = x2 * orig_w
+                    roi_orig_y2 = y2 * orig_h
+                    # Feature box
+                    spatial_scale = feat_w / orig_w
+                    roi_feat_x1 = roi_orig_x1 * spatial_scale
+                    roi_feat_y1 = roi_orig_y1 * spatial_scale
+                    roi_feat_x2 = roi_orig_x2 * spatial_scale
+                    roi_feat_y2 = roi_orig_y2 * spatial_scale
+                    roi = torch.tensor(
+                        [0, roi_feat_x1, roi_feat_y1, roi_feat_x2, roi_feat_y2],
+                        dtype=torch.float32,
+                        device=image_features_recover.device,
+                    )
+                    # output_size controls how many tokens are inserted (output_h * output_w)
+                    if self.roi_output_size is None:
+                        output_h, output_w = feat_h, feat_w
+                    elif isinstance(self.roi_output_size, int):
+                        output_h = output_w = self.roi_output_size
+                    else:
+                        output_h, output_w = self.roi_output_size
+                    roi_features = torchvision.ops.roi_align(
+                        input=image_features_recover.float(),
+                        boxes=roi.unsqueeze(0),
+                        output_size=(output_h, output_w),
+                        spatial_scale=spatial_scale,
+                        sampling_ratio=2,
+                        aligned=True,
+                    )
+                    image_features_replay = (
+                        roi_features.permute(0, 2, 3, 1)
+                        .flatten(1, 2)
+                        .to(image_features_recover.dtype)
+                        .squeeze()
+                    )
+                    # Replace crop token embeddings with RoI features
+                    curr_inputs_embeds = torch.cat(
+                        [
+                            curr_inputs_embeds[:adj_head],
+                            image_features_replay,
+                            curr_inputs_embeds[adj_tail + 1 :],
+                        ]
+                    )
+                    curr_input_ids = torch.cat(
+                        [
+                            curr_input_ids[:adj_head],
+                            torch.full(
+                                (image_features_replay.shape[0],),
+                                crop_token,
+                                dtype=torch.long,
+                                device=input_ids.device,
+                            ),
+                            curr_input_ids[adj_tail + 1 :],
+                        ]
+                    )
+                    reserved_token_spans[batch_idx].append(
+                        (cap_idx, adj_head, adj_head + image_features_replay.shape[0])
+                    )
+                    if curr_labels is not None:
+                        curr_labels = torch.cat(
+                            [
+                                curr_labels[:adj_head],
+                                -100 * torch.ones(
+                                    image_features_replay.shape[0],
+                                    dtype=torch.long,
+                                    device=labels.device,
+                                ),
+                                curr_labels[adj_tail + 1 :],
+                            ]
+                        )
+                    assert (
+                        curr_labels is None or curr_inputs_embeds.shape[0] == curr_labels.shape[0]
+                    ), f"shape mismatch, got {curr_inputs_embeds.shape[0]} != {curr_labels.shape[0]}"
+                    length_changed = True
+                    # Track shift caused by this replacement for subsequent insertions
+                    delta = image_features_replay.shape[0] - (tail_idx - head_idx + 1)
+                    running_shift += delta
+                # Keep spans ordered by start so downstream masking reads consistent positions
+                if reserved_token_spans[batch_idx]:
+                    reserved_token_spans[batch_idx].sort(key=lambda x: x[1])
+                new_inputs_embeds.append(curr_inputs_embeds.unsqueeze(0))
+                new_input_ids_list.append(curr_input_ids.unsqueeze(0))
+                if new_labels is not None:
+                    new_labels.append(curr_labels)
+            inputs_embeds = torch.cat(new_inputs_embeds, dim=0)
+            input_ids = torch.cat(new_input_ids_list, dim=0)
+            if new_labels is not None:
+                labels = torch.cat(new_labels, dim=0)
+            if (
+                length_changed
+                or attention_mask is None
+                or attention_mask.shape[1] != inputs_embeds.shape[1]
+                or position_ids is None
+                or position_ids.shape[1] != inputs_embeds.shape[1]
+            ):
+                attention_mask = torch.ones(
+                    inputs_embeds.shape[0],
+                    inputs_embeds.shape[1],
+                    dtype=torch.long,
+                    device=inputs_embeds.device,
+                )
+                position_ids = (
+                    torch.arange(
+                        0,
+                        inputs_embeds.shape[1],
+                        dtype=torch.long,
+                        device=inputs_embeds.device,
+                    )
+                    .unsqueeze(0)
+                    .repeat(inputs_embeds.shape[0], 1)
+                )
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                inputs_embeds.shape[0],
+                inputs_embeds.shape[1],
+                dtype=torch.long,
+                device=inputs_embeds.device,
+            )
+        if position_ids is None:
+            position_ids = (
+                torch.arange(
+                    0,
+                    inputs_embeds.shape[1],
+                    dtype=torch.long,
+                    device=inputs_embeds.device,
+                )
+                .unsqueeze(0)
+                .repeat(inputs_embeds.shape[0], 1)
+            )
+        tokenizer_for_mask = kwargs.pop("tokenizer", None)
+        if tokenizer_for_mask is None and hasattr(self, "processor") and hasattr(self.processor, "tokenizer"):
+            tokenizer_for_mask = self.processor.tokenizer
+        custom_mask = self._build_custom_4d_mask(
+            input_ids=input_ids,
+            attention_mask_2d=attention_mask,
+            tokenizer=tokenizer_for_mask,
+            dtype=inputs_embeds.dtype,
+            reserved_token_spans=reserved_token_spans,
+        )
+        if custom_mask is not None:
+            attention_mask = custom_mask
+        if self.is_gradient_checkpointing and torch.is_grad_enabled():
+            inputs_embeds.requires_grad_(True)
+        # Normalize label shape to (batch, seq_len) to match logits masking in language model
+        if labels is not None and labels.dim() == 1:
+            expected_tokens = inputs_embeds.shape[0] * inputs_embeds.shape[1]
+            if labels.numel() == expected_tokens:
+                labels = labels.view(inputs_embeds.shape[0], inputs_embeds.shape[1])
+        # ========Forward into LM========
+        outputs = self.language_model(
+            input_ids=None,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            return_dict=return_dict,
+            labels=labels,
+            use_cache=False,
+            conversation_ids=None,
+            replacement_noise_mode=self.replacement_noise_mode,
+            p_mask = p_mask,
+            answer_length = answer_length,
+            **kwargs,
+        )
+        return outputs
+    def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
+        super().gradient_checkpointing_enable(gradient_checkpointing_kwargs)
+        self.language_model.gradient_checkpointing_enable()
+        self.language_model.enable_input_require_grads()
+    def get_input_embeddings(self):
+        return self.language_model.get_input_embeddings()
+    def set_input_embeddings(self, value):
+        self.language_model.set_input_embeddings(value)
+    def get_output_embeddings(self):
+        return self.language_model.get_output_embeddings()
+    def set_output_embeddings(self, new_embeddings):
+        self.language_model.set_output_embeddings(new_embeddings)
+    def set_decoder(self, decoder):
+        self.language_model.set_decoder(decoder)
+    def get_decoder(self):
+        return self.language_model.get_decoder()
+    def tie_weights(self):
+        return self.language_model.tie_weights()
+    @torch.no_grad()
+    def generate(
+            self,
+            pixel_values: Optional[torch.FloatTensor] = None,
+            input_ids: Optional[torch.FloatTensor] = None,
+            global_mask_values_list: Optional[torch.FloatTensor] = None,
+            aspect_ratios: Optional[List] = None,
+            bboxes: Optional[List] = None,
+            prompt_tokens: Optional[List] = None,
+            tokenizer=None,
+            **generate_kwargs,
+    ) -> torch.LongTensor:
+        inputs_embeds, attention_mask, position_ids, input_ids, reserved_token_spans = self._prepare_inputs_for_generation(
+            input_ids=input_ids,
+            pixel_values=pixel_values,
+            global_mask_values_list=global_mask_values_list,
+            aspect_ratios=aspect_ratios,
+            bboxes=bboxes,
+            prompt_tokens=prompt_tokens,
+            tokenizer=tokenizer,
+        )
+        tokenizer_for_mask = tokenizer
+        if tokenizer_for_mask is None and hasattr(self, "processor") and hasattr(self.processor, "tokenizer"):
+            tokenizer_for_mask = self.processor.tokenizer
+        custom_mask = self._build_custom_4d_mask(
+            input_ids=input_ids,
+            attention_mask_2d=attention_mask,
+            tokenizer=tokenizer_for_mask,
+            dtype=inputs_embeds.dtype,
+            reserved_token_spans=reserved_token_spans,
+        )
+        if custom_mask is not None:
+            attention_mask = custom_mask
+        if 'llada' in self.config.language_model_config.name_or_path.lower():
+            outputs = self.language_model.generate_with_embeds_nonblock(
+                inputs_embeds=inputs_embeds,
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                **generate_kwargs,
+            )
+        return outputs
+    @torch.no_grad()
+    def generate_replace_noise(
+            self,
+            pixel_values: Optional[torch.FloatTensor] = None,
+            input_ids: Optional[torch.FloatTensor] = None,
+            global_mask_values_list: Optional[torch.FloatTensor] = None,
+            aspect_ratios: Optional[List] = None,
+            bboxes: Optional[List] = None,
+            prompt_tokens: Optional[List] = None,
+            tokenizer=None,
+            **generate_kwargs,
+    ) -> torch.LongTensor:
+        inputs_embeds, attention_mask, position_ids, input_ids, reserved_token_spans = self._prepare_inputs_for_generation(
+            input_ids=input_ids,
+            pixel_values=pixel_values,
+            global_mask_values_list=global_mask_values_list,
+            aspect_ratios=aspect_ratios,
+            bboxes=bboxes,
+            prompt_tokens=prompt_tokens,
+            tokenizer=tokenizer,
+        )
+        tokenizer_for_mask = tokenizer
+        if tokenizer_for_mask is None and hasattr(self, "processor") and hasattr(self.processor, "tokenizer"):
+            tokenizer_for_mask = self.processor.tokenizer
+        custom_mask = self._build_custom_4d_mask(
+            input_ids=input_ids,
+            attention_mask_2d=attention_mask,
+            tokenizer=tokenizer_for_mask,
+            dtype=inputs_embeds.dtype,
+            reserved_token_spans=reserved_token_spans,
+        )
+        if custom_mask is not None:
+            attention_mask = custom_mask
+        outputs, all_steps_response = self.language_model.generate_with_embeds_replace_noise(
+            inputs_embeds=inputs_embeds,
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            **generate_kwargs,
+        )
+        return outputs, all_steps_response
+    def get_template(self):
+        if 'llada' in self.config.language_model_config.name_or_path.lower():
+            template = dict(
+                SYSTEM=("<|start_header_id|>system<|end_header_id|>\n{system}<|eot_id|>\n"),
+                INSTRUCTION=("<|start_header_id|>user<|end_header_id|>\n{input}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"),
+                SUFFIX="<|eot_id|>",
+                SUFFIX_AS_EOS=True,
+                SEP="\n",
+                STOP_WORDS=["<|eot_id|>"],
+            )
+        return template
+    @torch.no_grad()
+    def chat(
+            self,
+            tokenizer,
+            pixel_values,
+            question,
+            generation_config,
+            global_mask_values=None,
+            aspect_ratios=None,
+            bboxes=None,
+            history=None,
+            return_history=False,
+            num_patches_list=None,
+            IMG_START_TOKEN='<img>',
+            IMG_END_TOKEN='</img>',
+            IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
+            verbose=False
+    ):
+        if history is None and pixel_values is not None and '<image>' not in question:
+            question = '<image>\n' + question
+        if num_patches_list is None:
+            num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
+        assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
+        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
+        self.img_context_token_id = img_context_token_id
+        template = self.get_template()
+        eos_token_id = tokenizer.convert_tokens_to_ids(template["SUFFIX"].strip())
+        history = "" if history is None else history
+        prompt = history
+        prompt = prompt + template["INSTRUCTION"].format(input=question)
+        if verbose and pixel_values is not None:
+            image_bs = pixel_values.shape[0]
+            print(f'dynamic ViT batch size: {image_bs}')
+        prompt = prompt[::-1]
+        for num_patches in num_patches_list[::-1]:
+            image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
+            prompt = prompt.replace('<image>'[::-1], image_tokens[::-1], 1)
+        prompt = prompt[::-1]
+        model_inputs = tokenizer(prompt, return_tensors='pt')
+        device = torch.device(self.language_model.device if torch.cuda.is_available() else 'cpu')
+        input_ids = model_inputs['input_ids'].to(device)
+        attention_mask = model_inputs['attention_mask'].to(device)
+        generation_config['eos_token_id'] = eos_token_id
+        generation_output = self.generate(
+            pixel_values=pixel_values,
+            global_mask_values=global_mask_values,
+            aspect_ratios=aspect_ratios,
+            bboxes=bboxes,
+            input_ids=input_ids,
+            **generation_config
+        )
+        response = [
+            tokenizer.decode(g[len(p) :].tolist())
+            for p, g in zip(input_ids, generation_output)
+        ][0]
+        # response = tokenizer.batch_decode(generation_output, skip_special_tokens=False)[0]
+        history = history + prompt + response
+        response = response.split(template["SUFFIX"].strip())[0].strip()
+        if return_history:
+            return response, history
+        else:
+            if verbose:
+                print(response)
+            return response
+        return
+    @torch.no_grad()
+    def chat_replace_noise(
+            self,
+            tokenizer,
+            pixel_values,
+            question,
+            generation_config,
+            global_mask_values=None,
+            aspect_ratios=None,
+            bboxes=None,
+            history=None,
+            return_history=False,
+            num_patches_list=None,
+            IMG_START_TOKEN='<img>',
+            IMG_END_TOKEN='</img>',
+            IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
+            verbose=False
+    ):
+        if history is None and pixel_values is not None and '<image>' not in question:
+            question = '<image>\n' + question
+        if num_patches_list is None:
+            num_patches_list = [pixel_values.shape[0]] if pixel_values is not None else []
+        assert pixel_values is None or len(pixel_values) == sum(num_patches_list)
+        img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
+        self.img_context_token_id = img_context_token_id
+        template = self.get_template()
+        eos_token_id = tokenizer.convert_tokens_to_ids(template["SUFFIX"].strip())
+        history = "" if history is None else history
+        prompt = history
+        prompt = prompt + template["INSTRUCTION"].format(input=question)
+        if verbose and pixel_values is not None:
+            image_bs = pixel_values.shape[0]
+            print(f'dynamic ViT batch size: {image_bs}')
+        prompt = prompt[::-1]
+        for num_patches in num_patches_list[::-1]:
+            image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * self.num_image_token * num_patches + IMG_END_TOKEN
+            prompt = prompt.replace('<image>'[::-1], image_tokens[::-1], 1)
+        prompt = prompt[::-1]
+        model_inputs = tokenizer(prompt, return_tensors='pt')
+        device = torch.device(self.language_model.device if torch.cuda.is_available() else 'cpu')
+        input_ids = model_inputs['input_ids'].to(device)
+        attention_mask = model_inputs['attention_mask'].to(device)
+        generation_config['eos_token_id'] = eos_token_id
+        generation_output, all_steps_response = self.generate_replace_noise(
+            pixel_values=pixel_values,
+            global_mask_values=global_mask_values,
+            aspect_ratios=aspect_ratios,
+            bboxes=bboxes,
+            input_ids=input_ids,
+            **generation_config
+        )
+        response = tokenizer.batch_decode(generation_output, skip_special_tokens=False)[0]
+        all_steps_response_ = []
+        for step_response in all_steps_response:
+            step_response = tokenizer.batch_decode(step_response, skip_special_tokens=False)[0]
+            all_steps_response_.append(step_response)
+        all_steps_response = all_steps_response_
+        for i, step_response in enumerate(all_steps_response):
+            print(f"Step {i}: {step_response}\n")
+        history = history + prompt + response
+        response = response.split(template["SUFFIX"].strip())[0].strip()
+        if return_history:
+            return response, history
+        else:
+            if verbose:
+                print(response)
+            return response
+        return
+AutoConfig.register("pdmllm", PDMLLMConfig)
+AutoModel.register(PDMLLMConfig, PDMLLM)

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_pdmllm.PDMLLMProcessor"
+  },
+  "do_convert_rgb": null,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "image_processor_type": "SiglipImageProcessor",
+  "image_std": [
+    0.5,
+    0.5,
+    0.5
+  ],
+  "processor_class": "PDMLLMProcessor",
+  "resample": 2,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "height": 512,
+    "width": 512
+  }
+}

processing_pdmllm.py ADDED Viewed

	@@ -0,0 +1,382 @@

+import math
+import torch
+import warnings
+import PIL.Image
+from torch.nn import functional as F
+from collections import UserDict, OrderedDict
+from typing import Union, Optional, Tuple, List, Dict, Any
+from transformers.image_utils import load_image
+from transformers.feature_extraction_utils import BatchFeature
+from .chat_template_utils import render_jinja_template
+from transformers.processing_utils import ProcessorMixin, AllKwargsForChatTemplate
+class PDMLLMProcessor(ProcessorMixin):
+    attributes = ["tokenizer", "image_processor"]
+    optional_attributes = ['chat_template']
+    model_input_names = ['input_ids', 'attention_mask', 'pixel_values']
+    image_processor_class = "AutoImageProcessor"
+    tokenizer_class = "AutoTokenizer"
+    def __init__(
+            self, tokenizer, image_processor, chat_template=None,
+            image_size=512,
+            patch_size=16,
+            downsample_ratio=0.5,
+            max_sub_img=6,
+            min_sub_img=1,
+            image_token='<IMG_CONTEXT>',
+            image_start_token='<img>',
+            image_end_token='</img>',
+            special_tokens=['<IMG_CONTEXT>', '<img>', '</img>'],
+            **kwargs):
+        if chat_template is None:
+            chat_template = "{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|start_header_id|>system<|end_header_id|>\nYou are a helpful assistant.<|eot_id|>\n{% endif %}<|start_header_id|>{{ message['role'] }}<|end_header_id|>\n{% if message['role'] == 'assistant' %}{% generation %}{{ message['content'][0]['text'] }}<|eot_id|>{% endgeneration %}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}<img><IMG_CONTEXT></img>{% elif content['type'] == 'video' or 'video' in content %}<video><VIDEO_CONTEXT></video>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|eot_id|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|start_header_id|>assistant<|end_header_id|>\n{% endif %}"
+        super().__init__(tokenizer=tokenizer, image_processor=image_processor, chat_template=chat_template)
+        if isinstance(image_size, List) or isinstance(image_size, Tuple):
+            image_size = image_size[0]
+        self.num_image_token = int((image_size // patch_size) ** 2 * (downsample_ratio ** 2))
+        self.vision_token_share_pe = kwargs.get('vision_token_share_pe', True)
+        self.image_token_len = kwargs.pop('image_token_len', 256)
+        self.max_sub_img = max_sub_img
+        self.min_sub_img = min_sub_img
+        self.image_token = image_token
+        self.image_start_token = image_start_token
+        self.image_end_token = image_end_token
+        special_tokens = special_tokens + [f'<|Mask_Cap_{i}|>' for i in range(16)]
+        self.tokenizer.add_special_tokens({'additional_special_tokens': special_tokens}, replace_additional_special_tokens=False)
+        self.image_token_id = self.tokenizer.convert_tokens_to_ids(self.image_token)
+        self.image_start_token_id = self.tokenizer.convert_tokens_to_ids(self.image_start_token)
+        self.image_end_token_id = self.tokenizer.convert_tokens_to_ids(self.image_end_token)
+        if 'llada' in tokenizer.name_or_path.lower():
+            self._pad_token_id = self.tokenizer.convert_tokens_to_ids("<|eot_id|>")
+        if isinstance(image_size, int):
+            image_size = (image_size, image_size)
+        else:
+            image_size = image_size
+        self.image_size = image_size
+        assert image_size[0] == image_size[1]
+    def apply_chat_template(self, conversation, chat_template = None, **kwargs) -> str:
+        if chat_template is None:
+            chat_template = self.chat_template
+        # Split template kwargs from processor/tokenization kwargs so that
+        # `tokenize=True` can reuse the processor pipeline without polluting
+        # the template rendering inputs.
+        tokenize = kwargs.pop("tokenize", False)
+        return_dict = kwargs.pop("return_dict", False)
+        return_tensors = kwargs.pop("return_tensors", None)
+        images = kwargs.pop("images", [])
+        videos = kwargs.pop("videos", None)
+        if not images:
+            for message in conversation:
+                content = message.get("content", [])
+                if isinstance(content, list):
+                    for item in content:
+                        if isinstance(item, dict) and (item.get("type") == "image" or "image" in item):
+                            image = item.get("image") or item.get("image_url")
+                            if image is not None:
+                                images.append(image)
+        processor_kwargs = {}
+        for key in ("padding", "truncation", "max_length"):
+            if key in kwargs:
+                processor_kwargs[key] = kwargs.pop(key)
+        if return_tensors is not None:
+            processor_kwargs["return_tensors"] = return_tensors
+        processed_kwargs = {
+            "mm_load_kwargs": {},
+            "template_kwargs": {},
+        }
+        # for kwarg_type in processed_kwargs:
+        #     for key in AllKwargsForChatTemplate.__annotations__[kwarg_type].__annotations__.keys():
+        #         kwarg_type_defaults = AllKwargsForChatTemplate.__annotations__[kwarg_type]
+        #         default_value = getattr(kwarg_type_defaults, key, None)
+        #         value = kwargs.pop(key, default_value)
+        #         if value is not None and not isinstance(value, dict):
+        #             processed_kwargs[kwarg_type][key] = value
+        # Pass unprocessed custom kwargs
+        processed_kwargs["template_kwargs"].update(kwargs)
+        conversations = [conversation]
+        prompt, generation_indices = render_jinja_template(
+            conversations=conversations,
+            chat_template=chat_template,
+            return_assistant_tokens_mask=True,
+            **processed_kwargs["template_kwargs"],  # different flags such as `return_assistant_mask`
+            **self.tokenizer.special_tokens_map,  # tokenizer special tokens are used by some templates
+        )
+        if not tokenize:
+            return prompt, generation_indices
+        # Reuse the processor pipeline to produce tokenized inputs.
+        model_inputs = self(
+            text=prompt,
+            images=images,
+            videos=videos,
+            generation_indices=generation_indices,
+            **processor_kwargs,
+        )
+        # if return_dict:
+        #     return model_inputs
+        return model_inputs
+    def __call__(self, text=None, images=[], videos=None, generation_indices=None, **kwargs) ->BatchFeature:
+        inputs = self.tokenizer(text, padding=False, truncation=False, return_attention_mask=False)
+        assistant_masks = []
+        input_ids = inputs["input_ids"]
+        for i in range(len(input_ids)):
+            current_mask = [0] * len(input_ids[i])
+            if 'llada' in self.tokenizer.name_or_path.lower():
+                for assistant_start_char, assistant_end_char in generation_indices[i]:
+                    start_token = inputs.char_to_token(i, assistant_start_char)
+                    end_token = inputs.char_to_token(i, assistant_end_char - 1)
+                    if start_token is None:
+                        # start_token is out of bounds maybe due to truncation.
+                        break
+                    for token_id in range(start_token, end_token + 1 if end_token else len(input_ids[i])):
+                        current_mask[token_id] = 1
+            assistant_masks.append(current_mask)
+        inputs["assistant_masks"] = assistant_masks[0]
+        inputs['input_ids'] = input_ids[0]
+        truncation = kwargs.pop('truncation', False)
+        max_length = kwargs.pop('max_length', 1024)
+        padding = kwargs.pop('padding', False)
+        inputs = self.process_images(images, inputs=inputs)
+        if isinstance(inputs, UserDict):
+            inputs = inputs.data
+        if 'attention_mask' not in inputs:
+            inputs['attention_mask'] = [1] * len(inputs['input_ids'])
+        if 'assistant_masks' in inputs:
+            inputs['prompt_mask'] = [1-x for x in inputs.pop('assistant_masks')]
+        inputs = self.process_inputs(inputs)
+        if truncation and len(inputs['input_ids']) > max_length:
+            inputs = self.truncate(inputs, max_length)
+        if padding and len(inputs['input_ids']) < max_length:
+            inputs = self.padding(inputs, max_length)
+        inputs = self.to_tensor(inputs)
+        self.check(inputs)
+        if self.vision_token_share_pe:
+            position_ids = self.get_position_ids(inputs)
+            position_ids = torch.tensor([position_ids], dtype=torch.long)
+            inputs['position_ids'] = position_ids
+        inputs.pop('sub_image_nums', None)
+        return BatchFeature(inputs)
+    def get_position_ids(self, inputs: Dict[str, Any]):
+        input_ids = inputs['input_ids'][0]
+        image_token_lens = self.get_image_token_length(inputs)
+        position_ids = []
+        i, j = 0, 0
+        while len(position_ids) < len(input_ids):
+            if input_ids[len(position_ids)] == self.image_token_id:
+                image_token_len = image_token_lens[j]
+                assert image_token_len % self.image_token_len == 0
+                num_views = image_token_len // self.image_token_len
+                for _ in range(num_views):
+                    position_ids += [i] * self.image_token_len # 同一个图像的所有 token 共享相同的位置编码
+                    i += 1
+                j += 1
+            else:
+                position_ids.append(i)
+                i += 1
+        assert j == len(image_token_lens) and len(position_ids) == len(input_ids), \
+            f"Wrong position_ids, {j} != {len(image_token_lens)} or {len(position_ids)} != {len(input_ids)}"
+        return position_ids
+    def process_images(self, images, inputs):
+        images = [load_image(img) for img in images]
+        if len(images) > 0:
+            processed_images = []
+            sub_image_nums = []
+            for image in images:
+                if len(images) > 1:
+                    # for multi images, remove the split strategy
+                    sub_images = dynamic_preprocess(
+                        image, min_num=1,
+                        max_num=1,
+                        image_size=self.image_size[0], use_thumbnail=True)
+                else:
+                    sub_images = dynamic_preprocess(
+                        image, min_num=self.min_sub_img,
+                        max_num=self.max_sub_img,
+                        image_size=self.image_size[0], use_thumbnail=True)
+                sub_image_nums.append(len(sub_images))
+                processed_images += sub_images
+            # print([_img.size for _img in processed_images])
+            pixel_values = self.image_processor.preprocess(
+                images=processed_images, return_tensors="pt"
+            )["pixel_values"] # (N, c, h, w)
+        else:
+            pixel_values = torch.zeros((
+                1, 3, self.image_size[0], self.image_size[1]), dtype=torch.float32
+            )
+            sub_image_nums = []
+        inputs['pixel_values'] = pixel_values
+        inputs['sub_image_nums'] = sub_image_nums
+        return inputs
+    def truncate(self, inputs: Dict[str, Any], max_length: int):
+        assert self.image_token_id not in inputs['input_ids'][max_length:], f"Truncate image token is not allowed."
+        inputs['input_ids'] = inputs['input_ids'][:max_length]
+        inputs['attention_mask'] = inputs['attention_mask'][:max_length]
+        if 'prompt_mask' in inputs:
+            inputs['prompt_mask'] = inputs['prompt_mask'][:max_length]
+        return inputs
+    def get_image_token_length(self, inputs: Dict[str, Any]) -> List[int]:
+        sub_image_nums = inputs.get('sub_image_nums', None)
+        if sub_image_nums is None or len(sub_image_nums) == 0:
+            return []
+        image_token_lens = [_num * self.num_image_token for _num in sub_image_nums]
+        return image_token_lens
+    def process_inputs(self, inputs: Dict[str, Any]):
+        graft_token_lens = self._get_graft_token_length(inputs)
+        inputs['input_ids'] = self._graft_token(inputs['input_ids'], graft_token_lens, self.image_token_id)
+        inputs['attention_mask'] = self._graft_token(inputs['attention_mask'], graft_token_lens, 'replicate')
+        if 'prompt_mask' in inputs:
+            inputs['prompt_mask'] = self._graft_token(inputs['prompt_mask'], graft_token_lens, 'replicate')
+        return inputs
+    def _graft_token(self, seq, graft_token_lens, value):
+        if value == 'replicate':
+            for i in reversed(graft_token_lens.keys()):
+                seq[i:] = [seq[i]] * graft_token_lens[i] + seq[i+1:]
+        else:
+            for i in reversed(graft_token_lens.keys()):
+                seq[i:] = [value] * graft_token_lens[i] + seq[i+1:]
+        return seq
+    def _get_graft_token_length(self, inputs: Dict[str, Any]) -> Dict[int, int]:
+        image_token_pos = [i for i, x in enumerate(inputs['input_ids']) if x == self.image_token_id]
+        image_token_lens = self.get_image_token_length(inputs)
+        assert len(image_token_pos) == len(image_token_lens), \
+            "Wrong image token count, " \
+            f"image_token_count({len(image_token_pos)}) != image_count({len(image_token_lens)})"
+        graft_token_lens = OrderedDict(item for item in zip(image_token_pos, image_token_lens))
+        return graft_token_lens
+    def check(self, inputs: Dict[str, Any]):
+        image_embed_token_count = torch.count_nonzero(inputs['input_ids'] == self.image_token_id).item()
+        image_embed_count = sum(self.get_image_token_length(inputs))
+        assert image_embed_token_count == image_embed_count, \
+            "Wrong image embed token count, " \
+            f"image_embed_token_count({image_embed_token_count}) != image_embed_count({image_embed_count})"
+    def padding(self, inputs: Dict[str, Any], max_length: int):
+        padding_len = max_length - len(inputs['input_ids'])
+        inputs['input_ids'] += [self.pad_token_id] * padding_len
+        inputs['attention_mask'] += [0] * padding_len
+        if 'prompt_mask' in inputs:
+            inputs['prompt_mask'] += [0] * padding_len
+        return inputs
+    def decode(self, token_ids: Union[List[int], torch.Tensor], **kwargs):
+        if isinstance(token_ids, torch.Tensor):
+            token_ids = token_ids.tolist()
+        text = self.tokenizer.decode(token_ids, **kwargs)
+        return text
+    def batch_decode(self, sequences: Union[List[List[int]], torch.Tensor], **kwargs):
+        if isinstance(sequences, torch.Tensor):
+            sequences = sequences.tolist()
+        texts = self.tokenizer.batch_decode(sequences, **kwargs)
+        return texts
+    def to_tensor(self, inputs):
+        inputs['input_ids'] = torch.tensor([inputs['input_ids']], dtype=torch.long)
+        inputs['attention_mask'] = torch.tensor([inputs['attention_mask']], dtype=torch.bool)
+        if 'prompt_mask' in inputs:
+            inputs['prompt_mask'] = torch.tensor([inputs['prompt_mask']], dtype=torch.bool)
+        return inputs
+    @property
+    def pad_token_id(self):
+        return self._pad_token_id
+    def __repr__(self):
+        pass
+    def __str__(self):
+        return 'PDMLLMProcessor'
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    # print(f'width: {width}, height: {height}, best_ratio: {best_ratio}')
+    return best_ratio
+def dynamic_preprocess(image, min_num=1, max_num=6, image_size=512, use_thumbnail=True):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images

processor_config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_pdmllm.PDMLLMProcessor"
+  },
+  "image_end_token": "</img>",
+  "image_size": [
+    512,
+    512
+  ],
+  "image_start_token": "<img>",
+  "image_token": "<IMG_CONTEXT>",
+  "max_sub_img": 6,
+  "min_sub_img": 1,
+  "processor_class": "PDMLLMProcessor"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,172 @@

+{
+  "additional_special_tokens": [
+    "<|mdm_mask|>",
+    "<role>",
+    "</role>",
+    "<|arithmetic_start|>",
+    "<|arithmetic_end|>",
+    "<|number_start|>",
+    "<|number_end|>",
+    {
+      "content": "<IMG_CONTEXT>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "</img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_0|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_4|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_5|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_6|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_7|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_8|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_9|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_10|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_11|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_12|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_13|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_14|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    },
+    {
+      "content": "<|Mask_Cap_15|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false
+    }
+  ],
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,2359 @@

+{
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "added_tokens_decoder": {
+    "126080": {
+      "content": "<|startoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126081": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126082": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126083": {
+      "content": "[gMASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126084": {
+      "content": "<|reserved_token_0|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126085": {
+      "content": "<|reserved_token_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126086": {
+      "content": "<|reserved_token_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126087": {
+      "content": "<|reserved_token_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126088": {
+      "content": "<|reserved_token_4|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126089": {
+      "content": "<|reserved_token_5|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126090": {
+      "content": "<|reserved_token_6|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126091": {
+      "content": "<|reserved_token_7|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126092": {
+      "content": "<|reserved_token_8|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126093": {
+      "content": "<|reserved_token_9|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126094": {
+      "content": "<|reserved_token_10|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126095": {
+      "content": "<|reserved_token_11|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126096": {
+      "content": "<|reserved_token_12|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126097": {
+      "content": "<|reserved_token_13|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126098": {
+      "content": "<|reserved_token_14|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126099": {
+      "content": "<|reserved_token_15|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126100": {
+      "content": "<|reserved_token_16|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126101": {
+      "content": "<|reserved_token_17|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126102": {
+      "content": "<|reserved_token_18|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126103": {
+      "content": "<|reserved_token_19|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126104": {
+      "content": "<|reserved_token_20|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126105": {
+      "content": "<|reserved_token_21|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126106": {
+      "content": "<|reserved_token_22|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126107": {
+      "content": "<|reserved_token_23|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126108": {
+      "content": "<|reserved_token_24|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126109": {
+      "content": "<|reserved_token_25|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126110": {
+      "content": "<|reserved_token_26|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126111": {
+      "content": "<|reserved_token_27|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126112": {
+      "content": "<|reserved_token_28|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126113": {
+      "content": "<|reserved_token_29|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126114": {
+      "content": "<|reserved_token_30|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126115": {
+      "content": "<|reserved_token_31|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126116": {
+      "content": "<|reserved_token_32|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126117": {
+      "content": "<|reserved_token_33|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126118": {
+      "content": "<|reserved_token_34|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126119": {
+      "content": "<|reserved_token_35|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126120": {
+      "content": "<|reserved_token_36|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126121": {
+      "content": "<|reserved_token_37|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126122": {
+      "content": "<|reserved_token_38|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126123": {
+      "content": "<|reserved_token_39|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126124": {
+      "content": "<|reserved_token_40|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126125": {
+      "content": "<|reserved_token_41|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126126": {
+      "content": "<|reserved_token_42|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126127": {
+      "content": "<|reserved_token_43|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126128": {
+      "content": "<|reserved_token_44|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126129": {
+      "content": "<|reserved_token_45|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126130": {
+      "content": "<|reserved_token_46|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126131": {
+      "content": "<|reserved_token_47|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126132": {
+      "content": "<|reserved_token_48|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126133": {
+      "content": "<|reserved_token_49|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126134": {
+      "content": "<|reserved_token_50|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126135": {
+      "content": "<|reserved_token_51|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126136": {
+      "content": "<|reserved_token_52|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126137": {
+      "content": "<|reserved_token_53|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126138": {
+      "content": "<|reserved_token_54|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126139": {
+      "content": "<|reserved_token_55|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126140": {
+      "content": "<|reserved_token_56|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126141": {
+      "content": "<|reserved_token_57|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126142": {
+      "content": "<|reserved_token_58|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126143": {
+      "content": "<|reserved_token_59|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126144": {
+      "content": "<|reserved_token_60|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126145": {
+      "content": "<|reserved_token_61|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126146": {
+      "content": "<|reserved_token_62|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126147": {
+      "content": "<|reserved_token_63|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126148": {
+      "content": "<|reserved_token_64|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126149": {
+      "content": "<|reserved_token_65|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126150": {
+      "content": "<|reserved_token_66|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126151": {
+      "content": "<|reserved_token_67|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126152": {
+      "content": "<|reserved_token_68|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126153": {
+      "content": "<|reserved_token_69|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126154": {
+      "content": "<|reserved_token_70|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126155": {
+      "content": "<|reserved_token_71|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126156": {
+      "content": "<|reserved_token_72|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126157": {
+      "content": "<|reserved_token_73|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126158": {
+      "content": "<|reserved_token_74|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126159": {
+      "content": "<|reserved_token_75|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126160": {
+      "content": "<|reserved_token_76|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126161": {
+      "content": "<|reserved_token_77|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126162": {
+      "content": "<|reserved_token_78|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126163": {
+      "content": "<|reserved_token_79|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126164": {
+      "content": "<|reserved_token_80|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126165": {
+      "content": "<|reserved_token_81|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126166": {
+      "content": "<|reserved_token_82|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126167": {
+      "content": "<|reserved_token_83|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126168": {
+      "content": "<|reserved_token_84|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126169": {
+      "content": "<|reserved_token_85|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126170": {
+      "content": "<|reserved_token_86|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126171": {
+      "content": "<|reserved_token_87|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126172": {
+      "content": "<|reserved_token_88|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126173": {
+      "content": "<|reserved_token_89|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126174": {
+      "content": "<|reserved_token_90|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126175": {
+      "content": "<|reserved_token_91|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126176": {
+      "content": "<|reserved_token_92|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126177": {
+      "content": "<|reserved_token_93|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126178": {
+      "content": "<|reserved_token_94|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126179": {
+      "content": "<|reserved_token_95|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126180": {
+      "content": "<|reserved_token_96|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126181": {
+      "content": "<|reserved_token_97|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126182": {
+      "content": "<|reserved_token_98|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126183": {
+      "content": "<|reserved_token_99|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126184": {
+      "content": "<|reserved_token_100|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126185": {
+      "content": "<|reserved_token_101|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126186": {
+      "content": "<|reserved_token_102|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126187": {
+      "content": "<|reserved_token_103|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126188": {
+      "content": "<|reserved_token_104|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126189": {
+      "content": "<|reserved_token_105|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126190": {
+      "content": "<|reserved_token_106|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126191": {
+      "content": "<|reserved_token_107|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126192": {
+      "content": "<|reserved_token_108|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126193": {
+      "content": "<|reserved_token_109|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126194": {
+      "content": "<|reserved_token_110|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126195": {
+      "content": "<|reserved_token_111|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126196": {
+      "content": "<|reserved_token_112|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126197": {
+      "content": "<|reserved_token_113|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126198": {
+      "content": "<|reserved_token_114|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126199": {
+      "content": "<|reserved_token_115|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126200": {
+      "content": "<|reserved_token_116|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126201": {
+      "content": "<|reserved_token_117|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126202": {
+      "content": "<|reserved_token_118|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126203": {
+      "content": "<|reserved_token_119|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126204": {
+      "content": "<|reserved_token_120|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126205": {
+      "content": "<|reserved_token_121|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126206": {
+      "content": "<|reserved_token_122|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126207": {
+      "content": "<|reserved_token_123|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126208": {
+      "content": "<|reserved_token_124|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126209": {
+      "content": "<|reserved_token_125|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126210": {
+      "content": "<|reserved_token_126|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126211": {
+      "content": "<|reserved_token_127|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126212": {
+      "content": "<|reserved_token_128|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126213": {
+      "content": "<|reserved_token_129|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126214": {
+      "content": "<|reserved_token_130|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126215": {
+      "content": "<|reserved_token_131|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126216": {
+      "content": "<|reserved_token_132|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126217": {
+      "content": "<|reserved_token_133|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126218": {
+      "content": "<|reserved_token_134|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126219": {
+      "content": "<|reserved_token_135|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126220": {
+      "content": "<|reserved_token_136|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126221": {
+      "content": "<|reserved_token_137|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126222": {
+      "content": "<|reserved_token_138|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126223": {
+      "content": "<|reserved_token_139|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126224": {
+      "content": "<|reserved_token_140|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126225": {
+      "content": "<|reserved_token_141|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126226": {
+      "content": "<|reserved_token_142|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126227": {
+      "content": "<|reserved_token_143|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126228": {
+      "content": "<|reserved_token_144|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126229": {
+      "content": "<|reserved_token_145|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126230": {
+      "content": "<|reserved_token_146|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126231": {
+      "content": "<|reserved_token_147|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126232": {
+      "content": "<|reserved_token_148|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126233": {
+      "content": "<|reserved_token_149|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126234": {
+      "content": "<|reserved_token_150|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126235": {
+      "content": "<|reserved_token_151|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126236": {
+      "content": "<|reserved_token_152|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126237": {
+      "content": "<|reserved_token_153|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126238": {
+      "content": "<|reserved_token_154|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126239": {
+      "content": "<|reserved_token_155|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126240": {
+      "content": "<|reserved_token_156|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126241": {
+      "content": "<|reserved_token_157|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126242": {
+      "content": "<|reserved_token_158|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126243": {
+      "content": "<|reserved_token_159|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126244": {
+      "content": "<|reserved_token_160|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126245": {
+      "content": "<|reserved_token_161|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126246": {
+      "content": "<|reserved_token_162|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126247": {
+      "content": "<|reserved_token_163|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126248": {
+      "content": "<|reserved_token_164|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126249": {
+      "content": "<|reserved_token_165|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126250": {
+      "content": "<|reserved_token_166|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126251": {
+      "content": "<|reserved_token_167|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126252": {
+      "content": "<|reserved_token_168|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126253": {
+      "content": "<|reserved_token_169|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126254": {
+      "content": "<|reserved_token_170|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126255": {
+      "content": "<|reserved_token_171|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126256": {
+      "content": "<|reserved_token_172|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126257": {
+      "content": "<|reserved_token_173|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126258": {
+      "content": "<|reserved_token_174|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126259": {
+      "content": "<|reserved_token_175|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126260": {
+      "content": "<|reserved_token_176|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126261": {
+      "content": "<|reserved_token_177|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126262": {
+      "content": "<|reserved_token_178|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126263": {
+      "content": "<|reserved_token_179|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126264": {
+      "content": "<|reserved_token_180|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126265": {
+      "content": "<|reserved_token_181|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126266": {
+      "content": "<|reserved_token_182|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126267": {
+      "content": "<|reserved_token_183|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126268": {
+      "content": "<|reserved_token_184|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126269": {
+      "content": "<|reserved_token_185|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126270": {
+      "content": "<|reserved_token_186|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126271": {
+      "content": "<|reserved_token_187|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126272": {
+      "content": "<|reserved_token_188|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126273": {
+      "content": "<|reserved_token_189|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126274": {
+      "content": "<|reserved_token_190|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126275": {
+      "content": "<|reserved_token_191|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126276": {
+      "content": "<|reserved_token_192|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126277": {
+      "content": "<|reserved_token_193|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126278": {
+      "content": "<|reserved_token_194|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126279": {
+      "content": "<|reserved_token_195|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126280": {
+      "content": "<|reserved_token_196|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126281": {
+      "content": "<|reserved_token_197|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126282": {
+      "content": "<|reserved_token_198|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126283": {
+      "content": "<|reserved_token_199|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126284": {
+      "content": "<|reserved_token_200|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126285": {
+      "content": "<|reserved_token_201|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126286": {
+      "content": "<|reserved_token_202|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126287": {
+      "content": "<|reserved_token_203|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126288": {
+      "content": "<|reserved_token_204|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126289": {
+      "content": "<|reserved_token_205|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126290": {
+      "content": "<|reserved_token_206|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126291": {
+      "content": "<|reserved_token_207|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126292": {
+      "content": "<|reserved_token_208|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126293": {
+      "content": "<|reserved_token_209|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126294": {
+      "content": "<|reserved_token_210|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126295": {
+      "content": "<|reserved_token_211|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126296": {
+      "content": "<|reserved_token_212|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126297": {
+      "content": "<|reserved_token_213|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126298": {
+      "content": "<|reserved_token_214|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126299": {
+      "content": "<|reserved_token_215|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126300": {
+      "content": "<|reserved_token_216|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126301": {
+      "content": "<|reserved_token_217|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126302": {
+      "content": "<|reserved_token_218|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126303": {
+      "content": "<|reserved_token_219|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126304": {
+      "content": "<|reserved_token_220|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126305": {
+      "content": "<|reserved_token_221|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126306": {
+      "content": "<|reserved_token_222|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126307": {
+      "content": "<|reserved_token_223|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126308": {
+      "content": "<|reserved_token_224|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126309": {
+      "content": "<|reserved_token_225|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126310": {
+      "content": "<|reserved_token_226|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126311": {
+      "content": "<|reserved_token_227|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126312": {
+      "content": "<|reserved_token_228|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126313": {
+      "content": "<|reserved_token_229|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126314": {
+      "content": "<|reserved_token_230|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126315": {
+      "content": "<|reserved_token_231|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126316": {
+      "content": "<|reserved_token_232|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126317": {
+      "content": "<|reserved_token_233|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126318": {
+      "content": "<|reserved_token_234|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126319": {
+      "content": "<|reserved_token_235|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126320": {
+      "content": "<|reserved_token_236|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126321": {
+      "content": "<|reserved_token_237|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126322": {
+      "content": "<|reserved_token_238|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126323": {
+      "content": "<|reserved_token_239|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126324": {
+      "content": "<|reserved_token_240|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126325": {
+      "content": "<|reserved_token_241|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126326": {
+      "content": "<|reserved_token_242|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126327": {
+      "content": "<|reserved_token_243|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126328": {
+      "content": "<|reserved_token_244|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126329": {
+      "content": "<|reserved_token_245|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126330": {
+      "content": "<|reserved_token_246|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126331": {
+      "content": "<|reserved_token_247|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126332": {
+      "content": "<|reserved_token_248|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126333": {
+      "content": "<|reserved_token_249|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126334": {
+      "content": "<|reserved_token_250|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126335": {
+      "content": "<|reserved_token_251|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126336": {
+      "content": "<|mdm_mask|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126337": {
+      "content": "<|reserved_token_253|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126338": {
+      "content": "<|reserved_token_254|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126339": {
+      "content": "<|reserved_token_255|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126340": {
+      "content": "<role>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126341": {
+      "content": "</role>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126342": {
+      "content": "<|arithmetic_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126343": {
+      "content": "<|arithmetic_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126344": {
+      "content": "<|number_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126345": {
+      "content": "<|number_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126346": {
+      "content": "<|start_header_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126347": {
+      "content": "<|end_header_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126348": {
+      "content": "<|eot_id|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126349": {
+      "content": "<IMG_CONTEXT>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126350": {
+      "content": "<img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126351": {
+      "content": "</img>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126352": {
+      "content": "<|Mask_Cap_0|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126353": {
+      "content": "<|Mask_Cap_1|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126354": {
+      "content": "<|Mask_Cap_2|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126355": {
+      "content": "<|Mask_Cap_3|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126356": {
+      "content": "<|Mask_Cap_4|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126357": {
+      "content": "<|Mask_Cap_5|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126358": {
+      "content": "<|Mask_Cap_6|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126359": {
+      "content": "<|Mask_Cap_7|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126360": {
+      "content": "<|Mask_Cap_8|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126361": {
+      "content": "<|Mask_Cap_9|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126362": {
+      "content": "<|Mask_Cap_10|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126363": {
+      "content": "<|Mask_Cap_11|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126364": {
+      "content": "<|Mask_Cap_12|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126365": {
+      "content": "<|Mask_Cap_13|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126366": {
+      "content": "<|Mask_Cap_14|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "126367": {
+      "content": "<|Mask_Cap_15|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|mdm_mask|>",
+    "<role>",
+    "</role>",
+    "<|arithmetic_start|>",
+    "<|arithmetic_end|>",
+    "<|number_start|>",
+    "<|number_end|>",
+    "<IMG_CONTEXT>",
+    "<img>",
+    "</img>",
+    "<|Mask_Cap_0|>",
+    "<|Mask_Cap_1|>",
+    "<|Mask_Cap_2|>",
+    "<|Mask_Cap_3|>",
+    "<|Mask_Cap_4|>",
+    "<|Mask_Cap_5|>",
+    "<|Mask_Cap_6|>",
+    "<|Mask_Cap_7|>",
+    "<|Mask_Cap_8|>",
+    "<|Mask_Cap_9|>",
+    "<|Mask_Cap_10|>",
+    "<|Mask_Cap_11|>",
+    "<|Mask_Cap_12|>",
+    "<|Mask_Cap_13|>",
+    "<|Mask_Cap_14|>",
+    "<|Mask_Cap_15|>"
+  ],
+  "auto_map": {
+    "AutoProcessor": "processing_pdmllm.PDMLLMProcessor"
+  },
+  "bos_token": "<|startoftext|>",
+  "chat_template": "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "eos_token": "<|endoftext|>",
+  "extra_special_tokens": {},
+  "fast_tokenizer": true,
+  "gmask_token": "[gMASK]",
+  "merges_file": null,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 1000000000000000019884624838656,
+  "pad_token": "<|endoftext|>",
+  "processor_class": "PDMLLMProcessor",
+  "tokenizer_class": "PreTrainedTokenizer",
+  "trust_remote_code": true
+}