shruthib commited on Oct 3, 2024

Commit

2553e7a

verified ·

1 Parent(s): b9432bf

Model Release (#1)

- Model Release (ba5c7c8a5203ec65aaec329a9f890b8b780d9301)
- Remove whitespace changes in LICENSE (fd1d2008bcc32388dea23cba1849b39c5ae4ad2c)

Files changed (21) hide show

README.md +248 -5
added_tokens.json +209 -0
chat_template.json +3 -0
config.json +212 -0
configuration_maira2.py +32 -0
generation_config.json +9 -0
model-00001-of-00006.safetensors +3 -0
model-00002-of-00006.safetensors +3 -0
model-00003-of-00006.safetensors +3 -0
model-00004-of-00006.safetensors +3 -0
model-00005-of-00006.safetensors +3 -0
model-00006-of-00006.safetensors +3 -0
model.safetensors.index.json +529 -0
modeling_maira2.py +88 -0
preprocessor_config.json +28 -0
processing_maira2.py +646 -0
processor_config.json +14 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +1701 -0

README.md CHANGED Viewed

@@ -1,5 +1,248 @@
----
-license: other
-license_name: msrla
-license_link: LICENSE
----

+---
+license: other
+license_name: msrla
+license_link: https://huggingface.co/microsoft/maira-2/blob/main/LICENSE
+library_name: transformers
+---
+# Model Card for MAIRA-2
+<!-- Provide a quick summary of what the model is/does. -->
+MAIRA-2 is a multimodal transformer designed for the generation of grounded or non-grounded radiology reports from chest X-rays. It is described in more detail in [MAIRA-2: Grounded Radiology Report Generation (S. Bannur, K. Bouzid et al., 2024)](https://arxiv.org/abs/2406.04449). MAIRA-2 has been built for research purposes only and is being shared to facilitate comparison and further research.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+MAIRA-2 is composed of the image encoder [RAD-DINO-MAIRA-2](https://huggingface.co/microsoft/rad-dino-maira-2) (used frozen), a projection layer (trained from scratch), and the language model [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) (fully fine-tuned).
+- **Developed by:** Microsoft Research Health Futures
+- **Model type:** Multimodal transformer
+- **Language(s) (NLP):** English
+- **License:** [MSRLA](./LICENSE)
+- **Finetuned from model [optional]:** [vicuna-7b-1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5), [RAD-DINO-MAIRA-2](https://huggingface.co/microosft/rad-dino-maira-2)
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+MAIRA-2 is shared for research purposes only. It is **not meant to be used for clinical practice.** MAIRA-2 was not extensively tested for its capabilities and properties, including its accuracy and reliability in application settings, fairness across different demographics and uses, and security and privacy.
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+As inputs, MAIRA-2 takes a frontal chest X-ray, and any of the following:
+- A lateral view from the current study
+- A frontal view from the *prior* study, with accompanying prior report
+- The indication for the current study
+- The technique and comparison sections for the current study
+MAIRA-2 can generate the _findings_ section of the current study, in one of two forms:
+- Narrative text, without any image annotations (this is the typical report generation scenario).
+- As a grounded report, wherein all described findings are accompanied by zero or more bounding boxes indicating their location on the current frontal image.
+MAIRA-2 can also perform phrase grounding. In this case, it must also be provided with an input phrase. It will then repeat the phrase and generate a bounding box localising the finding described in the phrase.
+These use-cases are illustrated with [sample code below](README.md#use-case-3-phrase-grounding).
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+MAIRA-2 was trained on chest X-rays from adults with English language reports only, and is not expected to work on any other imaging modality or anatomy. Variations in the input prompt (e.g. changing the instruction) are likely to degrade performance, as this model was *not* optimised for arbitrary user inputs.
+As above, this is a research model which should not be used in any real clinical or production scenario.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+### Data biases
+MAIRA-2 was trained on chest X-ray report datasets from Spain (translated from the original Spanish to English) and the USA, listed below. Reporting styles, patient demographics and disease prevalence, and image acquisition protocols can vary across health systems and regions. These factors will impact the generalisability of the model.
+### Model errors (fabrication, omission)
+This model does not perform perfectly on its tasks, as outlined in more detail in the [MAIRA-2 report](https://arxiv.org/abs/2406.04449). Hence, errors can be present in the generated (grounded) reports.
+## How to Get Started with the Model
+We demonstrate below how to run inference with MAIRA-2 for its three capabilities: findings generation with and without grounding, and phrase grounding.
+### Setup
+First, initialise the model and put it in eval mode.
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+from pathlib import Path
+import torch
+model = AutoModelForCausalLM.from_pretrained("microsoft/maira-2", trust_remote_code=True)
+processor = AutoProcessor.from_pretrained("microsoft/maira-2", trust_remote_code=True)
+device = torch.device("cuda")
+model = model.eval()
+model = model.to(device)
+```
+We need to get some data to demonstrate the forward pass.
+For this example, we'll collect an example from the IU X-ray dataset, which has a permissive license.
+```python
+import requests
+from PIL import Image
+def get_sample_data() -> dict[str, Image.Image | str]:
+    """
+    Download chest X-rays from IU-Xray, which we didn't train MAIRA-2 on. License is CC.
+    We modified this function from the Rad-DINO repository on Huggingface.
+    """
+    frontal_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-1001.png"
+    lateral_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-2001.png"
+    headers = {"User-Agent": "MAIRA-2"}
+    frontal_response = requests.get(frontal_image_url, headers=headers, stream=True)
+    frontal_image = Image.open(frontal_response.raw)
+    lateral_response = requests.get(lateral_image_url, headers=headers, stream=True)
+    lateral_image = Image.open(lateral_response.raw)
+    sample_data = {
+        "frontal": frontal_image,
+        "lateral": lateral_image,
+        "indication": "Dyspnea.",
+        "comparison": "None.",
+        "technique": "PA and lateral views of the chest.",
+        "phrase": "Pleural effusion."    # For the phrase grounding example. This patient has pleural effusion.
+    }
+    return sample_data
+sample_data = get_sample_data()
+```
+### Use-case 1 and 2: Findings generation with or without grounding
+We can toggle whether MAIRA-2 generates a grounded report based on how we preprocess the inputs, as it uses a different prompt. Lets start without grounding (`get_grounding=False`). While generating, for non-grounded reporting use `max_new_tokens=300`, and for grounded reporting use `max_new_tokens=450` to accommodate additional box and object tokens.
+```python
+processed_inputs = processor.format_and_preprocess_reporting_input(
+    current_frontal=sample_data["frontal"],
+    current_lateral=sample_data["lateral"],
+    prior_frontal=None, # Our example has no prior
+    indication=sample_data["indication"],
+    technique=sample_data["technique"],
+    comparison=sample_data["comparison"],
+    prior_report=None,  # Our example has no prior
+    return_tensors="pt",
+    get_grounding=False # For this example we generate a non-grounded report
+)
+processed_inputs = processed_inputs.to(device)
+with torch.no_grad():
+    output_decoding = model.generate(
+        **processed_inputs,
+        max_new_tokens=300,  # Set to 450 for grounded reporting.
+        use_cache=True,
+    )
+prompt_length = processed_inputs["input_ids"].shape[-1]
+decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
+decoded_text = decoded_text.lstrip()  # Findings generation completions have a single leading space
+print("Parsed prediction:", processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text))
+```
+We get something that looks like this:
+> "There is a large right pleural effusion with associated right basilar atelectasis. The left lung is clear. No pneumothorax is identified. The cardiomediastinal silhouette and hilar contours are normal. There is no free air under the diaphragm. Surgical clips are noted in the right upper quadrant of the abdomen."
+If we had set `get_grounding=True`, MAIRA-2 would generate a grounded report. For this example, that looks like this:
+```python
+('There is a large right pleural effusion.', [(0.055, 0.275, 0.445, 0.665)]),
+('The left lung is clear.', None),
+('No pneumothorax is identified.', None),
+('The cardiomediastinal silhouette is within normal limits.', None),
+('The visualized osseous structures are unremarkable.', None)
+```
+The generated bounding box coordinates are the `(x, y)` coordinates of the top left and bottom right corners of the box, e.g. `(x_topleft, y_topleft, x_bottomright, y_bottomright)`. These are relative to the _cropped_ image (that is, the image that MAIRA-2 ultimately got as input), so be careful while visualising. The processor provides a method `adjust_box_for_original_image_size` to get boxes relative to the original image shape.
+Note that MAIRA-2 generates slightly different reports for grounded and non-grounded reporting scenarios, a side-effect of its grounded reporting training data coming from a different data distribution.
+### Use-case 3: Phrase Grounding
+Here the input is different as we provide the model with a phrase to ground in the image. Recall (`get_sample_data`) that our phrase here is just "Pleural effusion", which we already know is present in this image.
+```python
+processed_inputs = processor.format_and_preprocess_phrase_grounding_input(
+    frontal_image=sample_data["frontal"],
+    phrase=sample_data["phrase"],
+    return_tensors="pt",
+)
+processed_inputs = processed_inputs.to(device)
+with torch.no_grad():
+    output_decoding = model.generate(
+        **processed_inputs,
+        max_new_tokens=150,
+        use_cache=True,
+    )
+prompt_length = processed_inputs["input_ids"].shape[-1]
+decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
+print("Parsed prediction:", processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text))
+```
+This gives us something like this:
+```python
+('Pleural effusion.', [(0.025, 0.345, 0.425, 0.575)])
+```
+Again, as for grounded reporting we must remember the bbox coordinates are relative to the cropped image seen by MAIRA-2, use `processor.adjust_box_for_original_image_size` to get boxes adjusted for the original image shape.
+## Training details
+We did not originally train MAIRA-2 using the exact model class provided here, however we have checked that its behaviour is the same. We provide this class to facilitate research re-use and inference.
+### Training data
+MAIRA-2 was trained on a mix of public and private chest X-ray datasets. Each example comprises one or more CXR images and associated report text, with or without grounding (spatial annotations). The model is trained to generate the _findings_ section of the report, with or without grounding.
+| Dataset | Country | # examples (ungrounded) | # examples (grounded) |
+| ----- | ------ |------- | ----- |
+| [MIMIC-CXR](https://www.nature.com/articles/s41597-019-0322-0) | USA | 55 218 | 595* |
+| [PadChest](https://www.sciencedirect.com/science/article/abs/pii/S1361841520301614) | Spain | 52 828 | 3 122 |
+| USMix (Private) | USA | 118 031 | 53 613 |
+*We use the [MS-CXR](https://physionet.org/content/ms-cxr/) phrase grounding dataset to provide `grounding' examples from MIMIC-CXR.
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** NVIDIA A100 GPUs
+- **Hours used:** 1432
+- **Cloud Provider:**  Azure
+- **Compute Region:** West US 2
+- **Carbon Emitted:** 107.4 CO₂ eq _(ostensibly offset by this provider)_
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```
+@article{Bannur2024MAIRA2GR,
+  title={MAIRA-2: Grounded Radiology Report Generation},
+  author={Shruthi Bannur and Kenza Bouzid and Daniel C. Castro and Anton Schwaighofer and Anja Thieme and Sam Bond-Taylor and Maximilian Ilse and Fernando P'erez-Garc'ia and Valentina Salvatelli and Harshita Sharma and Felix Meissen and Mercy Prasanna Ranjit and Shaury Srivastav and Julia Gong and Noel C. F. Codella and Fabian Falck and Ozan Oktay and Matthew P. Lungren and Maria T. A. Wetscherek and Javier Alvarez-Valle and Stephanie L. Hyland},
+  journal={arXiv},
+  year={2024},
+  volume={abs/2406.04449},
+  url={https://arxiv.org/abs/2406.04449}
+}
+```
+**APA:**
+> Bannur*, S., Bouzid*, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., P'erez-Garc'ia, F., Salvatelli, V., Sharma, H., Meissen, F., Ranjit, M.P., Srivastav, S., Gong, J., Codella, N.C.F., Falck, F., Oktay, O., Lungren, M.P., Wetscherek, M.T., Alvarez-Valle, J., & Hyland, S. L. (2024). *MAIRA-2: Grounded Radiology Report Generation*. arXiv preprint abs/2406.04449.
+## Model Card Contact
+- Stephanie Hyland ([`stephanie.hyland@microsoft.com`](mailto:stephanie.hyland@microsoft.com))
+- Shruthi Bannur ([`shruthi.bannur@microsoft.com`](mailto:shruthi.bannur@microsoft.com))

added_tokens.json ADDED Viewed

	@@ -0,0 +1,209 @@

+{
+  "</box>": 32203,
+  "</obj>": 32001,
+  "<box>": 32202,
+  "<image>": 32204,
+  "<lat_image>": 32206,
+  "<obj>": 32000,
+  "<prev_im>": 32205,
+  "<x0>": 32002,
+  "<x10>": 32012,
+  "<x11>": 32013,
+  "<x12>": 32014,
+  "<x13>": 32015,
+  "<x14>": 32016,
+  "<x15>": 32017,
+  "<x16>": 32018,
+  "<x17>": 32019,
+  "<x18>": 32020,
+  "<x19>": 32021,
+  "<x1>": 32003,
+  "<x20>": 32022,
+  "<x21>": 32023,
+  "<x22>": 32024,
+  "<x23>": 32025,
+  "<x24>": 32026,
+  "<x25>": 32027,
+  "<x26>": 32028,
+  "<x27>": 32029,
+  "<x28>": 32030,
+  "<x29>": 32031,
+  "<x2>": 32004,
+  "<x30>": 32032,
+  "<x31>": 32033,
+  "<x32>": 32034,
+  "<x33>": 32035,
+  "<x34>": 32036,
+  "<x35>": 32037,
+  "<x36>": 32038,
+  "<x37>": 32039,
+  "<x38>": 32040,
+  "<x39>": 32041,
+  "<x3>": 32005,
+  "<x40>": 32042,
+  "<x41>": 32043,
+  "<x42>": 32044,
+  "<x43>": 32045,
+  "<x44>": 32046,
+  "<x45>": 32047,
+  "<x46>": 32048,
+  "<x47>": 32049,
+  "<x48>": 32050,
+  "<x49>": 32051,
+  "<x4>": 32006,
+  "<x50>": 32052,
+  "<x51>": 32053,
+  "<x52>": 32054,
+  "<x53>": 32055,
+  "<x54>": 32056,
+  "<x55>": 32057,
+  "<x56>": 32058,
+  "<x57>": 32059,
+  "<x58>": 32060,
+  "<x59>": 32061,
+  "<x5>": 32007,
+  "<x60>": 32062,
+  "<x61>": 32063,
+  "<x62>": 32064,
+  "<x63>": 32065,
+  "<x64>": 32066,
+  "<x65>": 32067,
+  "<x66>": 32068,
+  "<x67>": 32069,
+  "<x68>": 32070,
+  "<x69>": 32071,
+  "<x6>": 32008,
+  "<x70>": 32072,
+  "<x71>": 32073,
+  "<x72>": 32074,
+  "<x73>": 32075,
+  "<x74>": 32076,
+  "<x75>": 32077,
+  "<x76>": 32078,
+  "<x77>": 32079,
+  "<x78>": 32080,
+  "<x79>": 32081,
+  "<x7>": 32009,
+  "<x80>": 32082,
+  "<x81>": 32083,
+  "<x82>": 32084,
+  "<x83>": 32085,
+  "<x84>": 32086,
+  "<x85>": 32087,
+  "<x86>": 32088,
+  "<x87>": 32089,
+  "<x88>": 32090,
+  "<x89>": 32091,
+  "<x8>": 32010,
+  "<x90>": 32092,
+  "<x91>": 32093,
+  "<x92>": 32094,
+  "<x93>": 32095,
+  "<x94>": 32096,
+  "<x95>": 32097,
+  "<x96>": 32098,
+  "<x97>": 32099,
+  "<x98>": 32100,
+  "<x99>": 32101,
+  "<x9>": 32011,
+  "<y0>": 32102,
+  "<y10>": 32112,
+  "<y11>": 32113,
+  "<y12>": 32114,
+  "<y13>": 32115,
+  "<y14>": 32116,
+  "<y15>": 32117,
+  "<y16>": 32118,
+  "<y17>": 32119,
+  "<y18>": 32120,
+  "<y19>": 32121,
+  "<y1>": 32103,
+  "<y20>": 32122,
+  "<y21>": 32123,
+  "<y22>": 32124,
+  "<y23>": 32125,
+  "<y24>": 32126,
+  "<y25>": 32127,
+  "<y26>": 32128,
+  "<y27>": 32129,
+  "<y28>": 32130,
+  "<y29>": 32131,
+  "<y2>": 32104,
+  "<y30>": 32132,
+  "<y31>": 32133,
+  "<y32>": 32134,
+  "<y33>": 32135,
+  "<y34>": 32136,
+  "<y35>": 32137,
+  "<y36>": 32138,
+  "<y37>": 32139,
+  "<y38>": 32140,
+  "<y39>": 32141,
+  "<y3>": 32105,
+  "<y40>": 32142,
+  "<y41>": 32143,
+  "<y42>": 32144,
+  "<y43>": 32145,
+  "<y44>": 32146,
+  "<y45>": 32147,
+  "<y46>": 32148,
+  "<y47>": 32149,
+  "<y48>": 32150,
+  "<y49>": 32151,
+  "<y4>": 32106,
+  "<y50>": 32152,
+  "<y51>": 32153,
+  "<y52>": 32154,
+  "<y53>": 32155,
+  "<y54>": 32156,
+  "<y55>": 32157,
+  "<y56>": 32158,
+  "<y57>": 32159,
+  "<y58>": 32160,
+  "<y59>": 32161,
+  "<y5>": 32107,
+  "<y60>": 32162,
+  "<y61>": 32163,
+  "<y62>": 32164,
+  "<y63>": 32165,
+  "<y64>": 32166,
+  "<y65>": 32167,
+  "<y66>": 32168,
+  "<y67>": 32169,
+  "<y68>": 32170,
+  "<y69>": 32171,
+  "<y6>": 32108,
+  "<y70>": 32172,
+  "<y71>": 32173,
+  "<y72>": 32174,
+  "<y73>": 32175,
+  "<y74>": 32176,
+  "<y75>": 32177,
+  "<y76>": 32178,
+  "<y77>": 32179,
+  "<y78>": 32180,
+  "<y79>": 32181,
+  "<y7>": 32109,
+  "<y80>": 32182,
+  "<y81>": 32183,
+  "<y82>": 32184,
+  "<y83>": 32185,
+  "<y84>": 32186,
+  "<y85>": 32187,
+  "<y86>": 32188,
+  "<y87>": 32189,
+  "<y88>": 32190,
+  "<y89>": 32191,
+  "<y8>": 32110,
+  "<y90>": 32192,
+  "<y91>": 32193,
+  "<y92>": 32194,
+  "<y93>": 32195,
+  "<y94>": 32196,
+  "<y95>": 32197,
+  "<y96>": 32198,
+  "<y97>": 32199,
+  "<y98>": 32200,
+  "<y99>": 32201,
+  "<y9>": 32111
+}

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}You are an expert radiology assistant tasked with interpreting a chest X-ray study.  {% for message in messages %}{% if message[\"role\"] == \"user\" %}USER:  {% else %}ASSISTANT: {% endif %}{% for item in message[\"content\"] %}{% if item[\"type\"] == \"text\" %}{{ item[\"text\"] }}{% elif item[\"type\"] == \"image\" %}<image>{% endif %}{% endfor %}{% if message[\"role\"] == \"user\" %}  {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}"
+}

config.json ADDED Viewed

	@@ -0,0 +1,212 @@

+{
+  "architectures": [
+    "Maira2ForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_maira2.Maira2Config",
+    "AutoModelForCausalLM": "modeling_maira2.Maira2ForConditionalGeneration",
+    "AutoModelForVision2Seq": "modeling_maira2.Maira2ForConditionalGeneration"
+  },
+  "hidden_size": 4096,
+  "ignore_index": -100,
+  "image_seq_length": 576,
+  "image_token_index": 32204,
+  "model_type": "maira2",
+  "pad_token_id": 0,
+  "projector_hidden_act": "gelu",
+  "projector_n_layers": 4,
+  "text_config": {
+    "_name_or_path": "lmsys/vicuna-7b-v1.5",
+    "add_cross_attention": false,
+    "architectures": [
+      "LlamaForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 1,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 2,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 4096,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 4096,
+    "min_length": 0,
+    "mlp_bias": false,
+    "model_type": "llama",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 32,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 32,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 0,
+    "prefix": null,
+    "pretraining_tp": 1,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "rms_norm_eps": 1e-05,
+    "rope_scaling": {
+      "factor": 1.5,
+      "rope_type": "linear"
+    },
+    "rope_theta": 10000.0,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": false,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": "bfloat16",
+    "torchscript": false,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": true,
+    "vocab_size": 32207
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.46.0.dev0",
+  "vision_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "apply_layernorm": true,
+    "architectures": [
+      "Dinov2Model"
+    ],
+    "attention_probs_dropout_prob": 0.0,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "drop_path_rate": 0.0,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.0,
+    "hidden_size": 768,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "image_size": 518,
+    "initializer_range": 0.02,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-06,
+    "layerscale_value": 1.0,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "mlp_ratio": 4,
+    "model_type": "dinov2",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 12,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_channels": 3,
+    "num_hidden_layers": 12,
+    "num_return_sequences": 1,
+    "out_features": [
+      "stage12"
+    ],
+    "out_indices": [
+      12
+    ],
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "patch_size": 14,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "qkv_bias": true,
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "reshape_hidden_states": false,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "stage_names": [
+      "stem",
+      "stage1",
+      "stage2",
+      "stage3",
+      "stage4",
+      "stage5",
+      "stage6",
+      "stage7",
+      "stage8",
+      "stage9",
+      "stage10",
+      "stage11",
+      "stage12"
+    ],
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": "float32",
+    "torchscript": false,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_swiglu_ffn": false
+  },
+  "vision_feature_layer": -1,
+  "vision_feature_select_strategy": "default"
+}

configuration_maira2.py ADDED Viewed

	@@ -0,0 +1,32 @@

+#  Copyright 2024 Microsoft. All rights reserved.
+#  Licensed under the MSRLA License. See LICENSE in the repo root for license information.
+from typing import Any
+from transformers import LlavaConfig
+class Maira2Config(LlavaConfig):
+    """
+    This is the configuration class to store the configuration of a `Maira2ForConditionalGeneration` model. It is
+    used to instantiate a MAIRA-2 model according to the specified arguments, defining the model architecture.
+    It inherits from `LlavaConfig`. In addition to the inherited attributes, it adds the
+    ability to customize the multimodal projector through following attributes:
+    Args:
+        projector_n_layers (`int`, *optional*, defaults to 4):
+            Number of layers in the multimodal projector.
+    """
+    model_type = "maira2"
+    def __init__(
+        self,
+        projector_n_layers: int = 4,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.hidden_size = self.text_config.hidden_size
+        self.projector_n_layers = projector_n_layers

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "max_length": 4096,
+  "max_new_tokens": 450,
+  "pad_token_id": 0,
+  "transformers_version": "4.46.0.dev0"
+}

model-00001-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e899381d0b4def093d86a599663831282240a57676bd0f6b9646c37c83dce682
+size 4955289768

model-00002-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1309c85392505e2fc421c741b7f3b56f4146f24662e26837220f32d53401aa80
+size 4857207664

model-00003-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:95ef725a26f95751c4e181fa44230b5793a297fc84928532797b0c51947e29e7
+size 4857207704

model-00004-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:735d55ba903a0a5733af227fea50da1588588a9bf6512b3747acda91184a96df
+size 4857207704

model-00005-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:38a7879c501f697864a36ca338866be4d45dfb2fc9b8d4f995d65cc0e984bac5
+size 4857207704

model-00006-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9131ce3ccd96fb1c2654048ba8fb8e4bb9eb131b4395d06c5c1330d0d608fb72
+size 3136688192

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,529 @@

+{
+  "metadata": {
+    "total_size": 27520742400
+  },
+  "weight_map": {
+    "language_model.lm_head.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.embed_tokens.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.10.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.10.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.10.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.11.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.16.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.16.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.17.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.2.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.20.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.22.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.22.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.23.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.28.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.28.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.29.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.3.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.30.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.4.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.4.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.5.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.norm.weight": "model-00006-of-00006.safetensors",
+    "multi_modal_projector.layers.0.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.0.weight": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.2.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.2.weight": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.4.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.4.weight": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.6.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.6.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.cls_token": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.mask_token": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.patch_embeddings.projection.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.patch_embeddings.projection.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.position_embeddings": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.layernorm.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.layernorm.weight": "model-00001-of-00006.safetensors"
+  }
+}

modeling_maira2.py ADDED Viewed

	@@ -0,0 +1,88 @@

+#  Copyright 2024 Microsoft. All rights reserved.
+#  Licensed under the MSRLA License. See LICENSE in the repo root for license information.
+import torch
+from torch.nn import Linear, Module, Sequential
+from transformers import AutoBackbone, AutoModelForCausalLM, LlavaForConditionalGeneration, LlavaPreTrainedModel
+from transformers.activations import ACT2FN
+from transformers.utils import check_min_version
+from .configuration_maira2 import Maira2Config
+class Maira2MultiModalProjector(Module):
+    """
+    This class implements the multimodal projector for MAIRA-2 model. It projects the image features to the text
+    hidden size via a series of linear layers (4 layers in MAIRA-2).
+    """
+    def __init__(self, config: Maira2Config):
+        super().__init__()
+        n_layers = config.projector_n_layers
+        if n_layers < 1:
+            raise ValueError(f"Number of layers should be at least 1, got {n_layers=}")
+        text_hidden_size = config.text_config.hidden_size
+        vision_hidden_size = config.vision_config.hidden_size
+        _layers = [Linear(vision_hidden_size, text_hidden_size, bias=True)]
+        for _ in range(n_layers - 1):
+            _layers.append(ACT2FN[config.projector_hidden_act])
+            _layers.append(Linear(text_hidden_size, text_hidden_size, bias=True))
+        self.layers = Sequential(*_layers)
+    def forward(self, image_features: torch.Tensor) -> torch.FloatTensor:
+        hidden_states = self.layers(image_features)
+        return hidden_states  # type: ignore[no-any-return]
+class Maira2ForConditionalGeneration(LlavaForConditionalGeneration):
+    """
+    This model implements the multimodal model MAIRA-2. It consists of a vision backbone, a multimodal projector, and a
+    language model. The model can be used for grounded and ungrounded report generation tasks as well as phrase grounding.
+    This class inherits from `LlavaForConditionalGeneration`, defining a custom multimodal projector and changing image
+    feature selection.
+    """
+    config_class = Maira2Config
+    def __init__(self, config: Maira2Config) -> None:
+        # Check transformers version is at least 4.46.0.dev0  otherwise the model fails
+        # silently since get_image_features is not called in the forward pass
+        check_min_version("4.46.0.dev0")
+        super(LlavaPreTrainedModel, self).__init__(config)
+        self.vision_tower = AutoBackbone.from_config(config.vision_config)
+        self.multi_modal_projector = Maira2MultiModalProjector(config)
+        self.vocab_size = config.text_config.vocab_size
+        self.language_model = AutoModelForCausalLM.from_config(
+            config.text_config,
+            attn_implementation=config._attn_implementation,
+        )
+        self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
+        self.post_init()
+    def get_image_features(
+        self, pixel_values: torch.FloatTensor, vision_feature_layer: int, vision_feature_select_strategy: str
+    ) -> torch.Tensor:
+        """
+        This method extracts the image features from the vision backbone using the specified feature layer and
+        selection strategy. This is custom to MAIRA-2 model since we want to use the `feature_maps` from the Dinov2Backbone
+        class instead of the `hidden_states` which are used in the default implementation of `get_image_features` in LlavaForConditionalGeneration.
+        The feature_maps returned by Dinov2Backbone are the hideen_states with a layernorm applied to them.
+        """
+        image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+        selected_image_feature = image_outputs.feature_maps[vision_feature_layer]
+        if vision_feature_select_strategy == "default":
+            selected_image_feature = selected_image_feature[:, 1:]
+        elif vision_feature_select_strategy == "full":
+            selected_image_feature = selected_image_feature
+        else:
+            raise ValueError(f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}")
+        image_features = self.multi_modal_projector(selected_image_feature)
+        return image_features  # type: ignore[no-any-return]

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "crop_size": {
+    "height": 518,
+    "width": 518
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5307,
+    0.5307,
+    0.5307
+  ],
+  "image_processor_type": "BitImageProcessor",
+  "image_std": [
+    0.2583,
+    0.2583,
+    0.2583
+  ],
+  "processor_class": "Maira2Processor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 518
+  }
+}

processing_maira2.py ADDED Viewed

	@@ -0,0 +1,646 @@

+#  Copyright 2024 Microsoft. All rights reserved.
+#  Licensed under the MSRLA License. See LICENSE in the repo root for license information.
+import re
+from typing import Any, TypeAlias
+import numpy as np
+from PIL import Image
+from transformers import BaseImageProcessor, LlavaProcessor, PreTrainedTokenizer
+from transformers.feature_extraction_utils import BatchFeature
+SingleChatMessageType: TypeAlias = dict[str, str | int | None]
+ChatMessageListType: TypeAlias = list[dict[str, str | list[SingleChatMessageType]]]
+BoxType: TypeAlias = tuple[float, float, float, float]
+class Maira2Processor(LlavaProcessor):
+    """
+    Constructs a Maira2 processor similar to LlavaProcessor but with additional arguments and functions to support
+    multi-image grounded and non-grounded radiology report generation.
+    In addition to the arguments of LlavaProcessor, Maira2Processor has the following extra arguments:
+    Args:
+        phrase_start_token (`str`, *optional*, defaults to `"<obj>"`):
+            Special token used to denote the start of a grounded phrase (with or without box).
+        phrase_end_token (`str`, *optional*, defaults to `"</obj>"`):
+            Special token used to denote the end of a grounded phrase.
+        box_start_token (`str`, *optional*, defaults to `"<box>"`):
+            Special token used to denote the start of a bounding box.
+        box_end_token (`str`, *optional*, defaults to `"</box>"`):
+            Special token used to denote the end of a bounding box.
+        num_box_coord_bins (`int`, *optional*, defaults to `100`):
+            Number of bins used to represent the bounding box coordinates.
+    """
+    valid_kwargs = [
+        "chat_template",
+        "patch_size",
+        "vision_feature_select_strategy",
+        "image_token",
+        "phrase_start_token",
+        "phrase_end_token",
+        "box_start_token",
+        "box_end_token",
+        "num_box_coord_bins",
+    ]
+    def __init__(
+        self,
+        image_processor: BaseImageProcessor = None,
+        tokenizer: PreTrainedTokenizer = None,
+        patch_size: int | None = None,
+        vision_feature_select_strategy: str | None = None,
+        chat_template: str | None = None,
+        image_token: str = "<image>",
+        phrase_start_token: str = "<obj>",
+        phrase_end_token: str = "</obj>",
+        box_start_token: str = "<box>",
+        box_end_token: str = "</box>",
+        num_box_coord_bins: int = 100,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(
+            image_processor=image_processor,
+            tokenizer=tokenizer,
+            patch_size=patch_size,
+            vision_feature_select_strategy=vision_feature_select_strategy,
+            chat_template=chat_template,
+            image_token=image_token,
+            **kwargs,
+        )
+        self.phrase_start_token = phrase_start_token
+        self.phrase_end_token = phrase_end_token
+        self.box_start_token = box_start_token
+        self.box_end_token = box_end_token
+        self.num_box_coord_bins = num_box_coord_bins
+    @staticmethod
+    def _normalize_image(image: Image.Image) -> Image.Image:
+        """
+        This function normalizes the input image to have pixel values in the range [0, 255].
+        Args:
+            image (Image.Image | np.ndarray):
+                The input image to be normalized.
+        Returns:
+            Image.Image: The normalized image in grayscale.
+        """
+        image_np = np.array(image.convert("L"))
+        image_np = image_np.astype(float)
+        image_np -= image_np.min()
+        image_np /= image_np.max()
+        image_np *= 255
+        image_np = image_np.astype(np.uint8)
+        return Image.fromarray(image_np).convert("L")
+    def _normalize_and_stack_images(
+        self,
+        current_frontal: Image.Image,
+        current_lateral: Image.Image | None,
+        prior_frontal: Image.Image | None,
+    ) -> list[Image.Image]:
+        """
+        This function normalizes the input images and stacks them together. The images are stacked in the order of
+        current_frontal, current_lateral, and prior_frontal. The order of images is important, since it must match the
+        order of the images in the prompt, which is frontal, then lateral then prior.
+        Args:
+            current_frontal (Image.Image):
+                The current frontal image.
+            current_lateral (Image.Image | None):
+                The current lateral image.
+            prior_frontal (Image.Image | None):
+                The prior frontal image.
+        Returns:
+            list[Image.Image]: The normalized images stacked together.
+        """
+        images = [self._normalize_image(current_frontal)]
+        if current_lateral is not None:
+            images.append(self._normalize_image(current_lateral))
+        if prior_frontal is not None:
+            images.append(self._normalize_image(prior_frontal))
+        return images
+    @staticmethod
+    def _get_section_text_or_missing_text(section: str | None) -> str:
+        """
+        This function returns the input section text if it is not None and not empty, otherwise it returns a missing
+        section text "N/A".
+        Args:
+            section (str | None):
+                The input section text.
+        Returns:
+            str: The section text if it is not None and not empty, otherwise "N/A".
+        """
+        missing_section_text = "N/A"
+        if not isinstance(section, str) or len(section) == 0:
+            return missing_section_text
+        return section
+    @staticmethod
+    def _construct_image_chat_messages_for_reporting(has_prior: bool, has_lateral: bool) -> list[SingleChatMessageType]:
+        """
+        This function constructs user chat messages based on the presence of the prior and lateral images.
+        Args:
+            has_prior (bool):
+                A boolean indicating whether the prior image is present.
+            has_lateral (bool):
+                A boolean indicating whether the lateral image is present.
+        Returns:
+            list[SingleChatMessageType]: The image prompt messages in the form of a list of dictionaries.
+        Example:
+        ```python
+        >>> _construct_image_chat_messages_for_reporting(has_prior=True, has_lateral=True)
+        >>> # [
+        >>> #     {"index": None, "text": "Given the current frontal image", "type": "text"},
+        >>> #     {"index": 0, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " the current lateral image", "type": "text"},
+        >>> #     {"index": 1, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " and the prior frontal image", "type": "text"},
+        >>> #     {"index": 2, "text": None, "type": "image"},
+        >>> # ]
+        ```
+        """
+        def _add_single_image_to_chat_messages(prompt_text: str, image_index: int) -> None:
+            image_prompt.extend(
+                [
+                    {"index": None, "text": prompt_text, "type": "text"},
+                    {"index": image_index, "text": None, "type": "image"},
+                ]
+            )
+        image_prompt: list[SingleChatMessageType] = []
+        image_index = 0
+        if not has_prior and not has_lateral:
+            _add_single_image_to_chat_messages("Given the current frontal image only", image_index)
+        else:
+            _add_single_image_to_chat_messages("Given the current frontal image", image_index)
+            image_index += 1
+            if has_prior:
+                if has_lateral:
+                    _add_single_image_to_chat_messages(" the current lateral image", image_index)
+                    image_index += 1
+                _add_single_image_to_chat_messages(" and the prior frontal image", image_index)
+            else:
+                if has_lateral:
+                    _add_single_image_to_chat_messages(" and the current lateral image", image_index)
+        return image_prompt
+    def _construct_chat_messages_reporting(
+        self,
+        has_prior: bool,
+        has_lateral: bool,
+        indication: str | None,
+        technique: str | None,
+        comparison: str | None,
+        prior_report: str | None,
+        get_grounding: bool = False,
+        assistant_text: str | None = None,
+    ) -> ChatMessageListType:
+        """
+        This function constructs the chat messages for reporting used in the grounded and non-grounded reporting tasks.
+        Args:
+            has_prior (bool):
+                A boolean indicating whether the prior image is present.
+            has_lateral (bool):
+                A boolean indicating whether the lateral image is present.
+            indication (str | None):
+                The indication section text.
+            technique (str | None):
+                The technique section text.
+            comparison (str | None):
+                The comparison section text.
+            prior_report (str | None):
+                The prior report section text.
+            get_grounding (bool):
+                A boolean indicating whether to get the grounding information.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            ChatMessageListType: The chat messages for reporting in the form of a list of dictionaries.
+        Example:
+        ```python
+        >>> _construct_chat_messages_reporting(
+        >>>     has_prior=True,
+        >>>     has_lateral=True,
+        >>>     indication="indication text from report goes here",
+        >>>     technique="technique text from report goes here",
+        >>>     comparison="comparison text from report goes here",
+        >>>     prior_report="prior reporting text goes here",
+        >>>     get_grounding=False,
+        >>>     assistant_text=None,
+        >>> )
+        >>> # [
+        >>> #     {"index": None, "text": "Given the current frontal image", "type": "text"},
+        >>> #     {"index": 0, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " the current lateral image", "type": "text"},
+        >>> #     {"index": 1, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " and the prior frontal image", "type": "text"},
+        >>> #     {"index": 2, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " PRIOR_REPORT: prior reporting text goes here", "type": "text"},
+        >>> #     {"index": None, "text": " Provide a description of the findings in the radiology study in comparison to the "
+        >>> #     "prior frontal image. INDICATION: indication text from report goes here TECHNIQUE: technique text from report "
+        >>> #     "goes here COMPARISON: comparison text from report goes here", "type": "text"},
+        >>> # ]
+        ```
+        """
+        indication = self._get_section_text_or_missing_text(indication)
+        technique = self._get_section_text_or_missing_text(technique)
+        comparison = self._get_section_text_or_missing_text(comparison)
+        prior_report = self._get_section_text_or_missing_text(prior_report)
+        prompt = self._construct_image_chat_messages_for_reporting(has_prior=has_prior, has_lateral=has_lateral)
+        if has_prior:
+            prompt.append({"index": None, "text": f" PRIOR_REPORT: {prior_report}", "type": "text"})
+        if get_grounding:
+            prompt.append(
+                {
+                    "index": None,
+                    "text": " Provide a description of the findings in the radiology study in comparison to the "
+                    "prior frontal image. Each finding should be described as a self-contained plain-text sentence."
+                    " If the finding is groundable, locate the finding in the current frontal chest X-ray image, "
+                    "with bounding boxes indicating all locations where it can be seen in the current frontal "
+                    "image. Otherwise, generate just the ungrounded finding without bounding boxes. INDICATION: "
+                    f"{indication} TECHNIQUE: {technique} COMPARISON: {comparison}",
+                    "type": "text",
+                }
+            )
+        else:
+            prompt.append(
+                {
+                    "index": None,
+                    "text": " Provide a description of the findings in the radiology study in comparison to the "
+                    f"prior frontal image. INDICATION: {indication} TECHNIQUE: {technique} COMPARISON: "
+                    f"{comparison}",
+                    "type": "text",
+                }
+            )
+        messages: ChatMessageListType = [{"content": prompt, "role": "user"}]
+        if assistant_text is not None:
+            messages.append({"content": [{"index": None, "text": assistant_text, "type": "text"}], "role": "assistant"})
+        return messages
+    def _construct_chat_messages_phrase_grounding(
+        self, phrase: str, assistant_text: str | None = None
+    ) -> ChatMessageListType:
+        """
+        This function constructs the chat messages for phrase grounding used in the phrase grounding task.
+        Args:
+            phrase (str):
+                The phrase to be grounded.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            ChatMessageListType: The chat messages for phrase grounding in the form of a list of dictionaries.
+        """
+        prompt: list[SingleChatMessageType] = [
+            {"index": None, "text": "Given the current frontal image", "type": "text"},
+            {"index": 0, "text": None, "type": "image"},
+            {
+                "index": None,
+                "text": f" Repeat the following finding as a grounded phrase with bounding boxes indicating all "
+                f"locations where it can be seen in the given chest X-ray image. Finding: {phrase}",
+                "type": "text",
+            },
+        ]
+        messages: ChatMessageListType = [{"content": prompt, "role": "user"}]
+        if assistant_text is not None:
+            messages.append({"content": [{"index": None, "text": assistant_text, "type": "text"}], "role": "assistant"})
+        return messages
+    def format_reporting_input(
+        self,
+        current_frontal: Image.Image,
+        current_lateral: Image.Image | None,
+        prior_frontal: Image.Image | None,
+        indication: str | None,
+        technique: str | None,
+        comparison: str | None,
+        prior_report: str | None,
+        get_grounding: bool = False,
+        assistant_text: str | None = None,
+    ) -> tuple[str, list[Image.Image]]:
+        """
+        This function formats the reporting prompt for the grounded and non-grounded reporting tasks from the given
+        input images and text sections. The images are normalized and stacked together in the right order.
+        Args:
+            current_frontal (Image.Image):
+                The current frontal image.
+            current_lateral (Image.Image | None):
+                The current lateral image.
+            prior_frontal (Image.Image | None):
+                The prior frontal image.
+            indication (str | None):
+                The indication section text.
+            technique (str | None):
+                The technique section text.
+            comparison (str | None):
+                The comparison section text.
+            prior_report (str | None):
+                The prior report section text.
+            get_grounding (bool):
+                A boolean indicating whether to construct the prompt for grounded or non-grounded reporting.
+            assistant_text (str | None): The assistant text (can be set to None for ordinary inference).
+        Returns:
+            tuple[str, list[Image.Image]]: The formatted prompt text and the normalized images stacked in the right order.
+        """
+        images = self._normalize_and_stack_images(
+            current_frontal=current_frontal,
+            current_lateral=current_lateral,
+            prior_frontal=prior_frontal,
+        )
+        messages = self._construct_chat_messages_reporting(
+            has_prior=prior_frontal is not None,
+            has_lateral=current_lateral is not None,
+            indication=indication,
+            technique=technique,
+            comparison=comparison,
+            prior_report=prior_report,
+            get_grounding=get_grounding,
+            assistant_text=assistant_text,
+        )
+        add_generation_prompt = assistant_text is None
+        text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=add_generation_prompt, tokenize=False)
+        return text, images
+    def format_phrase_grounding_input(
+        self,
+        frontal_image: Image.Image,
+        phrase: str,
+        assistant_text: str | None = None,
+    ) -> tuple[str, list[Image.Image]]:
+        """
+        This function formats the phrase grounding prompt for the phrase grounding task from the given input
+        image and phrase.
+        Args:
+            frontal_image (Image.Image):
+                The frontal image.
+            phrase (str):
+                The phrase to be grounded.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            tuple[str, list[Image.Image]]: The formatted phrase grounding prompt text and the normalized image.
+        """
+        images = self._normalize_and_stack_images(
+            current_frontal=frontal_image,
+            current_lateral=None,
+            prior_frontal=None,
+        )
+        messages = self._construct_chat_messages_phrase_grounding(phrase)
+        add_generation_prompt = assistant_text is None
+        text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=add_generation_prompt, tokenize=False)
+        return text, images
+    def format_and_preprocess_reporting_input(
+        self,
+        current_frontal: Image.Image,
+        current_lateral: Image.Image | None,
+        prior_frontal: Image.Image | None,
+        indication: str | None,
+        technique: str | None,
+        comparison: str | None,
+        prior_report: str | None,
+        get_grounding: bool = False,
+        assistant_text: str | None = None,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        """
+        This function formats and then preprocesses the input for the grounded and non-grounded reporting tasks from
+        the given input images and text sections and returns the batch feature for the model. It calls format_reporting_input
+        internally to format the input prompt and stack the images together in the right order.
+        Args:
+            current_frontal (Image.Image):
+                The current frontal image.
+            current_lateral (Image.Image | None):
+                The current lateral image.
+            prior_frontal (Image.Image | None):
+                The prior frontal image.
+            indication (str | None):
+                The indication section text.
+            technique (str | None):
+                The technique section text.
+            comparison (str | None):
+                The comparison section text.
+            prior_report (str | None):
+                The prior report section text.
+            get_grounding (bool):
+                A boolean indicating whether to preprocess the input for grounded or non-grounded reporting.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            BatchFeature: The batch feature for the model, ready to be passed to the model.
+        """
+        text, images = self.format_reporting_input(
+            current_frontal=current_frontal,
+            current_lateral=current_lateral,
+            prior_frontal=prior_frontal,
+            indication=indication,
+            technique=technique,
+            comparison=comparison,
+            prior_report=prior_report,
+            get_grounding=get_grounding,
+            assistant_text=assistant_text,
+        )
+        return self(text=text, images=images, **kwargs)
+    def format_and_preprocess_phrase_grounding_input(
+        self,
+        frontal_image: Image.Image,
+        phrase: str,
+        assistant_text: str | None = None,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        """
+        This function formats and then processes the input for the phrase grounding task from the given input image and
+        phrase and returns the batch feature for the model. It calls format_phrase_grounding_input internally to format
+        the input prompt and normalize the image.
+        Args:
+            frontal_image (Image.Image):
+                The frontal image.
+            phrase (str):
+                The phrase to be grounded.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            BatchFeature: The batch feature for the model, ready to be passed to the model.
+        """
+        text, images = self.format_phrase_grounding_input(
+            frontal_image=frontal_image,
+            phrase=phrase,
+            assistant_text=assistant_text,
+        )
+        return self(text=text, images=images, **kwargs)
+    def _get_text_between_delimiters(self, text: str, begin_token: str, end_token: str) -> list[str]:
+        """
+        This function splits the input text into a list of substrings beased on the given begin and end tokens.
+        Args:
+            text (str):
+                The input text to be split.
+            begin_token (str):
+                The begin token.
+            end_token (str):
+                The end token.
+        Returns:
+            list[str]: The list of substrings between the given begin and end tokens.
+        Example:
+        ```python
+        >>> _get_text_between_delimiters("<obj>This is a grounded phrase</obj>. <obj>This is another grounded phrase</obj>.", "<obj>", "</obj>")
+        >>> # ["grounded phrase", "This is another grounded phrase"]
+        >>> _get_text_between_delimiters("<box><x10><y20><x30><y40></box><box><x50><y60><x70><y80></box>", "<box>", "</box>")
+        >>> # ["<x10><y20><x30><y40>", "<x50><y60><x70><y80>"]
+        ```
+        """
+        split_text = []
+        while begin_token in text:
+            assert text.startswith(begin_token)
+            end_index = text.find(end_token)
+            assert end_index != -1
+            split_text.append(text[len(begin_token) : end_index])
+            text = text[end_index + len(end_token) :]
+        assert len(text) == 0
+        return split_text
+    def convert_output_to_plaintext_or_grounded_sequence(
+        self, text: str
+    ) -> str | list[tuple[str, list[BoxType] | None]]:
+        """
+        This function converts the input text to a grounded sequence by extracting the grounded phrases and bounding
+        boxes from the text. If the text is plaintext without any grounded phrases, it returns the text as is.
+        Args:
+            text (str):
+                The input text to be converted.
+        Returns:
+            str | list[tuple[str, list[BoxType] | None]]: The grounded sequence.
+        Example:
+        ```python
+        >>> convert_output_to_plaintext_or_grounded_sequence("<obj>grounded phrase <box><x55><y45><x70><y56></box></obj><obj>ungrounded phrase</obj>")
+        >>> # [
+        >>> #     ("grounded phrase", [(0.55, 0.45, 0.70, 0.56)]),
+        >>> #     ("ungrounded phrase", None),
+        >>> # ]
+        >>> convert_output_to_plaintext_or_grounded_sequence("plain text")
+        >>> # "plain text"
+        ```
+        """
+        text = text.strip()
+        # Plain text
+        if not any(
+            [
+                self.phrase_start_token in text,
+                self.phrase_end_token in text,
+                self.box_start_token in text,
+                self.box_end_token in text,
+            ]
+        ):
+            return text
+        # One or more grounded phrases
+        grounded_phrase_texts = self._get_text_between_delimiters(text, self.phrase_start_token, self.phrase_end_token)
+        grounded_phrases: list[tuple[str, list[BoxType] | None]] = []
+        for grounded_phrase_text in grounded_phrase_texts:
+            if self.box_start_token in grounded_phrase_text or self.box_end_token in grounded_phrase_text:
+                first_box_start_index = grounded_phrase_text.find(self.box_start_token)
+                phrase_text = grounded_phrase_text[:first_box_start_index].strip()
+                boxes_text = grounded_phrase_text[first_box_start_index:]
+                boxes_text_list = self._get_text_between_delimiters(
+                    boxes_text, self.box_start_token, self.box_end_token
+                )
+                boxes: list[BoxType] = []
+                for box_text in boxes_text_list:
+                    # extract from <x_><y_><x_><y_>
+                    regex = r"<x(\d+?)><y(\d+?)><x(\d+?)><y(\d+?)>"
+                    match = re.search(regex, box_text)
+                    if match:
+                        x_min, y_min, x_max, y_max = match.groups()
+                        box: BoxType = tuple(  # type: ignore[assignment]
+                            (int(coord) + 0.5) / self.num_box_coord_bins for coord in (x_min, y_min, x_max, y_max)
+                        )
+                        assert all(0 <= coord <= 1 for coord in box), f"Invalid box coordinates: {box}"
+                        boxes.append(box)
+                    else:
+                        raise ValueError(f"Invalid box coordinates: {box_text} not matching regex {regex}")
+                grounded_phrases.append((phrase_text, boxes))
+            else:
+                grounded_phrases.append((grounded_phrase_text.lstrip(), None))
+        return grounded_phrases
+    @staticmethod
+    def adjust_box_for_original_image_size(box: BoxType, width: int, height: int) -> BoxType:
+        """
+        This function adjusts the bounding boxes from the MAIRA-2 model output to account for the image processor
+        cropping the image to be square prior to the model forward pass. The box coordinates are adjusted to be
+        relative to the original shape of the image assuming the image processor cropped the image based on the length
+        of the shortest side.
+        Args:
+            box (BoxType):
+                The box to be adjusted, normalised to (0, 1).
+            width (int):
+                Original width of the image, in pixels.
+            height (int):
+                Original height of the image, in pixels.
+        Returns:
+            BoxType: The box normalised relative to the original size of the image.
+        """
+        crop_width = crop_height = min(width, height)
+        x_offset = (width - crop_width) // 2
+        y_offset = (height - crop_height) // 2
+        norm_x_min, norm_y_min, norm_x_max, norm_y_max = box
+        abs_x_min = int(norm_x_min * crop_width + x_offset)
+        abs_x_max = int(norm_x_max * crop_width + x_offset)
+        abs_y_min = int(norm_y_min * crop_height + y_offset)
+        abs_y_max = int(norm_y_max * crop_height + y_offset)
+        adjusted_norm_x_min = abs_x_min / width
+        adjusted_norm_x_max = abs_x_max / width
+        adjusted_norm_y_min = abs_y_min / height
+        adjusted_norm_y_max = abs_y_max / height
+        return (adjusted_norm_x_min, adjusted_norm_y_min, adjusted_norm_x_max, adjusted_norm_y_max)

processor_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_maira2.Maira2Processor"
+  },
+  "box_end_token": "</box>",
+  "box_start_token": "<box>",
+  "image_token": "<image>",
+  "num_box_coord_bins": 100,
+  "patch_size": 14,
+  "phrase_end_token": "</obj>",
+  "phrase_start_token": "<obj>",
+  "processor_class": "Maira2Processor",
+  "vision_feature_select_strategy": "default"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
+size 499723

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,1701 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32000": {
+      "content": "<obj>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32001": {
+      "content": "</obj>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32002": {
+      "content": "<x0>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32003": {
+      "content": "<x1>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32004": {
+      "content": "<x2>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32005": {
+      "content": "<x3>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32006": {
+      "content": "<x4>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32007": {
+      "content": "<x5>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32008": {
+      "content": "<x6>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32009": {
+      "content": "<x7>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32010": {
+      "content": "<x8>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32011": {
+      "content": "<x9>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32012": {
+      "content": "<x10>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32013": {
+      "content": "<x11>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32014": {
+      "content": "<x12>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32015": {
+      "content": "<x13>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32016": {
+      "content": "<x14>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32017": {
+      "content": "<x15>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32018": {
+      "content": "<x16>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32019": {
+      "content": "<x17>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32020": {
+      "content": "<x18>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32021": {
+      "content": "<x19>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32022": {
+      "content": "<x20>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32023": {
+      "content": "<x21>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32024": {
+      "content": "<x22>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32025": {
+      "content": "<x23>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32026": {
+      "content": "<x24>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32027": {
+      "content": "<x25>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32028": {
+      "content": "<x26>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32029": {
+      "content": "<x27>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32030": {
+      "content": "<x28>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32031": {
+      "content": "<x29>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32032": {
+      "content": "<x30>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32033": {
+      "content": "<x31>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32034": {
+      "content": "<x32>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32035": {
+      "content": "<x33>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32036": {
+      "content": "<x34>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32037": {
+      "content": "<x35>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32038": {
+      "content": "<x36>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32039": {
+      "content": "<x37>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32040": {
+      "content": "<x38>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32041": {
+      "content": "<x39>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32042": {
+      "content": "<x40>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32043": {
+      "content": "<x41>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32044": {
+      "content": "<x42>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32045": {
+      "content": "<x43>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32046": {
+      "content": "<x44>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32047": {
+      "content": "<x45>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32048": {
+      "content": "<x46>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32049": {
+      "content": "<x47>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32050": {
+      "content": "<x48>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32051": {
+      "content": "<x49>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32052": {
+      "content": "<x50>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32053": {
+      "content": "<x51>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32054": {
+      "content": "<x52>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32055": {
+      "content": "<x53>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32056": {
+      "content": "<x54>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32057": {
+      "content": "<x55>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32058": {
+      "content": "<x56>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32059": {
+      "content": "<x57>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32060": {
+      "content": "<x58>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32061": {
+      "content": "<x59>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32062": {
+      "content": "<x60>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32063": {
+      "content": "<x61>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32064": {
+      "content": "<x62>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32065": {
+      "content": "<x63>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32066": {
+      "content": "<x64>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32067": {
+      "content": "<x65>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32068": {
+      "content": "<x66>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32069": {
+      "content": "<x67>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32070": {
+      "content": "<x68>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32071": {
+      "content": "<x69>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32072": {
+      "content": "<x70>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32073": {
+      "content": "<x71>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32074": {
+      "content": "<x72>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32075": {
+      "content": "<x73>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32076": {
+      "content": "<x74>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32077": {
+      "content": "<x75>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32078": {
+      "content": "<x76>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32079": {
+      "content": "<x77>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32080": {
+      "content": "<x78>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32081": {
+      "content": "<x79>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32082": {
+      "content": "<x80>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32083": {
+      "content": "<x81>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32084": {
+      "content": "<x82>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32085": {
+      "content": "<x83>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32086": {
+      "content": "<x84>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32087": {
+      "content": "<x85>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32088": {
+      "content": "<x86>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32089": {
+      "content": "<x87>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32090": {
+      "content": "<x88>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32091": {
+      "content": "<x89>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32092": {
+      "content": "<x90>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32093": {
+      "content": "<x91>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32094": {
+      "content": "<x92>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32095": {
+      "content": "<x93>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32096": {
+      "content": "<x94>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32097": {
+      "content": "<x95>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32098": {
+      "content": "<x96>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32099": {
+      "content": "<x97>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32100": {
+      "content": "<x98>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32101": {
+      "content": "<x99>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32102": {
+      "content": "<y0>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32103": {
+      "content": "<y1>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32104": {
+      "content": "<y2>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32105": {
+      "content": "<y3>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32106": {
+      "content": "<y4>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32107": {
+      "content": "<y5>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32108": {
+      "content": "<y6>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32109": {
+      "content": "<y7>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32110": {
+      "content": "<y8>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32111": {
+      "content": "<y9>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32112": {
+      "content": "<y10>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32113": {
+      "content": "<y11>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32114": {
+      "content": "<y12>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32115": {
+      "content": "<y13>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32116": {
+      "content": "<y14>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32117": {
+      "content": "<y15>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32118": {
+      "content": "<y16>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32119": {
+      "content": "<y17>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32120": {
+      "content": "<y18>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32121": {
+      "content": "<y19>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32122": {
+      "content": "<y20>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32123": {
+      "content": "<y21>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32124": {
+      "content": "<y22>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32125": {
+      "content": "<y23>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32126": {
+      "content": "<y24>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32127": {
+      "content": "<y25>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32128": {
+      "content": "<y26>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32129": {
+      "content": "<y27>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32130": {
+      "content": "<y28>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32131": {
+      "content": "<y29>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32132": {
+      "content": "<y30>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32133": {
+      "content": "<y31>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32134": {
+      "content": "<y32>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32135": {
+      "content": "<y33>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32136": {
+      "content": "<y34>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32137": {
+      "content": "<y35>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32138": {
+      "content": "<y36>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32139": {
+      "content": "<y37>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32140": {
+      "content": "<y38>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32141": {
+      "content": "<y39>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32142": {
+      "content": "<y40>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32143": {
+      "content": "<y41>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32144": {
+      "content": "<y42>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32145": {
+      "content": "<y43>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32146": {
+      "content": "<y44>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32147": {
+      "content": "<y45>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32148": {
+      "content": "<y46>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32149": {
+      "content": "<y47>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32150": {
+      "content": "<y48>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32151": {
+      "content": "<y49>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32152": {
+      "content": "<y50>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32153": {
+      "content": "<y51>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32154": {
+      "content": "<y52>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32155": {
+      "content": "<y53>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32156": {
+      "content": "<y54>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32157": {
+      "content": "<y55>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32158": {
+      "content": "<y56>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32159": {
+      "content": "<y57>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32160": {
+      "content": "<y58>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32161": {
+      "content": "<y59>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32162": {
+      "content": "<y60>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32163": {
+      "content": "<y61>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32164": {
+      "content": "<y62>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32165": {
+      "content": "<y63>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32166": {
+      "content": "<y64>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32167": {
+      "content": "<y65>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32168": {
+      "content": "<y66>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32169": {
+      "content": "<y67>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32170": {
+      "content": "<y68>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32171": {
+      "content": "<y69>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32172": {
+      "content": "<y70>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32173": {
+      "content": "<y71>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32174": {
+      "content": "<y72>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32175": {
+      "content": "<y73>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32176": {
+      "content": "<y74>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32177": {
+      "content": "<y75>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32178": {
+      "content": "<y76>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32179": {
+      "content": "<y77>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32180": {
+      "content": "<y78>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32181": {
+      "content": "<y79>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32182": {
+      "content": "<y80>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32183": {
+      "content": "<y81>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32184": {
+      "content": "<y82>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32185": {
+      "content": "<y83>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32186": {
+      "content": "<y84>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32187": {
+      "content": "<y85>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32188": {
+      "content": "<y86>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32189": {
+      "content": "<y87>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32190": {
+      "content": "<y88>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32191": {
+      "content": "<y89>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32192": {
+      "content": "<y90>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32193": {
+      "content": "<y91>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32194": {
+      "content": "<y92>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32195": {
+      "content": "<y93>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32196": {
+      "content": "<y94>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32197": {
+      "content": "<y95>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32198": {
+      "content": "<y96>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32199": {
+      "content": "<y97>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32200": {
+      "content": "<y98>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32201": {
+      "content": "<y99>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32202": {
+      "content": "<box>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32203": {
+      "content": "</box>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32204": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32205": {
+      "content": "<prev_im>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32206": {
+      "content": "<lat_image>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<s>",
+  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}You are an expert radiology assistant tasked with interpreting a chest X-ray study.  {% for message in messages %}{% if message[\"role\"] == \"user\" %}USER:  {% else %}ASSISTANT: {% endif %}{% for item in message[\"content\"] %}{% if item[\"type\"] == \"text\" %}{{ item[\"text\"] }}{% elif item[\"type\"] == \"image\" %}<image>{% endif %}{% endfor %}{% if message[\"role\"] == \"user\" %}  {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "legacy": false,
+  "model_max_length": 4096,
+  "pad_token": "<unk>",
+  "padding_side": "left",
+  "processor_class": "Maira2Processor",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}