gnedivad commited on Dec 13, 2025

Commit

d4394af

verified ·

1 Parent(s): 6051590

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

LICENSE +36 -0
README.md +273 -0
added_tokens.json +209 -0
chat_template.json +3 -0
config.json +99 -0
configuration_maira2.py +32 -0
generation_config.json +9 -0
model-00001-of-00006.safetensors +3 -0
model-00002-of-00006.safetensors +3 -0
model-00003-of-00006.safetensors +3 -0
model-00004-of-00006.safetensors +3 -0
model-00005-of-00006.safetensors +3 -0
model-00006-of-00006.safetensors +3 -0
model.safetensors.index.json +529 -0
modeling_maira2.py +112 -0
preprocessor_config.json +31 -0
processing_maira2.py +649 -0
processor_config.json +15 -0
special_tokens_map.json +30 -0
tokenizer.json +0 -0
tokenizer.model +3 -0
tokenizer_config.json +1702 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,36 @@

+MICROSOFT RESEARCH LICENSE TERMS
+IF YOU LIVE IN THE UNITED STATES, PLEASE READ THE “BINDING ARBITRATION AND CLASS ACTION WAIVER” SECTION BELOW. IT AFFECTS HOW DISPUTES ARE RESOLVED.
+These license terms are an agreement between you and Microsoft Corporation (or one of its affiliates). They apply to the source code, object code, machine learning models, or data (collectively “Materials”) that accompany this license. IF YOU COMPLY WITH THESE LICENSE TERMS, YOU HAVE THE RIGHTS BELOW. BY USING THE MATERIALS, YOU ACCEPT THESE TERMS.
+1) INSTALLATION AND USE RIGHTS TO THE MATERIALS.
+Subject to the terms of this agreement, you have the below rights, if applicable, to use the Materials solely for non-commercial, non-revenue generating, research purposes:
+a) Source Code. If source code is included, you may use and modify the source code, but you may not distribute the source code.
+b) Object Code. If object code is included, you may use the object code, but you may not distribute the object code.
+c) Models.  If machine learning model(s) are included, you may use the model(s), but you may not distribute the models.
+d) Data. If data is included, you may use and modify the data, but your use and modification must be consistent with the consent under which the data was provided and/or gathered and you may not distribute the data or your modifications to the data.
+2) SCOPE OF LICENSE. The Materials are licensed, not sold. Microsoft reserves all other rights. Unless applicable law gives you more rights despite this limitation, you will not (and have no right to):
+a) work around any technical limitations in the Materials that only allow you to use it in certain ways;
+b) reverse engineer, decompile or disassemble the Materials;
+c) remove, minimize, block, or modify any notices of Microsoft or its suppliers in the Materials;
+d) use the Materials in any way that is against the law or to create or propagate malware; or
+e) share, publish, distribute or lend the Materials, provide the Materials as a stand-alone hosted solution for others to use, or transfer the Materials or this agreement to any third party.
+3) PERSONAL DATA.  If the data (set forth in Section 1(c) above) includes or is found to include any data that enables any ability to identify an individual (“Personal Data”), you will not use such Personal Data for any purpose other than was authorized and consented to by the data subject/research participant.  You will not use Personal Data to contact any person.  You will keep Personal Data in strict confidence.  You will not share any Personal Data that is collected or in your possession with any third party for any reason and as required under the original consent agreement.  Further, you will destroy the Personal Data and any backup or copies, immediately upon the completion of your research.
+4) LICENSE TO MICROSOFT.  Notwithstanding the limitations in Section 1, you may distribute your modifications back to Microsoft, and if you do provide Microsoft with modifications of the Materials, you hereby grant Microsoft, without any restrictions or limitations, a non-exclusive, perpetual, irrevocable, royalty-free, assignable and sub-licensable license, to reproduce, publicly perform or display, install, use, modify, post, distribute, make and have made, sell and transfer such modifications and derivatives for any purpose.
+5) PUBLICATION.  You may publish (or present papers or articles) on your results from using the Materials provided that no material or substantial portion of the Materials is included in any such publication or presentation.
+6) FEEDBACK. Any feedback about the Materials provided by you to us is voluntarily given, and Microsoft shall be free to use the feedback as it sees fit without obligation or restriction of any kind, even if the feedback is designated by you as confidential.  Such feedback shall be considered a contribution and licensed to Microsoft under the terms of Section 4 above.
+7) COMPLIANCE WITH TRADE LAWS. You acknowledge that the Materials may be subject to applicable trade laws in one or more countries.  You will comply with all relevant laws and regulations applicable to the import or export of the Materials, including but not limited to, trade laws such as the U.S. Export Administration Regulations or other end-user, end use, and destination restrictions by the U.S. and other governments, as well as sanctions regulations administered by the U.S. Office of Foreign Assets Control. Microsoft may suspend or terminate the agreement immediately to the extent that Microsoft reasonably concludes that continued performance would violate trade laws or put it at risk of becoming subject to sanctions or penalties under trade laws. For additional information, see www.microsoft.com/exporting.
+8) SUPPORT SERVICES. Microsoft is not obligated under this agreement to provide any support services for the Materials. Any support provided is “as is”, “with all faults”, and without warranty of any kind.
+9) BINDING ARBITRATION AND CLASS ACTION WAIVER. This Section applies if you live in (or, if a business, your principal place of business is in) the United States.  If you and Microsoft have a dispute, you and Microsoft agree to try for 60 days to resolve it informally. If you and Microsoft can’t, you and Microsoft agree to binding individual arbitration before the American Arbitration Association under the Federal Arbitration Act (“FAA”), and not to sue in court in front of a judge or jury. Instead, a neutral arbitrator will decide. Class action lawsuits, class-wide arbitrations, private attorney-general actions, and any other proceeding where someone acts in a representative capacity are not allowed; nor is combining individual proceedings without the consent of all parties. The complete Arbitration Agreement contains more terms and is at aka.ms/arb-agreement-1. You and Microsoft agree to these terms.
+10) ENTIRE AGREEMENT. This agreement, and any other terms Microsoft may provide for supplements, updates, or third-party applications, is the entire agreement for the Materials.
+11) APPLICABLE LAW AND PLACE TO RESOLVE DISPUTES. If you acquired the Materials in the United States or Canada, the laws of the state or province where you live (or, if a business, where your principal place of business is located) govern the interpretation of this agreement, claims for its breach, and all other claims (including consumer protection, unfair competition, and tort claims), regardless of conflict of laws principles, except that the FAA governs everything related to arbitration. If you acquired the Materials in any other country, its laws apply, except that the FAA governs everything related to arbitration. If U.S. federal jurisdiction exists, you and Microsoft consent to exclusive jurisdiction and venue in the federal court in King County, Washington for all disputes heard in court (excluding arbitration). If not, you and Microsoft consent to exclusive jurisdiction and venue in the Superior Court of King County, Washington for all disputes heard in court (excluding arbitration).
+12) CONSUMER RIGHTS; REGIONAL VARIATIONS. This agreement describes certain legal rights. You may have other rights, including consumer rights, under the laws of your state, province, or country. Separate and apart from your relationship with Microsoft, you may also have rights with respect to the party from which you acquired the Materials. This agreement does not change those other rights if the laws of your state, province, or country do not permit it to do so. For example, if you acquired the Materials in one of the below regions, or mandatory country law applies, then the following provisions apply to you:
+a) Australia. You have statutory guarantees under the Australian Consumer Law and nothing in this agreement is intended to affect those rights.
+b) Canada. If you acquired this software in Canada, you may stop receiving updates by turning off the automatic update feature, disconnecting your device from the Internet (if and when you re-connect to the Internet, however, the Materials will resume checking for and installing updates), or uninstalling the Materials. The product documentation, if any, may also specify how to turn off updates for your specific device or software.
+c) Germany and Austria.
+ i. Warranty. The properly licensed software will perform substantially as described in any Microsoft materials that accompany the Materials. However, Microsoft gives no contractual guarantee in relation to the licensed software.
+ ii. Limitation of Liability. In case of intentional conduct, gross negligence, claims based on the Product Liability Act, as well as, in case of death or personal or physical injury, Microsoft is liable according to the statutory law.
+Subject to the foregoing clause (ii), Microsoft will only be liable for slight negligence if Microsoft is in breach of such material contractual obligations, the fulfillment of which facilitate the due performance of this agreement, the breach of which would endanger the purpose of this agreement and the compliance with which a party may constantly trust in (so-called "cardinal obligations"). In other cases of slight negligence, Microsoft will not be liable for slight negligence.
+13) DISCLAIMER OF WARRANTY. THE MATERIALS ARE LICENSED “AS IS.” YOU BEAR THE RISK OF USING THEM. MICROSOFT GIVES NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. TO THE EXTENT PERMITTED UNDER APPLICABLE LAWS, MICROSOFT EXCLUDES ALL IMPLIED WARRANTIES, INCLUDING MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT.
+14) LIMITATION ON AND EXCLUSION OF DAMAGES. IF YOU HAVE ANY BASIS FOR RECOVERING DAMAGES DESPITE THE PRECEDING DISCLAIMER OF WARRANTY, YOU CAN RECOVER FROM MICROSOFT AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP TO U.S. $5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL, LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.
+This limitation applies to (a) anything related to the Materials, services, content (including code) on third party Internet sites, or third party applications; and (b) claims for breach of contract, warranty, guarantee, or condition; strict liability, negligence, or other tort; or any other claim; in each case to the extent permitted by applicable law.
+It also applies even if Microsoft knew or should have known about the possibility of the damages. The above limitation or exclusion may not apply to you because your state, province, or country may not allow the exclusion or limitation of incidental, consequential, or other damages.

README.md ADDED Viewed

	@@ -0,0 +1,273 @@

+---
+license: other
+license_name: msrla
+license_link: https://huggingface.co/microsoft/maira-2/blob/main/LICENSE
+library_name: transformers
+extra_gated_prompt: >-
+  Please confirm that you have read and agree to the following disclaimer.
+  The model(s) and/or software described in this repository are provided for research and development use only. The model(s) and/or software are not intended for use in clinical decision-making or for any other clinical use, and performance for clinical use has not been established. You bear sole responsibility for any use of these model(s) and/or software, including incorporation into any product intended for clinical use.
+extra_gated_fields:
+  I have read and agree to the disclaimer: checkbox
+---
+# Model Card for MAIRA-2
+<!-- Provide a quick summary of what the model is/does. -->
+MAIRA-2 is a multimodal transformer designed for the generation of grounded or non-grounded radiology reports from chest X-rays. It is described in more detail in [MAIRA-2: Grounded Radiology Report Generation (S. Bannur, K. Bouzid et al., 2024)](https://arxiv.org/abs/2406.04449). MAIRA-2 has been built for research purposes only and is being shared to facilitate comparison and further research.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+MAIRA-2 is composed of the image encoder [RAD-DINO-MAIRA-2](https://huggingface.co/microsoft/rad-dino-maira-2) (used frozen), a projection layer (trained from scratch), and the language model [vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5) (fully fine-tuned).
+- **Developed by:** Microsoft Research Health Futures
+- **Model type:** Multimodal transformer
+- **Language(s) (NLP):** English
+- **License:** [MSRLA](./LICENSE)
+- **Finetuned from model [optional]:** [vicuna-7b-1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5), [RAD-DINO-MAIRA-2](https://huggingface.co/microsoft/rad-dino-maira-2)
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+MAIRA-2 is shared for research purposes only. It is **not meant to be used for clinical practice.** MAIRA-2 was not extensively tested for its capabilities and properties, including its accuracy and reliability in application settings, fairness across different demographics and uses, and security and privacy.
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+As inputs, MAIRA-2 takes a frontal chest X-ray, and any of the following:
+- A lateral view from the current study
+- A frontal view from the *prior* study, with accompanying prior report
+- The indication for the current study
+- The technique and comparison sections for the current study
+MAIRA-2 can generate the _findings_ section of the current study, in one of two forms:
+- Narrative text, without any image annotations (this is the typical report generation scenario).
+- As a grounded report, wherein all described findings are accompanied by zero or more bounding boxes indicating their location on the current frontal image.
+MAIRA-2 can also perform phrase grounding. In this case, it must also be provided with an input phrase. It will then repeat the phrase and generate a bounding box localising the finding described in the phrase.
+These use-cases are illustrated with [sample code below](README.md#use-case-1-and-2-findings-generation-with-or-without-grounding).
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+MAIRA-2 was trained on chest X-rays from adults with English language reports only, and is not expected to work on any other imaging modality or anatomy. Variations in the input prompt (e.g. changing the instruction) are likely to degrade performance, as this model was *not* optimised for arbitrary user inputs.
+As above, this is a research model which should not be used in any real clinical or production scenario.
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+### Data biases
+MAIRA-2 was trained on chest X-ray report datasets from Spain (translated from the original Spanish to English) and the USA, listed below. Reporting styles, patient demographics and disease prevalence, and image acquisition protocols can vary across health systems and regions. These factors will impact the generalisability of the model.
+### Model errors (fabrication, omission)
+This model does not perform perfectly on its tasks, as outlined in more detail in the [MAIRA-2 report](https://arxiv.org/abs/2406.04449). Hence, errors can be present in the generated (grounded) reports.
+## How to Get Started with the Model
+We demonstrate below how to run inference with MAIRA-2 for its three capabilities: findings generation with and without grounding, and phrase grounding.
+### Setup
+To run this sample code, you will need the following packages:
+```
+pillow
+protobuf
+sentencepiece
+torch
+transformers>=4.48.0,<4.52
+```
+Note: MAIRA-2 has last been tested with transformers v4.51.3.
+First, initialise the model and put it in eval mode.
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+from pathlib import Path
+import torch
+model = AutoModelForCausalLM.from_pretrained("microsoft/maira-2", trust_remote_code=True)
+processor = AutoProcessor.from_pretrained("microsoft/maira-2", trust_remote_code=True)
+device = torch.device("cuda")
+model = model.eval()
+model = model.to(device)
+```
+We need to get some data to demonstrate the forward pass.
+For this example, we'll collect an example from the IU X-ray dataset, which has a permissive license.
+```python
+import requests
+from PIL import Image
+def get_sample_data() -> dict[str, Image.Image | str]:
+    """
+    Download chest X-rays from IU-Xray, which we didn't train MAIRA-2 on. License is CC.
+    We modified this function from the Rad-DINO repository on Huggingface.
+    """
+    frontal_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-1001.png"
+    lateral_image_url = "https://openi.nlm.nih.gov/imgs/512/145/145/CXR145_IM-0290-2001.png"
+    def download_and_open(url: str) -> Image.Image:
+        response = requests.get(url, headers={"User-Agent": "MAIRA-2"}, stream=True)
+        return Image.open(response.raw)
+    frontal_image = download_and_open(frontal_image_url)
+    lateral_image = download_and_open(lateral_image_url)
+    sample_data = {
+        "frontal": frontal_image,
+        "lateral": lateral_image,
+        "indication": "Dyspnea.",
+        "comparison": "None.",
+        "technique": "PA and lateral views of the chest.",
+        "phrase": "Pleural effusion."  # For the phrase grounding example. This patient has pleural effusion.
+    }
+    return sample_data
+sample_data = get_sample_data()
+```
+### Use-case 1 and 2: Findings generation with or without grounding
+We can toggle whether MAIRA-2 generates a grounded report based on how we preprocess the inputs, as it uses a different prompt. Let's start without grounding (`get_grounding=False`). While generating, for non-grounded reporting use `max_new_tokens=300`, and for grounded reporting use `max_new_tokens=450` to accommodate additional box and object tokens.
+```python
+processed_inputs = processor.format_and_preprocess_reporting_input(
+    current_frontal=sample_data["frontal"],
+    current_lateral=sample_data["lateral"],
+    prior_frontal=None,  # Our example has no prior
+    indication=sample_data["indication"],
+    technique=sample_data["technique"],
+    comparison=sample_data["comparison"],
+    prior_report=None,  # Our example has no prior
+    return_tensors="pt",
+    get_grounding=False,  # For this example we generate a non-grounded report
+)
+processed_inputs = processed_inputs.to(device)
+with torch.no_grad():
+    output_decoding = model.generate(
+        **processed_inputs,
+        max_new_tokens=300,  # Set to 450 for grounded reporting
+        use_cache=True,
+    )
+prompt_length = processed_inputs["input_ids"].shape[-1]
+decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
+decoded_text = decoded_text.lstrip()  # Findings generation completions have a single leading space
+prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)
+print("Parsed prediction:", prediction)
+```
+We get something that looks like this:
+> There is a large right pleural effusion with associated right basilar atelectasis. The left lung is clear. No pneumothorax is identified. The cardiomediastinal silhouette and hilar contours are normal. There is no free air under the diaphragm. Surgical clips are noted in the right upper quadrant of the abdomen.
+If we had set `get_grounding=True`, MAIRA-2 would generate a grounded report. For this example, that looks like this:
+```python
+('There is a large right pleural effusion.', [(0.055, 0.275, 0.445, 0.665)]),
+('The left lung is clear.', None),
+('No pneumothorax is identified.', None),
+('The cardiomediastinal silhouette is within normal limits.', None),
+('The visualized osseous structures are unremarkable.', None)
+```
+The generated bounding box coordinates are the `(x, y)` coordinates of the top left and bottom right corners of the box, e.g. `(x_topleft, y_topleft, x_bottomright, y_bottomright)`. These are relative to the _cropped_ image (that is, the image that MAIRA-2 ultimately got as input), so be careful while visualising. The processor provides a method `adjust_box_for_original_image_size` to get boxes relative to the original image shape.
+Note that MAIRA-2 generates slightly different reports for grounded and non-grounded reporting scenarios, a side-effect of its grounded reporting training data coming from a different data distribution.
+### Use-case 3: Phrase Grounding
+Here the input is different as we provide the model with a phrase to ground in the image. Recall (`get_sample_data`) that our phrase here is just "Pleural effusion", which we already know is present in this image.
+```python
+processed_inputs = processor.format_and_preprocess_phrase_grounding_input(
+    frontal_image=sample_data["frontal"],
+    phrase=sample_data["phrase"],
+    return_tensors="pt",
+)
+processed_inputs = processed_inputs.to(device)
+with torch.no_grad():
+    output_decoding = model.generate(
+        **processed_inputs,
+        max_new_tokens=150,
+        use_cache=True,
+    )
+prompt_length = processed_inputs["input_ids"].shape[-1]
+decoded_text = processor.decode(output_decoding[0][prompt_length:], skip_special_tokens=True)
+prediction = processor.convert_output_to_plaintext_or_grounded_sequence(decoded_text)
+print("Parsed prediction:", prediction)
+```
+This gives us something like this:
+```python
+('Pleural effusion.', [(0.025, 0.345, 0.425, 0.575)])
+```
+Again, as for grounded reporting we must remember the bbox coordinates are relative to the cropped image seen by MAIRA-2, use `processor.adjust_box_for_original_image_size` to get boxes adjusted for the original image shape.
+## Training details
+We did not originally train MAIRA-2 using the exact model class provided here, however we have checked that its behaviour is the same. We provide this class to facilitate research re-use and inference.
+### Training data
+MAIRA-2 was trained on a mix of public and private chest X-ray datasets. Each example comprises one or more CXR images and associated report text, with or without grounding (spatial annotations). The model is trained to generate the _findings_ section of the report, with or without grounding.
+| Dataset | Country | # examples (ungrounded) | # examples (grounded) |
+| ----- | ------ |------- | ----- |
+| [MIMIC-CXR](https://www.nature.com/articles/s41597-019-0322-0) | USA | 55 218 | 595* |
+| [PadChest](https://www.sciencedirect.com/science/article/abs/pii/S1361841520301614) | Spain | 52 828 | 3 122 |
+| USMix (Private) | USA | 118 031 | 53 613 |
+*We use the [MS-CXR](https://physionet.org/content/ms-cxr/) phrase grounding dataset to provide `grounding' examples from MIMIC-CXR.
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** NVIDIA A100 GPUs
+- **Hours used:** 1432
+- **Cloud Provider:**  Azure
+- **Compute Region:** West US 2
+- **Carbon Emitted:** 107.4 CO₂ eq _(ostensibly offset by this provider)_
+## Citation
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+```
+@article{Bannur2024MAIRA2GR,
+  title={MAIRA-2: Grounded Radiology Report Generation},
+  author={Shruthi Bannur and Kenza Bouzid and Daniel C. Castro and Anton Schwaighofer and Anja Thieme and Sam Bond-Taylor and Maximilian Ilse and Fernando P\'{e}rez-Garc\'{i}a and Valentina Salvatelli and Harshita Sharma and Felix Meissen and Mercy Prasanna Ranjit and Shaury Srivastav and Julia Gong and Noel C. F. Codella and Fabian Falck and Ozan Oktay and Matthew P. Lungren and Maria T. A. Wetscherek and Javier Alvarez-Valle and Stephanie L. Hyland},
+  journal={arXiv},
+  year={2024},
+  volume={abs/2406.04449},
+  url={https://arxiv.org/abs/2406.04449}
+}
+```
+**APA:**
+> Bannur*, S., Bouzid*, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., Meissen, F., Ranjit, M.P., Srivastav, S., Gong, J., Codella, N.C.F., Falck, F., Oktay, O., Lungren, M.P., Wetscherek, M.T., Alvarez-Valle, J., & Hyland, S. L. (2024). *MAIRA-2: Grounded Radiology Report Generation*. arXiv preprint abs/2406.04449.
+## Model Card Contact
+- Stephanie Hyland ([`stephanie.hyland@microsoft.com`](mailto:stephanie.hyland@microsoft.com))
+- Shruthi Bannur ([`shruthi.bannur@microsoft.com`](mailto:shruthi.bannur@microsoft.com))

added_tokens.json ADDED Viewed

	@@ -0,0 +1,209 @@

+{
+  "</box>": 32203,
+  "</obj>": 32001,
+  "<box>": 32202,
+  "<image>": 32204,
+  "<lat_image>": 32206,
+  "<obj>": 32000,
+  "<prev_im>": 32205,
+  "<x0>": 32002,
+  "<x10>": 32012,
+  "<x11>": 32013,
+  "<x12>": 32014,
+  "<x13>": 32015,
+  "<x14>": 32016,
+  "<x15>": 32017,
+  "<x16>": 32018,
+  "<x17>": 32019,
+  "<x18>": 32020,
+  "<x19>": 32021,
+  "<x1>": 32003,
+  "<x20>": 32022,
+  "<x21>": 32023,
+  "<x22>": 32024,
+  "<x23>": 32025,
+  "<x24>": 32026,
+  "<x25>": 32027,
+  "<x26>": 32028,
+  "<x27>": 32029,
+  "<x28>": 32030,
+  "<x29>": 32031,
+  "<x2>": 32004,
+  "<x30>": 32032,
+  "<x31>": 32033,
+  "<x32>": 32034,
+  "<x33>": 32035,
+  "<x34>": 32036,
+  "<x35>": 32037,
+  "<x36>": 32038,
+  "<x37>": 32039,
+  "<x38>": 32040,
+  "<x39>": 32041,
+  "<x3>": 32005,
+  "<x40>": 32042,
+  "<x41>": 32043,
+  "<x42>": 32044,
+  "<x43>": 32045,
+  "<x44>": 32046,
+  "<x45>": 32047,
+  "<x46>": 32048,
+  "<x47>": 32049,
+  "<x48>": 32050,
+  "<x49>": 32051,
+  "<x4>": 32006,
+  "<x50>": 32052,
+  "<x51>": 32053,
+  "<x52>": 32054,
+  "<x53>": 32055,
+  "<x54>": 32056,
+  "<x55>": 32057,
+  "<x56>": 32058,
+  "<x57>": 32059,
+  "<x58>": 32060,
+  "<x59>": 32061,
+  "<x5>": 32007,
+  "<x60>": 32062,
+  "<x61>": 32063,
+  "<x62>": 32064,
+  "<x63>": 32065,
+  "<x64>": 32066,
+  "<x65>": 32067,
+  "<x66>": 32068,
+  "<x67>": 32069,
+  "<x68>": 32070,
+  "<x69>": 32071,
+  "<x6>": 32008,
+  "<x70>": 32072,
+  "<x71>": 32073,
+  "<x72>": 32074,
+  "<x73>": 32075,
+  "<x74>": 32076,
+  "<x75>": 32077,
+  "<x76>": 32078,
+  "<x77>": 32079,
+  "<x78>": 32080,
+  "<x79>": 32081,
+  "<x7>": 32009,
+  "<x80>": 32082,
+  "<x81>": 32083,
+  "<x82>": 32084,
+  "<x83>": 32085,
+  "<x84>": 32086,
+  "<x85>": 32087,
+  "<x86>": 32088,
+  "<x87>": 32089,
+  "<x88>": 32090,
+  "<x89>": 32091,
+  "<x8>": 32010,
+  "<x90>": 32092,
+  "<x91>": 32093,
+  "<x92>": 32094,
+  "<x93>": 32095,
+  "<x94>": 32096,
+  "<x95>": 32097,
+  "<x96>": 32098,
+  "<x97>": 32099,
+  "<x98>": 32100,
+  "<x99>": 32101,
+  "<x9>": 32011,
+  "<y0>": 32102,
+  "<y10>": 32112,
+  "<y11>": 32113,
+  "<y12>": 32114,
+  "<y13>": 32115,
+  "<y14>": 32116,
+  "<y15>": 32117,
+  "<y16>": 32118,
+  "<y17>": 32119,
+  "<y18>": 32120,
+  "<y19>": 32121,
+  "<y1>": 32103,
+  "<y20>": 32122,
+  "<y21>": 32123,
+  "<y22>": 32124,
+  "<y23>": 32125,
+  "<y24>": 32126,
+  "<y25>": 32127,
+  "<y26>": 32128,
+  "<y27>": 32129,
+  "<y28>": 32130,
+  "<y29>": 32131,
+  "<y2>": 32104,
+  "<y30>": 32132,
+  "<y31>": 32133,
+  "<y32>": 32134,
+  "<y33>": 32135,
+  "<y34>": 32136,
+  "<y35>": 32137,
+  "<y36>": 32138,
+  "<y37>": 32139,
+  "<y38>": 32140,
+  "<y39>": 32141,
+  "<y3>": 32105,
+  "<y40>": 32142,
+  "<y41>": 32143,
+  "<y42>": 32144,
+  "<y43>": 32145,
+  "<y44>": 32146,
+  "<y45>": 32147,
+  "<y46>": 32148,
+  "<y47>": 32149,
+  "<y48>": 32150,
+  "<y49>": 32151,
+  "<y4>": 32106,
+  "<y50>": 32152,
+  "<y51>": 32153,
+  "<y52>": 32154,
+  "<y53>": 32155,
+  "<y54>": 32156,
+  "<y55>": 32157,
+  "<y56>": 32158,
+  "<y57>": 32159,
+  "<y58>": 32160,
+  "<y59>": 32161,
+  "<y5>": 32107,
+  "<y60>": 32162,
+  "<y61>": 32163,
+  "<y62>": 32164,
+  "<y63>": 32165,
+  "<y64>": 32166,
+  "<y65>": 32167,
+  "<y66>": 32168,
+  "<y67>": 32169,
+  "<y68>": 32170,
+  "<y69>": 32171,
+  "<y6>": 32108,
+  "<y70>": 32172,
+  "<y71>": 32173,
+  "<y72>": 32174,
+  "<y73>": 32175,
+  "<y74>": 32176,
+  "<y75>": 32177,
+  "<y76>": 32178,
+  "<y77>": 32179,
+  "<y78>": 32180,
+  "<y79>": 32181,
+  "<y7>": 32109,
+  "<y80>": 32182,
+  "<y81>": 32183,
+  "<y82>": 32184,
+  "<y83>": 32185,
+  "<y84>": 32186,
+  "<y85>": 32187,
+  "<y86>": 32188,
+  "<y87>": 32189,
+  "<y88>": 32190,
+  "<y89>": 32191,
+  "<y8>": 32110,
+  "<y90>": 32192,
+  "<y91>": 32193,
+  "<y92>": 32194,
+  "<y93>": 32195,
+  "<y94>": 32196,
+  "<y95>": 32197,
+  "<y96>": 32198,
+  "<y97>": 32199,
+  "<y98>": 32200,
+  "<y99>": 32201,
+  "<y9>": 32111
+}

chat_template.json ADDED Viewed

	@@ -0,0 +1,3 @@

+{
+  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}You are an expert radiology assistant tasked with interpreting a chest X-ray study.  {% for message in messages %}{% if message[\"role\"] == \"user\" %}USER:  {% else %}ASSISTANT: {% endif %}{% for item in message[\"content\"] %}{% if item[\"type\"] == \"text\" %}{{ item[\"text\"] }}{% elif item[\"type\"] == \"image\" %}<image>{% endif %}{% endfor %}{% if message[\"role\"] == \"user\" %}  {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}"
+}

config.json ADDED Viewed

	@@ -0,0 +1,99 @@

+{
+  "architectures": [
+    "Maira2ForConditionalGeneration"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_maira2.Maira2Config",
+    "AutoModelForCausalLM": "modeling_maira2.Maira2ForConditionalGeneration",
+    "AutoModelForVision2Seq": "modeling_maira2.Maira2ForConditionalGeneration"
+  },
+  "hidden_size": 4096,
+  "image_seq_length": 576,
+  "image_token_index": 32204,
+  "model_type": "maira2",
+  "multimodal_projector_bias": true,
+  "pad_token_id": 0,
+  "projector_hidden_act": "gelu",
+  "projector_n_layers": 4,
+  "text_config": {
+    "_name_or_path": "lmsys/vicuna-7b-v1.5",
+    "architectures": [
+      "LlamaForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 4096,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "max_position_embeddings": 4096,
+    "mlp_bias": false,
+    "model_type": "llama",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 32,
+    "num_key_value_heads": 32,
+    "pad_token_id": 0,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_scaling": {
+      "factor": 1.5,
+      "rope_type": "linear"
+    },
+    "rope_theta": 10000.0,
+    "torch_dtype": "bfloat16",
+    "use_cache": true,
+    "vocab_size": 32207
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.51.3",
+  "vision_config": {
+    "apply_layernorm": true,
+    "architectures": [
+      "Dinov2Model"
+    ],
+    "attention_probs_dropout_prob": 0.0,
+    "drop_path_rate": 0.0,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.0,
+    "hidden_size": 768,
+    "image_size": 518,
+    "initializer_range": 0.02,
+    "layer_norm_eps": 1e-06,
+    "layerscale_value": 1.0,
+    "mlp_ratio": 4,
+    "model_type": "dinov2",
+    "num_attention_heads": 12,
+    "num_channels": 3,
+    "num_hidden_layers": 12,
+    "out_features": [
+      "stage12"
+    ],
+    "out_indices": [
+      12
+    ],
+    "patch_size": 14,
+    "qkv_bias": true,
+    "reshape_hidden_states": false,
+    "stage_names": [
+      "stem",
+      "stage1",
+      "stage2",
+      "stage3",
+      "stage4",
+      "stage5",
+      "stage6",
+      "stage7",
+      "stage8",
+      "stage9",
+      "stage10",
+      "stage11",
+      "stage12"
+    ],
+    "torch_dtype": "float32",
+    "use_mask_token": true,
+    "use_swiglu_ffn": false
+  },
+  "vision_feature_layer": -1,
+  "vision_feature_select_strategy": "default"
+}

configuration_maira2.py ADDED Viewed

	@@ -0,0 +1,32 @@

+#  Copyright 2024 Microsoft. All rights reserved.
+#  Licensed under the MSRLA License. See LICENSE in the repo root for license information.
+from typing import Any
+from transformers import LlavaConfig
+class Maira2Config(LlavaConfig):
+    """
+    This is the configuration class to store the configuration of a `Maira2ForConditionalGeneration` model. It is
+    used to instantiate a MAIRA-2 model according to the specified arguments, defining the model architecture.
+    It inherits from `LlavaConfig`. In addition to the inherited attributes, it adds the
+    ability to customize the multimodal projector through following attributes:
+    Args:
+        projector_n_layers (`int`, *optional*, defaults to 4):
+            Number of layers in the multimodal projector.
+    """
+    model_type = "maira2"
+    def __init__(
+        self,
+        projector_n_layers: int = 4,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.hidden_size = self.text_config.hidden_size
+        self.projector_n_layers = projector_n_layers

generation_config.json ADDED Viewed

	@@ -0,0 +1,9 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "max_length": 4096,
+  "max_new_tokens": 450,
+  "pad_token_id": 0,
+  "transformers_version": "4.51.3"
+}

model-00001-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0582f1d522390f92f3ebaaa5fa01d2e1a6b7f090f6aad33e32476f109128d767
+size 135

model-00002-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e23f185c5830f812c171bf510c79a6b51a42f85e72da90d84b62d9433c2fbb1d
+size 135

model-00003-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a2b6ae86f3a49f69e7f68915dd1a4edab79117dc5d8df7348e01175ac184f02d
+size 135

model-00004-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f70554d96f021d607d4f443b92d987c3b9434a058350cdc9cde02fb2918c89b
+size 135

model-00005-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9bf5d731c86abe7f7c5aef9be877cf15e6baa59c2f83b4b450a0e337a38f26b1
+size 135

model-00006-of-00006.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ca650a64438cbdc0a6e7f346fd9042446f2d625a85ecc7da96298888956b2a13
+size 135

model.safetensors.index.json ADDED Viewed

	@@ -0,0 +1,529 @@

+{
+  "metadata": {
+    "total_size": 27520742400
+  },
+  "weight_map": {
+    "language_model.lm_head.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.embed_tokens.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.0.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.1.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.10.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.10.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.10.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.10.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.11.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.11.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.12.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.13.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.14.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.input_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.mlp.down_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.post_attention_layernorm.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.15.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.16.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.16.mlp.gate_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.mlp.up_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.k_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.o_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.q_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.16.self_attn.v_proj.weight": "model-00003-of-00006.safetensors",
+    "language_model.model.layers.17.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.17.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.18.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.19.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.2.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.2.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.20.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.20.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.input_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.mlp.down_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.post_attention_layernorm.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.21.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.22.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.22.mlp.gate_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.mlp.up_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.k_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.o_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.q_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.22.self_attn.v_proj.weight": "model-00004-of-00006.safetensors",
+    "language_model.model.layers.23.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.23.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.24.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.25.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.26.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.input_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.mlp.down_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.post_attention_layernorm.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.27.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.28.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.28.mlp.gate_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.mlp.up_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.k_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.o_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.q_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.28.self_attn.v_proj.weight": "model-00005-of-00006.safetensors",
+    "language_model.model.layers.29.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.29.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.3.input_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.mlp.down_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.post_attention_layernorm.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.3.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.30.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.30.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.input_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.mlp.down_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.mlp.gate_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.mlp.up_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.post_attention_layernorm.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.k_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.o_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.q_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.31.self_attn.v_proj.weight": "model-00006-of-00006.safetensors",
+    "language_model.model.layers.4.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.4.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.4.mlp.gate_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.mlp.up_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.k_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.o_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.q_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.4.self_attn.v_proj.weight": "model-00001-of-00006.safetensors",
+    "language_model.model.layers.5.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.5.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.6.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.7.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.8.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.input_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.mlp.down_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.mlp.gate_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.mlp.up_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.post_attention_layernorm.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.k_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.o_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.q_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.layers.9.self_attn.v_proj.weight": "model-00002-of-00006.safetensors",
+    "language_model.model.norm.weight": "model-00006-of-00006.safetensors",
+    "multi_modal_projector.layers.0.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.0.weight": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.2.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.2.weight": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.4.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.4.weight": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.6.bias": "model-00001-of-00006.safetensors",
+    "multi_modal_projector.layers.6.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.cls_token": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.mask_token": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.patch_embeddings.projection.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.patch_embeddings.projection.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.embeddings.position_embeddings": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.0.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.1.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.10.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.11.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.2.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.3.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.4.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.5.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.6.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.7.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.8.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.key.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.key.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.query.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.query.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.value.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.attention.value.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.output.dense.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.attention.output.dense.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.layer_scale1.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.layer_scale2.lambda1": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.mlp.fc2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm1.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm1.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm2.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.encoder.layer.9.norm2.weight": "model-00001-of-00006.safetensors",
+    "vision_tower.layernorm.bias": "model-00001-of-00006.safetensors",
+    "vision_tower.layernorm.weight": "model-00001-of-00006.safetensors"
+  }
+}

modeling_maira2.py ADDED Viewed

	@@ -0,0 +1,112 @@

+#  Copyright 2024 Microsoft. All rights reserved.
+#  Licensed under the MSRLA License. See LICENSE in the repo root for license information.
+from typing import Any
+import torch
+from torch.nn import Linear, Module, Sequential
+from transformers import (
+    AutoBackbone,
+    AutoModelForCausalLM,
+    LlavaForConditionalGeneration,
+    LlavaPreTrainedModel,
+)
+from transformers.activations import ACT2FN
+from transformers.utils import check_min_version
+from .configuration_maira2 import Maira2Config
+class Maira2MultiModalProjector(Module):
+    """
+    This class implements the multimodal projector for MAIRA-2 model. It projects the image features to the text
+    hidden size via a series of linear layers (4 layers in MAIRA-2).
+    """
+    def __init__(self, config: Maira2Config):
+        super().__init__()
+        n_layers = config.projector_n_layers
+        if n_layers < 1:
+            raise ValueError(f"Number of layers should be at least 1, got {n_layers=}")
+        text_hidden_size = config.text_config.hidden_size
+        vision_hidden_size = config.vision_config.hidden_size
+        _layers = [Linear(vision_hidden_size, text_hidden_size, bias=True)]
+        for _ in range(n_layers - 1):
+            _layers.append(ACT2FN[config.projector_hidden_act])
+            _layers.append(Linear(text_hidden_size, text_hidden_size, bias=True))
+        self.layers = Sequential(*_layers)
+    def forward(self, image_features: torch.Tensor) -> torch.FloatTensor:
+        hidden_states = self.layers(image_features)
+        return hidden_states  # type: ignore[no-any-return]
+class Maira2ForConditionalGeneration(LlavaForConditionalGeneration):
+    """
+    This model implements the multimodal model MAIRA-2. It consists of a vision backbone, a multimodal projector, and a
+    language model. The model can be used for grounded and ungrounded report generation tasks as well as phrase grounding.
+    This class inherits from `LlavaForConditionalGeneration`, defining a custom multimodal projector and changing image
+    feature selection.
+    """
+    config_class = Maira2Config
+    def __init__(self, config: Maira2Config) -> None:
+        # Check transformers version is at least 4.46.0.dev0  otherwise the model fails
+        # silently since get_image_features is not called in the forward pass
+        check_min_version("4.46.0.dev0")
+        super(LlavaPreTrainedModel, self).__init__(config)
+        self.vision_tower = AutoBackbone.from_config(config.vision_config)
+        self.multi_modal_projector = Maira2MultiModalProjector(config)
+        self.vocab_size = config.text_config.vocab_size
+        self.language_model = AutoModelForCausalLM.from_config(
+            config.text_config,
+            attn_implementation=config._attn_implementation,
+        )
+        self.pad_token_id = (
+            self.config.pad_token_id if self.config.pad_token_id is not None else -1
+        )
+        self.post_init()
+    def get_image_features(
+        self,
+        pixel_values: torch.FloatTensor,
+        vision_feature_layer: int | list[int],
+        vision_feature_select_strategy: str,
+        **kwargs: Any,
+    ) -> torch.Tensor:
+        """
+        This method extracts the image features from the vision backbone using the specified feature layer and
+        selection strategy. This is custom to MAIRA-2 model since we want to use the `feature_maps` from the Dinov2Backbone
+        class instead of the `hidden_states` which are used in the default implementation of `get_image_features` in LlavaForConditionalGeneration.
+        The feature_maps returned by Dinov2Backbone are the hideen_states with a layernorm applied to them.
+        """
+        if isinstance(vision_feature_layer, list):
+            raise ValueError(
+                "MAIRA-2 does not support list values for vision_feature_layer."
+            )
+        if vision_feature_select_strategy not in ["default", "full"]:
+            raise ValueError(
+                f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
+            )
+        extra_kwargs = {k: v for k, v in kwargs.items() if v is not None}
+        if extra_kwargs:
+            raise ValueError(
+                f"MAIRA-2 does not support passing extra kwargs to the vision tower, received: {extra_kwargs}"
+            )
+        image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
+        selected_image_feature = image_outputs.feature_maps[vision_feature_layer]
+        if vision_feature_select_strategy == "default":
+            selected_image_feature = selected_image_feature[:, 1:]
+        image_features = self.multi_modal_projector(selected_image_feature)
+        return image_features  # type: ignore[no-any-return]

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_maira2.Maira2Processor"
+  },
+  "crop_size": {
+    "height": 518,
+    "width": 518
+  },
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.5307,
+    0.5307,
+    0.5307
+  ],
+  "image_processor_type": "BitImageProcessor",
+  "image_std": [
+    0.2583,
+    0.2583,
+    0.2583
+  ],
+  "processor_class": "Maira2Processor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "shortest_edge": 518
+  }
+}

processing_maira2.py ADDED Viewed

	@@ -0,0 +1,649 @@

+#  Copyright 2024 Microsoft. All rights reserved.
+#  Licensed under the MSRLA License. See LICENSE in the repo root for license information.
+import re
+from typing import Any, TypeAlias
+import numpy as np
+from PIL import Image
+from transformers import BaseImageProcessor, LlavaProcessor, PreTrainedTokenizer
+from transformers.feature_extraction_utils import BatchFeature
+SingleChatMessageType: TypeAlias = dict[str, str | int | None]
+ChatMessageListType: TypeAlias = list[dict[str, str | list[SingleChatMessageType]]]
+BoxType: TypeAlias = tuple[float, float, float, float]
+class Maira2Processor(LlavaProcessor):
+    """
+    Constructs a Maira2 processor similar to LlavaProcessor but with additional arguments and functions to support
+    multi-image grounded and non-grounded radiology report generation.
+    In addition to the arguments of LlavaProcessor, Maira2Processor has the following extra arguments:
+    Args:
+        phrase_start_token (`str`, *optional*, defaults to `"<obj>"`):
+            Special token used to denote the start of a grounded phrase (with or without box).
+        phrase_end_token (`str`, *optional*, defaults to `"</obj>"`):
+            Special token used to denote the end of a grounded phrase.
+        box_start_token (`str`, *optional*, defaults to `"<box>"`):
+            Special token used to denote the start of a bounding box.
+        box_end_token (`str`, *optional*, defaults to `"</box>"`):
+            Special token used to denote the end of a bounding box.
+        num_box_coord_bins (`int`, *optional*, defaults to `100`):
+            Number of bins used to represent the bounding box coordinates.
+    """
+    valid_kwargs = [
+        "chat_template",
+        "patch_size",
+        "vision_feature_select_strategy",
+        "image_token",
+        "num_additional_image_tokens",
+        "phrase_start_token",
+        "phrase_end_token",
+        "box_start_token",
+        "box_end_token",
+        "num_box_coord_bins",
+    ]
+    def __init__(
+        self,
+        image_processor: BaseImageProcessor = None,
+        tokenizer: PreTrainedTokenizer = None,
+        patch_size: int | None = None,
+        vision_feature_select_strategy: str | None = None,
+        chat_template: str | None = None,
+        image_token: str = "<image>",
+        num_additional_image_tokens: int = 1,
+        phrase_start_token: str = "<obj>",
+        phrase_end_token: str = "</obj>",
+        box_start_token: str = "<box>",
+        box_end_token: str = "</box>",
+        num_box_coord_bins: int = 100,
+        **kwargs: Any,
+    ) -> None:
+        super().__init__(
+            image_processor=image_processor,
+            tokenizer=tokenizer,
+            patch_size=patch_size,
+            vision_feature_select_strategy=vision_feature_select_strategy,
+            chat_template=chat_template,
+            image_token=image_token,
+            num_additional_image_tokens=num_additional_image_tokens,
+            **kwargs,
+        )
+        self.phrase_start_token = phrase_start_token
+        self.phrase_end_token = phrase_end_token
+        self.box_start_token = box_start_token
+        self.box_end_token = box_end_token
+        self.num_box_coord_bins = num_box_coord_bins
+    @staticmethod
+    def _normalize_image(image: Image.Image) -> Image.Image:
+        """
+        This function normalizes the input image to have pixel values in the range [0, 255].
+        Args:
+            image (Image.Image | np.ndarray):
+                The input image to be normalized.
+        Returns:
+            Image.Image: The normalized image in grayscale.
+        """
+        image_np = np.array(image.convert("L"))
+        image_np = image_np.astype(float)
+        image_np -= image_np.min()
+        image_np /= image_np.max()
+        image_np *= 255
+        image_np = image_np.astype(np.uint8)
+        return Image.fromarray(image_np).convert("L")
+    def _normalize_and_stack_images(
+        self,
+        current_frontal: Image.Image,
+        current_lateral: Image.Image | None,
+        prior_frontal: Image.Image | None,
+    ) -> list[Image.Image]:
+        """
+        This function normalizes the input images and stacks them together. The images are stacked in the order of
+        current_frontal, current_lateral, and prior_frontal. The order of images is important, since it must match the
+        order of the images in the prompt, which is frontal, then lateral then prior.
+        Args:
+            current_frontal (Image.Image):
+                The current frontal image.
+            current_lateral (Image.Image | None):
+                The current lateral image.
+            prior_frontal (Image.Image | None):
+                The prior frontal image.
+        Returns:
+            list[Image.Image]: The normalized images stacked together.
+        """
+        images = [self._normalize_image(current_frontal)]
+        if current_lateral is not None:
+            images.append(self._normalize_image(current_lateral))
+        if prior_frontal is not None:
+            images.append(self._normalize_image(prior_frontal))
+        return images
+    @staticmethod
+    def _get_section_text_or_missing_text(section: str | None) -> str:
+        """
+        This function returns the input section text if it is not None and not empty, otherwise it returns a missing
+        section text "N/A".
+        Args:
+            section (str | None):
+                The input section text.
+        Returns:
+            str: The section text if it is not None and not empty, otherwise "N/A".
+        """
+        missing_section_text = "N/A"
+        if not isinstance(section, str) or len(section) == 0:
+            return missing_section_text
+        return section
+    @staticmethod
+    def _construct_image_chat_messages_for_reporting(has_prior: bool, has_lateral: bool) -> list[SingleChatMessageType]:
+        """
+        This function constructs user chat messages based on the presence of the prior and lateral images.
+        Args:
+            has_prior (bool):
+                A boolean indicating whether the prior image is present.
+            has_lateral (bool):
+                A boolean indicating whether the lateral image is present.
+        Returns:
+            list[SingleChatMessageType]: The image prompt messages in the form of a list of dictionaries.
+        Example:
+        ```python
+        >>> _construct_image_chat_messages_for_reporting(has_prior=True, has_lateral=True)
+        >>> # [
+        >>> #     {"index": None, "text": "Given the current frontal image", "type": "text"},
+        >>> #     {"index": 0, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " the current lateral image", "type": "text"},
+        >>> #     {"index": 1, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " and the prior frontal image", "type": "text"},
+        >>> #     {"index": 2, "text": None, "type": "image"},
+        >>> # ]
+        ```
+        """
+        def _add_single_image_to_chat_messages(prompt_text: str, image_index: int) -> None:
+            image_prompt.extend(
+                [
+                    {"index": None, "text": prompt_text, "type": "text"},
+                    {"index": image_index, "text": None, "type": "image"},
+                ]
+            )
+        image_prompt: list[SingleChatMessageType] = []
+        image_index = 0
+        if not has_prior and not has_lateral:
+            _add_single_image_to_chat_messages("Given the current frontal image only", image_index)
+        else:
+            _add_single_image_to_chat_messages("Given the current frontal image", image_index)
+            image_index += 1
+            if has_prior:
+                if has_lateral:
+                    _add_single_image_to_chat_messages(" the current lateral image", image_index)
+                    image_index += 1
+                _add_single_image_to_chat_messages(" and the prior frontal image", image_index)
+            else:
+                if has_lateral:
+                    _add_single_image_to_chat_messages(" and the current lateral image", image_index)
+        return image_prompt
+    def _construct_chat_messages_reporting(
+        self,
+        has_prior: bool,
+        has_lateral: bool,
+        indication: str | None,
+        technique: str | None,
+        comparison: str | None,
+        prior_report: str | None,
+        get_grounding: bool = False,
+        assistant_text: str | None = None,
+    ) -> ChatMessageListType:
+        """
+        This function constructs the chat messages for reporting used in the grounded and non-grounded reporting tasks.
+        Args:
+            has_prior (bool):
+                A boolean indicating whether the prior image is present.
+            has_lateral (bool):
+                A boolean indicating whether the lateral image is present.
+            indication (str | None):
+                The indication section text.
+            technique (str | None):
+                The technique section text.
+            comparison (str | None):
+                The comparison section text.
+            prior_report (str | None):
+                The prior report section text.
+            get_grounding (bool):
+                A boolean indicating whether to get the grounding information.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            ChatMessageListType: The chat messages for reporting in the form of a list of dictionaries.
+        Example:
+        ```python
+        >>> _construct_chat_messages_reporting(
+        >>>     has_prior=True,
+        >>>     has_lateral=True,
+        >>>     indication="indication text from report goes here",
+        >>>     technique="technique text from report goes here",
+        >>>     comparison="comparison text from report goes here",
+        >>>     prior_report="prior reporting text goes here",
+        >>>     get_grounding=False,
+        >>>     assistant_text=None,
+        >>> )
+        >>> # [
+        >>> #     {"index": None, "text": "Given the current frontal image", "type": "text"},
+        >>> #     {"index": 0, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " the current lateral image", "type": "text"},
+        >>> #     {"index": 1, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " and the prior frontal image", "type": "text"},
+        >>> #     {"index": 2, "text": None, "type": "image"},
+        >>> #     {"index": None, "text": " PRIOR_REPORT: prior reporting text goes here", "type": "text"},
+        >>> #     {"index": None, "text": " Provide a description of the findings in the radiology study in comparison to the "
+        >>> #     "prior frontal image. INDICATION: indication text from report goes here TECHNIQUE: technique text from report "
+        >>> #     "goes here COMPARISON: comparison text from report goes here", "type": "text"},
+        >>> # ]
+        ```
+        """
+        indication = self._get_section_text_or_missing_text(indication)
+        technique = self._get_section_text_or_missing_text(technique)
+        comparison = self._get_section_text_or_missing_text(comparison)
+        prior_report = self._get_section_text_or_missing_text(prior_report)
+        prompt = self._construct_image_chat_messages_for_reporting(has_prior=has_prior, has_lateral=has_lateral)
+        if has_prior:
+            prompt.append({"index": None, "text": f" PRIOR_REPORT: {prior_report}", "type": "text"})
+        if get_grounding:
+            prompt.append(
+                {
+                    "index": None,
+                    "text": " Provide a description of the findings in the radiology study in comparison to the "
+                    "prior frontal image. Each finding should be described as a self-contained plain-text sentence."
+                    " If the finding is groundable, locate the finding in the current frontal chest X-ray image, "
+                    "with bounding boxes indicating all locations where it can be seen in the current frontal "
+                    "image. Otherwise, generate just the ungrounded finding without bounding boxes. INDICATION: "
+                    f"{indication} TECHNIQUE: {technique} COMPARISON: {comparison}",
+                    "type": "text",
+                }
+            )
+        else:
+            prompt.append(
+                {
+                    "index": None,
+                    "text": " Provide a description of the findings in the radiology study in comparison to the "
+                    f"prior frontal image. INDICATION: {indication} TECHNIQUE: {technique} COMPARISON: "
+                    f"{comparison}",
+                    "type": "text",
+                }
+            )
+        messages: ChatMessageListType = [{"content": prompt, "role": "user"}]
+        if assistant_text is not None:
+            messages.append({"content": [{"index": None, "text": assistant_text, "type": "text"}], "role": "assistant"})
+        return messages
+    def _construct_chat_messages_phrase_grounding(
+        self, phrase: str, assistant_text: str | None = None
+    ) -> ChatMessageListType:
+        """
+        This function constructs the chat messages for phrase grounding used in the phrase grounding task.
+        Args:
+            phrase (str):
+                The phrase to be grounded.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            ChatMessageListType: The chat messages for phrase grounding in the form of a list of dictionaries.
+        """
+        prompt: list[SingleChatMessageType] = [
+            {"index": None, "text": "Given the current frontal image", "type": "text"},
+            {"index": 0, "text": None, "type": "image"},
+            {
+                "index": None,
+                "text": f" Repeat the following finding as a grounded phrase with bounding boxes indicating all "
+                f"locations where it can be seen in the given chest X-ray image. Finding: {phrase}",
+                "type": "text",
+            },
+        ]
+        messages: ChatMessageListType = [{"content": prompt, "role": "user"}]
+        if assistant_text is not None:
+            messages.append({"content": [{"index": None, "text": assistant_text, "type": "text"}], "role": "assistant"})
+        return messages
+    def format_reporting_input(
+        self,
+        current_frontal: Image.Image,
+        current_lateral: Image.Image | None,
+        prior_frontal: Image.Image | None,
+        indication: str | None,
+        technique: str | None,
+        comparison: str | None,
+        prior_report: str | None,
+        get_grounding: bool = False,
+        assistant_text: str | None = None,
+    ) -> tuple[str, list[Image.Image]]:
+        """
+        This function formats the reporting prompt for the grounded and non-grounded reporting tasks from the given
+        input images and text sections. The images are normalized and stacked together in the right order.
+        Args:
+            current_frontal (Image.Image):
+                The current frontal image.
+            current_lateral (Image.Image | None):
+                The current lateral image.
+            prior_frontal (Image.Image | None):
+                The prior frontal image.
+            indication (str | None):
+                The indication section text.
+            technique (str | None):
+                The technique section text.
+            comparison (str | None):
+                The comparison section text.
+            prior_report (str | None):
+                The prior report section text.
+            get_grounding (bool):
+                A boolean indicating whether to construct the prompt for grounded or non-grounded reporting.
+            assistant_text (str | None): The assistant text (can be set to None for ordinary inference).
+        Returns:
+            tuple[str, list[Image.Image]]: The formatted prompt text and the normalized images stacked in the right order.
+        """
+        images = self._normalize_and_stack_images(
+            current_frontal=current_frontal,
+            current_lateral=current_lateral,
+            prior_frontal=prior_frontal,
+        )
+        messages = self._construct_chat_messages_reporting(
+            has_prior=prior_frontal is not None,
+            has_lateral=current_lateral is not None,
+            indication=indication,
+            technique=technique,
+            comparison=comparison,
+            prior_report=prior_report,
+            get_grounding=get_grounding,
+            assistant_text=assistant_text,
+        )
+        add_generation_prompt = assistant_text is None
+        text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=add_generation_prompt, tokenize=False)
+        return text, images
+    def format_phrase_grounding_input(
+        self,
+        frontal_image: Image.Image,
+        phrase: str,
+        assistant_text: str | None = None,
+    ) -> tuple[str, list[Image.Image]]:
+        """
+        This function formats the phrase grounding prompt for the phrase grounding task from the given input
+        image and phrase.
+        Args:
+            frontal_image (Image.Image):
+                The frontal image.
+            phrase (str):
+                The phrase to be grounded.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            tuple[str, list[Image.Image]]: The formatted phrase grounding prompt text and the normalized image.
+        """
+        images = self._normalize_and_stack_images(
+            current_frontal=frontal_image,
+            current_lateral=None,
+            prior_frontal=None,
+        )
+        messages = self._construct_chat_messages_phrase_grounding(phrase)
+        add_generation_prompt = assistant_text is None
+        text = self.tokenizer.apply_chat_template(messages, add_generation_prompt=add_generation_prompt, tokenize=False)
+        return text, images
+    def format_and_preprocess_reporting_input(
+        self,
+        current_frontal: Image.Image,
+        current_lateral: Image.Image | None,
+        prior_frontal: Image.Image | None,
+        indication: str | None,
+        technique: str | None,
+        comparison: str | None,
+        prior_report: str | None,
+        get_grounding: bool = False,
+        assistant_text: str | None = None,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        """
+        This function formats and then preprocesses the input for the grounded and non-grounded reporting tasks from
+        the given input images and text sections and returns the batch feature for the model. It calls format_reporting_input
+        internally to format the input prompt and stack the images together in the right order.
+        Args:
+            current_frontal (Image.Image):
+                The current frontal image.
+            current_lateral (Image.Image | None):
+                The current lateral image.
+            prior_frontal (Image.Image | None):
+                The prior frontal image.
+            indication (str | None):
+                The indication section text.
+            technique (str | None):
+                The technique section text.
+            comparison (str | None):
+                The comparison section text.
+            prior_report (str | None):
+                The prior report section text.
+            get_grounding (bool):
+                A boolean indicating whether to preprocess the input for grounded or non-grounded reporting.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            BatchFeature: The batch feature for the model, ready to be passed to the model.
+        """
+        text, images = self.format_reporting_input(
+            current_frontal=current_frontal,
+            current_lateral=current_lateral,
+            prior_frontal=prior_frontal,
+            indication=indication,
+            technique=technique,
+            comparison=comparison,
+            prior_report=prior_report,
+            get_grounding=get_grounding,
+            assistant_text=assistant_text,
+        )
+        return self(text=text, images=images, **kwargs)
+    def format_and_preprocess_phrase_grounding_input(
+        self,
+        frontal_image: Image.Image,
+        phrase: str,
+        assistant_text: str | None = None,
+        **kwargs: Any,
+    ) -> BatchFeature:
+        """
+        This function formats and then processes the input for the phrase grounding task from the given input image and
+        phrase and returns the batch feature for the model. It calls format_phrase_grounding_input internally to format
+        the input prompt and normalize the image.
+        Args:
+            frontal_image (Image.Image):
+                The frontal image.
+            phrase (str):
+                The phrase to be grounded.
+            assistant_text (str | None):
+                The assistant text (can be set to None for ordinary inference).
+        Returns:
+            BatchFeature: The batch feature for the model, ready to be passed to the model.
+        """
+        text, images = self.format_phrase_grounding_input(
+            frontal_image=frontal_image,
+            phrase=phrase,
+            assistant_text=assistant_text,
+        )
+        return self(text=text, images=images, **kwargs)
+    def _get_text_between_delimiters(self, text: str, begin_token: str, end_token: str) -> list[str]:
+        """
+        This function splits the input text into a list of substrings beased on the given begin and end tokens.
+        Args:
+            text (str):
+                The input text to be split.
+            begin_token (str):
+                The begin token.
+            end_token (str):
+                The end token.
+        Returns:
+            list[str]: The list of substrings between the given begin and end tokens.
+        Example:
+        ```python
+        >>> _get_text_between_delimiters("<obj>This is a grounded phrase</obj>. <obj>This is another grounded phrase</obj>.", "<obj>", "</obj>")
+        >>> # ["grounded phrase", "This is another grounded phrase"]
+        >>> _get_text_between_delimiters("<box><x10><y20><x30><y40></box><box><x50><y60><x70><y80></box>", "<box>", "</box>")
+        >>> # ["<x10><y20><x30><y40>", "<x50><y60><x70><y80>"]
+        ```
+        """
+        split_text = []
+        while begin_token in text:
+            assert text.startswith(begin_token)
+            end_index = text.find(end_token)
+            assert end_index != -1
+            split_text.append(text[len(begin_token) : end_index])
+            text = text[end_index + len(end_token) :]
+        assert len(text) == 0
+        return split_text
+    def convert_output_to_plaintext_or_grounded_sequence(
+        self, text: str
+    ) -> str | list[tuple[str, list[BoxType] | None]]:
+        """
+        This function converts the input text to a grounded sequence by extracting the grounded phrases and bounding
+        boxes from the text. If the text is plaintext without any grounded phrases, it returns the text as is.
+        Args:
+            text (str):
+                The input text to be converted.
+        Returns:
+            str | list[tuple[str, list[BoxType] | None]]: The grounded sequence.
+        Example:
+        ```python
+        >>> convert_output_to_plaintext_or_grounded_sequence("<obj>grounded phrase <box><x55><y45><x70><y56></box></obj><obj>ungrounded phrase</obj>")
+        >>> # [
+        >>> #     ("grounded phrase", [(0.55, 0.45, 0.70, 0.56)]),
+        >>> #     ("ungrounded phrase", None),
+        >>> # ]
+        >>> convert_output_to_plaintext_or_grounded_sequence("plain text")
+        >>> # "plain text"
+        ```
+        """
+        text = text.strip()
+        # Plain text
+        if not any(
+            [
+                self.phrase_start_token in text,
+                self.phrase_end_token in text,
+                self.box_start_token in text,
+                self.box_end_token in text,
+            ]
+        ):
+            return text
+        # One or more grounded phrases
+        grounded_phrase_texts = self._get_text_between_delimiters(text, self.phrase_start_token, self.phrase_end_token)
+        grounded_phrases: list[tuple[str, list[BoxType] | None]] = []
+        for grounded_phrase_text in grounded_phrase_texts:
+            if self.box_start_token in grounded_phrase_text or self.box_end_token in grounded_phrase_text:
+                first_box_start_index = grounded_phrase_text.find(self.box_start_token)
+                phrase_text = grounded_phrase_text[:first_box_start_index].strip()
+                boxes_text = grounded_phrase_text[first_box_start_index:]
+                boxes_text_list = self._get_text_between_delimiters(
+                    boxes_text, self.box_start_token, self.box_end_token
+                )
+                boxes: list[BoxType] = []
+                for box_text in boxes_text_list:
+                    # extract from <x_><y_><x_><y_>
+                    regex = r"<x(\d+?)><y(\d+?)><x(\d+?)><y(\d+?)>"
+                    match = re.search(regex, box_text)
+                    if match:
+                        x_min, y_min, x_max, y_max = match.groups()
+                        box: BoxType = tuple(  # type: ignore[assignment]
+                            (int(coord) + 0.5) / self.num_box_coord_bins for coord in (x_min, y_min, x_max, y_max)
+                        )
+                        assert all(0 <= coord <= 1 for coord in box), f"Invalid box coordinates: {box}"
+                        boxes.append(box)
+                    else:
+                        raise ValueError(f"Invalid box coordinates: {box_text} not matching regex {regex}")
+                grounded_phrases.append((phrase_text, boxes))
+            else:
+                grounded_phrases.append((grounded_phrase_text.lstrip(), None))
+        return grounded_phrases
+    @staticmethod
+    def adjust_box_for_original_image_size(box: BoxType, width: int, height: int) -> BoxType:
+        """
+        This function adjusts the bounding boxes from the MAIRA-2 model output to account for the image processor
+        cropping the image to be square prior to the model forward pass. The box coordinates are adjusted to be
+        relative to the original shape of the image assuming the image processor cropped the image based on the length
+        of the shortest side.
+        Args:
+            box (BoxType):
+                The box to be adjusted, normalised to (0, 1).
+            width (int):
+                Original width of the image, in pixels.
+            height (int):
+                Original height of the image, in pixels.
+        Returns:
+            BoxType: The box normalised relative to the original size of the image.
+        """
+        crop_width = crop_height = min(width, height)
+        x_offset = (width - crop_width) // 2
+        y_offset = (height - crop_height) // 2
+        norm_x_min, norm_y_min, norm_x_max, norm_y_max = box
+        abs_x_min = int(norm_x_min * crop_width + x_offset)
+        abs_x_max = int(norm_x_max * crop_width + x_offset)
+        abs_y_min = int(norm_y_min * crop_height + y_offset)
+        abs_y_max = int(norm_y_max * crop_height + y_offset)
+        adjusted_norm_x_min = abs_x_min / width
+        adjusted_norm_x_max = abs_x_max / width
+        adjusted_norm_y_min = abs_y_min / height
+        adjusted_norm_y_max = abs_y_max / height
+        return (adjusted_norm_x_min, adjusted_norm_y_min, adjusted_norm_x_max, adjusted_norm_y_max)

processor_config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "auto_map": {
+    "AutoProcessor": "processing_maira2.Maira2Processor"
+  },
+  "box_end_token": "</box>",
+  "box_start_token": "<box>",
+  "image_token": "<image>",
+  "num_additional_image_tokens": 1,
+  "num_box_coord_bins": 100,
+  "patch_size": 14,
+  "phrase_end_token": "</obj>",
+  "phrase_start_token": "<obj>",
+  "processor_class": "Maira2Processor",
+  "vision_feature_select_strategy": "default"
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a8f238a200be6c23fbba0f9a999ab4fe3c09ca303b29805e68cf6659bfb7d89
+size 131

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,1702 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "32000": {
+      "content": "<obj>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32001": {
+      "content": "</obj>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32002": {
+      "content": "<x0>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32003": {
+      "content": "<x1>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32004": {
+      "content": "<x2>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32005": {
+      "content": "<x3>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32006": {
+      "content": "<x4>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32007": {
+      "content": "<x5>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32008": {
+      "content": "<x6>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32009": {
+      "content": "<x7>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32010": {
+      "content": "<x8>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32011": {
+      "content": "<x9>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32012": {
+      "content": "<x10>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32013": {
+      "content": "<x11>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32014": {
+      "content": "<x12>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32015": {
+      "content": "<x13>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32016": {
+      "content": "<x14>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32017": {
+      "content": "<x15>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32018": {
+      "content": "<x16>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32019": {
+      "content": "<x17>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32020": {
+      "content": "<x18>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32021": {
+      "content": "<x19>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32022": {
+      "content": "<x20>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32023": {
+      "content": "<x21>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32024": {
+      "content": "<x22>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32025": {
+      "content": "<x23>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32026": {
+      "content": "<x24>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32027": {
+      "content": "<x25>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32028": {
+      "content": "<x26>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32029": {
+      "content": "<x27>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32030": {
+      "content": "<x28>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32031": {
+      "content": "<x29>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32032": {
+      "content": "<x30>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32033": {
+      "content": "<x31>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32034": {
+      "content": "<x32>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32035": {
+      "content": "<x33>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32036": {
+      "content": "<x34>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32037": {
+      "content": "<x35>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32038": {
+      "content": "<x36>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32039": {
+      "content": "<x37>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32040": {
+      "content": "<x38>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32041": {
+      "content": "<x39>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32042": {
+      "content": "<x40>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32043": {
+      "content": "<x41>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32044": {
+      "content": "<x42>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32045": {
+      "content": "<x43>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32046": {
+      "content": "<x44>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32047": {
+      "content": "<x45>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32048": {
+      "content": "<x46>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32049": {
+      "content": "<x47>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32050": {
+      "content": "<x48>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32051": {
+      "content": "<x49>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32052": {
+      "content": "<x50>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32053": {
+      "content": "<x51>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32054": {
+      "content": "<x52>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32055": {
+      "content": "<x53>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32056": {
+      "content": "<x54>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32057": {
+      "content": "<x55>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32058": {
+      "content": "<x56>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32059": {
+      "content": "<x57>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32060": {
+      "content": "<x58>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32061": {
+      "content": "<x59>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32062": {
+      "content": "<x60>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32063": {
+      "content": "<x61>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32064": {
+      "content": "<x62>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32065": {
+      "content": "<x63>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32066": {
+      "content": "<x64>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32067": {
+      "content": "<x65>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32068": {
+      "content": "<x66>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32069": {
+      "content": "<x67>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32070": {
+      "content": "<x68>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32071": {
+      "content": "<x69>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32072": {
+      "content": "<x70>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32073": {
+      "content": "<x71>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32074": {
+      "content": "<x72>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32075": {
+      "content": "<x73>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32076": {
+      "content": "<x74>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32077": {
+      "content": "<x75>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32078": {
+      "content": "<x76>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32079": {
+      "content": "<x77>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32080": {
+      "content": "<x78>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32081": {
+      "content": "<x79>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32082": {
+      "content": "<x80>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32083": {
+      "content": "<x81>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32084": {
+      "content": "<x82>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32085": {
+      "content": "<x83>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32086": {
+      "content": "<x84>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32087": {
+      "content": "<x85>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32088": {
+      "content": "<x86>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32089": {
+      "content": "<x87>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32090": {
+      "content": "<x88>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32091": {
+      "content": "<x89>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32092": {
+      "content": "<x90>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32093": {
+      "content": "<x91>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32094": {
+      "content": "<x92>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32095": {
+      "content": "<x93>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32096": {
+      "content": "<x94>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32097": {
+      "content": "<x95>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32098": {
+      "content": "<x96>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32099": {
+      "content": "<x97>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32100": {
+      "content": "<x98>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32101": {
+      "content": "<x99>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32102": {
+      "content": "<y0>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32103": {
+      "content": "<y1>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32104": {
+      "content": "<y2>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32105": {
+      "content": "<y3>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32106": {
+      "content": "<y4>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32107": {
+      "content": "<y5>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32108": {
+      "content": "<y6>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32109": {
+      "content": "<y7>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32110": {
+      "content": "<y8>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32111": {
+      "content": "<y9>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32112": {
+      "content": "<y10>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32113": {
+      "content": "<y11>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32114": {
+      "content": "<y12>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32115": {
+      "content": "<y13>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32116": {
+      "content": "<y14>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32117": {
+      "content": "<y15>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32118": {
+      "content": "<y16>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32119": {
+      "content": "<y17>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32120": {
+      "content": "<y18>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32121": {
+      "content": "<y19>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32122": {
+      "content": "<y20>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32123": {
+      "content": "<y21>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32124": {
+      "content": "<y22>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32125": {
+      "content": "<y23>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32126": {
+      "content": "<y24>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32127": {
+      "content": "<y25>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32128": {
+      "content": "<y26>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32129": {
+      "content": "<y27>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32130": {
+      "content": "<y28>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32131": {
+      "content": "<y29>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32132": {
+      "content": "<y30>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32133": {
+      "content": "<y31>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32134": {
+      "content": "<y32>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32135": {
+      "content": "<y33>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32136": {
+      "content": "<y34>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32137": {
+      "content": "<y35>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32138": {
+      "content": "<y36>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32139": {
+      "content": "<y37>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32140": {
+      "content": "<y38>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32141": {
+      "content": "<y39>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32142": {
+      "content": "<y40>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32143": {
+      "content": "<y41>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32144": {
+      "content": "<y42>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32145": {
+      "content": "<y43>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32146": {
+      "content": "<y44>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32147": {
+      "content": "<y45>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32148": {
+      "content": "<y46>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32149": {
+      "content": "<y47>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32150": {
+      "content": "<y48>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32151": {
+      "content": "<y49>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32152": {
+      "content": "<y50>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32153": {
+      "content": "<y51>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32154": {
+      "content": "<y52>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32155": {
+      "content": "<y53>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32156": {
+      "content": "<y54>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32157": {
+      "content": "<y55>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32158": {
+      "content": "<y56>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32159": {
+      "content": "<y57>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32160": {
+      "content": "<y58>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32161": {
+      "content": "<y59>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32162": {
+      "content": "<y60>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32163": {
+      "content": "<y61>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32164": {
+      "content": "<y62>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32165": {
+      "content": "<y63>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32166": {
+      "content": "<y64>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32167": {
+      "content": "<y65>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32168": {
+      "content": "<y66>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32169": {
+      "content": "<y67>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32170": {
+      "content": "<y68>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32171": {
+      "content": "<y69>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32172": {
+      "content": "<y70>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32173": {
+      "content": "<y71>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32174": {
+      "content": "<y72>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32175": {
+      "content": "<y73>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32176": {
+      "content": "<y74>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32177": {
+      "content": "<y75>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32178": {
+      "content": "<y76>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32179": {
+      "content": "<y77>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32180": {
+      "content": "<y78>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32181": {
+      "content": "<y79>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32182": {
+      "content": "<y80>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32183": {
+      "content": "<y81>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32184": {
+      "content": "<y82>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32185": {
+      "content": "<y83>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32186": {
+      "content": "<y84>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32187": {
+      "content": "<y85>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32188": {
+      "content": "<y86>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32189": {
+      "content": "<y87>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32190": {
+      "content": "<y88>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32191": {
+      "content": "<y89>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32192": {
+      "content": "<y90>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32193": {
+      "content": "<y91>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32194": {
+      "content": "<y92>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32195": {
+      "content": "<y93>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32196": {
+      "content": "<y94>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32197": {
+      "content": "<y95>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32198": {
+      "content": "<y96>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32199": {
+      "content": "<y97>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32200": {
+      "content": "<y98>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32201": {
+      "content": "<y99>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32202": {
+      "content": "<box>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32203": {
+      "content": "</box>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32204": {
+      "content": "<image>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32205": {
+      "content": "<prev_im>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "32206": {
+      "content": "<lat_image>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "bos_token": "<s>",
+  "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}You are an expert radiology assistant tasked with interpreting a chest X-ray study.  {% for message in messages %}{% if message[\"role\"] == \"user\" %}USER:  {% else %}ASSISTANT: {% endif %}{% for item in message[\"content\"] %}{% if item[\"type\"] == \"text\" %}{{ item[\"text\"] }}{% elif item[\"type\"] == \"image\" %}<image>{% endif %}{% endfor %}{% if message[\"role\"] == \"user\" %}  {% else %}{{eos_token}}{% endif %}{% endfor %}{% if add_generation_prompt %}ASSISTANT: {% endif %}",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "legacy": false,
+  "model_max_length": 4096,
+  "pad_token": "<unk>",
+  "padding_side": "left",
+  "processor_class": "Maira2Processor",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": "<unk>",
+  "use_default_system_prompt": false
+}