|
|
--- |
|
|
base_model: Qwen/Qwen2-VL-7B |
|
|
library_name: peft |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- base_model:adapter:Qwen/Qwen2-VL-7B |
|
|
- lora |
|
|
- qwen2_vl |
|
|
- multimodal |
|
|
- transformers |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# MATRIX-PT |
|
|
|
|
|
MATRIX-PT is a parameter-efficient LoRA adapter released by **Radical AI** for **Qwen/Qwen2-VL-7B**. It is designed to study post-training adaptations for materials science tasks, with a focus on theoretical reasoning, scientific problem solving, and multimodal reasoning over experimental images. |
|
|
|
|
|
This model is released alongside the **MATRIX** benchmark ([dataset link](https://huggingface.co/datasets/radical-ai/MATRIX)), which is used to evaluate reasoning across text- and image-based materials science tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
- **Developed by:** Radical AI |
|
|
- **Model type:** LoRA adapter (PEFT) for a multimodal transformer |
|
|
- **Base model:** `Qwen/Qwen2-VL-7B` |
|
|
- **Language(s):** English |
|
|
- **License:** Apache-2.0 (adapter); base model license applies to `Qwen/Qwen2-VL-7B` |
|
|
- **Finetuned from model:** `Qwen/Qwen2-VL-7B` |
|
|
|
|
|
MATRIX-PT modifies the base model through lightweight post-training to better surface domain-relevant reasoning patterns in materials science. The adapter primarily affects inference-time behavior, improving the model's ability to reason about structured scientific concepts and experimental imagery without altering the underlying base weights. |
|
|
|
|
|
### Model Sources |
|
|
- **Repository:** https://huggingface.co/radical-ai/MATRIX-PT |
|
|
- **Paper:** *[MATRIX: A Multimodal Benchmark and Post-Training Framework for |
|
|
Materials Science](https://www.arxiv.org/pdf/2602.00376)* |
|
|
- **Benchmark:** https://huggingface.co/datasets/radical-ai/MATRIX |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
MATRIX-PT is intended for: |
|
|
- Evaluating multimodal reasoning in materials science |
|
|
- Studying post-training effects on scientific reasoning behavior |
|
|
- Benchmarking model performance on theory-driven and experiment-driven tasks using MATRIX |
|
|
|
|
|
The adapter can be loaded on top of `Qwen/Qwen2-VL-7B` using PEFT without modifying the base model weights. |
|
|
|
|
|
### Downstream Use |
|
|
The adapter may be used as a starting point for: |
|
|
- Further domain-specific fine-tuning |
|
|
- Diagnostic studies of reasoning behavior in scientific models |
|
|
- Comparative evaluation against other multimodal or domain-adapted models |
|
|
|
|
|
### Out-of-Scope Use |
|
|
MATRIX-PT is **not** intended for: |
|
|
- General-purpose conversational use |
|
|
- High-stakes decision making (e.g., medical, legal, industrial control) |
|
|
- Deployment without human oversight in safety-critical settings |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- MATRIX-PT inherits limitations and biases from the base model, including potential hallucinations and incorrect reasoning. |
|
|
- The adapter is trained and evaluated on a focused materials science benchmark and may not generalize outside this domain. |
|
|
- Performance improvements are task- and prompt-dependent and should not be interpreted as broad scientific understanding. |
|
|
- As with most LLMs/VLMs, the model may produce plausible-sounding but incorrect explanations. |
|
|
|
|
|
### Recommendations |
|
|
Users should: |
|
|
- Treat outputs as assistive rather than authoritative |
|
|
- Validate results against domain expertise or ground truth |
|
|
- Use MATRIX-PT primarily for evaluation, analysis, and research purposes |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
### Install |
|
|
|
|
|
**Tested versions:** |
|
|
```bash |
|
|
pip install torch>=2.0.0 torchvision>=0.15.0 |
|
|
pip install transformers>=4.56.0 peft>=0.17.0 accelerate>=1.10.0 |
|
|
pip install pillow>=10.0.0 qwen-vl-utils>=0.0.8 |
|
|
``` |
|
|
|
|
|
**Or install all at once:** |
|
|
```bash |
|
|
pip install torch>=2.0.0 torchvision>=0.15.0 transformers>=4.56.0 peft>=0.17.0 accelerate>=1.10.0 pillow>=10.0.0 qwen-vl-utils>=0.0.8 |
|
|
``` |
|
|
|
|
|
### Load the Adapter |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration |
|
|
from peft import PeftModel |
|
|
|
|
|
DEFAULT_EOS_TOKEN = "</s>" |
|
|
DEFAULT_BOS_TOKEN = "<s>" |
|
|
DEFAULT_UNK_TOKEN = "<unk>" |
|
|
|
|
|
def align_tokenizer_and_model(tokenizer, model): |
|
|
""" |
|
|
Ensure required special tokens exist and resize embeddings to match tokenizer vocab. |
|
|
This is necessary because the adapter was trained with this alignment. |
|
|
""" |
|
|
special_tokens = {} |
|
|
if tokenizer.pad_token is None: |
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
if tokenizer.eos_token is None: |
|
|
special_tokens["eos_token"] = DEFAULT_EOS_TOKEN |
|
|
if tokenizer.bos_token is None: |
|
|
special_tokens["bos_token"] = DEFAULT_BOS_TOKEN |
|
|
if tokenizer.unk_token is None: |
|
|
special_tokens["unk_token"] = DEFAULT_UNK_TOKEN |
|
|
|
|
|
num_new_tokens = tokenizer.add_special_tokens(special_tokens) |
|
|
if num_new_tokens > 0 or model.get_input_embeddings().weight.shape[0] != len(tokenizer): |
|
|
model.resize_token_embeddings(len(tokenizer)) |
|
|
if num_new_tokens > 0: |
|
|
input_embeds = model.get_input_embeddings().weight.data |
|
|
output_embeds = model.get_output_embeddings().weight.data |
|
|
|
|
|
if tokenizer.unk_token_id is not None: |
|
|
input_init = input_embeds[tokenizer.unk_token_id].unsqueeze(0) |
|
|
output_init = output_embeds[tokenizer.unk_token_id].unsqueeze(0) |
|
|
else: |
|
|
input_init = input_embeds[:-num_new_tokens].mean(dim=0, keepdim=True) |
|
|
output_init = output_embeds[:-num_new_tokens].mean(dim=0, keepdim=True) |
|
|
|
|
|
input_embeds[-num_new_tokens:] = input_init |
|
|
output_embeds[-num_new_tokens:] = output_init |
|
|
|
|
|
# Model IDs |
|
|
base_model_id = "Qwen/Qwen2-VL-7B" |
|
|
adapter_id = "radical-ai/MATRIX-PT" |
|
|
|
|
|
# Load processor from base model |
|
|
processor = AutoProcessor.from_pretrained(base_model_id, trust_remote_code=True) |
|
|
tokenizer = processor.tokenizer |
|
|
tokenizer.padding_side = "left" |
|
|
if tokenizer.pad_token_id is None: |
|
|
tokenizer.pad_token_id = tokenizer.eos_token_id |
|
|
|
|
|
# Use Instruct processor for chat template (base model template has issues) |
|
|
instruct_processor = AutoProcessor.from_pretrained( |
|
|
"Qwen/Qwen2-VL-7B-Instruct", |
|
|
trust_remote_code=True |
|
|
) |
|
|
processor.chat_template = instruct_processor.chat_template |
|
|
tokenizer.chat_template = instruct_processor.tokenizer.chat_template |
|
|
|
|
|
# Load base model |
|
|
model = Qwen2VLForConditionalGeneration.from_pretrained( |
|
|
base_model_id, |
|
|
device_map="auto", |
|
|
torch_dtype=torch.bfloat16, |
|
|
trust_remote_code=True, |
|
|
) |
|
|
|
|
|
# IMPORTANT: Align tokenizer and model before loading adapter |
|
|
align_tokenizer_and_model(tokenizer, model) |
|
|
|
|
|
# Load adapter |
|
|
model = PeftModel.from_pretrained(model, adapter_id) |
|
|
model.eval() |
|
|
``` |
|
|
|
|
|
### Run Inference |
|
|
|
|
|
```python |
|
|
# Text-only inference |
|
|
question = "What is a phase diagram?" |
|
|
messages = [{"role": "user", "content": question}] |
|
|
|
|
|
rendered = processor.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
) |
|
|
inputs = tokenizer([rendered], return_tensors="pt") |
|
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
do_sample=False, |
|
|
pad_token_id=tokenizer.pad_token_id |
|
|
) |
|
|
|
|
|
# Decode only the new tokens |
|
|
input_len = inputs["input_ids"].shape[1] |
|
|
generated_ids = outputs[:, input_len:] |
|
|
response = processor.batch_decode( |
|
|
generated_ids, |
|
|
skip_special_tokens=True, |
|
|
clean_up_tokenization_spaces=True, |
|
|
)[0].strip() |
|
|
|
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### With Images |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
|
|
|
# Load image |
|
|
image = Image.open("path/to/image.png").convert("RGB") |
|
|
|
|
|
# Create message with image |
|
|
messages = [ |
|
|
{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "image"}, |
|
|
{"type": "text", "text": "Describe this experimental image."} |
|
|
] |
|
|
} |
|
|
] |
|
|
|
|
|
# Process with image |
|
|
prompt = processor.apply_chat_template(messages, add_generation_prompt=True) |
|
|
inputs = processor(text=prompt, images=[image], return_tensors="pt") |
|
|
|
|
|
# Convert pixel_values to bfloat16 if present |
|
|
if "pixel_values" in inputs: |
|
|
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16) |
|
|
|
|
|
inputs = {k: v.to(model.device) for k, v in inputs.items()} |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
do_sample=False, |
|
|
) |
|
|
|
|
|
input_len = inputs["input_ids"].shape[1] |
|
|
generated_ids = outputs[:, input_len:] |
|
|
response = processor.batch_decode( |
|
|
generated_ids, |
|
|
skip_special_tokens=True, |
|
|
clean_up_tokenization_spaces=True, |
|
|
)[0].strip() |
|
|
|
|
|
print(response) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The adapter was trained using a curated materials science dataset emphasizing: |
|
|
|
|
|
- Foundational theory questions |
|
|
- Research-level reasoning |
|
|
- Hypothesis generation |
|
|
- Multimodal reasoning over experimental imagery |
|
|
|
|
|
For evaluation details, see the [MATRIX dataset](https://huggingface.co/datasets/radical-ai/MATRIX) card and accompanying paper. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
- Method: LoRA (parameter-efficient fine-tuning) |
|
|
- LoRA rank (r): 8 |
|
|
- LoRA alpha: 32 |
|
|
- LoRA dropout: 0.05 |
|
|
- Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
|
|
- Objective: Improve accessibility of materials science-relevant reasoning patterns during inference |
|
|
- Training regime: Mixed precision (bf16) |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data |
|
|
|
|
|
MATRIX-PT is benchmarked on the **MATRIX** dataset, which consists of both textual and visual reasoning tasks in materials science. Evaluation compares the adapted model against the base `Qwen/Qwen2-VL-7B` model under identical prompting and decoding settings. |
|
|
|
|
|
### Metrics |
|
|
- Task accuracy |
|
|
- Reasoning consistency across related prompts |
|
|
- Qualitative error analysis (see accompanying paper) |
|
|
|
|
|
## Results |
|
|
|
|
|
Across MATRIX tasks, MATRIX-PT demonstrates improved performance relative to the base model, particularly on: |
|
|
- Theory-driven reasoning questions |
|
|
- Structured scientific problem solving |
|
|
- Interpretation of experimental images |
|
|
|
|
|
These improvements primarily manifest at inference time, highlighting the role of post-training in shaping reasoning accessibility rather than training-time memorization alone. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or the MATRIX benchmark, please cite the accompanying paper: |
|
|
|
|
|
[MATRIX: A Multimodal Benchmark and Post-Training Framework for Materials Science](https://www.arxiv.org/pdf/2602.00376) |
|
|
|
|
|
### Bibtex |
|
|
``` |
|
|
@article{mcgrath2026matrix, |
|
|
title = {MATRIX: A Multimodal Benchmark and Post-Training Framework for Materials Science}, |
|
|
author = {McGrath, Delia and Chong, Curtis and Kulkarni, Rohil and Ceder, Gerbrand and Kolluru, Adeesh}, |
|
|
journal = {arXiv preprint arXiv:2602.00376}, |
|
|
year = {2026} |
|
|
} |
|
|
``` |
|
|
|
|
|
### Framework Versions |
|
|
|
|
|
- PEFT: 0.18.0 |
|
|
- Transformers: 4.56.0+ |
|
|
- PyTorch: 2.0.0+ |
|
|
- Python: 3.10+ |
|
|
|