File size: 3,343 Bytes

---
base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- mllama
license: apache-2.0
language:
- en
---

# Fine-tuned Vision-Language Model for Radiology Report Generation

This repository contains a fine-tuned vision-language model for generating radiology reports. It's based on the [Unsloth](https://github.com/unslothai/unsloth) library and utilizes the Llama-3.2-11B-Vision-Instruct model as a base. 

## Model Description

This model is fine-tuned on a sampled version of the ROCO radiography dataset ([Radiology_mini](https://huggingface.co/datasets/unsloth/Radiology_mini)). It's designed to assist medical professionals by providing accurate descriptions of medical images, such as X-rays, CT scans, and ultrasounds. 

The fine-tuning process uses Low-Rank Adaptation (LoRA) to efficiently train the model, focusing on the language layers while keeping the vision layers frozen. This approach minimizes the computational resources required for fine-tuning while achieving significant performance improvements.

## Usage

To use this model, you'll need the Unsloth library:

```bash
pip install unsloth
```

Then, you can load the model and tokenizer:

```python
from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained("awaliuddin/unsloth_finetune", load_in_4bit=True)
FastVisionModel.for_inference(model)
```

```python
from PIL import Image

image = Image.open("path/to/your/image.jpg") # Replace with your image path
instruction = "You are an expert radiographer. Describe accurately what you see in this image."
messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": instruction} ]} ]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1)
```

## Training Details

* **Base Model:** Llama-3.2-11B-Vision-Instruct
* **Dataset:** Radiology_mini (sampled from ROCO radiography dataset)
* **Fine-tuning Method:** LoRA (language layers only)
* **Optimizer:** AdamW 8-bit
* **Learning Rate:** 2e-4

## Limitations

* This model is trained on a limited dataset and might not generalize well to all types of medical images.
* The generated reports should be reviewed by qualified medical professionals before being used for diagnostic purposes.

## Acknowledgements

* The Unsloth library for efficient fine-tuning of vision-language models.
* The Hugging Face team for providing the platform and tools for model sharing.
* The authors of the ROCO radiography dataset.

## License

[Apache-2.0 License]

# Uploaded finetuned  model

- **Developed by:** Awaliuddin
- **License:** apache-2.0
- **Finetuned from model :** unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit

This mllama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)