|
|
--- |
|
|
base_model: unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- mllama |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Fine-tuned Vision-Language Model for Radiology Report Generation |
|
|
|
|
|
This repository contains a fine-tuned vision-language model for generating radiology reports. It's based on the [Unsloth](https://github.com/unslothai/unsloth) library and utilizes the Llama-3.2-11B-Vision-Instruct model as a base. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned on a sampled version of the ROCO radiography dataset ([Radiology_mini](https://huggingface.co/datasets/unsloth/Radiology_mini)). It's designed to assist medical professionals by providing accurate descriptions of medical images, such as X-rays, CT scans, and ultrasounds. |
|
|
|
|
|
The fine-tuning process uses Low-Rank Adaptation (LoRA) to efficiently train the model, focusing on the language layers while keeping the vision layers frozen. This approach minimizes the computational resources required for fine-tuning while achieving significant performance improvements. |
|
|
|
|
|
## Usage |
|
|
|
|
|
To use this model, you'll need the Unsloth library: |
|
|
|
|
|
```bash |
|
|
pip install unsloth |
|
|
``` |
|
|
|
|
|
Then, you can load the model and tokenizer: |
|
|
|
|
|
```python |
|
|
from unsloth import FastVisionModel |
|
|
|
|
|
model, tokenizer = FastVisionModel.from_pretrained("awaliuddin/unsloth_finetune", load_in_4bit=True) |
|
|
FastVisionModel.for_inference(model) |
|
|
``` |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
|
|
|
image = Image.open("path/to/your/image.jpg") # Replace with your image path |
|
|
instruction = "You are an expert radiographer. Describe accurately what you see in this image." |
|
|
messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": instruction} ]} ] |
|
|
|
|
|
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") |
|
|
|
|
|
from transformers import TextStreamer |
|
|
|
|
|
text_streamer = TextStreamer(tokenizer, skip_prompt=True) _ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=128, use_cache=True, temperature=1.5, min_p=0.1) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
* **Base Model:** Llama-3.2-11B-Vision-Instruct |
|
|
* **Dataset:** Radiology_mini (sampled from ROCO radiography dataset) |
|
|
* **Fine-tuning Method:** LoRA (language layers only) |
|
|
* **Optimizer:** AdamW 8-bit |
|
|
* **Learning Rate:** 2e-4 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* This model is trained on a limited dataset and might not generalize well to all types of medical images. |
|
|
* The generated reports should be reviewed by qualified medical professionals before being used for diagnostic purposes. |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
* The Unsloth library for efficient fine-tuning of vision-language models. |
|
|
* The Hugging Face team for providing the platform and tools for model sharing. |
|
|
* The authors of the ROCO radiography dataset. |
|
|
|
|
|
## License |
|
|
|
|
|
[Apache-2.0 License] |
|
|
|
|
|
# Uploaded finetuned model |
|
|
|
|
|
- **Developed by:** Awaliuddin |
|
|
- **License:** apache-2.0 |
|
|
- **Finetuned from model :** unsloth/llama-3.2-11b-vision-instruct-unsloth-bnb-4bit |
|
|
|
|
|
This mllama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
|
|
|
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
|
|
|