iamcode6
/

llama32-vision-ccmt-mi300x

Image-Text-to-Text

llama-3.2-vision

Model card Files Files and versions

llama32-vision-ccmt-mi300x / README.md

iamcode6's picture

Link to live Gradio demo Space

2ffc817 about 1 month ago

|

history blame contribute delete

3.15 kB

	---
	license: apache-2.0
	library_name: peft
	tags:
	- llama-3.2-vision
	- plant-disease
	- lora
	- rocm
	- mi300x
	- amd
	- vlm
	- image-text-to-text
	base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
	---

	# Llama 3.2 Vision 11B LoRA — Plant Disease Diagnosis (MI300X fine-tune)

	Fine-tuned Llama 3.2 Vision 11B with LoRA on a plant disease QA dataset
	(cashew, cassava, maize, tomato — 22 disease classes) for visual diagnosis and
	treatment recommendations.

	Trained on a single AMD Instinct MI300X using PyTorch + ROCm, as a submission to the
	lablab.ai AMD hackathon Track 3 — Building AI-Powered Applications on AMD GPUs.

	## 🌱 Try the live demo

	This adapter powers the conversational layer of an interactive Plant Disease Assistant — upload a leaf photo to get a diagnosis and treatment guide:

	👉 [huggingface.co/spaces/lablab-ai-amd-developer-hackathon/merolav-space](https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/merolav-space)

	(The hosted Space runs the lighter DINOv2-L classifier on free CPU. Load this LoRA adapter on a GPU machine for the full conversational VLM experience — see the code under [`vision/`](https://github.com/genyarko/amd-merolav/tree/main/vision) in the training repo.)

	## Results

	\| Epoch \| Train Loss \| Val Loss \| Throughput \| Wall Time \|
	\|------:\|-----------:\|---------:\|:----------:\|:---------:\|
	\| 1 \| 0.9530 \| 0.0244 \| 1.3 \| 3331.1s \|
	\| 2 \| 0.0203 \| 0.0177 \| 1.3 \| 3318.3s \|
	\| 3 \| 0.0151 \| 0.0151 \| 1.3 \| 3321.8s \|
	\| 4 \| 0.0125 \| 0.0147 \| 1.3 \| 3318.8s \|
	\| 5 \| 0.0109 \| 0.0146 \| 1.3 \| 3317.5s \|

	Best val_loss: 0.0146

	### Epoch 3 adapter

	An `epoch3_adapter/` checkpoint is included for A/B comparison. Epoch 3 had val_loss=0.0151 vs epoch 5's 0.0146 — the difference is marginal and epoch 3 may generalize equally well in practice.

	## Training Details

	- Base model: meta-llama/Llama-3.2-11B-Vision-Instruct (11B params, ~4B vision + ~7B language)
	- Method: LoRA (rank=16, alpha=32, dropout=0.05)
	- Target modules: q_proj, v_proj, k_proj, o_proj, gate_proj, up_proj, down_proj
	- Precision: bf16 (native MI300X)
	- Epochs: 5
	- Effective batch size: 16
	- Learning rate: 2e-05 with cosine decay + 0.1 warmup
	- Optimizer: AdamW (weight_decay=0.01)
	- Max sequence length: 2048
	- Hardware: 1x AMD Instinct MI300X (192 GB HBM3)

	## Usage

	```python
	from peft import PeftModel
	from transformers import AutoProcessor, MllamaForConditionalGeneration

	base = MllamaForConditionalGeneration.from_pretrained(
	"meta-llama/Llama-3.2-11B-Vision-Instruct",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)
	model = PeftModel.from_pretrained(base, "best_adapter")
	processor = AutoProcessor.from_pretrained("best_adapter")
	```

	## Artifacts

	- `best_adapter/` — LoRA weights from the best validation epoch
	- `epoch3_adapter/` — LoRA weights from epoch 3 (for A/B comparison)
	- `config.yaml` — training hyperparameters
	- `metrics.json` — per-epoch training history

	See `config.yaml` for the full hyperparameter set.

	## Source

	Training code: <https://github.com/genyarko/amd-merolav/tree/main/vision>