File size: 5,671 Bytes
9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca 9b4e1c4 92fa6ca | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | ---
library_name: transformers
license: mit
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
tags:
- vision-language
- product-descriptions
- e-commerce
- fine-tuned
- lora
- llama
datasets:
- philschmid/amazon-product-descriptions-vlm
language:
- en
pipeline_tag: image-text-to-text
---
# Finetuned Llama 3.2 Vision for Product Description Generation
A fine-tuned version of Meta's Llama-3.2-11B-Vision-Instruct model specialized for generating SEO-optimized product descriptions from product images, names, and categories.
## Model Details
### Model Description
This model generates concise, SEO-optimized product descriptions for e-commerce applications. Given a product image, name, and category, it produces mobile-friendly descriptions suitable for online marketplaces and product catalogs.
- **Developed by:** Aayush672
- **Model type:** Vision-Language Model (Multimodal)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct
### Model Sources
- **Repository:** [Aayush672/Finetuned-llama3.2-Vision-Model](https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model)
- **Base Model:** [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
## Uses
### Direct Use
The model is designed for generating product descriptions in e-commerce scenarios:
- Product catalog automation
- SEO-optimized content generation
- Mobile-friendly product descriptions
- Marketplace listing optimization
### Example Usage
```python
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
model = AutoModelForVision2Seq.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model")
processor = AutoProcessor.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model")
# Prepare your inputs
image = Image.open("product_image.jpg")
product_name = "Wireless Bluetooth Headphones"
category = "Electronics | Audio | Headphones"
prompt = f"""Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image.
Only return description. The description should be SEO optimized and for a better mobile search experience.
##PRODUCT NAME##: {product_name}
##CATEGORY##: {category}"""
messages = [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image", "image": image}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
description = processor.tokenizer.decode(output[0], skip_special_tokens=True)
```
### Out-of-Scope Use
- General conversation or chat applications
- Complex reasoning tasks
- Non-commercial product descriptions
- Content outside e-commerce domain
## Training Details
### Training Data
The model was fine-tuned on the [philschmid/amazon-product-descriptions-vlm](https://huggingface.co/datasets/philschmid/amazon-product-descriptions-vlm) dataset, which contains Amazon product images with corresponding names, categories, and descriptions.
### Training Procedure
#### Fine-tuning Method
- **Technique:** LoRA (Low-Rank Adaptation) with PEFT
- **Target modules:** q_proj, v_proj
- **LoRA rank (r):** 8
- **LoRA alpha:** 16
- **LoRA dropout:** 0.05
#### Training Hyperparameters
- **Training regime:** bf16 mixed precision with 4-bit quantization (QLoRA)
- **Number of epochs:** 1
- **Batch size:** 8 per device
- **Gradient accumulation steps:** 4
- **Learning rate:** 2e-4
- **Optimizer:** AdamW (torch fused)
- **LR scheduler:** Constant
- **Warmup ratio:** 0.03
- **Max gradient norm:** 0.3
- **Quantization:** 4-bit with double quantization (nf4)
#### Hardware & Software
- **Quantization:** BitsAndBytesConfig with 4-bit precision
- **Gradient checkpointing:** Enabled
- **Memory optimization:** QLoRA technique
- **Framework:** Transformers, TRL, PEFT
## Bias, Risks, and Limitations
### Limitations
- Trained specifically on Amazon product data, may not generalize well to other e-commerce platforms
- Limited to English language descriptions
- Optimized for mobile/SEO format, may not suit all description styles
- Performance depends on image quality and product visibility
### Recommendations
- Test thoroughly on your specific product categories before production use
- Consider additional fine-tuning for domain-specific products
- Implement content moderation for generated descriptions
- Validate SEO effectiveness for your target keywords
## Environmental Impact
Training utilized quantized models (4-bit) to reduce computational requirements and carbon footprint compared to full-precision training.
## Technical Specifications
### Model Architecture
- **Base Architecture:** Llama 3.2 Vision (11B parameters)
- **Vision Encoder:** Integrated multimodal architecture
- **Fine-tuning:** LoRA adapters (trainable parameters: ~16M)
- **Quantization:** 4-bit with double quantization
### Compute Infrastructure
- **Training:** Optimized with gradient checkpointing and mixed precision
- **Memory:** Reduced via 4-bit quantization and LoRA
- **Inference:** Supports both quantized and full precision modes
## Citation
```bibtex
@misc{finetuned-llama32-vision-product,
title={Fine-tuned Llama 3.2 Vision for Product Description Generation},
author={Aayush672},
year={2025},
howpublished={\url{https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model}}
}
```
## Model Card Contact
For questions or issues, please open an issue in the model repository or contact the model author. |