Aayush672's picture
Update README.md
92fa6ca verified
---
library_name: transformers
license: mit
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
tags:
- vision-language
- product-descriptions
- e-commerce
- fine-tuned
- lora
- llama
datasets:
- philschmid/amazon-product-descriptions-vlm
language:
- en
pipeline_tag: image-text-to-text
---
# Finetuned Llama 3.2 Vision for Product Description Generation
A fine-tuned version of Meta's Llama-3.2-11B-Vision-Instruct model specialized for generating SEO-optimized product descriptions from product images, names, and categories.
## Model Details
### Model Description
This model generates concise, SEO-optimized product descriptions for e-commerce applications. Given a product image, name, and category, it produces mobile-friendly descriptions suitable for online marketplaces and product catalogs.
- **Developed by:** Aayush672
- **Model type:** Vision-Language Model (Multimodal)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct
### Model Sources
- **Repository:** [Aayush672/Finetuned-llama3.2-Vision-Model](https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model)
- **Base Model:** [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)
## Uses
### Direct Use
The model is designed for generating product descriptions in e-commerce scenarios:
- Product catalog automation
- SEO-optimized content generation
- Mobile-friendly product descriptions
- Marketplace listing optimization
### Example Usage
```python
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
model = AutoModelForVision2Seq.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model")
processor = AutoProcessor.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model")
# Prepare your inputs
image = Image.open("product_image.jpg")
product_name = "Wireless Bluetooth Headphones"
category = "Electronics | Audio | Headphones"
prompt = f"""Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image.
Only return description. The description should be SEO optimized and for a better mobile search experience.
##PRODUCT NAME##: {product_name}
##CATEGORY##: {category}"""
messages = [{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image", "image": image}
]
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
description = processor.tokenizer.decode(output[0], skip_special_tokens=True)
```
### Out-of-Scope Use
- General conversation or chat applications
- Complex reasoning tasks
- Non-commercial product descriptions
- Content outside e-commerce domain
## Training Details
### Training Data
The model was fine-tuned on the [philschmid/amazon-product-descriptions-vlm](https://huggingface.co/datasets/philschmid/amazon-product-descriptions-vlm) dataset, which contains Amazon product images with corresponding names, categories, and descriptions.
### Training Procedure
#### Fine-tuning Method
- **Technique:** LoRA (Low-Rank Adaptation) with PEFT
- **Target modules:** q_proj, v_proj
- **LoRA rank (r):** 8
- **LoRA alpha:** 16
- **LoRA dropout:** 0.05
#### Training Hyperparameters
- **Training regime:** bf16 mixed precision with 4-bit quantization (QLoRA)
- **Number of epochs:** 1
- **Batch size:** 8 per device
- **Gradient accumulation steps:** 4
- **Learning rate:** 2e-4
- **Optimizer:** AdamW (torch fused)
- **LR scheduler:** Constant
- **Warmup ratio:** 0.03
- **Max gradient norm:** 0.3
- **Quantization:** 4-bit with double quantization (nf4)
#### Hardware & Software
- **Quantization:** BitsAndBytesConfig with 4-bit precision
- **Gradient checkpointing:** Enabled
- **Memory optimization:** QLoRA technique
- **Framework:** Transformers, TRL, PEFT
## Bias, Risks, and Limitations
### Limitations
- Trained specifically on Amazon product data, may not generalize well to other e-commerce platforms
- Limited to English language descriptions
- Optimized for mobile/SEO format, may not suit all description styles
- Performance depends on image quality and product visibility
### Recommendations
- Test thoroughly on your specific product categories before production use
- Consider additional fine-tuning for domain-specific products
- Implement content moderation for generated descriptions
- Validate SEO effectiveness for your target keywords
## Environmental Impact
Training utilized quantized models (4-bit) to reduce computational requirements and carbon footprint compared to full-precision training.
## Technical Specifications
### Model Architecture
- **Base Architecture:** Llama 3.2 Vision (11B parameters)
- **Vision Encoder:** Integrated multimodal architecture
- **Fine-tuning:** LoRA adapters (trainable parameters: ~16M)
- **Quantization:** 4-bit with double quantization
### Compute Infrastructure
- **Training:** Optimized with gradient checkpointing and mixed precision
- **Memory:** Reduced via 4-bit quantization and LoRA
- **Inference:** Supports both quantized and full precision modes
## Citation
```bibtex
@misc{finetuned-llama32-vision-product,
title={Fine-tuned Llama 3.2 Vision for Product Description Generation},
author={Aayush672},
year={2025},
howpublished={\url{https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model}}
}
```
## Model Card Contact
For questions or issues, please open an issue in the model repository or contact the model author.