|
|
--- |
|
|
library_name: transformers |
|
|
license: mit |
|
|
base_model: |
|
|
- meta-llama/Llama-3.2-11B-Vision-Instruct |
|
|
tags: |
|
|
- vision-language |
|
|
- product-descriptions |
|
|
- e-commerce |
|
|
- fine-tuned |
|
|
- lora |
|
|
- llama |
|
|
datasets: |
|
|
- philschmid/amazon-product-descriptions-vlm |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# Finetuned Llama 3.2 Vision for Product Description Generation |
|
|
|
|
|
A fine-tuned version of Meta's Llama-3.2-11B-Vision-Instruct model specialized for generating SEO-optimized product descriptions from product images, names, and categories. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model generates concise, SEO-optimized product descriptions for e-commerce applications. Given a product image, name, and category, it produces mobile-friendly descriptions suitable for online marketplaces and product catalogs. |
|
|
|
|
|
- **Developed by:** Aayush672 |
|
|
- **Model type:** Vision-Language Model (Multimodal) |
|
|
- **Language(s):** English |
|
|
- **License:** MIT |
|
|
- **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [Aayush672/Finetuned-llama3.2-Vision-Model](https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model) |
|
|
- **Base Model:** [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The model is designed for generating product descriptions in e-commerce scenarios: |
|
|
- Product catalog automation |
|
|
- SEO-optimized content generation |
|
|
- Mobile-friendly product descriptions |
|
|
- Marketplace listing optimization |
|
|
|
|
|
### Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForVision2Seq, AutoProcessor |
|
|
from PIL import Image |
|
|
|
|
|
model = AutoModelForVision2Seq.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model") |
|
|
processor = AutoProcessor.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model") |
|
|
|
|
|
# Prepare your inputs |
|
|
image = Image.open("product_image.jpg") |
|
|
product_name = "Wireless Bluetooth Headphones" |
|
|
category = "Electronics | Audio | Headphones" |
|
|
|
|
|
prompt = f"""Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image. |
|
|
Only return description. The description should be SEO optimized and for a better mobile search experience. |
|
|
|
|
|
##PRODUCT NAME##: {product_name} |
|
|
##CATEGORY##: {category}""" |
|
|
|
|
|
messages = [{ |
|
|
"role": "user", |
|
|
"content": [ |
|
|
{"type": "text", "text": prompt}, |
|
|
{"type": "image", "image": image} |
|
|
] |
|
|
}] |
|
|
|
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(text=text, images=[image], return_tensors="pt") |
|
|
|
|
|
output = model.generate(**inputs, max_new_tokens=200, temperature=0.7) |
|
|
description = processor.tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- General conversation or chat applications |
|
|
- Complex reasoning tasks |
|
|
- Non-commercial product descriptions |
|
|
- Content outside e-commerce domain |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was fine-tuned on the [philschmid/amazon-product-descriptions-vlm](https://huggingface.co/datasets/philschmid/amazon-product-descriptions-vlm) dataset, which contains Amazon product images with corresponding names, categories, and descriptions. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Fine-tuning Method |
|
|
- **Technique:** LoRA (Low-Rank Adaptation) with PEFT |
|
|
- **Target modules:** q_proj, v_proj |
|
|
- **LoRA rank (r):** 8 |
|
|
- **LoRA alpha:** 16 |
|
|
- **LoRA dropout:** 0.05 |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** bf16 mixed precision with 4-bit quantization (QLoRA) |
|
|
- **Number of epochs:** 1 |
|
|
- **Batch size:** 8 per device |
|
|
- **Gradient accumulation steps:** 4 |
|
|
- **Learning rate:** 2e-4 |
|
|
- **Optimizer:** AdamW (torch fused) |
|
|
- **LR scheduler:** Constant |
|
|
- **Warmup ratio:** 0.03 |
|
|
- **Max gradient norm:** 0.3 |
|
|
- **Quantization:** 4-bit with double quantization (nf4) |
|
|
|
|
|
#### Hardware & Software |
|
|
|
|
|
- **Quantization:** BitsAndBytesConfig with 4-bit precision |
|
|
- **Gradient checkpointing:** Enabled |
|
|
- **Memory optimization:** QLoRA technique |
|
|
- **Framework:** Transformers, TRL, PEFT |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
### Limitations |
|
|
|
|
|
- Trained specifically on Amazon product data, may not generalize well to other e-commerce platforms |
|
|
- Limited to English language descriptions |
|
|
- Optimized for mobile/SEO format, may not suit all description styles |
|
|
- Performance depends on image quality and product visibility |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- Test thoroughly on your specific product categories before production use |
|
|
- Consider additional fine-tuning for domain-specific products |
|
|
- Implement content moderation for generated descriptions |
|
|
- Validate SEO effectiveness for your target keywords |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
Training utilized quantized models (4-bit) to reduce computational requirements and carbon footprint compared to full-precision training. |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Base Architecture:** Llama 3.2 Vision (11B parameters) |
|
|
- **Vision Encoder:** Integrated multimodal architecture |
|
|
- **Fine-tuning:** LoRA adapters (trainable parameters: ~16M) |
|
|
- **Quantization:** 4-bit with double quantization |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
- **Training:** Optimized with gradient checkpointing and mixed precision |
|
|
- **Memory:** Reduced via 4-bit quantization and LoRA |
|
|
- **Inference:** Supports both quantized and full precision modes |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{finetuned-llama32-vision-product, |
|
|
title={Fine-tuned Llama 3.2 Vision for Product Description Generation}, |
|
|
author={Aayush672}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model}} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
For questions or issues, please open an issue in the model repository or contact the model author. |