File size: 5,671 Bytes
9b4e1c4
 
92fa6ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9b4e1c4
 
92fa6ca
9b4e1c4
92fa6ca
9b4e1c4
 
 
 
 
92fa6ca
9b4e1c4
92fa6ca
 
 
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
9b4e1c4
 
 
 
 
92fa6ca
 
 
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
9b4e1c4
92fa6ca
 
9b4e1c4
92fa6ca
 
 
 
9b4e1c4
92fa6ca
 
9b4e1c4
92fa6ca
 
9b4e1c4
92fa6ca
 
 
 
 
 
 
9b4e1c4
92fa6ca
 
9b4e1c4
92fa6ca
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
 
9b4e1c4
 
 
 
 
92fa6ca
9b4e1c4
 
 
92fa6ca
 
 
 
 
 
9b4e1c4
 
 
92fa6ca
 
 
 
 
 
 
 
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
 
9b4e1c4
 
 
92fa6ca
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
 
9b4e1c4
 
 
92fa6ca
 
 
9b4e1c4
92fa6ca
9b4e1c4
92fa6ca
 
 
 
 
 
 
 
9b4e1c4
 
 
92fa6ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
library_name: transformers
license: mit
base_model:
- meta-llama/Llama-3.2-11B-Vision-Instruct
tags:
- vision-language
- product-descriptions
- e-commerce
- fine-tuned
- lora
- llama
datasets:
- philschmid/amazon-product-descriptions-vlm
language:
- en
pipeline_tag: image-text-to-text
---

# Finetuned Llama 3.2 Vision for Product Description Generation

A fine-tuned version of Meta's Llama-3.2-11B-Vision-Instruct model specialized for generating SEO-optimized product descriptions from product images, names, and categories.

## Model Details

### Model Description

This model generates concise, SEO-optimized product descriptions for e-commerce applications. Given a product image, name, and category, it produces mobile-friendly descriptions suitable for online marketplaces and product catalogs.

- **Developed by:** Aayush672
- **Model type:** Vision-Language Model (Multimodal)
- **Language(s):** English
- **License:** MIT
- **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct

### Model Sources

- **Repository:** [Aayush672/Finetuned-llama3.2-Vision-Model](https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model)
- **Base Model:** [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)

## Uses

### Direct Use

The model is designed for generating product descriptions in e-commerce scenarios:
- Product catalog automation
- SEO-optimized content generation
- Mobile-friendly product descriptions
- Marketplace listing optimization

### Example Usage

```python
from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image

model = AutoModelForVision2Seq.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model")
processor = AutoProcessor.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model")

# Prepare your inputs
image = Image.open("product_image.jpg")
product_name = "Wireless Bluetooth Headphones"
category = "Electronics | Audio | Headphones"

prompt = f"""Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image.
Only return description. The description should be SEO optimized and for a better mobile search experience.

##PRODUCT NAME##: {product_name}
##CATEGORY##: {category}"""

messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": prompt},
        {"type": "image", "image": image}
    ]
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=[image], return_tensors="pt")

output = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
description = processor.tokenizer.decode(output[0], skip_special_tokens=True)
```

### Out-of-Scope Use

- General conversation or chat applications
- Complex reasoning tasks
- Non-commercial product descriptions
- Content outside e-commerce domain

## Training Details

### Training Data

The model was fine-tuned on the [philschmid/amazon-product-descriptions-vlm](https://huggingface.co/datasets/philschmid/amazon-product-descriptions-vlm) dataset, which contains Amazon product images with corresponding names, categories, and descriptions.

### Training Procedure

#### Fine-tuning Method
- **Technique:** LoRA (Low-Rank Adaptation) with PEFT
- **Target modules:** q_proj, v_proj
- **LoRA rank (r):** 8
- **LoRA alpha:** 16
- **LoRA dropout:** 0.05

#### Training Hyperparameters

- **Training regime:** bf16 mixed precision with 4-bit quantization (QLoRA)
- **Number of epochs:** 1
- **Batch size:** 8 per device
- **Gradient accumulation steps:** 4
- **Learning rate:** 2e-4
- **Optimizer:** AdamW (torch fused)
- **LR scheduler:** Constant
- **Warmup ratio:** 0.03
- **Max gradient norm:** 0.3
- **Quantization:** 4-bit with double quantization (nf4)

#### Hardware & Software

- **Quantization:** BitsAndBytesConfig with 4-bit precision
- **Gradient checkpointing:** Enabled
- **Memory optimization:** QLoRA technique
- **Framework:** Transformers, TRL, PEFT

## Bias, Risks, and Limitations

### Limitations

- Trained specifically on Amazon product data, may not generalize well to other e-commerce platforms
- Limited to English language descriptions
- Optimized for mobile/SEO format, may not suit all description styles
- Performance depends on image quality and product visibility

### Recommendations

- Test thoroughly on your specific product categories before production use
- Consider additional fine-tuning for domain-specific products
- Implement content moderation for generated descriptions
- Validate SEO effectiveness for your target keywords

## Environmental Impact

Training utilized quantized models (4-bit) to reduce computational requirements and carbon footprint compared to full-precision training.

## Technical Specifications

### Model Architecture

- **Base Architecture:** Llama 3.2 Vision (11B parameters)
- **Vision Encoder:** Integrated multimodal architecture
- **Fine-tuning:** LoRA adapters (trainable parameters: ~16M)
- **Quantization:** 4-bit with double quantization

### Compute Infrastructure

- **Training:** Optimized with gradient checkpointing and mixed precision
- **Memory:** Reduced via 4-bit quantization and LoRA
- **Inference:** Supports both quantized and full precision modes

## Citation

```bibtex
@misc{finetuned-llama32-vision-product,
  title={Fine-tuned Llama 3.2 Vision for Product Description Generation},
  author={Aayush672},
  year={2025},
  howpublished={\url{https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model}}
}
```

## Model Card Contact

For questions or issues, please open an issue in the model repository or contact the model author.