Safetensors
File size: 3,629 Bytes
103c587
 
 
7a8626a
 
 
 
 
103c587
 
 
 
7a8626a
 
 
 
 
 
 
 
103c587
 
 
 
7a8626a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103c587
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
base_model:
- meta-llama/Llama-3.2-3B-Instruct
datasets:
- NingLab/MMECInstruct
license: cc-by-4.0
pipeline_tag: image-text-to-text
library_name: transformers
---

# CASLIE-S

This repo contains the models for "[Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data](https://huggingface.co/papers/2410.17337)".

- 📚 [Paper](https://huggingface.co/papers/2410.17337)
- 🌐 [Project Page](https://ninglab.github.io/CASLIE/)
- 💻 [Code](https://github.com/ninglab/CASLIE)

## Introduction
We introduce [MMECInstruct](https://huggingface.co/datasets/NingLab/MMECInstruct), the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information. Leveraging MMECInstruct, we fine-tune a series of e-commerce Multimodal Foundation Models (MFMs) within CASLIE.

## CASLIE Models
The CASLIE-S model is instruction-tuned from the small base models [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct).

## Sample Usage

To conduct multimodal inference with the CASLIE-S model using the Hugging Face `transformers` library, you can follow this example. This snippet demonstrates how to load the model and processor, and perform a basic image-text-to-text generation.

```python
import torch
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image

# Load model and processor
model_path = "NingLab/CASLIE-S"
# The `trust_remote_code=True` is necessary to load custom model and processor definitions.
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True)

# Example: Image and text input for a product description task
# Replace "image.png" with the actual path to your image file
try:
    image = Image.open("image.png").convert("RGB")
except FileNotFoundError:
    print("Warning: 'image.png' not found. Using a dummy image for demonstration. Please replace with a real image path.")
    # Create a dummy image for demonstration if actual image is not found
    image = Image.new('RGB', (256, 256), color = 'red')

question = "Describe the product in detail."

# Prepare the conversation in a chat template format
# The "<image>" token is a placeholder which the processor handles to embed image features.
messages = [{"role": "user", "content": f"{question} <image>"}]

# Apply the chat template and process inputs (image and text)
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device)

# Generate response from the model
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7)
response = processor.decode(output_ids[0], skip_special_tokens=True)

print(f"Question: {question}")
print(f"Response: {response}")

# For more advanced usage, specific tasks, and detailed inference scripts,
# please refer to the project's official GitHub repository:
# https://github.com/ninglab/CASLIE
```

## Citation
```bibtex
@article{ling2024captions,
    title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data},
    author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia},
    journal={arXiv preprint arXiv:2410.17337},
    year={2024}
}
```