|
|
--- |
|
|
base_model: |
|
|
- meta-llama/Llama-3.2-3B-Instruct |
|
|
datasets: |
|
|
- NingLab/MMECInstruct |
|
|
license: cc-by-4.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# CASLIE-S |
|
|
|
|
|
This repo contains the models for "[Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data](https://huggingface.co/papers/2410.17337)". |
|
|
|
|
|
- π [Paper](https://huggingface.co/papers/2410.17337) |
|
|
- π [Project Page](https://ninglab.github.io/CASLIE/) |
|
|
- π» [Code](https://github.com/ninglab/CASLIE) |
|
|
|
|
|
## Introduction |
|
|
We introduce [MMECInstruct](https://huggingface.co/datasets/NingLab/MMECInstruct), the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information. Leveraging MMECInstruct, we fine-tune a series of e-commerce Multimodal Foundation Models (MFMs) within CASLIE. |
|
|
|
|
|
## CASLIE Models |
|
|
The CASLIE-S model is instruction-tuned from the small base models [Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct). |
|
|
|
|
|
## Sample Usage |
|
|
|
|
|
To conduct multimodal inference with the CASLIE-S model using the Hugging Face `transformers` library, you can follow this example. This snippet demonstrates how to load the model and processor, and perform a basic image-text-to-text generation. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
|
from PIL import Image |
|
|
|
|
|
# Load model and processor |
|
|
model_path = "NingLab/CASLIE-S" |
|
|
# The `trust_remote_code=True` is necessary to load custom model and processor definitions. |
|
|
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, device_map="auto", trust_remote_code=True) |
|
|
|
|
|
# Example: Image and text input for a product description task |
|
|
# Replace "image.png" with the actual path to your image file |
|
|
try: |
|
|
image = Image.open("image.png").convert("RGB") |
|
|
except FileNotFoundError: |
|
|
print("Warning: 'image.png' not found. Using a dummy image for demonstration. Please replace with a real image path.") |
|
|
# Create a dummy image for demonstration if actual image is not found |
|
|
image = Image.new('RGB', (256, 256), color = 'red') |
|
|
|
|
|
question = "Describe the product in detail." |
|
|
|
|
|
# Prepare the conversation in a chat template format |
|
|
# The "<image>" token is a placeholder which the processor handles to embed image features. |
|
|
messages = [{"role": "user", "content": f"{question} <image>"}] |
|
|
|
|
|
# Apply the chat template and process inputs (image and text) |
|
|
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate response from the model |
|
|
output_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.7) |
|
|
response = processor.decode(output_ids[0], skip_special_tokens=True) |
|
|
|
|
|
print(f"Question: {question}") |
|
|
print(f"Response: {response}") |
|
|
|
|
|
# For more advanced usage, specific tasks, and detailed inference scripts, |
|
|
# please refer to the project's official GitHub repository: |
|
|
# https://github.com/ninglab/CASLIE |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@article{ling2024captions, |
|
|
title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data}, |
|
|
author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia}, |
|
|
journal={arXiv preprint arXiv:2410.17337}, |
|
|
year={2024} |
|
|
} |
|
|
``` |