|
|
--- |
|
|
base_model: |
|
|
- meta-llama/Llama-2-13b-chat-hf |
|
|
datasets: |
|
|
- NingLab/MMECInstruct |
|
|
license: cc-by-4.0 |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# CASLIE-L |
|
|
|
|
|
This repository contains the models for "[Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data](https://huggingface.co/papers/2410.17337)". |
|
|
|
|
|
**Project Page**: [https://ninglab.github.io/CASLIE/](https://ninglab.github.io/CASLIE/) |
|
|
**Code Repository**: [https://github.com/ninglab/CASLIE](https://github.com/ninglab/CASLIE) |
|
|
|
|
|
## Introduction |
|
|
Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention. This work introduces [MMECInstruct](https://huggingface.co/datasets/NingLab/MMECInstruct), the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct, we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. |
|
|
|
|
|
## CASLIE Models |
|
|
The CASLIE-L model is instruction-tuned from the large base model [Llama-2-13b-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf). |
|
|
|
|
|
## Sample Usage (Modality-unified Inference) |
|
|
To conduct inference with the CASLIE models, refer to the following example directly from the [official GitHub repository](https://github.com/ninglab/CASLIE#modality-unified-inference). |
|
|
|
|
|
`$model_path` is the path of the instruction-tuned model. |
|
|
|
|
|
`$task` specifies the task to be tested. |
|
|
|
|
|
`$output_path` specifies the path where you want to save the inference output. |
|
|
|
|
|
Example: |
|
|
``` |
|
|
python inference.py --model_path NingLab/CASLIE-M --task answerability_prediction --output_path ap.json |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@article{ling2024captions, |
|
|
title={Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data}, |
|
|
author={Ling, Xinyi and Peng, Bo and Du, Hanwen and Zhu, Zhihui and Ning, Xia}, |
|
|
journal={arXiv preprint arXiv:2410.17337}, |
|
|
year={2024} |
|
|
} |
|
|
``` |