elem2design / README.md
ShizhaoSun's picture
Update README.md
c4f20b5 verified
---
license: mit
---
## Model Details
### Model Description
This model aims to compose user-provided graphic elements into a pleasing graphical design. It takes graphic elements (i.e., the images and texts) from users as input and generates the position, color and font information of each element as output.
- **Developed by:** Jiawei Lin, Shizhao Sun, Danqing Huang, Ting Liu, Ji Li and Jiang Bian
- **Model type:** Large Language Models
- **Language(s):** Python
- **License:** MIT
- **Finetuned from model:** Llama-3.1-8B
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/microsoft/elem2design
- **Paper:** https://arxiv.org/abs/2412.19712
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Compose user-provided graphic elements (i.e., images and texts) into a pleasing graphic design.
Elem2Design is being shared with the research community to facilitate reproduction of our results and foster further research in this area.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
We do not recommend using Elem2Design in commercial or real-world applications without further testing and development. It is being released for research purposes.
Use in any manner that violates applicable laws or regulations.
## Risks and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
Elem2Design inherits any biases, errors, or omissions produced by its base model. Developers are advised to choose an appropriate base LLM/MLLM carefully, depending on the intended use case. 
Elem2Design uses the Llama model. See https://huggingface.co/meta-llama/Llama-3.1-8B to understand the capabilities and limitations of this model.  
As the model is fine-tuned on very specific data about design composition, it is unlikely to generate information other than position, color and font. However, this is possible. It is more likely to happen when instructions unrelated to graphic design composition, e.g., how has the social media influenced our daily life, are fed into the model.
Graphic designs generated by Elem2Design may not be technically accurate or meet user specifications in all cases. Users are responsible for assessing the acceptability of generated content for each intended use case.
Elem2Design was developed for research and experimental purposes. Further testing and validation are needed before considering its application in commercial or real-world scenarios.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Please only provide the images and texts that you want to show on the graphic design to the model.
Users are responsible for sourcing their content legally and ethically. This could include securing appropriate copy rights, ensuring consent for use of images of people, and/or the anonymization of data prior to use in research.  
## How to Get Started with the Model
```
python llava/infer/infer.py \
--model_name_or_path /path/to/model/checkpoint-xxxx \
--data_path /path/to/data/test.json \
--image_folder /path/to/crello_images \
--output_dir /path/to/output_dir \
--start_layer_index 0 \
--end_layer_index 4
```
For more information, please visit our GitHub repo: https://github.com/microsoft/elem2design.
## Training Details
### Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The training data is from an open-source dataset (https://huggingface.co/datasets/cyberagent/crello).
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing
The training samples with more than 25 design elements are filtered out to maintain a limited sequence length and thereby improve training efficiency.
#### Training Hyperparameters
- Learning rate: 2e-4
- Global batch size: 128
- Number of training steps: 7000
- Rank and alpha of LoRA: 32 and 64
#### Speeds, Sizes, Times
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
- Llama-3.1-8B: 8B parameters
- CLIP ViT-Large-Patch14: 428M parameters
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
The testing data is from an open-source dataset (https://huggingface.co/datasets/cyberagent/crello).
#### Metrics
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
- Overall metrics. We use a robust proxy model (https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf) for comprehensive evaluation from five aspects: (i) design and layout, (ii) content relevance, (iii) typography and color, (iv) graphics and images, and (v) innovation and originality. We use the same prompts as presented in COLE [[1](https://arxiv.org/abs/2311.16974)].
- Geometry-related metrics. These metrics focus purely on the geometric attributes of elements without considering their content, including element validity (Val), Overlap (Ove), Alignment (Ali) and underlay effectiveness (Undl, Unds) [[2](https://arxiv.org/abs/2303.15937 )][[3](https://arxiv.org/pdf/2404.00995 )].
### Results
We use prior work FlexDM [[1](https://arxiv.org/pdf/2303.18248)] and prompting GPT-4o [[2](https://platform.openai.com/docs/models#gpt-4o)] as baselines. In comparison, Elem2Design demonstrates superior performance across nearly all metrics. For example, on overall metrics, Elem2Design achieves 8.08, 7.92, 8.00, 7.82 and 6.98 on the evaluated five aspects. For another example, regarding geometry-related metrics, Elem2Design and FlexDM achieve Ove score of 0.0865 and 0.3242 respectively, indicting that Elem2Design effectively addresses the overlap issue whereas FlexDM encounters difficulties in this area. See Table 1 for the complete evaluation in our paper (https://arxiv.org/pdf/2412.19712)
## Environmental Impact
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
```
@InProceedings{lin2024elements,
title={From Elements to Design: A Layered Approach for Automatic Graphic Design Composition},
author={Lin, Jiawei and Sun, Shizhao and Huang, Danqing and Liu, Ting and Li, Ji and Bian, Jiang},
booktitle={CVPR},
year={2025}
}
```
## Model Card Contact
We welcome feedback and collaboration from our audience. If you have suggestions, questions, or observe unexpected/offensive behavior in our technology, please contact us at Shizhao Sun, shizsu@microsoft.com.
If the team receives reports of undesired behavior or identifies issues independently, we will update this repository with appropriate mitigations.