A newer version of the Gradio SDK is available:
6.1.0
CADFusion
This repo is the official implementation of paper [ICML 2025] Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models by Ruiyu Wang, Yu Yuan, Shizhao Sun, Jiang Bian.
Paper | Video | Huggingface
CADFusion is a text-to-CAD generation framework that leverages visual feedback to enhance the performance of large language models (LLMs) in generating CAD models from textual descriptions. It consists of two main components: sequential learning and visual learning. The sequential learning component fine-tunes LLMs on a text-to-CAD dataset, while the visual learning component alternates between training a visual feedback model and fine-tuning the LLM with the generated visual feedback.
Installation
- Create a conda environment and install the generic dependencies.
name=<your-env-name>
conda create -n $name python=3.9
conda activate $name
python -m pip install -e .
- Install the additional dependencies for training.
python -m pip install -e .["train"]
- Install the additional dependencies for evaluation and rendering.
python -m pip install -e .["render"]
conda install -c conda-forge pythonocc-core=7.7.0
python -m pip install git+https://github.com/otaheri/chamfer_distance@dc9987dcf70888d387d96893ba1fb9ba9a333992
python -m pip install -e .["eval"]
Data Preparation
CADFusion is trained by alternating the Sequential Learning (SL) stage and the Visual Feedback (VF) stage. We introduce how to prepare the training data for these two stages in the below.
Data for Sequential Learning
Approach 1: use human-annotated textual descriptions provided by us
We provide human-annoated textual descriptions and their correspoding CAD model IDs in Skexgen under data/sl_data/sl_data.zip. It should contain the following files after unzipping:
data/sl_data
βββ train.json
βββ val.json
βββ test.json
To use our annotated data, download the SkexGen data, unzip it as the reference dataset and run the convertion script to get the dataset. In detail, run the following command:
# make sure you are in the root directory of this repo and have the 'data/sl_data/sl_data.zip' unzipped
gdown --id 1so_CCGLIhqGEDQxMoiR--A4CQk4MjuOp
unzip cad_data.zip
python3 data/sl_data/convert.py
The train.json, val.json and test.json under data/sl_data are the datasets.
Approach 2: create human-annotated textual descriptions by yourself
We provide a script to execute all the preprocessing steps until human annotation.
./scripts/preprocess_skexgen.sh
If you want to customize the internal steps, expand the following section for more details.
Start from scratch (click to expand).
- Download the SkexGen data by: Google Drive link.
gdown --id 1so_CCGLIhqGEDQxMoiR--A4CQk4MjuOp
unzip cad_data.zip
- Convert the SkexGen data into sequences. Note that
train_deduplicate_s.pkl,val.pklandtest.pklshould be converted separately.
python3 src/data_preprocessing/convert.py --in_path <skexgen_path> --out_path <sequence_path>
- Render the sequences into images. Note that running the last step on linux requires the installation of an x server (e.g.
xvfb). See this discussion.
python3 src/rendering_utils/parser.py --in-path <sequence_path> --out-path <visual_object_folder>
timeout 180 python3 src/rendering_utils/parser_visual.py --data_folder <visual_object_folder>
python3 src/rendering_utils/img_renderer.py --input_dir <visual_object_folder> --output_dir <image_folder>
- Annotate these data with LLM captioning.
# Generic:
python3 src/data_preprocessing/captioning.py --image-folder-path <image_folder> --out-path <sl_data_path>
- We use openai and azure system for LLM calling. You are welcome to use your own LLMs and prompts by changing
line 21, 22ofsrc/data_preprocessing/captioning.pywith your own client definition and function calls.
Data for Visual Feedback
The Visual Feedback dataset should be automatically generated from the Visual Feedback pipeline described in the Training section.
We provide an example under data/vf_data/example_vf_data.json to help people understand how it should look like.
You can retrieve this file by unzipping data/vf_data/example_vf_data.zip.
We do not recommend using this example data as the training data, as the policy update should depend on its own generations.
Training
Our training receipe contains two parts. In the first part, we conduct initial sequential learning. In the second part, we conduct alternate training between sequential learning and visual feedback.
Initial Sequential Learning
We use the following script to train the model in the sequential learning stage.
./scripts/train_with_shuffling.sh <run_name>
You are also welcome to customize the training procedure. A normal training script on multiple GPUs is provided. Change num_processes in ds_config.yaml to specify how many GPUs will be used.
CUDA_VISIBLE_DEVICES=<gpu_ids> accelerate launch --config_file ds_config.yaml src/train/llama_finetune.py \
--num-epochs <num_epochs> --run-name <run_name> --data-path <train_data> --eval-data-path <eval_data> \
--device-map accelerate --model-name llama3 --expdir <model_saving_path>
In our work we shuffle the dataset per x epochs. To train model with this implementation, inspect and modify scripts/train_with_shuffling.sh.
Alternate Training between Sequential Learning and Visual Feedback
We provide a script for executing our alternate training round. See scripts/alternate_VF.sh.
./scripts/alternate_VF.sh # change the value of base_name in the script as instructed
We also provide a script for training on multiple gpus for saving time: scripts/alternate_VF_quadra_gpu.sh. In our setting, we use 4 GPUs for training. You can change the script to use more GPUs if you have them available.
If you only want to conduct a single round of visual learning, run
python src/train/dpo.py --run-name <dpo_run_name> --pretrained-path <pretrained_model_path> --data-path <dpo_data_Path> --output-path <model_saving_path>
By default it runs dpo for 3 epochs, but you can change by adding flag --num-epochs x.
Model Checkpoints
We provide two versions. v1.0 has 5 rounds of alternate training and is used for evaluation in our paper. v1.1 has 9 rounds of alternate training and is considered to have better performance than v1.0.
You should download, unzip and place them under the exp/model_ckpt folder for using.
Inference & Visualization
Use scripts/generate_samples.sh.
./scripts/generate_samples.sh <run_name> test --full
You can find samples generated in exp/model_generation/<run_name>.jsonl and rendered figures under the exp/figures/<run_name> folder. The point clouds, .obj files, .step and .stl files are saved under exp/visual_objects/<run_name> directory for your own usage and evaluation.
Evaluation
Use the functions in src/test. This includes the Chamfer Distance (chamfer_dist.py), Minimum Matching Distance, Coverage, Jensen-Shannon Divergence (dist_eval.py), and the VLM score (VLM_score.py).
For VLM Score, we use Azure OpenAI API to access the GPT-4o model for scoring the CAD objects. In this way, you should log in your own azure account before using this module. If your are using other LLM/VLM service and feel difficult to adapt to our setup, we provide the prompt in the python module that is available for you to integrate into your own testing pipeline.
Acknowledgements
We would like to acknowledge that the CAD rendering and distributional metrics in this repository is partially based on and adapted from the SkexGen project.
Citation
If you find our work useful, please cite the following paper
@inproceedings{wang2025texttocad,
title = {Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models},
author = {Wang, Ruiyu and Yuan, Yu and Sun, Shizhao and Bian, Jiang},
booktitle = {International Conference on Machine Learning},
year={2025}
}
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.