Spaces:

kshdes37
/

cadspace

Runtime error

App Files Files Community

cadspace / CADFusion /README.md

kshdes37

Upload 50 files

91daf98 verified 3 months ago

preview code

raw

history blame contribute delete

10.4 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

CADFusion

This repo is the official implementation of paper [ICML 2025] Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models by Ruiyu Wang, Yu Yuan, Shizhao Sun, Jiang Bian.

Paper | Video | Huggingface

CADFusion is a text-to-CAD generation framework that leverages visual feedback to enhance the performance of large language models (LLMs) in generating CAD models from textual descriptions. It consists of two main components: sequential learning and visual learning. The sequential learning component fine-tunes LLMs on a text-to-CAD dataset, while the visual learning component alternates between training a visual feedback model and fine-tuning the LLM with the generated visual feedback.

Installation

Create a conda environment and install the generic dependencies.

name=<your-env-name>
conda create -n $name python=3.9
conda activate $name
python -m pip install -e .

Install the additional dependencies for training.

python -m pip install -e .["train"]

Install the additional dependencies for evaluation and rendering.

python -m pip install -e .["render"]
conda install -c conda-forge pythonocc-core=7.7.0
python -m pip install git+https://github.com/otaheri/chamfer_distance@dc9987dcf70888d387d96893ba1fb9ba9a333992
python -m pip install -e .["eval"]

Data Preparation

CADFusion is trained by alternating the Sequential Learning (SL) stage and the Visual Feedback (VF) stage. We introduce how to prepare the training data for these two stages in the below.

Data for Sequential Learning

Approach 1: use human-annotated textual descriptions provided by us

We provide human-annoated textual descriptions and their correspoding CAD model IDs in Skexgen under data/sl_data/sl_data.zip. It should contain the following files after unzipping:

data/sl_data
├── train.json
├── val.json
├── test.json

To use our annotated data, download the SkexGen data, unzip it as the reference dataset and run the convertion script to get the dataset. In detail, run the following command:

# make sure you are in the root directory of this repo and have the 'data/sl_data/sl_data.zip' unzipped
gdown --id 1so_CCGLIhqGEDQxMoiR--A4CQk4MjuOp 
unzip cad_data.zip
python3 data/sl_data/convert.py

The train.json, val.json and test.json under data/sl_data are the datasets.

Approach 2: create human-annotated textual descriptions by yourself

We provide a script to execute all the preprocessing steps until human annotation.

./scripts/preprocess_skexgen.sh

If you want to customize the internal steps, expand the following section for more details.

Start from scratch (click to expand).

Download the SkexGen data by: Google Drive link.

gdown --id 1so_CCGLIhqGEDQxMoiR--A4CQk4MjuOp
unzip cad_data.zip

Convert the SkexGen data into sequences. Note that train_deduplicate_s.pkl, val.pkl and test.pkl should be converted separately.

python3 src/data_preprocessing/convert.py --in_path <skexgen_path> --out_path <sequence_path>

Render the sequences into images. Note that running the last step on linux requires the installation of an x server (e.g. xvfb). See this discussion.

python3 src/rendering_utils/parser.py --in-path <sequence_path> --out-path <visual_object_folder>
timeout 180 python3 src/rendering_utils/parser_visual.py --data_folder <visual_object_folder>
python3 src/rendering_utils/img_renderer.py --input_dir <visual_object_folder> --output_dir <image_folder>

Annotate these data with LLM captioning.

# Generic:
python3 src/data_preprocessing/captioning.py --image-folder-path <image_folder> --out-path <sl_data_path>

We use openai and azure system for LLM calling. You are welcome to use your own LLMs and prompts by changing line 21, 22 of src/data_preprocessing/captioning.py with your own client definition and function calls.

Data for Visual Feedback

The Visual Feedback dataset should be automatically generated from the Visual Feedback pipeline described in the Training section. We provide an example under data/vf_data/example_vf_data.json to help people understand how it should look like. You can retrieve this file by unzipping data/vf_data/example_vf_data.zip. We do not recommend using this example data as the training data, as the policy update should depend on its own generations.

Training

Our training receipe contains two parts. In the first part, we conduct initial sequential learning. In the second part, we conduct alternate training between sequential learning and visual feedback.

Initial Sequential Learning

We use the following script to train the model in the sequential learning stage.

./scripts/train_with_shuffling.sh <run_name>

You are also welcome to customize the training procedure. A normal training script on multiple GPUs is provided. Change num_processes in ds_config.yaml to specify how many GPUs will be used.

CUDA_VISIBLE_DEVICES=<gpu_ids> accelerate launch --config_file ds_config.yaml src/train/llama_finetune.py \
    --num-epochs <num_epochs> --run-name <run_name> --data-path <train_data> --eval-data-path <eval_data> \
    --device-map accelerate --model-name llama3 --expdir <model_saving_path>

In our work we shuffle the dataset per x epochs. To train model with this implementation, inspect and modify scripts/train_with_shuffling.sh.

Alternate Training between Sequential Learning and Visual Feedback

We provide a script for executing our alternate training round. See scripts/alternate_VF.sh.

./scripts/alternate_VF.sh  # change the value of base_name in the script as instructed

We also provide a script for training on multiple gpus for saving time: scripts/alternate_VF_quadra_gpu.sh. In our setting, we use 4 GPUs for training. You can change the script to use more GPUs if you have them available.

If you only want to conduct a single round of visual learning, run

python src/train/dpo.py --run-name <dpo_run_name> --pretrained-path <pretrained_model_path> --data-path <dpo_data_Path> --output-path <model_saving_path>

By default it runs dpo for 3 epochs, but you can change by adding flag --num-epochs x.

Model Checkpoints

We provide two versions. v1.0 has 5 rounds of alternate training and is used for evaluation in our paper. v1.1 has 9 rounds of alternate training and is considered to have better performance than v1.0.

You should download, unzip and place them under the exp/model_ckpt folder for using.

Inference & Visualization

Use scripts/generate_samples.sh.

./scripts/generate_samples.sh <run_name> test --full

You can find samples generated in exp/model_generation/<run_name>.jsonl and rendered figures under the exp/figures/<run_name> folder. The point clouds, .obj files, .step and .stl files are saved under exp/visual_objects/<run_name> directory for your own usage and evaluation.

Evaluation

Use the functions in src/test. This includes the Chamfer Distance (chamfer_dist.py), Minimum Matching Distance, Coverage, Jensen-Shannon Divergence (dist_eval.py), and the VLM score (VLM_score.py).

For VLM Score, we use Azure OpenAI API to access the GPT-4o model for scoring the CAD objects. In this way, you should log in your own azure account before using this module. If your are using other LLM/VLM service and feel difficult to adapt to our setup, we provide the prompt in the python module that is available for you to integrate into your own testing pipeline.

Acknowledgements

We would like to acknowledge that the CAD rendering and distributional metrics in this repository is partially based on and adapted from the SkexGen project.

Citation

If you find our work useful, please cite the following paper

@inproceedings{wang2025texttocad, 
  title = {Text-to-CAD Generation Through Infusing Visual Feedback in Large Language Models},
  author = {Wang, Ruiyu and Yuan, Yu and Sun, Shizhao and Bian, Jiang},
  booktitle = {International Conference on Machine Learning},
  year={2025}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.