File size: 1,422 Bytes
4091ce3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# NeurIPS11092: Text-to-CadQuery Repo

This repository contains all resources used to train and evaluate large language models for generating CadQuery code from natural language descriptions.

## Contents

- `data/`  
  Contains prompt-completion pairs used to finetune six open-source LLMs.  
  These files are split into `data_train.jsonl`, `data_val.jsonl`, and `data_test.jsonl` following a 90/5/5 ratio.

- `CadQuery.zip`  
  Includes all **170,000 CadQuery programs** we generated from the [Text2CAD](https://github.com/SadilKhan/Text2CAD) dataset using Gemini 2.0 Flash.

- `text2cad_v1.1.csv`  
  Original source data provided by the Text2CAD authors, in minimal JSON format.

## Finetuned Models

We trained the following models on this dataset:

- [CodeGPT-small](https://huggingface.co/ricemonster/codegpt-small-sft)  
- [GPT-2 Medium](https://huggingface.co/ricemonster/gpt2-medium-sft)  
- [GPT-2 Large](https://huggingface.co/ricemonster/gpt2-large-sft)  
- [Gemma-1B](https://huggingface.co/ricemonster/gemma-1B-SFT)  
- [Qwen2.5-3B](https://huggingface.co/ricemonster/qwen2.5-3B-SFT)
- [Mistral-7B (LoRA)](https://huggingface.co/ricemonster/Mistral-7B-lora)  

## Acknowledgements

We gratefully acknowledge the authors of [Text2CAD](https://github.com/SadilKhan/Text2CAD) and [DeepCAD](https://github.com/ChrisWu1997/DeepCAD) for their foundational datasets and inspiration.