Text-to-3D
English
Turkish
Italian
LDM+VAE
medical
File size: 3,206 Bytes
4c70139
 
 
 
 
 
 
 
 
 
 
 
 
484342d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2442b0
484342d
c2442b0
 
 
 
 
 
 
 
484342d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c70139
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
datasets:
- ibrahimhamamci/CT-RATE
- dmolino/CT-RATE_Generated_Scans
language:
- en
- tr
- it
pipeline_tag: text-to-3d
tags:
- medical
---
# Text-to-CT Model Weights

Checkpoints for **“Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining”** (Molino et al., 2025).
---

## Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

### Model Description
- **Authors:** Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi  
- **Model type:** 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.  
- **License:** Apache 2.0 (same as code release).  
- **Sources:** Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633  
- **Demo:** Use `diff_model_demo.py` from the code release for a one-off generation from text.

### Intended Use
- **Direct use:** Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.  
- **Downstream use:** Fine-tuning or integration into broader research pipelines.  
- **Out of scope:** Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.

### Risks & Limitations
- Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
- Synthetic outputs may contain artifacts; do not use for patient care.

### Files included
- `autoencoder_epoch273.pt` — 3D VAE for latent compression/decoding.
- `unet_rflow_200ep.pt` — Diffusion UNet trained with rectified flow.
- `CLIP3D_Finding_Impression_30ep.pt` — CLIP3D weights for encoding reports.

### How to Get Started (Python)
```python
from huggingface_hub import snapshot_download

repo_id = "dmolino/text2ct-weights"

snapshot_download(
    repo_id=repo_id,
    repo_type="model",
    local_dir="your_local_path" 
)
```

# Use these in the code release configs:
# trained_autoencoder_path -> autoencoder_path
# existing_ckpt_filepath / model_filename -> unet_path
# clip_weights (for report embeddings) -> clip_path
```

### Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.

### Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.

### Evaluation
- See paper for quantitative and qualitative results.

### Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.

### Environmental Impact
- Not reported. Training used multi-GPU setup;.

### Citation
If you use these weights or code, please cite the paper:
```
@article{molino2025text,
  title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
  author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
  journal={arXiv preprint arXiv:2506.00633},
  year={2025}
}
```