Text-to-3D
English
Turkish
Italian
LDM+VAE
medical
text2ct-weights / README.md
dmolino's picture
Update README.md
c2442b0 verified
---
license: apache-2.0
datasets:
- ibrahimhamamci/CT-RATE
- dmolino/CT-RATE_Generated_Scans
language:
- en
- tr
- it
pipeline_tag: text-to-3d
tags:
- medical
---
# Text-to-CT Model Weights
Checkpoints for **“Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining”** (Molino et al., 2025).
---
## Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
### Model Description
- **Authors:** Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
- **Model type:** 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
- **License:** Apache 2.0 (same as code release).
- **Sources:** Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633
- **Demo:** Use `diff_model_demo.py` from the code release for a one-off generation from text.
### Intended Use
- **Direct use:** Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
- **Downstream use:** Fine-tuning or integration into broader research pipelines.
- **Out of scope:** Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.
### Risks & Limitations
- Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
- Synthetic outputs may contain artifacts; do not use for patient care.
### Files included
- `autoencoder_epoch273.pt` — 3D VAE for latent compression/decoding.
- `unet_rflow_200ep.pt` — Diffusion UNet trained with rectified flow.
- `CLIP3D_Finding_Impression_30ep.pt` — CLIP3D weights for encoding reports.
### How to Get Started (Python)
```python
from huggingface_hub import snapshot_download
repo_id = "dmolino/text2ct-weights"
snapshot_download(
repo_id=repo_id,
repo_type="model",
local_dir="your_local_path"
)
```
# Use these in the code release configs:
# trained_autoencoder_path -> autoencoder_path
# existing_ckpt_filepath / model_filename -> unet_path
# clip_weights (for report embeddings) -> clip_path
```
### Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.
### Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.
### Evaluation
- See paper for quantitative and qualitative results.
### Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.
### Environmental Impact
- Not reported. Training used multi-GPU setup;.
### Citation
If you use these weights or code, please cite the paper:
```
@article{molino2025text,
title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
journal={arXiv preprint arXiv:2506.00633},
year={2025}
}
```