File size: 3,206 Bytes
4c70139 484342d c2442b0 484342d c2442b0 484342d 4c70139 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 | ---
license: apache-2.0
datasets:
- ibrahimhamamci/CT-RATE
- dmolino/CT-RATE_Generated_Scans
language:
- en
- tr
- it
pipeline_tag: text-to-3d
tags:
- medical
---
# Text-to-CT Model Weights
Checkpoints for **“Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining”** (Molino et al., 2025).
---
## Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining
### Model Description
- **Authors:** Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
- **Model type:** 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
- **License:** Apache 2.0 (same as code release).
- **Sources:** Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633
- **Demo:** Use `diff_model_demo.py` from the code release for a one-off generation from text.
### Intended Use
- **Direct use:** Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
- **Downstream use:** Fine-tuning or integration into broader research pipelines.
- **Out of scope:** Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.
### Risks & Limitations
- Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
- Synthetic outputs may contain artifacts; do not use for patient care.
### Files included
- `autoencoder_epoch273.pt` — 3D VAE for latent compression/decoding.
- `unet_rflow_200ep.pt` — Diffusion UNet trained with rectified flow.
- `CLIP3D_Finding_Impression_30ep.pt` — CLIP3D weights for encoding reports.
### How to Get Started (Python)
```python
from huggingface_hub import snapshot_download
repo_id = "dmolino/text2ct-weights"
snapshot_download(
repo_id=repo_id,
repo_type="model",
local_dir="your_local_path"
)
```
# Use these in the code release configs:
# trained_autoencoder_path -> autoencoder_path
# existing_ckpt_filepath / model_filename -> unet_path
# clip_weights (for report embeddings) -> clip_path
```
### Training Data (for these weights)
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.
### Training Procedure (summary)
- CLIP3D trained for vision-language alignment on CT+reports.
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.
### Evaluation
- See paper for quantitative and qualitative results.
### Further Information
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.
### Environmental Impact
- Not reported. Training used multi-GPU setup;.
### Citation
If you use these weights or code, please cite the paper:
```
@article{molino2025text,
title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
journal={arXiv preprint arXiv:2506.00633},
year={2025}
}
``` |