|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- ibrahimhamamci/CT-RATE |
|
|
- dmolino/CT-RATE_Generated_Scans |
|
|
language: |
|
|
- en |
|
|
- tr |
|
|
- it |
|
|
pipeline_tag: text-to-3d |
|
|
tags: |
|
|
- medical |
|
|
--- |
|
|
# Text-to-CT Model Weights |
|
|
|
|
|
Checkpoints for **“Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining”** (Molino et al., 2025). |
|
|
--- |
|
|
|
|
|
## Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining |
|
|
|
|
|
### Model Description |
|
|
- **Authors:** Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi |
|
|
- **Model type:** 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation. |
|
|
- **License:** Apache 2.0 (same as code release). |
|
|
- **Sources:** Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633 |
|
|
- **Demo:** Use `diff_model_demo.py` from the code release for a one-off generation from text. |
|
|
|
|
|
### Intended Use |
|
|
- **Direct use:** Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation. |
|
|
- **Downstream use:** Fine-tuning or integration into broader research pipelines. |
|
|
- **Out of scope:** Clinical decision-making, diagnostic use, or deployment without proper validation and approvals. |
|
|
|
|
|
### Risks & Limitations |
|
|
- Trained on CT-RATE; may encode dataset biases and is not validated for clinical use. |
|
|
- Synthetic outputs may contain artifacts; do not use for patient care. |
|
|
|
|
|
### Files included |
|
|
- `autoencoder_epoch273.pt` — 3D VAE for latent compression/decoding. |
|
|
- `unet_rflow_200ep.pt` — Diffusion UNet trained with rectified flow. |
|
|
- `CLIP3D_Finding_Impression_30ep.pt` — CLIP3D weights for encoding reports. |
|
|
|
|
|
### How to Get Started (Python) |
|
|
```python |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
repo_id = "dmolino/text2ct-weights" |
|
|
|
|
|
snapshot_download( |
|
|
repo_id=repo_id, |
|
|
repo_type="model", |
|
|
local_dir="your_local_path" |
|
|
) |
|
|
``` |
|
|
|
|
|
# Use these in the code release configs: |
|
|
# trained_autoencoder_path -> autoencoder_path |
|
|
# existing_ckpt_filepath / model_filename -> unet_path |
|
|
# clip_weights (for report embeddings) -> clip_path |
|
|
``` |
|
|
|
|
|
### Training Data (for these weights) |
|
|
- CT-RATE dataset (public on Hugging Face) for CT volumes and reports. |
|
|
|
|
|
### Training Procedure (summary) |
|
|
- CLIP3D trained for vision-language alignment on CT+reports. |
|
|
- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi. |
|
|
- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings. |
|
|
|
|
|
### Evaluation |
|
|
- See paper for quantitative and qualitative results. |
|
|
|
|
|
### Further Information |
|
|
- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans. |
|
|
|
|
|
### Environmental Impact |
|
|
- Not reported. Training used multi-GPU setup;. |
|
|
|
|
|
### Citation |
|
|
If you use these weights or code, please cite the paper: |
|
|
``` |
|
|
@article{molino2025text, |
|
|
title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining}, |
|
|
author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio}, |
|
|
journal={arXiv preprint arXiv:2506.00633}, |
|
|
year={2025} |
|
|
} |
|
|
``` |