--- license: apache-2.0 datasets: - ibrahimhamamci/CT-RATE - dmolino/CT-RATE_Generated_Scans language: - en - tr - it pipeline_tag: text-to-3d tags: - medical --- # Text-to-CT Model Weights Checkpoints for **“Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining”** (Molino et al., 2025). --- ## Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining ### Model Description - **Authors:** Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi - **Model type:** 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation. - **License:** Apache 2.0 (same as code release). - **Sources:** Code https://github.com/cosbidev/Text2CT | Paper https://arxiv.org/abs/2506.00633 - **Demo:** Use `diff_model_demo.py` from the code release for a one-off generation from text. ### Intended Use - **Direct use:** Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation. - **Downstream use:** Fine-tuning or integration into broader research pipelines. - **Out of scope:** Clinical decision-making, diagnostic use, or deployment without proper validation and approvals. ### Risks & Limitations - Trained on CT-RATE; may encode dataset biases and is not validated for clinical use. - Synthetic outputs may contain artifacts; do not use for patient care. ### Files included - `autoencoder_epoch273.pt` — 3D VAE for latent compression/decoding. - `unet_rflow_200ep.pt` — Diffusion UNet trained with rectified flow. - `CLIP3D_Finding_Impression_30ep.pt` — CLIP3D weights for encoding reports. ### How to Get Started (Python) ```python from huggingface_hub import snapshot_download repo_id = "dmolino/text2ct-weights" snapshot_download( repo_id=repo_id, repo_type="model", local_dir="your_local_path" ) ``` # Use these in the code release configs: # trained_autoencoder_path -> autoencoder_path # existing_ckpt_filepath / model_filename -> unet_path # clip_weights (for report embeddings) -> clip_path ``` ### Training Data (for these weights) - CT-RATE dataset (public on Hugging Face) for CT volumes and reports. ### Training Procedure (summary) - CLIP3D trained for vision-language alignment on CT+reports. - VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi. - Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings. ### Evaluation - See paper for quantitative and qualitative results. ### Further Information - 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans. ### Environmental Impact - Not reported. Training used multi-GPU setup;. ### Citation If you use these weights or code, please cite the paper: ``` @article{molino2025text, title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining}, author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio}, journal={arXiv preprint arXiv:2506.00633}, year={2025} } ```