dmolino
/

text2ct-weights

Model card Files Files and versions

text2ct-weights / README.md

dmolino's picture

Update README.md

c2442b0 verified 26 days ago

|

history blame contribute delete

3.21 kB

	---
	license: apache-2.0
	datasets:
	- ibrahimhamamci/CT-RATE
	- dmolino/CT-RATE_Generated_Scans
	language:
	- en
	- tr
	- it
	pipeline_tag: text-to-3d
	tags:
	- medical
	---
	# Text-to-CT Model Weights

	Checkpoints for “Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining” (Molino et al., 2025).
	---

	## Model Card for Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

	### Model Description
	- Authors: Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi
	- Model type: 3D latent diffusion (RFlow) + 3D VAE + CLIP3D text encoder for CT generation.
	- License: Apache 2.0 (same as code release).
	- Sources: Code https://github.com/cosbidev/Text2CT \| Paper https://arxiv.org/abs/2506.00633
	- Demo: Use `diff_model_demo.py` from the code release for a one-off generation from text.

	### Intended Use
	- Direct use: Research/experimentation on text-conditioned 3D CT synthesis; generating synthetic data for benchmarking or augmentation.
	- Downstream use: Fine-tuning or integration into broader research pipelines.
	- Out of scope: Clinical decision-making, diagnostic use, or deployment without proper validation and approvals.

	### Risks & Limitations
	- Trained on CT-RATE; may encode dataset biases and is not validated for clinical use.
	- Synthetic outputs may contain artifacts; do not use for patient care.

	### Files included
	- `autoencoder_epoch273.pt` — 3D VAE for latent compression/decoding.
	- `unet_rflow_200ep.pt` — Diffusion UNet trained with rectified flow.
	- `CLIP3D_Finding_Impression_30ep.pt` — CLIP3D weights for encoding reports.

	### How to Get Started (Python)
	```python
	from huggingface_hub import snapshot_download

	repo_id = "dmolino/text2ct-weights"

	snapshot_download(
	repo_id=repo_id,
	repo_type="model",
	local_dir="your_local_path"
	)
	```

	# Use these in the code release configs:
	# trained_autoencoder_path -> autoencoder_path
	# existing_ckpt_filepath / model_filename -> unet_path
	# clip_weights (for report embeddings) -> clip_path
	```

	### Training Data (for these weights)
	- CT-RATE dataset (public on Hugging Face) for CT volumes and reports.

	### Training Procedure (summary)
	- CLIP3D trained for vision-language alignment on CT+reports.
	- VAE checkpoint from https://github.com/Project-MONAI/tutorials/tree/main/generation/maisi.
	- Diffusion UNet trained with rectified flow (RFlow) in latent space, conditioned on text embeddings.

	### Evaluation
	- See paper for quantitative and qualitative results.

	### Further Information
	- 1,000 generated CT scans are available at https://huggingface.co/datasets/dmolino/CT-RATE_Generated_Scans.

	### Environmental Impact
	- Not reported. Training used multi-GPU setup;.

	### Citation
	If you use these weights or code, please cite the paper:
	```
	@article{molino2025text,
	title={Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining},
	author={Molino, Daniele and Caruso, Camillo Maria and Ruffini, Filippo and Soda, Paolo and Guarrasi, Valerio},
	journal={arXiv preprint arXiv:2506.00633},
	year={2025}
	}
	```