| | --- |
| | library_name: transformers |
| | license: apache-2.0 |
| | datasets: |
| | - Jialuo21/Science-T2I-Trainset |
| | base_model: |
| | - laion/CLIP-ViT-H-14-laion2B-s32B-b79K |
| | --- |
| | |
| |
|
| | <img src="teaser.png" align="center"> |
| |
|
| | # SciScore |
| | SciScore is finetuned on the base model [CLIP-H](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) using [Science-T2I](https://huggingface.co/datasets/Jialuo21/Science-T2I-Trainset) dataset. It takes an implicit prompt and a generated image as input and outputs a score that represents the scientific alignment between them. |
| |
|
| |
|
| | ## Resources |
| | - [Website](https://jialuo-li.github.io/Science-T2I-Web/) |
| | - [arXiv: Paper](https://arxiv.org/abs/2504.13129) |
| | - [GitHub: Code](https://github.com/Jialuo-Li/Science-T2I) |
| | - [Huggingface: Science-T2I-S&C Benchmark](https://huggingface.co/collections/Jialuo21/science-t2i-67d3bfe43253da2bc7cfaf06) |
| | - [Huggingface: Science-T2I Trainset](https://huggingface.co/datasets/Jialuo21/Science-T2I-Trainset) |
| |
|
| | ## Feature |
| | <img src="exp.png" align="center"> |
| |
|
| | ## Qick Start |
| | ``` |
| | from transformers import AutoProcessor, AutoModel |
| | from PIL import Image |
| | import torch |
| | |
| | device = "cuda" |
| | processor_name_or_path = "Jialuo21/SciScore" |
| | model_pretrained_name_or_path = "Jialuo21/SciScore" |
| | |
| | processor = AutoProcessor.from_pretrained(processor_name_or_path) |
| | model = AutoModel.from_pretrained(model_pretrained_name_or_path).eval().to(device) |
| | |
| | def calc_probs(prompt, images): |
| | |
| | image_inputs = processor( |
| | images=images, |
| | padding=True, |
| | truncation=True, |
| | max_length=77, |
| | return_tensors="pt", |
| | ).to(device) |
| | |
| | text_inputs = processor( |
| | text=prompt, |
| | padding=True, |
| | truncation=True, |
| | max_length=77, |
| | return_tensors="pt", |
| | ).to(device) |
| | |
| | with torch.no_grad(): |
| | image_embs = model.get_image_features(**image_inputs) |
| | image_embs = image_embs / torch.norm(image_embs, dim=-1, keepdim=True) |
| | |
| | text_embs = model.get_text_features(**text_inputs) |
| | text_embs = text_embs / torch.norm(text_embs, dim=-1, keepdim=True) |
| | |
| | scores = model.logit_scale.exp() * (text_embs @ image_embs.T)[0] |
| | probs = torch.softmax(scores, dim=-1) |
| | return probs.cpu().tolist() |
| | |
| | pil_images = [Image.open("./examples/camera_1.png"), Image.open("./examples/camera_2.png")] |
| | prompt = "A camera screen without electricity sits beside the window, realistic." |
| | print(calc_probs(prompt, pil_images)) |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ``` |
| | @misc{li2025sciencet2iaddressingscientificillusions, |
| | title={Science-T2I: Addressing Scientific Illusions in Image Synthesis}, |
| | author={Jialuo Li and Wenhao Chai and Xingyu Fu and Haiyang Xu and Saining Xie}, |
| | year={2025}, |
| | eprint={2504.13129}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CV}, |
| | url={https://arxiv.org/abs/2504.13129}, |
| | } |
| | ``` |