|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
# Physics Foundation Vision Transformer (PhysicsViT-ExtendedVersion) |
|
|
|
|
|
A Vision Transformer model trained on multi-physics simulation data for scientific computing applications. This model is specifically designed for understanding and analyzing physics simulations across multiple domains. |
|
|
|
|
|
**Model Version:** Extended Version - Trained for 195,930 steps |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
- **Developed by:** PhysicsAlchemists Research Team |
|
|
- **Model type:** Vision Transformer (ViT-Huge) |
|
|
- **License:** MIT Licence |
|
|
- **Finetuned from model:** Trained from scratch on physics simulation data |
|
|
- **Training Steps:** 195,930 steps |
|
|
|
|
|
### Model Architecture |
|
|
|
|
|
- **Architecture:** ViT-Huge (Feature Extraction) |
|
|
- **Hidden size:** 1280 |
|
|
- **Number of layers:** 32 |
|
|
- **Number of attention heads:** 16 |
|
|
- **Intermediate size:** 5120 |
|
|
- **Image size:** 224×224 |
|
|
- **Patch size:** 16×16 |
|
|
- **Embedding dimension:** 1280 |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on a comprehensive dataset of physics simulations including: |
|
|
|
|
|
- Acoustic scattering (inclusions, discontinuous, maze) |
|
|
- Active matter simulations |
|
|
- Euler equations (multi-quadrants with open/periodic BC) |
|
|
- Gray-Scott reaction-diffusion |
|
|
- Helmholtz staircase |
|
|
- Planetary shallow water equations |
|
|
- Rayleigh-Bénard convection (standard and uniform) |
|
|
- Shear flow dynamics |
|
|
- Turbulent radiative layer (2D) |
|
|
- Viscoelastic instability |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
- **Training regime:** 195,930 steps |
|
|
- **Batch size:** 1,470 |
|
|
- **Learning rate:** 0.0005 (with warmup and cosine decay) |
|
|
- **Optimizer:** Adam (β₁=0.9, β₂=0.999, weight_decay=0.0003) |
|
|
- **Mixed precision:** bfloat16 |
|
|
- **Hardware:** Cerebras CS-X systems |
|
|
|
|
|
### Data Augmentation |
|
|
|
|
|
- Random colormap application (viridis, plasma, inferno, coolwarm) |
|
|
- Grayscale conversion (30% probability) |
|
|
- Temporal trajectory preservation during training |
|
|
|
|
|
## Usage |
|
|
|
|
|
⚠️ **Important:** This model requires specific preprocessing that differs from standard ViT models. |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModel, AutoImageProcessor |
|
|
from PIL import Image |
|
|
import torch |
|
|
|
|
|
# Load model and processor |
|
|
model = AutoModel.from_pretrained("JessicaE/physics-vit-full") |
|
|
processor = AutoImageProcessor.from_pretrained("JessicaE/physics-vit-full") |
|
|
|
|
|
# Load your physics image |
|
|
image = Image.open("physics_simulation.png").convert('RGB') |
|
|
|
|
|
# Apply custom preprocessing |
|
|
image = expand_to_square(image, background_color=(128, 128, 128)) |
|
|
image = image.resize((224, 224), Image.BILINEAR) |
|
|
|
|
|
# Convert to tensor and add batch dimension |
|
|
from torchvision import transforms |
|
|
tensor = transforms.ToTensor()(image).unsqueeze(0) |
|
|
|
|
|
# Extract physics-aware embeddings |
|
|
with torch.no_grad(): |
|
|
outputs = model(pixel_values=tensor) |
|
|
|
|
|
# CLS token embedding (best for classification tasks) |
|
|
cls_embedding = outputs.last_hidden_state[:, 0, :] # Shape: [1, 1280] |
|
|
|
|
|
# Average pooled embedding (good for trajectory prediction) |
|
|
pooled_embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280] |
|
|
|
|
|
# Patch embeddings (for spatial analysis) |
|
|
patch_embeddings = outputs.last_hidden_state[:, 1:, :] # Shape: [1, 196, 1280] |
|
|
|
|
|
print(f"CLS embedding shape: {cls_embedding.shape}") |
|
|
``` |
|
|
|
|
|
### Required Preprocessing Function |
|
|
|
|
|
```python |
|
|
from PIL import Image |
|
|
|
|
|
def expand_to_square(pil_img, background_color): |
|
|
""" |
|
|
Pad image to square with background color, keeping image centered. |
|
|
|
|
|
REQUIRED for Physics ViT - this preprocessing was used during training. |
|
|
""" |
|
|
background_color = tuple(background_color) |
|
|
width, height = pil_img.size |
|
|
if width == height: |
|
|
return pil_img |
|
|
elif width > height: |
|
|
result = Image.new(pil_img.mode, (width, width), background_color) |
|
|
result.paste(pil_img, (0, (width - height) // 2)) |
|
|
return result |
|
|
else: |
|
|
result = Image.new(pil_img.mode, (height, height), background_color) |
|
|
result.paste(pil_img, ((height - width) // 2, 0)) |
|
|
return result |
|
|
``` |
|
|
|
|
|
### Downstream Tasks |
|
|
|
|
|
This model produces rich 1280-dimensional embeddings optimized for: |
|
|
|
|
|
- **Physics Domain Classification:** Use CLS token embeddings |
|
|
- **Temporal Forecasting:** Use pooled embeddings for trajectory prediction |
|
|
- **Clustering & Similarity:** Use CLS or pooled embeddings |
|
|
- **Spatial Analysis:** Use patch embeddings |
|
|
- **Transfer Learning:** Fine-tune embeddings for new physics domains |
|
|
|
|
|
## Performance |
|
|
|
|
|
The model has been evaluated against DINO v2 and CLIP on physics-specific tasks: |
|
|
|
|
|
- **Classification:** Superior performance on physics domain classification |
|
|
- **Temporal Forecasting:** Better prediction of physics evolution |
|
|
- **Clustering:** Clearer separation of physics simulation types |
|
|
- **Transfer Learning:** Robust features for new physics applications |
|
|
|
|
|
*Detailed benchmarks available in the original research.* |
|
|
|
|
|
## Model Versions |
|
|
|
|
|
- **Standard Version:** 78,372 training steps- Good balance of performance and training efficiency |
|
|
- **Extended Version:** 195,930 training steps- Maximum performance, longer training |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch torchvision pillow |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Domain Specific:** Optimized for physics simulations, may not generalize to natural images |
|
|
- **Preprocessing Required:** Must use expand_to_square preprocessing for correct results |
|
|
- **Resolution:** Optimized for 224×224 input images |
|
|
- **Physics Domains:** Trained on specific simulation types listed above |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{physics-vit-2025, |
|
|
title={PhySiViT : A Physics Simulation Vision Transformer}, |
|
|
author={Jessica Ezemba, James Afful, Mei-Yu Wang}, |
|
|
year={2025}, |
|
|
howpublished={HuggingFace Model Hub}, |
|
|
url={https://huggingface.co/JessicaE/physics-vit-full} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- Built using [Cerebras ModelZoo](https://github.com/Cerebras/modelzoo) |
|
|
- Trained on Cerebras CS-X systems and Bridges-2 GPUs (Pittsburgh Supercomputing Center) |
|
|
- Based on Vision Transformer architecture |
|
|
- This work was made possible thanks to the ByteBoost cybertraining program which is funded by the National Science Foundation Cybertraining awards: 2320990, 2320991, and 2320992, and the Neocortex project, the ACES platform, and the Ookami cluster. |
|
|
- The Neocortex project is supported by National Science Foundation award number 2005597. |
|
|
- The ACES (Accelerating Computing for Emerging Sciences) platform was funded by National Science Foundation award number 2112356. |
|
|
- The Ookami cluster is supported by National Science Foundation award number 1927880. |