File size: 6,540 Bytes

---
license: mit
language:
- en
---
# Physics Foundation Vision Transformer (PhysicsViT-ExtendedVersion)

A Vision Transformer model trained on multi-physics simulation data for scientific computing applications. This model is specifically designed for understanding and analyzing physics simulations across multiple domains.

**Model Version:** Extended Version - Trained for 195,930 steps

## Model Details

### Model Description

- **Developed by:** PhysicsAlchemists Research Team
- **Model type:** Vision Transformer (ViT-Huge)
- **License:** MIT Licence
- **Finetuned from model:** Trained from scratch on physics simulation data
- **Training Steps:** 195,930 steps

### Model Architecture

- **Architecture:** ViT-Huge (Feature Extraction)
- **Hidden size:** 1280
- **Number of layers:** 32
- **Number of attention heads:** 16
- **Intermediate size:** 5120
- **Image size:** 224×224
- **Patch size:** 16×16
- **Embedding dimension:** 1280

## Training Details

### Training Data

The model was trained on a comprehensive dataset of physics simulations including:

- Acoustic scattering (inclusions, discontinuous, maze)
- Active matter simulations  
- Euler equations (multi-quadrants with open/periodic BC)
- Gray-Scott reaction-diffusion
- Helmholtz staircase
- Planetary shallow water equations
- Rayleigh-Bénard convection (standard and uniform)
- Shear flow dynamics
- Turbulent radiative layer (2D)
- Viscoelastic instability

### Training Configuration

- **Training regime:** 195,930 steps
- **Batch size:** 1,470
- **Learning rate:** 0.0005 (with warmup and cosine decay)
- **Optimizer:** Adam (β₁=0.9, β₂=0.999, weight_decay=0.0003)
- **Mixed precision:** bfloat16
- **Hardware:** Cerebras CS-X systems

### Data Augmentation

- Random colormap application (viridis, plasma, inferno, coolwarm)
- Grayscale conversion (30% probability)
- Temporal trajectory preservation during training

## Usage

⚠️ **Important:** This model requires specific preprocessing that differs from standard ViT models.

### Basic Usage

```python
from transformers import AutoModel, AutoImageProcessor
from PIL import Image
import torch

# Load model and processor
model = AutoModel.from_pretrained("JessicaE/physics-vit-full")
processor = AutoImageProcessor.from_pretrained("JessicaE/physics-vit-full")

# Load your physics image
image = Image.open("physics_simulation.png").convert('RGB')

# Apply custom preprocessing
image = expand_to_square(image, background_color=(128, 128, 128))
image = image.resize((224, 224), Image.BILINEAR)

# Convert to tensor and add batch dimension
from torchvision import transforms
tensor = transforms.ToTensor()(image).unsqueeze(0)

# Extract physics-aware embeddings
with torch.no_grad():
    outputs = model(pixel_values=tensor)
    
    # CLS token embedding (best for classification tasks)
    cls_embedding = outputs.last_hidden_state[:, 0, :]  # Shape: [1, 1280]
    
    # Average pooled embedding (good for trajectory prediction)  
    pooled_embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]
    
    # Patch embeddings (for spatial analysis)
    patch_embeddings = outputs.last_hidden_state[:, 1:, :]  # Shape: [1, 196, 1280]

print(f"CLS embedding shape: {cls_embedding.shape}")
```

### Required Preprocessing Function

```python
from PIL import Image

def expand_to_square(pil_img, background_color):
    """
    Pad image to square with background color, keeping image centered.
    
    REQUIRED for Physics ViT - this preprocessing was used during training.
    """
    background_color = tuple(background_color)
    width, height = pil_img.size
    if width == height:
        return pil_img
    elif width > height:
        result = Image.new(pil_img.mode, (width, width), background_color)
        result.paste(pil_img, (0, (width - height) // 2))
        return result
    else:
        result = Image.new(pil_img.mode, (height, height), background_color)
        result.paste(pil_img, ((height - width) // 2, 0))
        return result
```

### Downstream Tasks

This model produces rich 1280-dimensional embeddings optimized for:

- **Physics Domain Classification:** Use CLS token embeddings
- **Temporal Forecasting:** Use pooled embeddings for trajectory prediction
- **Clustering & Similarity:** Use CLS or pooled embeddings
- **Spatial Analysis:** Use patch embeddings
- **Transfer Learning:** Fine-tune embeddings for new physics domains

## Performance

The model has been evaluated against DINO v2 and CLIP on physics-specific tasks:

- **Classification:** Superior performance on physics domain classification
- **Temporal Forecasting:** Better prediction of physics evolution
- **Clustering:** Clearer separation of physics simulation types
- **Transfer Learning:** Robust features for new physics applications

*Detailed benchmarks available in the original research.*

## Model Versions

- **Standard Version:** 78,372 training steps- Good balance of performance and training efficiency
- **Extended Version:** 195,930 training steps- Maximum performance, longer training

## Installation

```bash
pip install transformers torch torchvision pillow
```

## Limitations

- **Domain Specific:** Optimized for physics simulations, may not generalize to natural images
- **Preprocessing Required:** Must use expand_to_square preprocessing for correct results
- **Resolution:** Optimized for 224×224 input images
- **Physics Domains:** Trained on specific simulation types listed above

## Citation

```bibtex
@misc{physics-vit-2025,
  title={PhySiViT : A Physics Simulation Vision Transformer},
  author={Jessica Ezemba, James Afful, Mei-Yu Wang},
  year={2025},
  howpublished={HuggingFace Model Hub},
  url={https://huggingface.co/JessicaE/physics-vit-full}
}
```

## Acknowledgments

- Built using [Cerebras ModelZoo](https://github.com/Cerebras/modelzoo)
- Trained on Cerebras CS-X systems and Bridges-2 GPUs (Pittsburgh Supercomputing Center)
- Based on Vision Transformer architecture
- This work was made possible thanks to the ByteBoost cybertraining program which is funded by the National Science Foundation Cybertraining awards: 2320990, 2320991, and 2320992, and the Neocortex project, the ACES platform, and the Ookami cluster.
- The Neocortex project is supported by National Science Foundation award number 2005597.
- The ACES (Accelerating Computing for Emerging Sciences) platform was funded by National Science Foundation award number 2112356.
- The Ookami cluster is supported by National Science Foundation award number 1927880.