Spaces:

hugging-science
/

SciMLx_Production

Runtime error

App Files Files Community

SciMLx_Production / docs /ARCHITECTURE.md

Moatasim Farooque

Remove problematic files

54fa103 27 days ago

preview code

raw

history blame contribute delete

6.48 kB

	# System Architecture (Hardware Agnostic)

	Deep technical reference for the SciMLx autonomous research loop components, optimized for both NVIDIA GPUs (PyTorch) and Apple Silicon (MLX).

	---

	## 3-Tier Scientific Implementation (SI) Layer

	SciMLx utilizes a modular SI layer in `core/` to decouple scientific logic from underlying hardware and compute frameworks. The layer is organized into three tiers:

	### Tier 1: Foundations
	Core math and device abstractions that provide a stable, hardware-agnostic base.
	- `device.py` (Hardware Agnostic Dispatch): Automatically detects the best available backend (CUDA, MLX, MPS, or CPU) and provides a single API for framework-agnostic tensor creation (`to_array()`) and device placement (`to_device()`).
	- `units.py` (`SciMLTensor`): Ensures mathematical and physical groundedness by performing dimensional analysis on every operation using the `pint` unit registry.
	- `lie_math.py` & `heat_kernels.py` (Geometric Math): Implements Lie Algebra foundations and mesh-based heat kernel signatures for geometry-aware operators.
	- `oracle_constants.py` (Buckingham Pi Theorem): Identifies dimensionless groups (like Reynolds or Péclet numbers) to aid in feature discovery and similarity analysis.

	### Tier 2: Models
	Production-grade neural operators and the infrastructure to build them.
	- Dual-Backend Operators: All new models are implemented with both PyTorch (`_torch.py`) and MLX (`_mlx.py`) backends, managed by a central dispatcher (e.g., `models/mff.py`).
	- `scaffold.py` (Automated Scaffolding): Generates dual-backend model stubs from research proposals, ensuring feature parity across hardware.
	- `losses.py` (Physics-Informed Operators): Framework-agnostic implementations of Sobolev ($H^1$, $H^2$) and Spectral losses that penalize unphysical oscillations.
	- `spectral_governor.py` (Frequency Governance): Dynamically monitors the Fourier spectrum of residuals across backends and adjusts loss weighting to ensure high-frequency features are captured.

	### Tier 3: Production
	Systems for scaling, deploying, and automating the research cycle.
	- `deployment.py`: Integration with Google Cloud Platform (Vertex AI, Compute Engine) for serverless GPU training.
	- `model_versioning.py`: A centralized model registry and lineage tracking system for managing champion models.
	- `hpo.py` (Bayesian Optimization): Automated hyperparameter search that adapts to hardware-specific constraints.
	- `dp_federated.py` (Differential Privacy): Logic for secure, privacy-preserving federated training of scientific models.
	- `arxiv_agent.py` (ASIL Pipeline): Orchestrates the Agentic Scientist Ideation Loop, from literature review to automated model scaffolding.

	---

	## Two-Mode Operation

	The system supports two operating modes that can be mixed within a session:

	Mode A — Human-Guided
	A human (or AI agent) reads `RESEARCH_BRAIN.md`, interprets results, edits `experiments.yaml` directly, and invokes `autorun.py` to execute the queue. The system handles execution, retry, and logging; the human handles strategy.

	Mode B — Fully Autonomous (`agent_loop.py`)
	`agent_loop.py` performs one full autonomous cycle:
	1. Calls `tracker.analyze_lineage()` to build per-benchmark summaries.
	2. Calls `HypothesisEngine.analyze_benchmark()` for each priority benchmark.
	3. Calls `BayesianHPO.ask()` to sample hyperparameters.
	4. Generates new `ExperimentConfig` entries and appends them to `experiments.yaml`.
	5. Triggers `autorun.py` to process the queue on NVIDIA GPUs.

	---

	## Unified Trainer (`core/trainer.py`)

	The trainer is designed to be high-performance while remaining flexible across backends:

	### Compute Optimizations
	- NVIDIA/PyTorch: Utilizes `torch.compile()` for kernel fusion and `torch.amp` for mixed precision training.
	- Apple/MLX: Leverages MLX's lazy evaluation and unified memory for efficient processing on M-series chips.
	- Precision Management: Configurable precision levels (float32, bfloat16) mapped to hardware-specific best practices.

	### Training Logic
	- EMA (Exponential Moving Average): Maintains a shadow copy of model weights for more stable evaluation.
	- Dynamic Budget Extension: Automatically extends training time by 20% if the loss is still decreasing significantly at the end of the budget.
	- Snapshot Ensembling: Optionally saves and averages multiple model states throughout the run.

	---

	## Hardware-Accelerated PDE Solvers (`data/simulations/`)

	All PDE solvers are implemented using framework-native spectral methods to ensure high-speed simulation on the active device:
	- Spectral Methods: Utilize fast Fourier transforms (`torch.fft` or `mlx.fft`) for high-speed spectral derivatives and integration.
	- Zero-Copy Data: Solvers execute directly on the `DEVICE`, producing tensors that never leave high-speed device memory during training.
	- Batch Processing: All simulations are vectorized to solve multiple initial conditions in parallel, maximizing device throughput.

	---

	## High-Throughput Data Pipeline

	The I/O bottleneck is eliminated through:
	1. `PDEDataset`: An `IterableDataset` that interfaces with cached `.npz` files or on-the-fly solvers.
	2. `DataLoader`: Standard PyTorch implementation with:
	- `pin_memory=True`: For faster Host-to-Device transfer.
	- `num_workers > 0`: For multi-process data pre-fetching.
	- `prefetch_factor`: To keep the GPU saturated.

	---

	## HypothesisEngine (`core/hypothesis.py`)

	The engine classifies experiment outcomes to guide follow-up logic:

	\| Mode \| Detection Logic \|
	\|---\|---\|
	\| `gradient_collapse` \| `val_l2_rel ≥ 1.0`, NaN loss, or CUDA launch errors. \|
	\| `spectral_bias` \| High-frequency error > 0.3 in spectral diagnostic. \|
	\| `capacity_limited` \| Small model (`hidden_dim < 64`) with high error. \|
	\| `cuda_oom` \| Log analysis detects "Out of Memory" on GPU. \|

	---

	## Retry Escalation (`autorun.py`)

	Escalates through recovery levels for NVIDIA environments:
	- r1 — `smart_fix()`: Detects CUDA OOM and automatically halves `hidden_dim`.
	- r2: Aggressive reduction of `hidden_dim`, `n_layers`, and `n_modes`.
	- r3: Minimal viable fallback (`h=32, l=2, lr=1e-4`).

	---

	## Cloud Infrastructure (GCP)

	Configured for project `gdpr-494411`:
	- Vertex AI: Custom container execution using the project's Artifact Registry.
	- Compute Engine: G2-standard instances with NVIDIA L4 GPUs for development.