SciMLx_Production / docs /ARCHITECTURE.md
Moatasim Farooque
Remove problematic files
54fa103

System Architecture (Hardware Agnostic)

Deep technical reference for the SciMLx autonomous research loop components, optimized for both NVIDIA GPUs (PyTorch) and Apple Silicon (MLX).


3-Tier Scientific Implementation (SI) Layer

SciMLx utilizes a modular SI layer in core/ to decouple scientific logic from underlying hardware and compute frameworks. The layer is organized into three tiers:

Tier 1: Foundations

Core math and device abstractions that provide a stable, hardware-agnostic base.

  • device.py (Hardware Agnostic Dispatch): Automatically detects the best available backend (CUDA, MLX, MPS, or CPU) and provides a single API for framework-agnostic tensor creation (to_array()) and device placement (to_device()).
  • units.py (SciMLTensor): Ensures mathematical and physical groundedness by performing dimensional analysis on every operation using the pint unit registry.
  • lie_math.py & heat_kernels.py (Geometric Math): Implements Lie Algebra foundations and mesh-based heat kernel signatures for geometry-aware operators.
  • oracle_constants.py (Buckingham Pi Theorem): Identifies dimensionless groups (like Reynolds or Péclet numbers) to aid in feature discovery and similarity analysis.

Tier 2: Models

Production-grade neural operators and the infrastructure to build them.

  • Dual-Backend Operators: All new models are implemented with both PyTorch (_torch.py) and MLX (_mlx.py) backends, managed by a central dispatcher (e.g., models/mff.py).
  • scaffold.py (Automated Scaffolding): Generates dual-backend model stubs from research proposals, ensuring feature parity across hardware.
  • losses.py (Physics-Informed Operators): Framework-agnostic implementations of Sobolev ($H^1$, $H^2$) and Spectral losses that penalize unphysical oscillations.
  • spectral_governor.py (Frequency Governance): Dynamically monitors the Fourier spectrum of residuals across backends and adjusts loss weighting to ensure high-frequency features are captured.

Tier 3: Production

Systems for scaling, deploying, and automating the research cycle.

  • deployment.py: Integration with Google Cloud Platform (Vertex AI, Compute Engine) for serverless GPU training.
  • model_versioning.py: A centralized model registry and lineage tracking system for managing champion models.
  • hpo.py (Bayesian Optimization): Automated hyperparameter search that adapts to hardware-specific constraints.
  • dp_federated.py (Differential Privacy): Logic for secure, privacy-preserving federated training of scientific models.
  • arxiv_agent.py (ASIL Pipeline): Orchestrates the Agentic Scientist Ideation Loop, from literature review to automated model scaffolding.

Two-Mode Operation

The system supports two operating modes that can be mixed within a session:

Mode A — Human-Guided
A human (or AI agent) reads RESEARCH_BRAIN.md, interprets results, edits experiments.yaml directly, and invokes autorun.py to execute the queue. The system handles execution, retry, and logging; the human handles strategy.

Mode B — Fully Autonomous (agent_loop.py)
agent_loop.py performs one full autonomous cycle:

  1. Calls tracker.analyze_lineage() to build per-benchmark summaries.
  2. Calls HypothesisEngine.analyze_benchmark() for each priority benchmark.
  3. Calls BayesianHPO.ask() to sample hyperparameters.
  4. Generates new ExperimentConfig entries and appends them to experiments.yaml.
  5. Triggers autorun.py to process the queue on NVIDIA GPUs.

Unified Trainer (core/trainer.py)

The trainer is designed to be high-performance while remaining flexible across backends:

Compute Optimizations

  • NVIDIA/PyTorch: Utilizes torch.compile() for kernel fusion and torch.amp for mixed precision training.
  • Apple/MLX: Leverages MLX's lazy evaluation and unified memory for efficient processing on M-series chips.
  • Precision Management: Configurable precision levels (float32, bfloat16) mapped to hardware-specific best practices.

Training Logic

  • EMA (Exponential Moving Average): Maintains a shadow copy of model weights for more stable evaluation.
  • Dynamic Budget Extension: Automatically extends training time by 20% if the loss is still decreasing significantly at the end of the budget.
  • Snapshot Ensembling: Optionally saves and averages multiple model states throughout the run.

Hardware-Accelerated PDE Solvers (data/simulations/)

All PDE solvers are implemented using framework-native spectral methods to ensure high-speed simulation on the active device:

  • Spectral Methods: Utilize fast Fourier transforms (torch.fft or mlx.fft) for high-speed spectral derivatives and integration.
  • Zero-Copy Data: Solvers execute directly on the DEVICE, producing tensors that never leave high-speed device memory during training.
  • Batch Processing: All simulations are vectorized to solve multiple initial conditions in parallel, maximizing device throughput.

High-Throughput Data Pipeline

The I/O bottleneck is eliminated through:

  1. PDEDataset: An IterableDataset that interfaces with cached .npz files or on-the-fly solvers.
  2. DataLoader: Standard PyTorch implementation with:
    • pin_memory=True: For faster Host-to-Device transfer.
    • num_workers > 0: For multi-process data pre-fetching.
    • prefetch_factor: To keep the GPU saturated.

HypothesisEngine (core/hypothesis.py)

The engine classifies experiment outcomes to guide follow-up logic:

Mode Detection Logic
gradient_collapse val_l2_rel ≥ 1.0, NaN loss, or CUDA launch errors.
spectral_bias High-frequency error > 0.3 in spectral diagnostic.
capacity_limited Small model (hidden_dim < 64) with high error.
cuda_oom Log analysis detects "Out of Memory" on GPU.

Retry Escalation (autorun.py)

Escalates through recovery levels for NVIDIA environments:

  • r1 — smart_fix(): Detects CUDA OOM and automatically halves hidden_dim.
  • r2: Aggressive reduction of hidden_dim, n_layers, and n_modes.
  • r3: Minimal viable fallback (h=32, l=2, lr=1e-4).

Cloud Infrastructure (GCP)

Configured for project gdpr-494411:

  • Vertex AI: Custom container execution using the project's Artifact Registry.
  • Compute Engine: G2-standard instances with NVIDIA L4 GPUs for development.