Spaces:
Runtime error
Runtime error
| # System Architecture (Hardware Agnostic) | |
| Deep technical reference for the SciMLx autonomous research loop components, optimized for both NVIDIA GPUs (PyTorch) and Apple Silicon (MLX). | |
| --- | |
| ## 3-Tier Scientific Implementation (SI) Layer | |
| SciMLx utilizes a modular SI layer in `core/` to decouple scientific logic from underlying hardware and compute frameworks. The layer is organized into three tiers: | |
| ### Tier 1: Foundations | |
| Core math and device abstractions that provide a stable, hardware-agnostic base. | |
| - **`device.py` (Hardware Agnostic Dispatch)**: Automatically detects the best available backend (CUDA, MLX, MPS, or CPU) and provides a single API for framework-agnostic tensor creation (`to_array()`) and device placement (`to_device()`). | |
| - **`units.py` (`SciMLTensor`)**: Ensures mathematical and physical groundedness by performing dimensional analysis on every operation using the `pint` unit registry. | |
| - **`lie_math.py` & `heat_kernels.py` (Geometric Math)**: Implements Lie Algebra foundations and mesh-based heat kernel signatures for geometry-aware operators. | |
| - **`oracle_constants.py` (Buckingham Pi Theorem)**: Identifies dimensionless groups (like Reynolds or Péclet numbers) to aid in feature discovery and similarity analysis. | |
| ### Tier 2: Models | |
| Production-grade neural operators and the infrastructure to build them. | |
| - **Dual-Backend Operators**: All new models are implemented with both PyTorch (`_torch.py`) and MLX (`_mlx.py`) backends, managed by a central dispatcher (e.g., `models/mff.py`). | |
| - **`scaffold.py` (Automated Scaffolding)**: Generates dual-backend model stubs from research proposals, ensuring feature parity across hardware. | |
| - **`losses.py` (Physics-Informed Operators)**: Framework-agnostic implementations of Sobolev ($H^1$, $H^2$) and Spectral losses that penalize unphysical oscillations. | |
| - **`spectral_governor.py` (Frequency Governance)**: Dynamically monitors the Fourier spectrum of residuals across backends and adjusts loss weighting to ensure high-frequency features are captured. | |
| ### Tier 3: Production | |
| Systems for scaling, deploying, and automating the research cycle. | |
| - **`deployment.py`**: Integration with Google Cloud Platform (Vertex AI, Compute Engine) for serverless GPU training. | |
| - **`model_versioning.py`**: A centralized model registry and lineage tracking system for managing champion models. | |
| - **`hpo.py` (Bayesian Optimization)**: Automated hyperparameter search that adapts to hardware-specific constraints. | |
| - **`dp_federated.py` (Differential Privacy)**: Logic for secure, privacy-preserving federated training of scientific models. | |
| - **`arxiv_agent.py` (ASIL Pipeline)**: Orchestrates the Agentic Scientist Ideation Loop, from literature review to automated model scaffolding. | |
| --- | |
| ## Two-Mode Operation | |
| The system supports two operating modes that can be mixed within a session: | |
| **Mode A — Human-Guided** | |
| A human (or AI agent) reads `RESEARCH_BRAIN.md`, interprets results, edits `experiments.yaml` directly, and invokes `autorun.py` to execute the queue. The system handles execution, retry, and logging; the human handles strategy. | |
| **Mode B — Fully Autonomous (`agent_loop.py`)** | |
| `agent_loop.py` performs one full autonomous cycle: | |
| 1. Calls `tracker.analyze_lineage()` to build per-benchmark summaries. | |
| 2. Calls `HypothesisEngine.analyze_benchmark()` for each priority benchmark. | |
| 3. Calls `BayesianHPO.ask()` to sample hyperparameters. | |
| 4. Generates new `ExperimentConfig` entries and appends them to `experiments.yaml`. | |
| 5. Triggers `autorun.py` to process the queue on NVIDIA GPUs. | |
| --- | |
| ## Unified Trainer (`core/trainer.py`) | |
| The trainer is designed to be high-performance while remaining flexible across backends: | |
| ### Compute Optimizations | |
| - **NVIDIA/PyTorch**: Utilizes `torch.compile()` for kernel fusion and `torch.amp` for mixed precision training. | |
| - **Apple/MLX**: Leverages MLX's lazy evaluation and unified memory for efficient processing on M-series chips. | |
| - **Precision Management**: Configurable precision levels (float32, bfloat16) mapped to hardware-specific best practices. | |
| ### Training Logic | |
| - **EMA (Exponential Moving Average)**: Maintains a shadow copy of model weights for more stable evaluation. | |
| - **Dynamic Budget Extension**: Automatically extends training time by 20% if the loss is still decreasing significantly at the end of the budget. | |
| - **Snapshot Ensembling**: Optionally saves and averages multiple model states throughout the run. | |
| --- | |
| ## Hardware-Accelerated PDE Solvers (`data/simulations/`) | |
| All PDE solvers are implemented using framework-native spectral methods to ensure high-speed simulation on the active device: | |
| - **Spectral Methods**: Utilize fast Fourier transforms (`torch.fft` or `mlx.fft`) for high-speed spectral derivatives and integration. | |
| - **Zero-Copy Data**: Solvers execute directly on the `DEVICE`, producing tensors that never leave high-speed device memory during training. | |
| - **Batch Processing**: All simulations are vectorized to solve multiple initial conditions in parallel, maximizing device throughput. | |
| --- | |
| ## High-Throughput Data Pipeline | |
| The I/O bottleneck is eliminated through: | |
| 1. **`PDEDataset`**: An `IterableDataset` that interfaces with cached `.npz` files or on-the-fly solvers. | |
| 2. **`DataLoader`**: Standard PyTorch implementation with: | |
| - `pin_memory=True`: For faster Host-to-Device transfer. | |
| - `num_workers > 0`: For multi-process data pre-fetching. | |
| - `prefetch_factor`: To keep the GPU saturated. | |
| --- | |
| ## HypothesisEngine (`core/hypothesis.py`) | |
| The engine classifies experiment outcomes to guide follow-up logic: | |
| | Mode | Detection Logic | | |
| |---|---| | |
| | `gradient_collapse` | `val_l2_rel ≥ 1.0`, NaN loss, or CUDA launch errors. | | |
| | `spectral_bias` | High-frequency error > 0.3 in spectral diagnostic. | | |
| | `capacity_limited` | Small model (`hidden_dim < 64`) with high error. | | |
| | `cuda_oom` | Log analysis detects "Out of Memory" on GPU. | | |
| --- | |
| ## Retry Escalation (`autorun.py`) | |
| Escalates through recovery levels for NVIDIA environments: | |
| - **r1 — `smart_fix()`**: Detects CUDA OOM and automatically halves `hidden_dim`. | |
| - **r2**: Aggressive reduction of `hidden_dim`, `n_layers`, and `n_modes`. | |
| - **r3**: Minimal viable fallback (`h=32, l=2, lr=1e-4`). | |
| --- | |
| ## Cloud Infrastructure (GCP) | |
| Configured for project `gdpr-494411`: | |
| - **Vertex AI**: Custom container execution using the project's Artifact Registry. | |
| - **Compute Engine**: G2-standard instances with NVIDIA L4 GPUs for development. | |