Spaces:

engineportf
/

math-backend

Sleeping

App Files Files Community

math-backend / docs /ARCHITECTURE.md

engineportf

Upload folder using huggingface_hub

558db1e verified 14 days ago

preview code

Raw

History Blame Contribute Delete

20.2 kB

	# Portfolio Engine — Complete Architecture Reference

	## Abstract

	This document is the master reference for the entire Portfolio Engine codebase. It describes every module, its purpose, its key functions, and how it connects to the rest of the system. When a topic is explained in full depth in a dedicated document, this file links to it rather than duplicating the content. After reading this document, you should understand what every file does, how data flows through the system, and where to find detailed explanations of each subsystem.

	---

	## 1. System Overview

	The Portfolio Engine is an institutional-grade quantitative portfolio allocation system. It ingests market data, estimates expected returns and risk, solves a constrained convex optimization problem to produce target portfolio weights, validates those weights via out-of-sample econometric tests, and generates interactive HTML reports.

	```
	┌─────────────────────────────────────────────────────────────────────────┐
	│ Portfolio Engine │
	│ │
	│ ┌──────────┐ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ │
	│ │ Data │─▶│ Risk & Return│─▶│ Convex │─▶│ Reporting & │ │
	│ │ Ingestion│ │ Modeling │ │ Optimization │ │ Analytics │ │
	│ └──────────┘ └──────────────┘ └───────────────┘ └──────────────┘ │
	│ │ │ │ │ │
	│ data.py models.py solver.py report.py │
	│ database.py dl_models.py cvxpy_engine.py analytics.py │
	│ alternative_ forecast_ hrp_engine.py validation.py │
	│ data.py generation.py erc_engine.py backtest.py │
	└─────────────────────────────────────────────────────────────────────────┘
	```

	---

	## 2. Complete File Map

	Every Python file in the project, grouped by functional layer.

	### 2.1 Orchestration & Entry Points

	\| File \| Purpose \|
	\|------\|---------\|
	\| `main.py` \| CLI entry point; invokes the pipeline \|
	\| `core_engine.py` \| `PortfolioPipeline` class — the orchestrator (validate → optimize → report) \|
	\| `config.py` \| Configuration Facade importing schema, IO, logging, and constants \|
	\| `config_schema.py`\| Pydantic `AppConfig` and validation rules \|
	\| `config_io.py` \| File loading/saving for configuration dictionaries \|
	\| `constants.py` \| Centralized magic numbers, UI formatting, and mapping dictionaries \|
	\| `logger.py` \| JSON rotating log configuration \|
	\| `core_types.py` \| Shared dataclasses: `PortfolioState`, `ForecastResult`, `CovarianceResult`, `OptimizationResult`, `OptimizationError`, `EngineConfig`, etc. \|
	\| `api.py` \| FastAPI REST endpoints for headless/programmatic execution \|
	\| `server.py` \| Lightweight HTTP server to serve generated HTML reports \|
	\| `dashboard.py` \| Interactive CLI wizard for portfolio configuration \|

	### 2.2 Data Ingestion & Persistence

	\| File \| Purpose \|
	\|------\|---------\|
	\| `data.py` \| Market data fetching (yfinance), Fama-French factor download, ML feature engineering (`build_ml_features()`), credit spread proxies, extended history stitching, and block bootstrapping \|
	\| `data_repository.py` \| [NEW] `DataRepository` class. Centralized abstraction layer responsible for invoking data fetchers, cleaning returned series, standardizing timestamps, and returning a unified `DataSnapshot` for the engine. \|
	\| `database.py` \| SQLAlchemy ORM models (`DailyPrice`, `DailyYield`), PostgreSQL/SQLite connection pooling via `get_pg_engine()`, schema initialization \|
	\| `alternative_data.py` \| [NEW] Options flow sentiment: Put/Call volume ratios, Implied Volatility skew extraction from yfinance options chains. Parallelized across assets \|
	\| `fixed_income.py` \| Bond pricing: clean price from yield, duration, convexity, and synthetic historical price generation for direct bonds \|
	\| `futures_data.py` \| Futures continuous contract construction via Panama Canal stitching method \|

	#### Data Flow

	```
	Yahoo Finance ──┐
	FRED API ───────┤
	Kenneth French ─┤──▶ data.py ──▶ PostgreSQL/SQLite ──▶ data_repository.py ──▶ core_engine.load_data()
	Options Chains ─┘ │
	┌──────┴──────┐
	│ DataSnapshot│
	│ (returns_df,│
	│ ff_df, rfr,│
	│ yield_df) │
	└─────────────┘
	```

	The feature engineering pipeline (`build_ml_features()`) transforms raw returns into a per-asset feature matrix with momentum, volatility, factor exposure, and alternative data columns. Non-overlapping sampling prevents serial correlation in the training target. See [MODELS.md](MODELS.md) § 6 for the full feature list.

	### 2.3 Return Forecasting & Risk Modeling

	> Detailed reference: [MODELS.md](MODELS.md)

	\| File \| Purpose \|
	\|------\|---------\|
	\| `models.py` \| All 7 return forecasting models (CAPM, BL, Bayesian, FF, ML Ensemble, E2E, Regime-Adaptive), covariance estimation (Ledoit-Wolf, hybrid block-diagonal), GARCH scaling, and the meta-learner stacking pipeline \|
	\| `dl_models.py` \| [NEW] PyTorch `NoiseFilteredTransformer` (Conv1D + Transformer Encoder), `CrossAssetSequenceDataset`, and `train_cross_asset_transformer()` training loop \|
	\| `forecast_generation.py` \| `_generate_forecasts()` — the Strategy Pattern router that selects and executes the correct model, applies fixed-income overrides, and returns a `ForecastResult` \|
	\| `bl_bridge.py` \| Black-Litterman integration bridge: `compute_bl_posterior()` combines ML views with the BL equilibrium prior; `scale_uncertainty_by_regime()` modulates view confidence \|
	\| `e2e_forecast_model.py` \| End-to-End Differentiable Optimization (Model 6): forecast network, differentiable CVXPY layer, SPO+ loss training \|
	\| `regime_detection.py` \| Hidden Markov Model (HMM) regime classifier for benchmark returns; `dynamic_risk_aversion()` VIX-based risk adjustment \|
	\| `bayesian_online.py` \| Bayesian Online Change-Point Detection (BOCD) for structural break identification in return series \|
	\| `generative_scenarios.py` \| Monte Carlo scenario generation from fitted covariance models \|
	\| `math_utils.py` \| Shared mathematical utilities: `compute_risk_contributions()` for marginal risk decomposition \|

	### 2.4 Portfolio Optimization

	> Detailed reference: [ALLOCATION_ENGINES.md](ALLOCATION_ENGINES.md)

	\| File \| Purpose \|
	\|------\|---------\|
	\| `solver.py` \| Master optimization router: `build_and_optimize()` for single-period; `multi_period_optimize()` for MPC stochastic programming. Routes to Engine 1, 2, or 3. Computes efficient frontier, risk contributions, and sensitivity analysis \|
	\| `cvxpy_engine.py` \| `CVXPYOptimizationEngine` — Mean-Variance quadratic programming with full constraint suite, 7-stage relaxation cascade, cardinality heuristic, CVaR tail-risk, and Almgren-Chriss market impact \|
	\| `hrp_engine.py` \| Hierarchical Risk Parity: agglomerative clustering, quasi-diagonalisation, recursive bisection, and tax-aware blending \|
	\| `erc_engine.py` \| [NEW] Exact True Risk Parity: Spinu logarithmic barrier formulation via CVXPY (SCS/ECOS solver) \|
	\| `constraints.py` \| Constraint pre-processing: `check_and_fix_bounds()` for sanitising user inputs, `make_nearest_psd()` for covariance matrix repair \|
	\| `differentiable_optimizer.py` \| `cvxpylayers`-based differentiable portfolio layer for gradient flow in Model 6 \|
	\| `futures_overlay.py` \| Futures overlay optimizer: beta hedge, duration hedge, or volatility dampening via ES/MES futures \|
	\| `safety.py` \| Pre-trade safety checks: position limits, concentration alerts, and drawdown circuit breakers \|

	#### Optimization Flow

	```
	forecast_generation.py
	│
	▼
	ForecastResult (exp_rets, covariance, betas, garch_info)
	│
	▼
	solver.py ── allocation_engine == 1 ──▶ cvxpy_engine.py (Mean-Variance)
	├─ allocation_engine == 2 ──▶ hrp_engine.py (HRP)
	└─ allocation_engine == 3 ──▶ erc_engine.py (Exact Risk Parity)
	│
	▼
	OptimizationResult (weights, model_info, risk_contributions, ef_curve)
	```

	### 2.5 Validation & Econometrics

	\| File \| Purpose \|
	\|------\|---------\|
	\| `validation.py` \| Four econometric tests: Christoffersen Conditional Coverage, Diebold-Mariano, Probabilistic Sharpe Ratio (PSR), Deflated Sharpe Ratio (DSR). See [PIPELINE.md](PIPELINE.md) § 3 for mathematical formulations \|
	\| `backtest.py` \| Walk-forward expanding window cross-validation (`expanding_window_backtest()`), Monte Carlo simulation (`monte_carlo()`), and rolling performance metrics \|
	\| `analytics.py` \| Portfolio sensitivity analysis (±10% return perturbation), historical stress testing (2008 GFC, 2020 COVID, rate shock, tech crash), behavioural diagnostics \|
	\| `risk_attribution.py` \| Factor exposure decomposition, marginal VaR, CVaR component attribution, and stress correlation analysis \|
	\| `overlay_analytics.py` \| Futures overlay analytics: aggregated overlay returns, margin call simulation \|
	\| `simulation.py` \| Monte Carlo and historical simulation engines for risk budgeting \|
	\| `audit_reproducibility.py` \| Bit-exact reproducibility verification: hashes inputs and outputs across runs \|

	### 2.6 Reporting & Output

	> Detailed reference: [OUTPUT.md](OUTPUT.md)

	\| File \| Purpose \|
	\|------\|---------\|
	\| `report.py` \| Report orchestrator: coordinates data preparation → HTML rendering → file output \|
	\| `report_data.py` \| `prepare_template_variables()` — transforms mathematical outputs into HTML fragments and Chart.js data payloads (~675 lines) \|
	\| `report_html.py` \| HTML rendering layer: substitutes template variables into `report_template.html` \|
	\| `report_template.html` \| 26KB static HTML template with Chart.js initialization, dark theme, and responsive CSS \|
	\| `report_chart.py` \| Chart.js payload generators for equity curves, pie charts, efficient frontiers, Monte Carlo fans \|
	\| `chart_data.py` \| Lightweight chart data serialization utilities \|
	\| `model_visuals.py` \| Model-specific visualization helpers (factor exposure plots, GARCH regime charts) \|
	\| `narrative.py` \| Natural-language narrative generation summarising portfolio strategy and market conditions \|
	\| `table_builder.py` \| HTML table construction utilities \|
	\| `exports.py` \| CSV, Excel, and PDF export (`export_csv()`, `export_excel()`) \|
	\| `report_builders/` \| Modular HTML section builders for performance, risk, and tax report sections \|

	### 2.7 Execution & Infrastructure

	\| File \| Purpose \|
	\|------\|---------\|
	\| `execution.py` \| IBKR execution stubs, order management, and paper trading interface (19KB, not yet production-connected) \|
	\| `Dockerfile` \| Container image definition (Python 3.11-slim) \|
	\| `docker-compose.yml` \| Local development environment with PostgreSQL 15 and Redis 7 \|
	\| `deploy/helm/` \| Helm chart for Kubernetes deployment (see [DEPLOY.md](DEPLOY.md)) \|
	\| `pyproject.toml` \| Project metadata, pytest configuration, and build system \|
	\| `requirements.txt` \| Python dependency manifest \|
	\| `setup.py` \| Legacy setuptools configuration \|
	\| `.github/workflows/ci.yml` \| GitHub Actions CI pipeline (lint, type-check, test) \|
	\| `.pre-commit-config.yaml` \| Pre-commit hooks configuration \|

	### 2.8 Research & Experimental

	> Detailed reference: [RESEARCH.md](RESEARCH.md)

	\| File \| Purpose \|
	\|------\|---------\|
	\| `research/dreamer/` \| DreamerV2 world-model RL agent adapted for financial time series \|
	\| `research/cybernetic.py` \| PID volatility controller and adaptive risk setpoint \|
	\| `research/cybernetic_ensemble.py` \| Three-layer cybernetic control hierarchy \|
	\| `run_simulation.py` \| Standalone simulation script for research experiments \|
	\| `debug_validation.py` \| Debugging utilities for validation pipeline \|

	### 2.9 Tests

	> Detailed reference: [TESTS.md](TESTS.md)

	\| File \| Purpose \|
	\|------\|---------\|
	\| `tests/test_optimize.py` \| Constraint logic, mean-variance, HRP, multi-period optimization \|
	\| `tests/test_simulate.py` \| End-to-end integration test \|
	\| `tests/test_e2e.py` \| Differentiable optimization pipeline \|
	\| `tests/test_models.py` \| Return model correctness \|
	\| `tests/test_analytics.py` \| Backtest engine, Sharpe, Sortino, Calmar \|
	\| `tests/test_data.py` \| Data fetching, missing-data handling \|
	\| `tests/test_validation.py` \| Econometric test statistical properties \|
	\| `tests/test_new_features.py` \| [NEW] Transformer training/inference, options flow sentiment extraction, and exact risk parity mathematical verification \|
	\| `test_audit.py` \| Reproducibility audit \|
	\| `test_perf.py` \| Performance benchmarks \|

	---

	## 3. Configuration System

	`config.py` → `AppConfig`

	The engine is driven by a Pydantic-validated configuration schema. The `AppConfig` class enforces type safety and cross-field validation (e.g., `single_asset_min` ≤ `single_asset_max`). Configuration is loaded from `output/portfolio_config.json`, merged with `constraints.json`, and can be overridden programmatically via the API or CLI.

	Key configuration axes:

	\| Parameter \| Type \| Default \| Description \|
	\|-----------\|------\|---------\|-------------\|
	\| `model` \| int (1–7) \| 5 \| Return forecasting model selection \|
	\| `allocation_engine` \| int (1–3) \| 1 \| Optimization engine: 1=MV, 2=HRP, 3=ERC \|
	\| `max_assets` \| int \| None \| Cardinality constraint (max non-zero positions) \|
	\| `risk_free_rate` \| float \| 0.04 \| Annual risk-free rate \|
	\| `single_asset_min` \| float \| -1.0 \| Min weight per asset (negative = shorting) \|
	\| `single_asset_max` \| float \| 0.40 \| Max weight per asset \|
	\| `sector_limit` \| float \| 0.40 \| Max aggregate weight per sector \|
	\| `gross_leverage_cap` \| float \| 2.0 \| Maximum gross leverage (L1 norm of weights) \|
	\| `max_turnover` \| float \| 3.0 \| Maximum total turnover per rebalance \|
	\| `garch_enabled` \| bool \| True \| Enable GARCH(1,1) covariance scaling \|
	\| `cvar_enabled` \| bool \| True \| Enable CVaR tail-risk constraint \|
	\| `tax_enabled` \| bool \| False \| Enable tax-aware optimization \|
	\| `hmm_regime` \| bool \| True \| Enable HMM regime detection \|
	\| `dynamic_risk` \| bool \| True \| VIX-based risk aversion adjustment \|
	\| `with_futures` \| bool \| False \| Enable futures overlay \|
	\| `extended_history` \| bool \| False \| Extended history via proxy stitching \|

	See `config.py` for the full schema and validation rules.

	---

	## 4. Data Structures

	The engine communicates between layers via typed dataclasses defined in `core_types.py`:

	### PortfolioState
	Tracks the current portfolio: `total_capital`, `current_weights`, `cost_basis`, `tax_rates`, `gain_fractions`, and `tickers`. Created empty for new portfolios or loaded from `portfolio_state.json`.

	### ForecastResult
	Output of `_generate_forecasts()`: contains `expected_returns`, `covariance_result`, `betas`, `garch_info`, `js_alpha`, `capm_rets`, `ff_betas`, `periods`, `historical_returns`, and `feature_importances`.

	### CovarianceResult
	Wraps the covariance matrix with its derived correlation matrix, per-asset volatility series, and the Ledoit-Wolf shrinkage intensity α.

	### OptimizationResult
	Final output: `weights`, `expected_returns`, `covariance_matrix`, `volatility`, `correlation_matrix`, `betas`, and a `model_info` dictionary containing all metadata (risk contributions, efficient frontier, relaxation log, binding constraints, duration, GARCH info, feature importances, etc.).

	---

	## 5. Concurrency & Thread Safety

	\| Mechanism \| Location \| Purpose \|
	\|-----------\|----------\|---------\|
	\| `_yf_lock` (threading.Lock) \| `data.py` \| Rate-limits yfinance API calls to max 2/sec \|
	\| `_ML_CACHE_LOCK` (threading.Lock) \| `models.py` \| Thread-safe caching of trained ML ensemble models \|
	\| `_ef_cache_lock` (threading.Lock) \| `solver.py` \| Thread-safe efficient frontier LRU cache \|
	\| `ThreadPoolExecutor` \| `data.py`, `alternative_data.py` \| Parallel data fetching (max 10 workers) \|

	---

	## 6. Graceful Degradation

	The engine is designed to always produce output, even under degraded conditions:

	\| Failure Mode \| Fallback \|
	\|-------------\|----------\|
	\| PostgreSQL unreachable \| Falls back to local SQLite \|
	\| ML ensemble training fails \| Falls back to CAPM expected returns \|
	\| PyTorch not installed \| Transformer predictions silently skipped; ensemble uses only XGBoost + ElasticNet \|
	\| Options data fetch fails \| Returns neutral defaults (PCR=1.0, skew=0.0) \|
	\| GARCH fitting fails \| Uses unconditional covariance (no scaling) \|
	\| Fama-French download fails \| Models 4/5 fall back to CAPM \|
	\| CVXPY solver infeasible \| 7-stage constraint relaxation cascade \|
	\| All constraints infeasible \| 100% cash allocation \|
	\| PDF export fails \| Only HTML report generated \|
	\| MPC multi-period fails \| Falls back to single-period optimization \|

	---

	## 7. Dependency Graph

	```
	core_engine.py
	├── config.py
	├── core_types.py
	├── data.py
	│ ├── database.py
	│ ├── fixed_income.py
	│ └── alternative_data.py
	├── regime_detection.py
	├── solver.py
	│ ├── forecast_generation.py
	│ │ ├── models.py
	│ │ │ └── dl_models.py
	│ │ └── bl_bridge.py
	│ ├── cvxpy_engine.py
	│ │ └── constraints.py
	│ ├── hrp_engine.py
	│ └── erc_engine.py
	├── backtest.py
	├── validation.py
	├── analytics.py
	│ └── risk_attribution.py
	├── report.py
	│ ├── report_data.py
	│ ├── report_html.py
	│ └── report_builders/
	├── exports.py
	├── futures_overlay.py
	│ ├── futures_data.py
	│ └── overlay_analytics.py
	├── execution.py
	└── server.py
	```

	Circular dependency rule: The dependency chain follows a strict unidirectional flow: `config` ← `core_types` ← `data` ← `models` ← `solver` ← `analytics` ← `report`. Lazy imports are used in `forecast_generation.py` and `solver.py` to avoid circular references at module load time.

	---

	## 8. Document Index

	\| Document \| What It Covers \|
	\|----------\|----------------\|
	\| [ARCHITECTURE.md](ARCHITECTURE.md) (this file) \| Master reference — complete file map, data flow, dependency graph \|
	\| [PIPELINE.md](PIPELINE.md) \| 4-stage pipeline execution model, data flow diagrams, configuration axes \|
	\| [MODELS.md](MODELS.md) \| All 7 return forecasting models, covariance estimators, GARCH, BL bridge, alternative data, and Transformer \|
	\| [ALLOCATION_ENGINES.md](ALLOCATION_ENGINES.md) \| Mean-Variance (CVXPY), HRP, and Exact Risk Parity engines; cardinality constraints; MPC multi-period \|
	\| [RELAXATION_CASCADE.md](RELAXATION_CASCADE.md) \| 7-stage progressive constraint relaxation for infeasible CVXPY solves \|
	\| [TESTS.md](TESTS.md) \| Test suite design, mocking strategy, property-based testing, and full test inventory \|
	\| [OUTPUT.md](OUTPUT.md) \| Output directory structure and artefact descriptions \|
	\| [DEPLOY.md](DEPLOY.md) \| Docker, Helm, Kubernetes, CI/CD, and production considerations \|
	\| [RESEARCH.md](RESEARCH.md) \| Experimental modules: PID controller, Dreamer RL, cybernetic ensemble \|