math-backend / docs /ARCHITECTURE.md
engineportf's picture
Upload folder using huggingface_hub
558db1e verified
|
Raw
History Blame Contribute Delete
20.2 kB

Portfolio Engine β€” Complete Architecture Reference

Abstract

This document is the master reference for the entire Portfolio Engine codebase. It describes every module, its purpose, its key functions, and how it connects to the rest of the system. When a topic is explained in full depth in a dedicated document, this file links to it rather than duplicating the content. After reading this document, you should understand what every file does, how data flows through the system, and where to find detailed explanations of each subsystem.


1. System Overview

The Portfolio Engine is an institutional-grade quantitative portfolio allocation system. It ingests market data, estimates expected returns and risk, solves a constrained convex optimization problem to produce target portfolio weights, validates those weights via out-of-sample econometric tests, and generates interactive HTML reports.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Portfolio Engine                              β”‚
β”‚                                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚ Data     │─▢│ Risk & Return│─▢│ Convex        │─▢│ Reporting &  β”‚  β”‚
β”‚  β”‚ Ingestionβ”‚  β”‚ Modeling     β”‚  β”‚ Optimization  β”‚  β”‚ Analytics    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚       β”‚              β”‚                   β”‚                  β”‚           β”‚
β”‚   data.py        models.py          solver.py          report.py       β”‚
β”‚   database.py    dl_models.py       cvxpy_engine.py    analytics.py    β”‚
β”‚   alternative_  forecast_           hrp_engine.py      validation.py   β”‚
β”‚    data.py       generation.py      erc_engine.py      backtest.py     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Complete File Map

Every Python file in the project, grouped by functional layer.

2.1 Orchestration & Entry Points

File Purpose
main.py CLI entry point; invokes the pipeline
core_engine.py PortfolioPipeline class β€” the orchestrator (validate β†’ optimize β†’ report)
config.py Configuration Facade importing schema, IO, logging, and constants
config_schema.py Pydantic AppConfig and validation rules
config_io.py File loading/saving for configuration dictionaries
constants.py Centralized magic numbers, UI formatting, and mapping dictionaries
logger.py JSON rotating log configuration
core_types.py Shared dataclasses: PortfolioState, ForecastResult, CovarianceResult, OptimizationResult, OptimizationError, EngineConfig, etc.
api.py FastAPI REST endpoints for headless/programmatic execution
server.py Lightweight HTTP server to serve generated HTML reports
dashboard.py Interactive CLI wizard for portfolio configuration

2.2 Data Ingestion & Persistence

File Purpose
data.py Market data fetching (yfinance), Fama-French factor download, ML feature engineering (build_ml_features()), credit spread proxies, extended history stitching, and block bootstrapping
data_repository.py [NEW] DataRepository class. Centralized abstraction layer responsible for invoking data fetchers, cleaning returned series, standardizing timestamps, and returning a unified DataSnapshot for the engine.
database.py SQLAlchemy ORM models (DailyPrice, DailyYield), PostgreSQL/SQLite connection pooling via get_pg_engine(), schema initialization
alternative_data.py [NEW] Options flow sentiment: Put/Call volume ratios, Implied Volatility skew extraction from yfinance options chains. Parallelized across assets
fixed_income.py Bond pricing: clean price from yield, duration, convexity, and synthetic historical price generation for direct bonds
futures_data.py Futures continuous contract construction via Panama Canal stitching method

Data Flow

Yahoo Finance ──┐
FRED API ────────
Kenneth French ────▢ data.py ──▢ PostgreSQL/SQLite ──▢ data_repository.py ──▢ core_engine.load_data()
Options Chains β”€β”˜                                           β”‚
                                                     β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
                                                     β”‚ DataSnapshotβ”‚
                                                     β”‚ (returns_df,β”‚
                                                     β”‚  ff_df, rfr,β”‚
                                                     β”‚  yield_df)  β”‚
                                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The feature engineering pipeline (build_ml_features()) transforms raw returns into a per-asset feature matrix with momentum, volatility, factor exposure, and alternative data columns. Non-overlapping sampling prevents serial correlation in the training target. See MODELS.md Β§ 6 for the full feature list.

2.3 Return Forecasting & Risk Modeling

Detailed reference: MODELS.md

File Purpose
models.py All 7 return forecasting models (CAPM, BL, Bayesian, FF, ML Ensemble, E2E, Regime-Adaptive), covariance estimation (Ledoit-Wolf, hybrid block-diagonal), GARCH scaling, and the meta-learner stacking pipeline
dl_models.py [NEW] PyTorch NoiseFilteredTransformer (Conv1D + Transformer Encoder), CrossAssetSequenceDataset, and train_cross_asset_transformer() training loop
forecast_generation.py _generate_forecasts() β€” the Strategy Pattern router that selects and executes the correct model, applies fixed-income overrides, and returns a ForecastResult
bl_bridge.py Black-Litterman integration bridge: compute_bl_posterior() combines ML views with the BL equilibrium prior; scale_uncertainty_by_regime() modulates view confidence
e2e_forecast_model.py End-to-End Differentiable Optimization (Model 6): forecast network, differentiable CVXPY layer, SPO+ loss training
regime_detection.py Hidden Markov Model (HMM) regime classifier for benchmark returns; dynamic_risk_aversion() VIX-based risk adjustment
bayesian_online.py Bayesian Online Change-Point Detection (BOCD) for structural break identification in return series
generative_scenarios.py Monte Carlo scenario generation from fitted covariance models
math_utils.py Shared mathematical utilities: compute_risk_contributions() for marginal risk decomposition

2.4 Portfolio Optimization

Detailed reference: ALLOCATION_ENGINES.md

File Purpose
solver.py Master optimization router: build_and_optimize() for single-period; multi_period_optimize() for MPC stochastic programming. Routes to Engine 1, 2, or 3. Computes efficient frontier, risk contributions, and sensitivity analysis
cvxpy_engine.py CVXPYOptimizationEngine β€” Mean-Variance quadratic programming with full constraint suite, 7-stage relaxation cascade, cardinality heuristic, CVaR tail-risk, and Almgren-Chriss market impact
hrp_engine.py Hierarchical Risk Parity: agglomerative clustering, quasi-diagonalisation, recursive bisection, and tax-aware blending
erc_engine.py [NEW] Exact True Risk Parity: Spinu logarithmic barrier formulation via CVXPY (SCS/ECOS solver)
constraints.py Constraint pre-processing: check_and_fix_bounds() for sanitising user inputs, make_nearest_psd() for covariance matrix repair
differentiable_optimizer.py cvxpylayers-based differentiable portfolio layer for gradient flow in Model 6
futures_overlay.py Futures overlay optimizer: beta hedge, duration hedge, or volatility dampening via ES/MES futures
safety.py Pre-trade safety checks: position limits, concentration alerts, and drawdown circuit breakers

Optimization Flow

forecast_generation.py
        β”‚
        β–Ό
    ForecastResult (exp_rets, covariance, betas, garch_info)
        β”‚
        β–Ό
    solver.py ── allocation_engine == 1 ──▢ cvxpy_engine.py (Mean-Variance)
              β”œβ”€ allocation_engine == 2 ──▢ hrp_engine.py (HRP)
              └─ allocation_engine == 3 ──▢ erc_engine.py (Exact Risk Parity)
        β”‚
        β–Ό
    OptimizationResult (weights, model_info, risk_contributions, ef_curve)

2.5 Validation & Econometrics

File Purpose
validation.py Four econometric tests: Christoffersen Conditional Coverage, Diebold-Mariano, Probabilistic Sharpe Ratio (PSR), Deflated Sharpe Ratio (DSR). See PIPELINE.md Β§ 3 for mathematical formulations
backtest.py Walk-forward expanding window cross-validation (expanding_window_backtest()), Monte Carlo simulation (monte_carlo()), and rolling performance metrics
analytics.py Portfolio sensitivity analysis (Β±10% return perturbation), historical stress testing (2008 GFC, 2020 COVID, rate shock, tech crash), behavioural diagnostics
risk_attribution.py Factor exposure decomposition, marginal VaR, CVaR component attribution, and stress correlation analysis
overlay_analytics.py Futures overlay analytics: aggregated overlay returns, margin call simulation
simulation.py Monte Carlo and historical simulation engines for risk budgeting
audit_reproducibility.py Bit-exact reproducibility verification: hashes inputs and outputs across runs

2.6 Reporting & Output

Detailed reference: OUTPUT.md

File Purpose
report.py Report orchestrator: coordinates data preparation β†’ HTML rendering β†’ file output
report_data.py prepare_template_variables() β€” transforms mathematical outputs into HTML fragments and Chart.js data payloads (~675 lines)
report_html.py HTML rendering layer: substitutes template variables into report_template.html
report_template.html 26KB static HTML template with Chart.js initialization, dark theme, and responsive CSS
report_chart.py Chart.js payload generators for equity curves, pie charts, efficient frontiers, Monte Carlo fans
chart_data.py Lightweight chart data serialization utilities
model_visuals.py Model-specific visualization helpers (factor exposure plots, GARCH regime charts)
narrative.py Natural-language narrative generation summarising portfolio strategy and market conditions
table_builder.py HTML table construction utilities
exports.py CSV, Excel, and PDF export (export_csv(), export_excel())
report_builders/ Modular HTML section builders for performance, risk, and tax report sections

2.7 Execution & Infrastructure

File Purpose
execution.py IBKR execution stubs, order management, and paper trading interface (19KB, not yet production-connected)
Dockerfile Container image definition (Python 3.11-slim)
docker-compose.yml Local development environment with PostgreSQL 15 and Redis 7
deploy/helm/ Helm chart for Kubernetes deployment (see DEPLOY.md)
pyproject.toml Project metadata, pytest configuration, and build system
requirements.txt Python dependency manifest
setup.py Legacy setuptools configuration
.github/workflows/ci.yml GitHub Actions CI pipeline (lint, type-check, test)
.pre-commit-config.yaml Pre-commit hooks configuration

2.8 Research & Experimental

Detailed reference: RESEARCH.md

File Purpose
research/dreamer/ DreamerV2 world-model RL agent adapted for financial time series
research/cybernetic.py PID volatility controller and adaptive risk setpoint
research/cybernetic_ensemble.py Three-layer cybernetic control hierarchy
run_simulation.py Standalone simulation script for research experiments
debug_validation.py Debugging utilities for validation pipeline

2.9 Tests

Detailed reference: TESTS.md

File Purpose
tests/test_optimize.py Constraint logic, mean-variance, HRP, multi-period optimization
tests/test_simulate.py End-to-end integration test
tests/test_e2e.py Differentiable optimization pipeline
tests/test_models.py Return model correctness
tests/test_analytics.py Backtest engine, Sharpe, Sortino, Calmar
tests/test_data.py Data fetching, missing-data handling
tests/test_validation.py Econometric test statistical properties
tests/test_new_features.py [NEW] Transformer training/inference, options flow sentiment extraction, and exact risk parity mathematical verification
test_audit.py Reproducibility audit
test_perf.py Performance benchmarks

3. Configuration System

config.py β†’ AppConfig

The engine is driven by a Pydantic-validated configuration schema. The AppConfig class enforces type safety and cross-field validation (e.g., single_asset_min ≀ single_asset_max). Configuration is loaded from output/portfolio_config.json, merged with constraints.json, and can be overridden programmatically via the API or CLI.

Key configuration axes:

Parameter Type Default Description
model int (1–7) 5 Return forecasting model selection
allocation_engine int (1–3) 1 Optimization engine: 1=MV, 2=HRP, 3=ERC
max_assets int None Cardinality constraint (max non-zero positions)
risk_free_rate float 0.04 Annual risk-free rate
single_asset_min float -1.0 Min weight per asset (negative = shorting)
single_asset_max float 0.40 Max weight per asset
sector_limit float 0.40 Max aggregate weight per sector
gross_leverage_cap float 2.0 Maximum gross leverage (L1 norm of weights)
max_turnover float 3.0 Maximum total turnover per rebalance
garch_enabled bool True Enable GARCH(1,1) covariance scaling
cvar_enabled bool True Enable CVaR tail-risk constraint
tax_enabled bool False Enable tax-aware optimization
hmm_regime bool True Enable HMM regime detection
dynamic_risk bool True VIX-based risk aversion adjustment
with_futures bool False Enable futures overlay
extended_history bool False Extended history via proxy stitching

See config.py for the full schema and validation rules.


4. Data Structures

The engine communicates between layers via typed dataclasses defined in core_types.py:

PortfolioState

Tracks the current portfolio: total_capital, current_weights, cost_basis, tax_rates, gain_fractions, and tickers. Created empty for new portfolios or loaded from portfolio_state.json.

ForecastResult

Output of _generate_forecasts(): contains expected_returns, covariance_result, betas, garch_info, js_alpha, capm_rets, ff_betas, periods, historical_returns, and feature_importances.

CovarianceResult

Wraps the covariance matrix with its derived correlation matrix, per-asset volatility series, and the Ledoit-Wolf shrinkage intensity Ξ±.

OptimizationResult

Final output: weights, expected_returns, covariance_matrix, volatility, correlation_matrix, betas, and a model_info dictionary containing all metadata (risk contributions, efficient frontier, relaxation log, binding constraints, duration, GARCH info, feature importances, etc.).


5. Concurrency & Thread Safety

Mechanism Location Purpose
_yf_lock (threading.Lock) data.py Rate-limits yfinance API calls to max 2/sec
_ML_CACHE_LOCK (threading.Lock) models.py Thread-safe caching of trained ML ensemble models
_ef_cache_lock (threading.Lock) solver.py Thread-safe efficient frontier LRU cache
ThreadPoolExecutor data.py, alternative_data.py Parallel data fetching (max 10 workers)

6. Graceful Degradation

The engine is designed to always produce output, even under degraded conditions:

Failure Mode Fallback
PostgreSQL unreachable Falls back to local SQLite
ML ensemble training fails Falls back to CAPM expected returns
PyTorch not installed Transformer predictions silently skipped; ensemble uses only XGBoost + ElasticNet
Options data fetch fails Returns neutral defaults (PCR=1.0, skew=0.0)
GARCH fitting fails Uses unconditional covariance (no scaling)
Fama-French download fails Models 4/5 fall back to CAPM
CVXPY solver infeasible 7-stage constraint relaxation cascade
All constraints infeasible 100% cash allocation
PDF export fails Only HTML report generated
MPC multi-period fails Falls back to single-period optimization

7. Dependency Graph

core_engine.py
β”œβ”€β”€ config.py
β”œβ”€β”€ core_types.py
β”œβ”€β”€ data.py
β”‚   β”œβ”€β”€ database.py
β”‚   β”œβ”€β”€ fixed_income.py
β”‚   └── alternative_data.py
β”œβ”€β”€ regime_detection.py
β”œβ”€β”€ solver.py
β”‚   β”œβ”€β”€ forecast_generation.py
β”‚   β”‚   β”œβ”€β”€ models.py
β”‚   β”‚   β”‚   └── dl_models.py
β”‚   β”‚   └── bl_bridge.py
β”‚   β”œβ”€β”€ cvxpy_engine.py
β”‚   β”‚   └── constraints.py
β”‚   β”œβ”€β”€ hrp_engine.py
β”‚   └── erc_engine.py
β”œβ”€β”€ backtest.py
β”œβ”€β”€ validation.py
β”œβ”€β”€ analytics.py
β”‚   └── risk_attribution.py
β”œβ”€β”€ report.py
β”‚   β”œβ”€β”€ report_data.py
β”‚   β”œβ”€β”€ report_html.py
β”‚   └── report_builders/
β”œβ”€β”€ exports.py
β”œβ”€β”€ futures_overlay.py
β”‚   β”œβ”€β”€ futures_data.py
β”‚   └── overlay_analytics.py
β”œβ”€β”€ execution.py
└── server.py

Circular dependency rule: The dependency chain follows a strict unidirectional flow: config ← core_types ← data ← models ← solver ← analytics ← report. Lazy imports are used in forecast_generation.py and solver.py to avoid circular references at module load time.


8. Document Index

Document What It Covers
ARCHITECTURE.md (this file) Master reference β€” complete file map, data flow, dependency graph
PIPELINE.md 4-stage pipeline execution model, data flow diagrams, configuration axes
MODELS.md All 7 return forecasting models, covariance estimators, GARCH, BL bridge, alternative data, and Transformer
ALLOCATION_ENGINES.md Mean-Variance (CVXPY), HRP, and Exact Risk Parity engines; cardinality constraints; MPC multi-period
RELAXATION_CASCADE.md 7-stage progressive constraint relaxation for infeasible CVXPY solves
TESTS.md Test suite design, mocking strategy, property-based testing, and full test inventory
OUTPUT.md Output directory structure and artefact descriptions
DEPLOY.md Docker, Helm, Kubernetes, CI/CD, and production considerations
RESEARCH.md Experimental modules: PID controller, Dreamer RL, cybernetic ensemble