Spaces:

engineportf
/

math-backend

Sleeping

App Files Files Community

math-backend / docs /PIPELINE.md

engineportf

Upload folder using huggingface_hub

558db1e verified 16 days ago

preview code

Raw

History Blame Contribute Delete

16.4 kB

Pipeline Architecture

Abstract

The Portfolio Engine implements a multi-stage pipeline that transforms raw market data into optimised portfolio allocations, validated through rigorous out-of-sample econometric testing, and exported as interactive reports. This document describes the full execution flow, the data structures that mediate inter-stage communication, the mathematical validation framework, and the report generation subsystem. It serves as the architectural reference for understanding how the engine's components compose into a coherent analytical system.

1. Pipeline Overview

The engine is orchestrated by the PortfolioPipeline class in core_engine.py, which implements a four-stage execution model:

┌────────────────────────────────────────────────────────────────────────┐
│                        Pipeline Stages                                 │
│                                                                        │
│   ┌───────────┐   ┌────────────────┐   ┌───────────┐   ┌───────────┐ │
│   │  Stage 1   │──▶│    Stage 2      │──▶│  Stage 3   │──▶│  Stage 4   │ │
│   │ load_data()│   │run_validation()│   │ optimize() │   │ reports() │ │
│   └───────────┘   └────────────────┘   └───────────┘   └───────────┘ │
│                                                                        │
│   Data Fetch      Walk-Forward CV      Full-Sample      HTML + CSV    │
│   Regime Detect   Econometric Tests    Optimisation      PDF Export   │
│   Risk Aversion   DM / Christoffersen  Sensitivity       Serve       │
│   Adjustment      PSR / DSR            Stress Test                    │
└────────────────────────────────────────────────────────────────────────┘

Entry Point

def run_engine(overrides=None):
    pipeline = PortfolioPipeline(overrides=overrides)
    pipeline.load_data()
    val_bundle = pipeline.run_validation()
    opt_bundle = pipeline.optimize()
    pipeline.generate_reports(val_bundle, opt_bundle)

The overrides dictionary enables headless execution from the API layer (api.py), test harnesses, or scheduled batch jobs, bypassing the interactive CLI wizard.

2. Stage 1 — Data Loading (`load_data`)

2.1 Data Sources

Source	Target	Module
Yahoo Finance / DB Cache	Daily OHLCV prices	`data.py`
Kenneth French Library	Fama-French factors	`data.py`
FRED / ^TNX proxy	Risk-free rate series	`data.py`
PostgreSQL / SQLite	Cached price data	`database.py`

2.2 Data Validation

Minimum History: Assets must have ≥ 2× trading_days_per_year (default: 504 business days) of return history to be included. Assets with insufficient history are silently dropped.
Missing Data: Returns DataFrames are constructed via pd.DataFrame.dropna(), ensuring a common date index across all assets.
Frequency Conversion: When return_frequency = 'monthly', daily returns are geometrically compounded to monthly via build_monthly_returns().

2.3 Regime Detection

If hmm_regime = True (default), the engine fits a Hidden Markov Model to benchmark returns via regime_detection.detect_volatility_regime(). The detected regime (Bull, Normal, Crash) informs:

Dynamic risk aversion adjustment (Stage 2 and 3).
PID volatility target in the research cybernetic ensemble.
Report visualisation annotations.

2.4 Dynamic Risk Aversion

If dynamic_risk = True (default), the VIX level is used to adjust the user's stated risk aversion via regime_detection.dynamic_risk_aversion(). This implements a counter-cyclical risk management policy: risk aversion increases during high-volatility episodes, reducing exposure before drawdowns deepen.

3. Stage 2 — Walk-Forward Validation (`run_validation`)

3.1 Expanding Window Cross-Validation

The engine performs expanding-window (walk-forward) backtesting via backtest.expanding_window_backtest():

An initial training window of OOS_TRAIN_DAYS (total days − 252) is established.
The model is trained on the expanding window and produces out-of-sample weights.
Weights are rebalanced every trading_days / 4 periods (quarterly).
An out-of-sample equity curve is constructed from realised returns.

This methodology prevents look-ahead bias and is the gold standard for strategy validation in quantitative finance (Bailey et al., 2014).

3.2 Econometric Tests

The validation stage runs four statistical tests on the out-of-sample returns:

Christoffersen Conditional Coverage Test

Tests whether Value-at-Risk (VaR) exceedances are both correctly calibrated (unconditional coverage) and serially independent (no volatility clustering in violations). A joint likelihood ratio statistic is computed:

LR_cc = LR_uc + LR_ind ~ χ²(2)

Pass Criterion: p-value > 0.05 for both components.

Diebold-Mariano Test

Tests whether the engine's expected return model statistically outperforms a naive historical mean baseline in terms of out-of-sample prediction accuracy:

DM = d̄ / σ̂(d) ~ N(0, 1)

where d_t = |e₁_t| − |e₂_t| is the loss differential (MAE loss function). The test is robust to heteroskedasticity via Newey-West variance estimation.

Pass Criterion: p-value < 0.05 and the engine's model wins.

Probabilistic Sharpe Ratio (PSR)

Accounts for the non-normality of returns (skewness and kurtosis) when evaluating whether the observed Sharpe ratio is statistically distinguishable from a benchmark value of zero (Bailey & López de Prado, 2012):

PSR = Φ[(SR − SR*) · √(n-1) / √(1 − γ₃·SR + (γ₄−1)/4 · SR²)]

where γ₃ and γ₄ are the sample skewness and kurtosis.

Pass Criterion: PSR > 0.95 (95% confidence that the true Sharpe exceeds zero).

Deflated Sharpe Ratio (DSR)

Adjusts for multiple testing bias when the engine evaluates K candidate models (Bailey & López de Prado, 2014). The expected maximum Sharpe ratio under the null hypothesis (all models have zero alpha) is:

E[max(SR)] ≈ √(2·ln(K)) − [γ + ln(π/2)] / [2·√(2·ln(K))]

The DSR then tests whether the observed Sharpe significantly exceeds this multiple-testing threshold.

Pass Criterion: DSR > 0.95.

3.3 Output

The validation stage produces a ValidationBundle dataclass:

@dataclass
class ValidationBundle:
    oos_eq: pd.Series            # Out-of-sample equity curve
    oos_bench_curve: pd.Series   # Benchmark equity curve
    oos_port_rets: pd.Series     # Out-of-sample portfolio returns
    wf_ann_ret: float            # Walk-forward annualised return
    var_results: dict            # Christoffersen test results
    dm_results: dict             # Diebold-Mariano test results
    psr_results: dict            # Probabilistic Sharpe Ratio
    dsr_results: dict            # Deflated Sharpe Ratio

4. Stage 3 — Full-Sample Optimisation (`optimize`)

4.1 Solver Invocation

The full historical dataset is passed to solver.build_and_optimize(), which:

Computes expected returns using the selected model (CAPM, BL, Fama-French, Bayesian, or ML Stacking).
Estimates the covariance matrix with Ledoit-Wolf shrinkage and optional GARCH scaling.
Formulates and solves the convex optimisation problem via the CVXPY engine.
Applies the 7-stage constraint relaxation cascade if the initial formulation is infeasible (see docs/RELAXATION_CASCADE.md).

4.2 Sensitivity & Stress Analysis

Post-optimisation, the engine runs two diagnostic analyses:

Sensitivity Analysis (analytics.portfolio_sensitivity): Perturbs expected returns by ±10% and re-solves, measuring the weight response range per asset. Assets with >15pp swings are flagged as "fragile."
Stress Testing (analytics.portfolio_stress_test): Evaluates portfolio impact under historical crash scenarios (e.g., 2008 GFC, 2020 COVID, rate shock, tech crash).

If fragile allocations are detected and the allocation engine is Mean-Variance (engine 1), a stability penalty is added to the objective function and the solver is re-invoked.

4.3 Output

@dataclass
class OptimizationBundle:
    weights: pd.Series           # Final target weights
    exp_rets: pd.Series          # Expected returns per asset
    cov_mat: pd.DataFrame        # Covariance matrix
    vol: float                   # Portfolio volatility
    corr_matrix: pd.DataFrame    # Correlation matrix
    betas: pd.Series             # Market betas
    model_info: dict             # Model metadata
    sens_report: dict            # Sensitivity analysis
    stress_report: dict          # Stress test results
    n_fragile: int               # Count of fragile allocations

5. Stage 4 — Report Generation (`generate_reports`)

5.1 Architecture

Report generation follows a three-layer architecture:

┌─────────────────────────────────────────────────┐
│              report.py (Orchestrator)             │
│  Coordinates data → template → file pipeline     │
├────────────────┬────────────────────────────────┤
│ report_data.py │      report_html.py             │
│ (Data Layer)   │      (Rendering Layer)          │
│ Formats all    │      Injects variables into     │
│ mathematical   │      report_template.html       │
│ outputs into   │      static template            │
│ template vars  │                                  │
└────────────────┴────────────────────────────────┘

5.2 Report Data Layer — `report_data.py`

The prepare_template_variables() function is the largest single function in the codebase (~675 lines). It transforms raw mathematical outputs into presentation-ready HTML fragments and Chart.js data payloads. Key computations include:

Advanced Risk Metrics: CVaR (95%), Conditional Drawdown-at-Risk (CDaR), Mean Absolute Deviation (MAD), and semi-deviation.
Transition Comparisons: When the user provides current holdings, the report computes before/after comparisons for all metrics.
Chart Payload: A JSON dictionary consumed by Chart.js for interactive equity curves, allocation pie charts, efficient frontier plots, Monte Carlo fan charts, and risk contribution bar charts.
Narrative Generation: narrative.py produces a natural-language summary of the portfolio strategy, market conditions, and key risk factors.

5.3 HTML Rendering — `report_html.py`

The rendering layer substitutes template variables into report_template.html, a 26KB static template with Chart.js initialisation scripts. The template uses CSS-in-HTML styling with a dark theme optimised for screen presentation.

5.4 Export Formats

Format	Module	Content
HTML	`report.py`	Interactive report with Chart.js
PDF	`exports.py`	Static rendering via headless browser
CSV	`exports.py`	Tabular weight/allocation summary
Excel	`exports.py`	Multi-sheet workbook (optional)

6. Data Flow Diagram

External APIs ──▶ data.py ──▶ PostgreSQL/SQLite
                                     │
                              ┌──────┴──────┐
                              │ core_engine  │
                              │ load_data()  │
                              └──────┬───────┘
                                     │
                         ┌───────────┼───────────┐
                         ▼           ▼           ▼
                   solver.py   backtest.py  validation.py
                         │           │           │
                         ▼           ▼           ▼
                   OptBundle   ValBundle    Test Results
                         │           │           │
                         └─────┬─────┘───────────┘
                               ▼
                    report_data.py ──▶ report_html.py
                               │
                               ▼
                         output/*.html
                         output/*.csv
                         output/*.pdf

7. Configuration-Driven Behaviour

The pipeline's behaviour is heavily parameterised via config.py. Key configuration axes include:

Parameter	Effect
`model` (1–7)	Selects expected return model (see `docs/MODELS.md`)
`allocation_engine` (1–3)	Mean-Variance (CVXPY), HRP, or Exact Risk Parity (see `docs/ALLOCATION_ENGINES.md`)
`max_assets`	Cardinality constraint: max number of non-zero positions
`garch_enabled`	Enables GARCH(1,1) covariance scaling
`cvar_enabled`	Adds CVaR tail-risk constraint to CVXPY formulation
`tax_enabled`	Activates tax-aware optimisation with cost-basis tracking
`hmm_regime`	Enables HMM regime detection
`dynamic_risk`	Enables VIX-based risk aversion adjustment
`with_futures`	Enables futures overlay optimisation
`return_frequency`	Daily or monthly return aggregation

8. Error Handling & Graceful Degradation

The pipeline employs multiple fallback mechanisms:

Constraint Relaxation Cascade: 7-stage progressive constraint relaxation (see RELAXATION_CASCADE.md).
Data Fallback: If PostgreSQL is unreachable, the engine falls back to local SQLite.
Model Fallback: If ML ensemble training fails, the engine falls back to CAPM.
Report Fallback: If PDF export fails (no headless browser), only HTML is generated.

These mechanisms ensure the pipeline always produces output, even under degraded conditions.

References

Bailey, D. H., Borwein, J. M., López de Prado, M., & Zhu, Q. J. (2014). Pseudo-mathematics and financial charlatanism: The effects of backtest overfitting on out-of-sample performance. Notices of the AMS, 61(5), 458–471.
Bailey, D. H., & López de Prado, M. (2012). The Sharpe ratio efficient frontier. Journal of Risk, 15(2), 3–44.
Bailey, D. H., & López de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. Journal of Portfolio Management, 40(5), 94–107.
Christoffersen, P. (1998). Evaluating interval forecasts. International Economic Review, 39(4), 841–862.
Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3), 253–263.
Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88(2), 365–411.
Markowitz, H. (1952). Portfolio selection. Journal of Finance, 7(1), 77–91.