math-backend / docs /PIPELINE.md
engineportf's picture
Upload folder using huggingface_hub
558db1e verified
|
Raw
History Blame Contribute Delete
16.4 kB
# Pipeline Architecture
## Abstract
The Portfolio Engine implements a multi-stage pipeline that transforms raw market data into optimised portfolio allocations, validated through rigorous out-of-sample econometric testing, and exported as interactive reports. This document describes the full execution flow, the data structures that mediate inter-stage communication, the mathematical validation framework, and the report generation subsystem. It serves as the architectural reference for understanding how the engine's components compose into a coherent analytical system.
---
## 1. Pipeline Overview
The engine is orchestrated by the `PortfolioPipeline` class in `core_engine.py`, which implements a four-stage execution model:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pipeline Stages β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Stage 1 │──▢│ Stage 2 │──▢│ Stage 3 │──▢│ Stage 4 β”‚ β”‚
β”‚ β”‚ load_data()β”‚ β”‚run_validation()β”‚ β”‚ optimize() β”‚ β”‚ reports() β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ Data Fetch Walk-Forward CV Full-Sample HTML + CSV β”‚
β”‚ Regime Detect Econometric Tests Optimisation PDF Export β”‚
β”‚ Risk Aversion DM / Christoffersen Sensitivity Serve β”‚
β”‚ Adjustment PSR / DSR Stress Test β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Entry Point
```python
def run_engine(overrides=None):
pipeline = PortfolioPipeline(overrides=overrides)
pipeline.load_data()
val_bundle = pipeline.run_validation()
opt_bundle = pipeline.optimize()
pipeline.generate_reports(val_bundle, opt_bundle)
```
The `overrides` dictionary enables headless execution from the API layer (`api.py`), test harnesses, or scheduled batch jobs, bypassing the interactive CLI wizard.
---
## 2. Stage 1 β€” Data Loading (`load_data`)
### 2.1 Data Sources
| Source | Target | Module |
|---------------------------|-----------------------|---------------------|
| Yahoo Finance / DB Cache | Daily OHLCV prices | `data.py` |
| Kenneth French Library | Fama-French factors | `data.py` |
| FRED / ^TNX proxy | Risk-free rate series | `data.py` |
| PostgreSQL / SQLite | Cached price data | `database.py` |
### 2.2 Data Validation
- **Minimum History:** Assets must have β‰₯ 2Γ— `trading_days_per_year` (default: 504 business days) of return history to be included. Assets with insufficient history are silently dropped.
- **Missing Data:** Returns DataFrames are constructed via `pd.DataFrame.dropna()`, ensuring a common date index across all assets.
- **Frequency Conversion:** When `return_frequency = 'monthly'`, daily returns are geometrically compounded to monthly via `build_monthly_returns()`.
### 2.3 Regime Detection
If `hmm_regime = True` (default), the engine fits a Hidden Markov Model to benchmark returns via `regime_detection.detect_volatility_regime()`. The detected regime (Bull, Normal, Crash) informs:
- Dynamic risk aversion adjustment (Stage 2 and 3).
- PID volatility target in the research cybernetic ensemble.
- Report visualisation annotations.
### 2.4 Dynamic Risk Aversion
If `dynamic_risk = True` (default), the VIX level is used to adjust the user's stated risk aversion via `regime_detection.dynamic_risk_aversion()`. This implements a counter-cyclical risk management policy: risk aversion increases during high-volatility episodes, reducing exposure before drawdowns deepen.
---
## 3. Stage 2 β€” Walk-Forward Validation (`run_validation`)
### 3.1 Expanding Window Cross-Validation
The engine performs expanding-window (walk-forward) backtesting via `backtest.expanding_window_backtest()`:
1. An initial training window of `OOS_TRAIN_DAYS` (total days βˆ’ 252) is established.
2. The model is trained on the expanding window and produces out-of-sample weights.
3. Weights are rebalanced every `trading_days / 4` periods (quarterly).
4. An out-of-sample equity curve is constructed from realised returns.
This methodology prevents look-ahead bias and is the gold standard for strategy validation in quantitative finance (Bailey et al., 2014).
### 3.2 Econometric Tests
The validation stage runs four statistical tests on the out-of-sample returns:
#### Christoffersen Conditional Coverage Test
Tests whether Value-at-Risk (VaR) exceedances are both correctly calibrated (unconditional coverage) and serially independent (no volatility clustering in violations). A joint likelihood ratio statistic is computed:
```
LR_cc = LR_uc + LR_ind ~ χ²(2)
```
**Pass Criterion:** p-value > 0.05 for both components.
#### Diebold-Mariano Test
Tests whether the engine's expected return model statistically outperforms a naive historical mean baseline in terms of out-of-sample prediction accuracy:
```
DM = dΜ„ / ΟƒΜ‚(d) ~ N(0, 1)
```
where d_t = |e₁_t| βˆ’ |eβ‚‚_t| is the loss differential (MAE loss function). The test is robust to heteroskedasticity via Newey-West variance estimation.
**Pass Criterion:** p-value < 0.05 and the engine's model wins.
#### Probabilistic Sharpe Ratio (PSR)
Accounts for the non-normality of returns (skewness and kurtosis) when evaluating whether the observed Sharpe ratio is statistically distinguishable from a benchmark value of zero (Bailey & LΓ³pez de Prado, 2012):
```
PSR = Ξ¦[(SR βˆ’ SR*) Β· √(n-1) / √(1 βˆ’ γ₃·SR + (Ξ³β‚„βˆ’1)/4 Β· SRΒ²)]
```
where γ₃ and Ξ³β‚„ are the sample skewness and kurtosis.
**Pass Criterion:** PSR > 0.95 (95% confidence that the true Sharpe exceeds zero).
#### Deflated Sharpe Ratio (DSR)
Adjusts for multiple testing bias when the engine evaluates K candidate models (Bailey & LΓ³pez de Prado, 2014). The expected maximum Sharpe ratio under the null hypothesis (all models have zero alpha) is:
```
E[max(SR)] β‰ˆ √(2Β·ln(K)) βˆ’ [Ξ³ + ln(Ο€/2)] / [2·√(2Β·ln(K))]
```
The DSR then tests whether the observed Sharpe significantly exceeds this multiple-testing threshold.
**Pass Criterion:** DSR > 0.95.
### 3.3 Output
The validation stage produces a `ValidationBundle` dataclass:
```python
@dataclass
class ValidationBundle:
oos_eq: pd.Series # Out-of-sample equity curve
oos_bench_curve: pd.Series # Benchmark equity curve
oos_port_rets: pd.Series # Out-of-sample portfolio returns
wf_ann_ret: float # Walk-forward annualised return
var_results: dict # Christoffersen test results
dm_results: dict # Diebold-Mariano test results
psr_results: dict # Probabilistic Sharpe Ratio
dsr_results: dict # Deflated Sharpe Ratio
```
---
## 4. Stage 3 β€” Full-Sample Optimisation (`optimize`)
### 4.1 Solver Invocation
The full historical dataset is passed to `solver.build_and_optimize()`, which:
1. Computes expected returns using the selected model (CAPM, BL, Fama-French, Bayesian, or ML Stacking).
2. Estimates the covariance matrix with Ledoit-Wolf shrinkage and optional GARCH scaling.
3. Formulates and solves the convex optimisation problem via the CVXPY engine.
4. Applies the 7-stage constraint relaxation cascade if the initial formulation is infeasible (see `docs/RELAXATION_CASCADE.md`).
### 4.2 Sensitivity & Stress Analysis
Post-optimisation, the engine runs two diagnostic analyses:
- **Sensitivity Analysis** (`analytics.portfolio_sensitivity`): Perturbs expected returns by Β±10% and re-solves, measuring the weight response range per asset. Assets with >15pp swings are flagged as "fragile."
- **Stress Testing** (`analytics.portfolio_stress_test`): Evaluates portfolio impact under historical crash scenarios (e.g., 2008 GFC, 2020 COVID, rate shock, tech crash).
If fragile allocations are detected and the allocation engine is Mean-Variance (engine 1), a stability penalty is added to the objective function and the solver is re-invoked.
### 4.3 Output
```python
@dataclass
class OptimizationBundle:
weights: pd.Series # Final target weights
exp_rets: pd.Series # Expected returns per asset
cov_mat: pd.DataFrame # Covariance matrix
vol: float # Portfolio volatility
corr_matrix: pd.DataFrame # Correlation matrix
betas: pd.Series # Market betas
model_info: dict # Model metadata
sens_report: dict # Sensitivity analysis
stress_report: dict # Stress test results
n_fragile: int # Count of fragile allocations
```
---
## 5. Stage 4 β€” Report Generation (`generate_reports`)
### 5.1 Architecture
Report generation follows a three-layer architecture:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ report.py (Orchestrator) β”‚
β”‚ Coordinates data β†’ template β†’ file pipeline β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ report_data.py β”‚ report_html.py β”‚
β”‚ (Data Layer) β”‚ (Rendering Layer) β”‚
β”‚ Formats all β”‚ Injects variables into β”‚
β”‚ mathematical β”‚ report_template.html β”‚
β”‚ outputs into β”‚ static template β”‚
β”‚ template vars β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### 5.2 Report Data Layer β€” `report_data.py`
The `prepare_template_variables()` function is the largest single function in the codebase (~675 lines). It transforms raw mathematical outputs into presentation-ready HTML fragments and Chart.js data payloads. Key computations include:
- **Advanced Risk Metrics:** CVaR (95%), Conditional Drawdown-at-Risk (CDaR), Mean Absolute Deviation (MAD), and semi-deviation.
- **Transition Comparisons:** When the user provides current holdings, the report computes before/after comparisons for all metrics.
- **Chart Payload:** A JSON dictionary consumed by Chart.js for interactive equity curves, allocation pie charts, efficient frontier plots, Monte Carlo fan charts, and risk contribution bar charts.
- **Narrative Generation:** `narrative.py` produces a natural-language summary of the portfolio strategy, market conditions, and key risk factors.
### 5.3 HTML Rendering β€” `report_html.py`
The rendering layer substitutes template variables into `report_template.html`, a 26KB static template with Chart.js initialisation scripts. The template uses CSS-in-HTML styling with a dark theme optimised for screen presentation.
### 5.4 Export Formats
| Format | Module | Content |
|----------|----------------|--------------------------------------------|
| HTML | `report.py` | Interactive report with Chart.js |
| PDF | `exports.py` | Static rendering via headless browser |
| CSV | `exports.py` | Tabular weight/allocation summary |
| Excel | `exports.py` | Multi-sheet workbook (optional) |
---
## 6. Data Flow Diagram
```
External APIs ──▢ data.py ──▢ PostgreSQL/SQLite
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”
β”‚ core_engine β”‚
β”‚ load_data() β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό
solver.py backtest.py validation.py
β”‚ β”‚ β”‚
β–Ό β–Ό β–Ό
OptBundle ValBundle Test Results
β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
report_data.py ──▢ report_html.py
β”‚
β–Ό
output/*.html
output/*.csv
output/*.pdf
```
---
## 7. Configuration-Driven Behaviour
The pipeline's behaviour is heavily parameterised via `config.py`. Key configuration axes include:
| Parameter | Effect |
|--------------------------|-------------------------------------------------------|
| `model` (1–7) | Selects expected return model (see `docs/MODELS.md`) |
| `allocation_engine` (1–3)| Mean-Variance (CVXPY), HRP, or Exact Risk Parity (see `docs/ALLOCATION_ENGINES.md`) |
| `max_assets` | Cardinality constraint: max number of non-zero positions |
| `garch_enabled` | Enables GARCH(1,1) covariance scaling |
| `cvar_enabled` | Adds CVaR tail-risk constraint to CVXPY formulation |
| `tax_enabled` | Activates tax-aware optimisation with cost-basis tracking |
| `hmm_regime` | Enables HMM regime detection |
| `dynamic_risk` | Enables VIX-based risk aversion adjustment |
| `with_futures` | Enables futures overlay optimisation |
| `return_frequency` | Daily or monthly return aggregation |
---
## 8. Error Handling & Graceful Degradation
The pipeline employs multiple fallback mechanisms:
1. **Constraint Relaxation Cascade:** 7-stage progressive constraint relaxation (see `RELAXATION_CASCADE.md`).
2. **Data Fallback:** If PostgreSQL is unreachable, the engine falls back to local SQLite.
3. **Model Fallback:** If ML ensemble training fails, the engine falls back to CAPM.
4. **Report Fallback:** If PDF export fails (no headless browser), only HTML is generated.
These mechanisms ensure the pipeline always produces output, even under degraded conditions.
---
## References
- Bailey, D. H., Borwein, J. M., LΓ³pez de Prado, M., & Zhu, Q. J. (2014). Pseudo-mathematics and financial charlatanism: The effects of backtest overfitting on out-of-sample performance. *Notices of the AMS*, 61(5), 458–471.
- Bailey, D. H., & LΓ³pez de Prado, M. (2012). The Sharpe ratio efficient frontier. *Journal of Risk*, 15(2), 3–44.
- Bailey, D. H., & LΓ³pez de Prado, M. (2014). The deflated Sharpe ratio: Correcting for selection bias, backtest overfitting, and non-normality. *Journal of Portfolio Management*, 40(5), 94–107.
- Christoffersen, P. (1998). Evaluating interval forecasts. *International Economic Review*, 39(4), 841–862.
- Diebold, F. X., & Mariano, R. S. (1995). Comparing predictive accuracy. *Journal of Business & Economic Statistics*, 13(3), 253–263.
- Ledoit, O., & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. *Journal of Multivariate Analysis*, 88(2), 365–411.
- Markowitz, H. (1952). Portfolio selection. *Journal of Finance*, 7(1), 77–91.