# Research Module

## Abstract

The `research/` directory houses experimental and pre-production components that extend the Portfolio Engine beyond classical mean-variance optimisation. These modules investigate cybernetic control theory and model-based reinforcement learning as complementary paradigms for adaptive portfolio management. None of these components are currently integrated into the production pipeline; they constitute a forward-looking research agenda grounded in control-theoretic and decision-theoretic foundations.

---

## 1. Cybernetic Control Systems

### 1.1 PID Volatility Controller — `cybernetic.py`

**Theoretical Basis.** The Proportional-Integral-Derivative (PID) controller, first formalised by Minorsky (1922) and refined by Ziegler & Nichols (1942), is the workhorse of industrial process control. We adapt this framework to the problem of *volatility targeting*, an approach widely used in institutional risk management (Moreira & Muir, 2017; Harvey et al., 2018).

**Mechanism.** The controller measures a portfolio's realised volatility over a rolling window (default: 21 trading days, annualised via the √252 scaling convention). It computes the error signal *e(t) = σ_target − σ_realised* and derives a leverage multiplier:

```
leverage(t) = 1 + K_p · e(t) + K_i · ∫e(τ)dτ + K_d · de(t)/dt
```

- **Proportional gain (K_p = 2.0):** Immediate response to current deviation.
- **Integral gain (K_i = 0.5):** Corrects persistent bias; subject to anti-windup clamping at ±0.5 to prevent integrator saturation.
- **Derivative gain (K_d = 0.3):** Anticipates the direction of error change, providing damping.

Leverage is hard-clamped to [0.3, 1.5] to enforce position limits consistent with institutional mandates.

**Key Insight.** Volatility is substantially more predictable than returns (Andersen et al., 2003). A PID controller exploiting this property delivers stable risk exposure without requiring accurate return forecasts.

### 1.2 Adaptive Risk Controller — `cybernetic.py`

**Concept.** This module implements a *homeostatic setpoint adjustment* for the PID controller. Rather than using a fixed volatility target, the outer loop adjusts σ_target according to the prevailing market regime:

| Regime                    | Multiplier | Effective Target |
|---------------------------|-----------|------------------|
| Bull / Low Volatility     | 1.2×      | 18% annualised   |
| Normal / Chop             | 1.0×      | 15% annualised   |
| Crash / High Volatility   | 0.5×      | 7.5% annualised  |

This nested-loop architecture mirrors Ashby's Law of Requisite Variety (1956): the controller must possess at least as much regulatory diversity as the environment it governs.

### 1.3 Three-Layer Cybernetic Ensemble — `cybernetic_ensemble.py`

**Architecture.** The `CyberneticPortfolioController` implements a hierarchical control system with three timescales, inspired by Wiener's cybernetic feedback principles (1948):

| Layer | Component                  | Timescale     | Function                          |
|-------|----------------------------|---------------|-----------------------------------|
| 1     | PID Controller             | Intraday–Daily | Instant volatility regulation     |
| 2     | Differentiable Optimiser   | Daily          | Mean-variance weight computation  |
| 3     | Dreamer RL Agent           | Weekly–Monthly | Meta-parameter adaptation         |

Each layer operates on the output of the layer below it. Faster layers handle high-frequency perturbations; slower layers learn structural adaptations from accumulated performance data.

**MetaController.** A fourth supervisory layer (`MetaController`) monitors tracking error against a benchmark and dynamically increases or decreases control complexity—adjusting PID gains and exploration parameters—based on rolling performance diagnostics. This implements Ashby's principle at the architectural level.

---

## 2. Dreamer World-Model Agent — `research/dreamer/`

### 2.1 Overview

The `dreamer/` package implements a variant of the DreamerV2 world-model agent (Hafner et al., 2021) adapted for financial time series. The architecture learns a latent dynamics model from historical observation-action-reward trajectories and then trains an actor-critic pair *entirely in imagination*, avoiding the sample-inefficiency of model-free reinforcement learning.

### 2.2 Components

| Module         | Class / Function           | Purpose                                                         |
|----------------|----------------------------|-----------------------------------------------------------------|
| `rssm.py`      | `RSSM`                     | Recurrent State-Space Model with GRU dynamics, stochastic latent state, and prior/posterior networks |
| `rssm.py`      | `RSSMState`                | Lightweight container for the concatenated deterministic–stochastic state vector |
| `networks.py`  | `Encoder`, `Decoder`       | Observation embedding and reconstruction networks (2-layer MLP with ELU activations) |
| `networks.py`  | `RewardModel`              | Predicts scalar reward (Sharpe ratio proxy) from latent features |
| `networks.py`  | `Actor`                    | Policy network outputting portfolio weights via softmax (ensures simplex constraint) |
| `networks.py`  | `Critic`, `HomeostaticCritic` | Value estimation; the homeostatic variant maintains a slowly-adapting setpoint |
| `buffer.py`    | `ReplayBuffer`             | Sequence replay buffer storing variable-length episodes with edge-padding |
| `agent.py`     | `AgenticForecaster`        | Main agent class: world-model training, latent-space actor-critic training, and inference |
| `agent.py`     | `HomeostaticAgenticForecaster` | Extension with homeostatic critic and periodic setpoint updates |

### 2.3 Training Procedure

**World Model.** Given batches of `(observations, actions, rewards)` sequences of shape `(B, T, D)`:

1. The encoder maps each observation to an embedding.
2. The RSSM rolls forward through time, producing posterior states conditioned on real observations and prior states from the dynamics model alone.
3. The decoder reconstructs observations from latent features; the reward model predicts scalar rewards.
4. Loss = Reconstruction MSE + Reward MSE + KL(posterior ‖ prior), with free-nats clamping (default: 3.0 nats) to prevent posterior collapse.

**Actor-Critic.** Training occurs entirely in the latent imagination space:

1. From a batch of start states sampled from the posterior, the actor rolls out an imagined trajectory of *H* steps (default: 15).
2. The target critic estimates values along the trajectory; TD(λ) returns are computed with γ = 0.99, λ = 0.95.
3. The actor maximises expected λ-returns; the critic minimises MSE against the λ-return targets.
4. A Polyak-averaged target critic (τ = 0.05) stabilises training.

### 2.4 Homeostatic Critic

The `HomeostaticCritic` decomposes value prediction as:

```
V(s) = setpoint + deviation(s)
```

The setpoint adapts slowly via exponential moving average (rate = 0.01), implementing a biological homeostasis analogy: the critic maintains a baseline expectation and learns only *deviations* from it. This stabilises learning in non-stationary financial environments where the absolute scale of returns drifts over time.

---

## 3. Integration Test — `test_dreamer.py`

A standalone integration test validates the full Dreamer pipeline:

1. Generates 20 synthetic episodes of random observations, normalised portfolio-weight actions, and scalar rewards.
2. Populates a `ReplayBuffer` and samples a batch.
3. Trains the world model for one gradient step, verifying loss convergence.
4. Trains the actor-critic in imagination from the RSSM initial state, verifying gradient flow.

This test is intended for rapid smoke-testing during development and does not constitute a performance benchmark.

---

## 4. Status and Roadmap

| Item                                   | Status         |
|----------------------------------------|----------------|
| PID volatility controller              | ✅ Implemented  |
| Adaptive regime-based target           | ✅ Implemented  |
| Cybernetic ensemble (3-layer)          | ✅ Implemented  |
| Dreamer world-model training           | ✅ Implemented  |
| Homeostatic critic variant             | ✅ Implemented  |
| Integration with production pipeline   | ⬜ Not started  |
| Hyperparameter tuning on real data     | ⬜ Not started  |
| Out-of-sample performance evaluation   | ⬜ Not started  |

---

## References

- Andersen, T. G., Bollerslev, T., Diebold, F. X., & Labys, P. (2003). Modeling and forecasting realized volatility. *Econometrica*, 71(2), 579–625.
- Ashby, W. R. (1956). *An Introduction to Cybernetics*. Chapman & Hall.
- Hafner, D., Lillicrap, T., Norouzi, M., & Ba, J. (2021). Mastering Atari with discrete world models. *ICLR 2021*.
- Harvey, C. R., Hoyle, E., Korgaonkar, R., Rattray, S., Sargaison, M., & Van Hemert, O. (2018). The impact of volatility targeting. *Journal of Portfolio Management*, 45(1), 14–33.
- Minorsky, N. (1922). Directional stability of automatically steered bodies. *Journal of the American Society of Naval Engineers*, 34(2), 280–309.
- Moreira, A., & Muir, T. (2017). Volatility-managed portfolios. *Journal of Finance*, 72(4), 1611–1644.
- Wiener, N. (1948). *Cybernetics: Or Control and Communication in the Animal and the Machine*. MIT Press.
- Ziegler, J. G., & Nichols, N. B. (1942). Optimum settings for automatic controllers. *Transactions of the ASME*, 64(11), 759–768.