Reinforcement Learning Framework — Dueling DQN

This engine implements a Dueling Deep Q-Network (Dueling DQN) for daily ETF selection, directly extending the RL framework proposed by Yasin & Gill (2024) — "Reinforcement Learning Framework for Quantitative Trading", presented at the ICAIF 2024 FM4TS Workshop (arXiv:2411.07585).

From the Paper → Our Implementation

The paper benchmarks DQN, PPO, and A2C agents on single-stock buy/sell decisions using 20 technical indicators, finding that DQN with MLP policy significantly outperforms policy-gradient methods (PPO, A2C) on daily financial time-series, and that higher learning rates (lr = 0.001) produce the most profitable signals.

We extend this methodology in three key ways:

Multi-Asset Action Space: Rather than binary buy/sell on a single asset, the agent selects from 8 discrete actions — CASH or one of 7 ETFs (TLT, VCIT, LQD, HYG, VNQ, GLD, SLV). This is fundamentally a harder problem than the paper's setup, requiring the agent to learn relative value across assets.
Dueling Architecture (Wang et al., 2016): We replace the paper's standard DQN with a Dueling DQN, which separates the Q-function into a state-value stream V(s) and an advantage stream A(s,a):
Q(s,a) = V(s) + A(s,a) − mean_a(A(s,a))
This is specifically more effective for multi-action spaces because it explicitly learns which state is valuable independent of which action to take — critical when TLT and VCIT have similar Q-values in a rate-falling regime.
Macro State Augmentation: The paper's state space uses only price-derived technical indicators. We add six FRED macro signals to the state: VIX, T10Y2Y (yield curve slope), TBILL_3M, DXY, Corp Spread, and HY Spread. These directly encode the macro regime that drives fixed-income and credit ETF selection.

State Space (per trading day)

20 technical indicators per ETF × 7 ETFs + 6 macro signals (+ z-scored variants), all computed over a rolling 20-day lookback window. The flattened window is fed to the DQN as a single state vector. Indicators follow the paper exactly: RSI(14), MACD(12/26/9), Stochastic(14), CCI(20), ROC(10), CMO(14), Williams%R, ATR, Bollinger %B + Width, StochRSI, Ultimate Oscillator, Momentum(10), rolling returns at 1/5/10/21d, and 21d realised volatility.

Reward Function

Reward = excess daily return over 3m T-bill, minus transaction cost on switches, scaled by inverse 21d realised volatility to penalise drawdown-prone positions. This replaces the paper's raw P&L reward with a risk-adjusted signal aligned with Sharpe Ratio maximisation.

Training

Data split is 80/10/10 (train/val/test) from the user-selected start year to present. Best weights are saved by validation-set Sharpe Ratio. The agent uses Double DQN (online network selects action, frozen target network evaluates) to reduce Q-value overestimation — a known instability in financial RL applications. Experience replay buffer of 100k transitions; hard target network update every 500 steps; ε-greedy exploration decaying from 1.0 → 0.05 over the first 50% of training.

Risk Controls

A post-signal Trailing Stop Loss overrides the DQN signal to CASH if the 2-day cumulative return of the held ETF breaches the configured threshold. Re-entry from CASH requires the DQN's best-action Z-score to clear the re-entry threshold, ensuring the model has recovered conviction before re-entering risk.