nflow-architecture-spec / docs /LOOPS_AND_RECURRENCE.md

docs: LOOPS_AND_RECURRENCE.md — how LSTM/GRU/Mamba/scan work in nFlow (DAG-based, explicit unrolling + nn.scan pattern)

fbb55cc verified 5 days ago

5.05 kB

	# Loops, Recurrence, and Sequential Models in nFlow

	## The Core Principle: DAGs Only, No Cycles

	nFlow's graph IR is a directed acyclic graph (DAG). This is not a limitation —
	it is the correct model for deep learning computation graphs. PyTorch, TensorFlow,
	ONNX, and JAX all use DAGs. Cycles in a computation graph don't have a well-defined
	execution order and can't be differentiated or compiled.

	What looks like a "loop" in RNNs is always one of two things:
	1. Explicit unrolling — you wire T time steps manually (classical RNNs)
	2. Scan/fold operation — a single node that maps over a sequence (modern approach)

	---

	## Pattern 1: LSTM Cell (Single Step)

	`nn.lstm_cell` — computes one time step of an LSTM:
	- Inputs: `x` [B, input_size], `h` [B, hidden_size], `c` [B, hidden_size]
	- Outputs: `h_new` [B, hidden_size], `c_new` [B, hidden_size]

	To model a 3-step LSTM in nFlow:
	```
	input[0] → lstm_cell → h1, c1
	↑ h0, c0
	input[1] → lstm_cell → h2, c2
	↑ h1, c1
	input[2] → lstm_cell → h3, c3
	↑ h2, c2
	```
	You wire 3 `nn.lstm_cell` nodes in sequence. The "loop" is explicit at the graph level.
	This is exactly how `torch.nn.LSTMCell` works — and it's perfectly expressible as a DAG.

	When to use: Fixed-length sequences known at model-design time. Clean, simple,
	easy to debug. Good for teaching, small models, and architectures with specific step counts.

	---

	## Pattern 2: Scan / Fold Over Sequence (Recommended)

	`nn.scan` — maps a cell over a variable-length sequence:
	- Inputs: `xs` [B, T, input_size], `h0` [B, hidden_size], `c0` [B, hidden_size]
	- Outputs: `hs` [B, T, hidden_size], `h_T` [B, hidden_size], `c_T` [B, hidden_size]
	- Attrs: `cell: str` — uid of the cell composite to scan over (e.g. `"user.lstm_cell"`)

	This is equivalent to JAX's `lax.scan`, Haskell's `scanl`, and how modern implementations
	of RWKV, Mamba, and Linear RNNs are expressed.

	```
	xs [B,T,D] → nn.scan(cell="user.lstm_cell") → hs [B,T,H]
	↑ h0, c0 h_T, c_T
	```

	One node represents the entire recurrence. This is how `torch.nn.LSTM` and Mamba's SSM
	scan are compiled internally. The scan node exports to:
	- PyTorch: `nn.LSTM(...)` or a `torch.func.scan` / manual for-loop
	- Keras: `keras.layers.RNN(cell, return_sequences=True)`
	- ONNX: `ONNX::Loop` / `ONNX::Scan` operators

	When to use: Variable-length sequences, production models, Mamba/RWKV/S4/linear
	attention. The scan node abstracts away the loop entirely.

	---

	## Pattern 3: Custom SSM (Mamba-style)

	For Mamba, S4, RWKV, and similar linear recurrences (`h_t = A·h_{t-1} + B·x_t`):

	Build a `UserComposite` for the SSM kernel:
	1. Create a `user.mamba_ssm` composite in the Custom Nodes panel
	2. Wire the inner graph: `dt, A, B, C, x → selective_scan → y`
	3. The selective scan is itself a `nn.scan` over a custom cell

	The `.nfl` file carries the `paper_url: "https://arxiv.org/abs/2312.00752"` so
	collaborators always know the source architecture.

	---

	## Pattern 4: Bidirectional RNNs

	Wire two scan nodes:
	```
	xs → scan_forward → hs_fwd [B,T,H] ──┐
	xs → scan_backward → hs_bwd [B,T,H] ──┤→ concat → output
	```
	`scan_backward` can use the same cell with `direction: "backward"` attr.

	---

	## Why Not Explicit Cycles?

	Some graph editors allow "feedback edges" that create cycles. This is fundamentally
	broken for ML:

	1. No well-defined topo order → can't run inference, can't compute shapes
	2. Not differentiable as written (need time-unrolling to get gradients)
	3. ONNX, PyTorch JIT, TensorRT don't support cyclic graphs — you'd need a
	special lowering pass that nFlow avoids by design

	The DAG-based approach (explicit unrolling + scan) is used by every production ML
	framework and compiler. nFlow follows this industry standard.

	---

	## Quick Reference

	\| Architecture \| nFlow Pattern \|
	\|---\|---\|
	\| LSTM (fixed T steps) \| T × `nn.lstm_cell` wired in sequence \|
	\| LSTM (variable length) \| 1 × `nn.scan(cell="user.lstm_cell")` \|
	\| GRU \| `nn.gru_cell` or `nn.scan(cell="user.gru_cell")` \|
	\| Mamba SSM \| `user.mamba_ssm` composite with `nn.scan` inside \|
	\| RWKV \| `user.rwkv_block` composite with `nn.scan` inside \|
	\| Transformer (no recurrence) \| Standard DAG — MHA + FFN blocks \|
	\| Bidirectional \| Two parallel `nn.scan` nodes + concat \|
	\| Temporal Conv (TCN) \| Standard conv nodes — fully DAG-compatible \|

	---

	## Implementing `nn.scan` in nFlow

	The `nn.scan` primitive is registered in `nflow-ops/src/prims.rs`. It:
	- Declares attrs: `cell: str` (the composite uid to scan), `axis: int = 1`
	- Shape inference: reads the cell's declared `ports_out`, infers sequence output shape
	- Codegen: emits `nn.LSTM` / `nn.GRU` for built-in cells; `torch.func.scan` or
	manual loop for custom cells
	- Execution: the CPU evaluator calls `infer_user_composite` iteratively for each step

	This is the recommended way to add any sequential model to nFlow.