Spaces:

Jackoatmon
/

feather-runtime

Runtime error

App Files Files Community

feather-runtime / overlay /htm_rust /docs /GPU_HTM.md

Jackoatmon

Update Feather H200 runtime: Nemotron streaming and HTM force-CPU canary fixes

c2bf4b6 verified 9 days ago

preview code

raw

history blame contribute delete

13.6 kB

	# GPU HTM Backend

	## Status

	**FUSED MEGAKERNEL: entire T-timestep SP+TM forward collapsed into a single
	CUDA launch per forward pass.**

	* Legacy path: 12 kernels × T=2048 timesteps = 24K launches per forward.
	* Fused path: 1 launch per forward (24000× launch-overhead reduction).
	* End-to-end training throughput: ~2.7k → ~60k tok/sec (~22x speedup).
	* Fused path uses per-column threshold inhibition instead of global top-K
	(see §Fused Kernel below — this is a real architectural change).

	## Fused Kernel

	### Why

	Global top-K column selection requires cross-block synchronization at every
	timestep. On WSL2/sm_86 without `-rdc=true`, `cooperative_groups::grid_sync()`
	is unreliable. Without a grid sync, collapsing the T-loop into one kernel is
	impossible, so every forward pays 12×T kernel launches and 90%+ of runtime is
	CUDA launch overhead + small-kernel tails.

	### How

	Replace global top-K with per-column threshold activation:

	is_active[c] = (overlap[c] * boost[c]) > inhibition_threshold[c]

	`inhibition_threshold[c]` is a per-column scalar, learned via EMA update:

	err = active_duty[c] - sparsity_target
	new_thr = clamp(thr + thr_adapt_rate * err * 100, 0.1, 1000)

	This is biologically grounded (GABAergic local lateral inhibition in
	neocortical columns) and supported by HTM theory. The duty-cycle-driven
	feedback loop was already present; we simply redirect its output to drive
	activation threshold instead of multiplicative boost. The global top-K,
	which had no biological basis, is removed.

	### Cross-block coherence

	- Ping-pong bitsets for `cell_active_bits` and `cell_winner_bits`: at
	even t write to `_a`, read from `_b`; at odd t reversed. This eliminates
	the need for an in-place snapshot kernel between timesteps.
	- Primary path: cooperative launch + hardware grid sync. Host code probes
	`CU_DEVICE_ATTRIBUTE_COOPERATIVE_LAUNCH`, computes the cooperative whole-grid
	residency limit from occupancy, and launches the fused megakernel with
	`cuLaunchCooperativeKernel`. In-kernel barriers use
	`cooperative_groups::this_grid().sync()`.
	- Fallback path: software grid barrier via a 3-slot atomic counter array
	(`barrier_counters`). This remains as a compatibility fallback when
	cooperative launch is unavailable.
	- Launch invariant: cooperative launch is capped to the hardware residency
	limit for `blockDim.x = 1024`; software fallback remains capped conservatively
	(`HTM_FUSED_GRID_CAP`, default 8) to avoid whole-grid spin deadlock.

	### Kernel structure

	```
	for t in 0..T:
	# Phase 0: clear curr_active/curr_winner for my column range
	grid_barrier()
	# Phase A: SP overlap → boost → threshold → SP learn → duty + threshold EMA
	grid_barrier()
	# Phase B: TM predict (per cell, per seg) → TM learn (reinforce on match)
	# → burst if none predicted → segment grow/reinforce
	grid_barrier()
	# Phase C: block 0 writes anomaly[t]
	```

	Each warp owns a contiguous slice of columns. At grid=24 blocks × 32 warps =
	768 warps, n_columns=2048 → 2-3 columns per warp.

	### Parity with legacy GPU path

	Semantics diverge. Legacy: exactly `k = round(sparsity * n_cols)` columns
	active per step. Fused: variable, converging to `sparsity * n_cols` on
	average via the per-column EMA. Anomaly decay on repeating sequences is
	preserved (see `gpu_fused_tm_anomaly_decays_on_repeating_sequence` test).

	This is an intentional architectural change committed under
	`no-bypass/full-architecture` per program.md rules. The legacy top-K path
	(`step_many_cuda`) remains available for reference and can be re-enabled via
	`HYDRA_HTM_FUSED=0`.

	### Tests

	- `gpu_threshold_converges_to_sparsity` (tests.rs): 1000-step warmup on
	random SDRs, then measure mean active cols/step on next 200 steps. Must
	land within [0.25×, 4×] of `sparsity_target * n_cols`.
	- `gpu_fused_tm_anomaly_decays_on_repeating_sequence`: feed A,B,C repeating
	for 300 steps. Late anomaly must be < early anomaly AND < 0.5.

	## Legacy Pipeline (kept for fallback)

	* SP: 5 kernels, bit-identical parity with CPU under strict-parity mode.
	* TM: 7 kernels, relaxed-parity with CPU.
	* Speedup at training size (B=8, T=2048, bits=16384): 3.83x vs CPU.

	## Building

	CPU-only (default, zero CUDA dep):
	```bash
	cargo build --release
	```

	GPU-enabled:
	```bash
	export PATH=/usr/local/cuda-12.1/bin:$PATH
	export LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH
	export HTM_PTX_VERSION=7.8 # lower if driver older than nvcc
	cargo build --release --features gpu
	cargo test --release --features gpu --lib # fused path includes cooperative launch + grid-sync tests

	# Python wheel:
	maturin develop --release --features gpu --manifest-path htm_rust/Cargo.toml
	```

	## Architecture

	### Module layout
	```
	src/gpu/
	mod.rs # HTMRegionGpu pyclass + step_many_gpu (full pipeline)
	sp_gpu.rs # Persistent SP device buffers + step_batch_with_tm
	tm_gpu.rs # Persistent TM device buffers + step (predict→activate→learn)
	tests.rs # CPU-vs-GPU SP parity + end-to-end TM anomaly decay
	kernels/
	sp_overlap.cu # per-column overlap reduction
	sp_topk.cu # k-WTA top-K winner selection
	sp_learn.cu # Hebbian +inc/-dec on proximal synapses
	sp_duty.cu # EMA duty-cycle update
	sp_boost_fused.cu # fused mean + exp boost (GPU-side)
	tm_reset.cu # per-step: snapshot active→prev, clear buffers
	tm_predict.cu # per-cell: score owned segments vs prev_active_bits
	tm_activate.cu # per-col: activate predicted cells OR burst
	tm_learn.cu # per-cell: reinforce correctly-predicted segments
	tm_punish.cu # per-cell: decay matching segs on inactive cols
	tm_grow.cu # per-bursting-col: reuse matching seg OR create new,
	# grow synapses to prev_winners
	tm_anomaly.cu # per-step: unpredicted/active ratio
	```

	### Persistent SP state (per region, unchanged from Phase 1)
	At n_cols=2048, S=40, bits=16384: ~355 KB persistent + ~90 KB transient.

	### Persistent TM state (per region)

	Capacity knobs (configured in `tm_gpu.rs`):
	- `MAX_SEGMENTS_PER_CELL = 4`
	- `MAX_SYN_PER_SEGMENT = 20`

	At cells_per_col=32, n_cols=2048:
	- `n_cells = 65_536`
	- `n_segments_max = 262_144` (~262K)
	- `n_synapses_max = 5_242_880` (~5.2M)

	\| Buffer \| Shape / type \| Notes \|
	\|-----------------------\|----------------------\|----------------------------------------\|
	\| `seg_cell_id` \| (n_segs,) u32 \| owning cell; U32_MAX = unused \|
	\| `seg_syn_count` \| (n_segs,) u32 \| #active synapses in slot \|
	\| `syn_presyn` \| (n_segs × S,) u32 \| presynaptic cell indices \|
	\| `syn_perm` \| (n_segs × S,) i16 \| permanence scaled 0..32767 (0.0..1.0) \|
	\| `cell_seg_count` \| (n_cells,) u32 \| segments allocated on each cell \|
	\| `cell_active_bits` \| (n_cells/32,) u32 \| packed bitset, current step \|
	\| `cell_winner_bits` \| (n_cells/32,) u32 \| packed bitset, current step \|
	\| `cell_predictive_bits`\| (n_cells/32,) u32 \| set by predict, read by activate \|
	\| `prev_active_bits` \| (n_cells/32,) u32 \| snapshot at step start \|
	\| `prev_winner_bits` \| (n_cells/32,) u32 \| snapshot at step start \|
	\| `col_predicted` \| (n_cols,) u8 \| set if any cell in col is predictive \|
	\| `col_best_match` \| (n_cols,) u32 \| packed (pot<<21 \| seg_id), atomicMax \|
	\| `seg_num_active_conn` \| (n_segs,) u32 \| output of predict \|
	\| `seg_num_active_pot` \| (n_segs,) u32 \| output of predict \|
	\| `unpredicted_count` \| (1,) u32 \| atomic counter for anomaly \|
	\| `burst_cols_flat` \| (n_cols,) u32 \| list of bursting cols \|
	\| `burst_cols_count` \| (1,) u32 \| length of above list \|

	Total per TM region: ~42 MB. Batch of 8 regions: ~340 MB. Fits 6 GB RTX 3060.

	### Per-step pipeline (single iteration of `step_batch_with_tm`)

	```
	SP side TM side
	--------- ---------
	1. D2D input slice → inp_dev
	2. sp_overlap (n_cols blocks)
	3. sp_topk (1 block)
	4. sp_learn (n_cols blocks)
	5. sp_duty (n_cols/256 blocks)
	6. sp_boost_fused (1 block)
	7. D2D active_mask → cols_dev[ti]
	8. tm_reset_step (ceil(n_cells/32/256))
	9. tm_predict (n_cells blocks × 32 thr)
	10. tm_activate (n_cols/256 blocks)
	11. tm_anomaly (1 block)
	if learn:
	12. tm_learn (n_cells blocks)
	13. tm_punish (n_cells blocks)
	14. tm_grow (n_cols blocks — early-exits)
	```

	No host sync in the T-step loop. At the end one `dtoh_sync_copy` each for
	`cols_dev` (T × n_cols bytes) and `anom_dev` (T × f32).

	## Parity

	### SP: strict bit-identical
	See Phase 1 docs — `gpu_sp_matches_cpu_with_learn` over 50 steps passes exact.

	### TM: relaxed-parity
	The GPU TM has known, deliberate deviations from CPU to admit massive parallelism:

	1. Bursting winner cell: CPU picks the least-used cell (fewest segments) with
	random tiebreak. GPU picks cell 0 of the column (deterministic, branch-free).
	Learning dynamics are preserved because segment creation/reinforcement is
	the dominant effect, not which specific cell in a bursting column wins.

	2. Permanence storage: i16 fixed-point (scale 32767) vs f32. Rounding
	differs by <=1 ULP of the scale (~3.0e-5), below any meaningful learning
	quantum (inc=0.10, dec=0.10, predicted_segment_dec=0.10).

	3. Grown synapse candidate order: CPU randomly samples from prev_winner_cells.
	GPU iterates prev_winner_bits words in a pseudo-random rotated order keyed
	by (bursting_col_idx, iter_seed). Output is a different subset but same size.

	4. Segment LRU eviction: CPU tracks `last_used_iteration` per segment.
	GPU wraps around (slot = count % max_segments_per_cell). In the autoresearch
	loop where TM resets every forward, eviction rarely triggers.

	The GPU parity test (`gpu_tm_anomaly_decays_on_repeating_sequence`) feeds a
	repeating A,B,C sequence and asserts anomaly decays: 1.000 early → 0.000 late.

	## Bottleneck Analysis

	\| Source \| Cost/step (B=8 T=2048) \|
	\|----------------------------------\|-------------------------:\|
	\| 14 kernel launches \| ~70 μs \|
	\| ~262K predict/learn/punish blocks\| ~2.5 ms \|
	\| No D2H until end-of-batch \| 0 μs \|
	\| Final D2H (T × n_cols + T × f32) \| ~200 μs per region \|

	Per-step wall time at B=8 T=2048:
	- CPU (reference): ~11.4 ms / step
	- GPU (current): ~2.98 ms / step
	- Speedup: 3.83x

	## End-to-End Training Benchmark

	Config: B=8, T=2048, vocab=8192, 60-second time budget, full HYDRA stack
	(SDR Semantic + HTM + Mamba-3 + Engram + mHC + Hestia QAT).

	Results:
	- GPU util: 97-98% sustained
	- VRAM: 5.4 GB / 6.0 GB (90% utilisation)
	- Steps completed: 16
	- tok/sec: ~2,200-2,500 (stable post-warmup)
	- Final val_bpb: 2.249 (from ~3.1 initial)
	- Factual eval: 1/9 hits

	Compared to previous CPU-HTM baseline (~100 tok/s), the full-GPU HTM delivers
	~22x end-to-end throughput — far above the 3-10x target.

	## Bench Commands

	```bash
	source .venv/bin/activate
	export LD_LIBRARY_PATH=/usr/lib/wsl/lib:/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH

	# Microbench
	B=8 T=2048 python htm_rust/bench_gpu.py

	# Full training
	HYDRA_TIME_BUDGET=60 HYDRA_BATCH_SIZE=8 HYDRA_TOTAL_BATCH=32768 python -u train.py
	```

	## Known Limitations / Future Work

	- Segment-compacted launches: predict/learn/punish iterate all n_cells
	blocks, using `cell_seg_count` to skip empty cells. A compacted live-cell
	list would shave another ~40% of launch overhead.
	- Winner selection: currently cell 0 of bursting col. Proper least-used
	selection would help stability of cross-column patterns.
	- Single CUDA stream per region: with B=8 regions we serialise on stream 0.
	Multi-stream would lift the ~20% launch overhead at small batch sizes.
	- Permanence bump on chronically under-stimulated columns: SP's strict-parity
	bump is not mirrored on GPU fast path. Effect on long runs needs measurement.
	- `seg_num_active_conn` output is reused across reinforce + punish: the two
	kernels each launch n_cells blocks. They could be fused into one for one fewer
	kernel launch per step.

	## Files

	- `htm_rust/build.rs` — nvcc-driven PTX compilation, 12 kernels.
	- `htm_rust/Cargo.toml` — `gpu` feature flag, cudarc dep.
	- `htm_rust/src/gpu/mod.rs` — `HTMRegionGpu` pyclass + `step_many_gpu`.
	- `htm_rust/src/gpu/sp_gpu.rs` — SP state + `step_batch_with_tm`.
	- `htm_rust/src/gpu/tm_gpu.rs` — TM state + `step`.
	- `htm_rust/src/gpu/tests.rs` — parity + correctness tests.
	- `htm_rust/src/gpu/kernels/*.cu` — 5 SP + 7 TM kernels.
	- `htm_rust/bench_gpu.py` — CPU-vs-GPU microbench.
	- `subsystems/htm.py` — transparent GPU/CPU backend selection in `HTMLayer`.