perf(htm): T9 input-tile memcpy_async to cluster smem (bandwidth reduction) 09aa6be verified icarus112 commited on Apr 20
fix(htm): add cluster.sync between Stage A writes and next-timestep reads (T8 bimodal fix) b4f024c verified icarus112 commited on Apr 20
perf(htm): cluster distributed shared memory for inhib_thr/boost/active_duty (T8) 321ed28 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 9b51027 verified icarus112 commited on Apr 19
perf(htm): DLB software grid barrier + non-cooperative launch (lifts 132-SM cap) d09ca5e verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel — B=8 regions in ONE launch via blockIdx.y indexing 20577da verified icarus112 commited on Apr 19