perf(htm): T9 input-tile memcpy_async to cluster smem (bandwidth reduction) 09aa6be verified icarus112 commited on Apr 20
fix(htm): add cluster.sync between Stage A writes and next-timestep reads (T8 bimodal fix) b4f024c verified icarus112 commited on Apr 20
perf(htm): cluster distributed shared memory for inhib_thr/boost/active_duty (T8) 321ed28 verified icarus112 commited on Apr 19
fix(htm): set NON_PORTABLE_CLUSTER_SIZE_ALLOWED=1 for cluster_size=16 19dee83 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 278f184 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 73e6160 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 7721a60 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 9b51027 verified icarus112 commited on Apr 19
perf(htm): DLB software grid barrier + non-cooperative launch (lifts 132-SM cap) 5c751ce verified icarus112 commited on Apr 19
perf(htm): DLB software grid barrier + non-cooperative launch (lifts 132-SM cap) d09ca5e verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing 20577da verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing bd5981d verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing b4bfe98 verified icarus112 commited on Apr 19
perf(htm): remove per-call dev.sync, cache SDR, device_sync once per step 4ef1558 verified icarus112 commited on Apr 19