fix: use wte.weight.shape instead of undefined config var 7d0a5d8 verified icarus112 commited on Apr 20
fix: lm_head init std=0.02 (was 0.001 β pathological at V=65k) + retina HF cache 1627611 verified icarus112 commited on Apr 20
fix: lm_head init std=0.02 (was 0.001 β pathological at V=65k) + retina HF cache b8e2ac7 verified icarus112 commited on Apr 20
fix(mid_val): use prepare.get_token_bytes(device=cuda) + model(x,y,reduction=none) API ea7be5a verified icarus112 commited on Apr 20
feat(sdr_retina): streaming Nemotron path when HYDRA_USE_NEMOTRON=1 c13016b verified icarus112 commited on Apr 20
fix(tokenizer): use rustbpe.train_from_iterator API; bump vocab to 65536 89bd6c2 verified icarus112 commited on Apr 20
fix(tokenizer): use rustbpe.train_from_iterator API; bump vocab to 65536 ea3cc17 verified icarus112 commited on Apr 20
perf(nemotron): 2-stage prefetch pipeline (HFβtokenizerβpacker) zero-tps-loss f4757c2 verified icarus112 commited on Apr 20
feat(nemotron): streaming pretraining loader for Specialized-v1.1 (Super3 recipe) 94eabe0 verified icarus112 commited on Apr 20
feat(nemotron): streaming pretraining loader for Specialized-v1.1 (Super3 recipe) fc6373a verified icarus112 commited on Apr 20
feat: add ppl to per-step log + MID_VAL every 500 steps for learnability visibility 565fb9e verified icarus112 commited on Apr 20
perf(htm): T9 input-tile memcpy_async to cluster smem (bandwidth reduction) 09aa6be verified icarus112 commited on Apr 20
fix(htm): add cluster.sync between Stage A writes and next-timestep reads (T8 bimodal fix) b4f024c verified icarus112 commited on Apr 20
feat(triton-cache): wire setup/teardown into entrypoint 6ef2d06 verified icarus112 commited on Apr 20
feat(triton-cache): HF Hub-backed compilation cache persistence 313e3b0 verified icarus112 commited on Apr 20
perf(htm): cluster distributed shared memory for inhib_thr/boost/active_duty (T8) 321ed28 verified icarus112 commited on Apr 19
fix(htm): slice regions[:B] to handle eval batch size differing from training 230b3b5 verified icarus112 commited on Apr 19
fix(htm): set NON_PORTABLE_CLUSTER_SIZE_ALLOWED=1 for cluster_size=16 19dee83 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 278f184 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 73e6160 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 7721a60 verified icarus112 commited on Apr 19
perf(htm): Hopper cluster::sync hardware barrier + sm_90a + cluster launch attr 9b51027 verified icarus112 commited on Apr 19
perf(htm): DLB software grid barrier + non-cooperative launch (lifts 132-SM cap) 5c751ce verified icarus112 commited on Apr 19
perf(htm): DLB software grid barrier + non-cooperative launch (lifts 132-SM cap) d09ca5e verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing 20577da verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing bd5981d verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing b4bfe98 verified icarus112 commited on Apr 19
perf(htm): batched cooperative kernel β B=8 regions in ONE launch via blockIdx.y indexing 772ee76 verified icarus112 commited on Apr 19
perf(htm): thread-pool dispatch of B regions concurrently + HTM_FUSED_GRID_CAP=16 for kernel concurrency 8ff9654 verified icarus112 commited on Apr 19
perf(htm): remove per-call dev.sync, cache SDR, device_sync once per step 4ef1558 verified icarus112 commited on Apr 19
perf(htm): remove per-call dev.sync, cache SDR, device_sync once per step c9d5ed8 verified icarus112 commited on Apr 19
perf(htm): remove per-call dev.sync, cache SDR, device_sync once per step 020c87a verified icarus112 commited on Apr 19
patch: 7 triton 3.4 compat fixes (3 tl.sum + 3 bwd desc.dtype + 3 fwd desc.dtype) 8bec152 verified icarus112 commited on Apr 19
patch: add bf16 casts on TMA descriptor stores (triton 3.4 strict dtype match) ec7cc74 verified icarus112 commited on Apr 19
compat: triton 3.4 M,N,K>=16 patches for mamba3 SISO kernels (tl.dot -> tl.sum for sum-reductions) 4b6a863 verified icarus112 commited on Apr 19