Kernels
activation / torch-ext

Commit History

Merge pull request #22 from MotifTechnologies/jangwoong/mla-rope-fa4-port
5adea7d
unverified

Jangwoong Kim commited on

cleanup: drop k_pe RoPE custom kernel (caller uses PyTorch native)
7e86d2e

3v324v23 Claude Opus 4.6 (1M context) commited on

refactor: replace warp shuffle with CUB BlockReduce
79a877a

wyldecat Claude Opus 4.6 (1M context) commited on

fix: unify all backward kernels to input-based math + fix test import
09ecd67

wyldecat Claude Opus 4.6 (1M context) commited on

style: fix yapf/isort/clang-format for CI --all-files
9dcee96

wyldecat Claude Opus 4.6 (1M context) commited on

feat: update RMSNorm Python interface for optimized kernels
4bb42a5

wyldecat Claude Opus 4.6 (1M context) commited on

perf: optimize RMSNorm CUDA kernels for all dims
dc88599

wyldecat Claude Opus 4.6 (1M context) commited on

feat: dedicated _kv_rope_bwd_kernel (register-sum + copy-fused)
35a25ee

3v324v23 Claude Opus 4.6 (1M context) commited on

perf: remove autotune, hard-code per-kernel configs from live dump
1e2bc2b

3v324v23 Claude Opus 4.6 (1M context) commited on

cleanup: remove dead Phase 3 Q kernel + shrink autotune to hand-picked configs
4d94a7d

3v324v23 Claude Opus 4.6 (1M context) commited on

review fixups: stride asserts, autotune split, intent comments
2712745

3v324v23 Claude Opus 4.6 (1M context) commited on

feat: MLA RoPE Triton kernels (port from llm-training)
f61868b

3v324v23 Claude Opus 4.6 (1M context) commited on

style: fix yapf/isort formatting for CI --all-files check
3f2678c

wyldecat Claude Opus 4.6 (1M context) commited on

grouped polynorm with padding aware (#19)
972d63b
unverified

TaehyunKim commited on

chore: remove pre-built binaries and add local build loader shim (#18)
1e08296
unverified

wyldecat Claude Opus 4.6 (1M context) commited on

style: apply yapf, isort, and clang-format
6436ad6

wyldecat Claude Opus 4.6 (1M context) commited on

style: fix clang-format on torch_binding.h
344ed39

wyldecat Claude Opus 4.6 (1M context) commited on

fix: rename stale references and clean up Triton remnants
5a9d09d

wyldecat Claude Opus 4.6 (1M context) commited on

refactor: remove Triton kernels, add hidden_clamp to unscored ops
906e125

wyldecat Claude Opus 4.6 (1M context) commited on

feat: add grouped poly norm CUDA kernel with scores and hidden_clamp fusion
0045757

wyldecat Claude Opus 4.6 (1M context) commited on

refactor: rename grouped_fused_mul_poly_norm → fused_mul_grouped_poly_norm
60a628a

wyldecat Claude Opus 4.6 (1M context) commited on

feat: add GroupedFusedMulPolyNorm Triton kernel for MoE models (#16)
e195bbb
unverified

TaehyunKim Claude Opus 4.6 github-actions[bot] commited on

fix: support PyTorch 2.10 register_op_strategy import path change
ad23c2a

wyldecat Claude Opus 4.6 commited on

fix: fix fused add rms norm sharding strategy
a35a092

wyldecat commited on

fix: fix rms norm sharding strategy
138159c

wyldecat commited on

fix(rms_norm.py): add assertion for input gradients to handle unsupported cases in backward pass
f19f8f4

wyldecat commited on

feat: support sequence parallel with fused_add_rms_norm
151bb5a

wyldecat commited on

refactor(activation): change fused_add_rms_norm and fused_add_rms_norm_backward to out-place operations
7e4334d

wyldecat commited on

refactor(rms_norm): move RMS normalization logic to a new module for better organization and maintainability
66b3c5e

wyldecat commited on

feat: support sequence parallel with rms_norm
06d6367

wyldecat commited on

feat: make rms_norm as out-place
9d0a235

wyldecat commited on

Fix fused add rms norm (#4)
a1e5ca8
unverified

TaehyunKim TaehyunKimMotif commited on

Add fusion (#3)
e5e2eeb
unverified

TaehyunKim TaehyunKimMotif commited on

feat: support reset_parameters()
605f22e

iamwyldecat commited on

feat(rms-norm): Impl fused RMSNorm
f3b99fb

iamwyldecat commited on

refactor(poly-norm): use const for immutable args
e85ecc9

iamwyldecat commited on

chore: use latest build image and misc
f5a7d38

iamwyldecat commited on

feat(poly-norm): add default value for eps argument
afd2a56

iamwyldecat commited on

feat(poly-norm): Add PolyNorm
44e9845

iamwyldecat commited on