activation / torch-ext

Commit History

Merge pull request #22 from MotifTechnologies/jangwoong/mla-rope-fa4-port

5adea7d
unverified

Jangwoong Kim commited on Apr 15

cleanup: drop k_pe RoPE custom kernel (caller uses PyTorch native)

7e86d2e

3v324v23 Claude Opus 4.6 (1M context) commited on Apr 14

refactor: replace warp shuffle with CUB BlockReduce

79a877a

wyldecat Claude Opus 4.6 (1M context) commited on Apr 14

fix: unify all backward kernels to input-based math + fix test import

09ecd67

wyldecat Claude Opus 4.6 (1M context) commited on Apr 13

style: fix yapf/isort/clang-format for CI --all-files

9dcee96

wyldecat Claude Opus 4.6 (1M context) commited on Apr 13

feat: update RMSNorm Python interface for optimized kernels

4bb42a5

wyldecat Claude Opus 4.6 (1M context) commited on Apr 13

perf: optimize RMSNorm CUDA kernels for all dims

dc88599

wyldecat Claude Opus 4.6 (1M context) commited on Apr 13

feat: dedicated _kv_rope_bwd_kernel (register-sum + copy-fused)

35a25ee

3v324v23 Claude Opus 4.6 (1M context) commited on Apr 14

perf: remove autotune, hard-code per-kernel configs from live dump

1e2bc2b

3v324v23 Claude Opus 4.6 (1M context) commited on Apr 14

cleanup: remove dead Phase 3 Q kernel + shrink autotune to hand-picked configs

4d94a7d

3v324v23 Claude Opus 4.6 (1M context) commited on Apr 14

review fixups: stride asserts, autotune split, intent comments

2712745

3v324v23 Claude Opus 4.6 (1M context) commited on Apr 14

feat: MLA RoPE Triton kernels (port from llm-training)

f61868b

3v324v23 Claude Opus 4.6 (1M context) commited on Apr 13

style: fix yapf/isort formatting for CI --all-files check

3f2678c

wyldecat Claude Opus 4.6 (1M context) commited on Apr 13

grouped polynorm with padding aware (#19)

972d63b
unverified

TaehyunKim commited on Apr 9

chore: remove pre-built binaries and add local build loader shim (#18)

1e08296
unverified

wyldecat Claude Opus 4.6 (1M context) commited on Apr 8

style: apply yapf, isort, and clang-format

6436ad6

wyldecat Claude Opus 4.6 (1M context) commited on Apr 6

style: fix clang-format on torch_binding.h

344ed39

wyldecat Claude Opus 4.6 (1M context) commited on Apr 6

fix: rename stale references and clean up Triton remnants

5a9d09d

wyldecat Claude Opus 4.6 (1M context) commited on Apr 6

refactor: remove Triton kernels, add hidden_clamp to unscored ops

906e125

wyldecat Claude Opus 4.6 (1M context) commited on Apr 6

feat: add grouped poly norm CUDA kernel with scores and hidden_clamp fusion

0045757

wyldecat Claude Opus 4.6 (1M context) commited on Apr 6

refactor: rename grouped_fused_mul_poly_norm → fused_mul_grouped_poly_norm

60a628a

wyldecat Claude Opus 4.6 (1M context) commited on Apr 4

feat: add GroupedFusedMulPolyNorm Triton kernel for MoE models (#16)

e195bbb
unverified

TaehyunKim Claude Opus 4.6 github-actions[bot] commited on Mar 6

fix: support PyTorch 2.10 register_op_strategy import path change

ad23c2a

wyldecat Claude Opus 4.6 commited on Feb 19

fix: fix fused add rms norm sharding strategy

a35a092

wyldecat commited on Nov 11, 2025

fix: fix rms norm sharding strategy

138159c

wyldecat commited on Nov 10, 2025

fix(rms_norm.py): add assertion for input gradients to handle unsupported cases in backward pass

f19f8f4

wyldecat commited on Oct 13, 2025

feat: support sequence parallel with fused_add_rms_norm

151bb5a

wyldecat commited on Oct 13, 2025

refactor(activation): change fused_add_rms_norm and fused_add_rms_norm_backward to out-place operations

7e4334d

wyldecat commited on Oct 13, 2025

refactor(rms_norm): move RMS normalization logic to a new module for better organization and maintainability

66b3c5e

wyldecat commited on Oct 13, 2025

feat: support sequence parallel with rms_norm

06d6367

wyldecat commited on Oct 1, 2025

feat: make rms_norm as out-place

9d0a235

wyldecat commited on Oct 1, 2025

Fix fused add rms norm (#4)

a1e5ca8
unverified

TaehyunKim

TaehyunKimMotif commited on Sep 9, 2025

Add fusion (#3)

e5e2eeb
unverified

TaehyunKim

TaehyunKimMotif commited on Sep 8, 2025

Optimize kernel (#2)

97825b8
verified

TaehyunKimMotif commited on Aug 22, 2025

feat: support reset_parameters()

605f22e

iamwyldecat commited on Jun 28, 2025

feat(rms-norm): Impl fused RMSNorm

f3b99fb

iamwyldecat commited on Jun 28, 2025

refactor(poly-norm): use const for immutable args

e85ecc9

iamwyldecat commited on Jun 2, 2025

chore: use latest build image and misc

f5a7d38

iamwyldecat commited on Jun 2, 2025

feat(poly-norm): add default value for eps argument

afd2a56

iamwyldecat commited on May 30, 2025

feat(poly-norm): Add PolyNorm

44e9845

iamwyldecat commited on May 30, 2025

Commit History

Merge pull request #22 from MotifTechnologies/jangwoong/mla-rope-fa4-port 5adea7d unverified

cleanup: drop k_pe RoPE custom kernel (caller uses PyTorch native) 7e86d2e

refactor: replace warp shuffle with CUB BlockReduce 79a877a

fix: unify all backward kernels to input-based math + fix test import 09ecd67

style: fix yapf/isort/clang-format for CI --all-files 9dcee96

feat: update RMSNorm Python interface for optimized kernels 4bb42a5

perf: optimize RMSNorm CUDA kernels for all dims dc88599

feat: dedicated _kv_rope_bwd_kernel (register-sum + copy-fused) 35a25ee

perf: remove autotune, hard-code per-kernel configs from live dump 1e2bc2b

cleanup: remove dead Phase 3 Q kernel + shrink autotune to hand-picked configs 4d94a7d

review fixups: stride asserts, autotune split, intent comments 2712745

feat: MLA RoPE Triton kernels (port from llm-training) f61868b

style: fix yapf/isort formatting for CI --all-files check 3f2678c

grouped polynorm with padding aware (#19) 972d63b unverified

chore: remove pre-built binaries and add local build loader shim (#18) 1e08296 unverified

style: apply yapf, isort, and clang-format 6436ad6

style: fix clang-format on torch_binding.h 344ed39

fix: rename stale references and clean up Triton remnants 5a9d09d

refactor: remove Triton kernels, add hidden_clamp to unscored ops 906e125

feat: add grouped poly norm CUDA kernel with scores and hidden_clamp fusion 0045757

refactor: rename grouped_fused_mul_poly_norm → fused_mul_grouped_poly_norm 60a628a

feat: add GroupedFusedMulPolyNorm Triton kernel for MoE models (#16) e195bbb unverified

fix: support PyTorch 2.10 register_op_strategy import path change ad23c2a

fix: fix fused add rms norm sharding strategy a35a092

fix: fix rms norm sharding strategy 138159c

fix(rms_norm.py): add assertion for input gradients to handle unsupported cases in backward pass f19f8f4

feat: support sequence parallel with fused_add_rms_norm 151bb5a

refactor(activation): change fused_add_rms_norm and fused_add_rms_norm_backward to out-place operations 7e4334d

refactor(rms_norm): move RMS normalization logic to a new module for better organization and maintainability 66b3c5e

feat: support sequence parallel with rms_norm 06d6367

feat: make rms_norm as out-place 9d0a235

Fix fused add rms norm (#4) a1e5ca8 unverified

Add fusion (#3) e5e2eeb unverified

Optimize kernel (#2) 97825b8 verified

feat: support reset_parameters() 605f22e

feat(rms-norm): Impl fused RMSNorm f3b99fb

refactor(poly-norm): use const for immutable args e85ecc9

chore: use latest build image and misc f5a7d38

feat(poly-norm): add default value for eps argument afd2a56

feat(poly-norm): Add PolyNorm 44e9845

Merge pull request #22 from MotifTechnologies/jangwoong/mla-rope-fa4-port

5adea7d
unverified

cleanup: drop k_pe RoPE custom kernel (caller uses PyTorch native)

7e86d2e

refactor: replace warp shuffle with CUB BlockReduce

79a877a

fix: unify all backward kernels to input-based math + fix test import

09ecd67

style: fix yapf/isort/clang-format for CI --all-files

9dcee96

feat: update RMSNorm Python interface for optimized kernels

4bb42a5

perf: optimize RMSNorm CUDA kernels for all dims

dc88599

feat: dedicated _kv_rope_bwd_kernel (register-sum + copy-fused)

35a25ee

perf: remove autotune, hard-code per-kernel configs from live dump

1e2bc2b

cleanup: remove dead Phase 3 Q kernel + shrink autotune to hand-picked configs

4d94a7d

review fixups: stride asserts, autotune split, intent comments

2712745

feat: MLA RoPE Triton kernels (port from llm-training)

f61868b

style: fix yapf/isort formatting for CI --all-files check

3f2678c

grouped polynorm with padding aware (#19)

972d63b
unverified

chore: remove pre-built binaries and add local build loader shim (#18)

1e08296
unverified

style: apply yapf, isort, and clang-format

6436ad6

style: fix clang-format on torch_binding.h

344ed39

fix: rename stale references and clean up Triton remnants

5a9d09d

refactor: remove Triton kernels, add hidden_clamp to unscored ops

906e125

feat: add grouped poly norm CUDA kernel with scores and hidden_clamp fusion

0045757

refactor: rename grouped_fused_mul_poly_norm → fused_mul_grouped_poly_norm

60a628a

feat: add GroupedFusedMulPolyNorm Triton kernel for MoE models (#16)

e195bbb
unverified

fix: support PyTorch 2.10 register_op_strategy import path change

ad23c2a

fix: fix fused add rms norm sharding strategy

a35a092

fix: fix rms norm sharding strategy

138159c

fix(rms_norm.py): add assertion for input gradients to handle unsupported cases in backward pass

f19f8f4

feat: support sequence parallel with fused_add_rms_norm

151bb5a

refactor(activation): change fused_add_rms_norm and fused_add_rms_norm_backward to out-place operations

7e4334d

refactor(rms_norm): move RMS normalization logic to a new module for better organization and maintainability

66b3c5e

feat: support sequence parallel with rms_norm

06d6367

feat: make rms_norm as out-place

9d0a235

Fix fused add rms norm (#4)

a1e5ca8
unverified

Add fusion (#3)

e5e2eeb
unverified

Optimize kernel (#2)

97825b8
verified

feat: support reset_parameters()

605f22e

feat(rms-norm): Impl fused RMSNorm

f3b99fb

refactor(poly-norm): use const for immutable args

e85ecc9

chore: use latest build image and misc

f5a7d38

feat(poly-norm): add default value for eps argument

afd2a56

feat(poly-norm): Add PolyNorm

44e9845