Commit History

working resume | classic input embed | nGPT logit scaling | XSA | del M3 as_strided

10aee3a

alexandretl commited on Mar 27

FAN MLP | input norm is back | nGPT lm head | geo loss

b9faeb1

alexandretl commited on Mar 17

fix proper scale for CDP | safer Geodesic

0427acd

alexandretl commited on Mar 16

mamba3 inference (correct prefill)

8833933

alexandretl commited on Mar 15

[big refactoring] remove ngram, ddl, stem, vwn, reduce lm head, seednorm, mla, uscaling, derf, old args, old layers, diffattnV1, old mamba3s + MAMBA3 TP AWARE

64e97d9

alexandretl commited on Mar 14

cosnet | complete SLW | mamba3 fast inference | fix position_ids shape | ProRes | proper print0 | optim work

24d333c

alexandretl commited on Mar 12

geodesic norm | shared expert gate | new mg dataset | no more buffers | moe router in fp32

70543c0

alexandretl commited on Mar 5

new TP mamba | derf mamba

f168e50

alexandretl commited on Feb 19

MG equivalence | fast mamba3 | ddl update

702de97

alexandretl commited on Feb 16

ngram embeds | DDL new code + EC | Hyperball (AdamH, AdEMAMixH) | lr_expert

940f633

alexandretl commited on Feb 2

SLW end

b5b44c3

alexandretl commited on Jan 29

DDL

70d8309

alexandretl commited on Jan 28

STEM

fe38bae

alexandretl commited on Jan 27

fix n_kv_heads + pos_id GTPAv2

069edd6

alexandretl commited on Jan 26

M3 MIMO large output fix (see discussion)

e1d8de1

alexandretl commited on Jan 26

FA3 fixes | some refactoring

98a21d2

alexandretl commited on Jan 26

equivalence with MG (updated GTDPA, GDN, M3MIMO ) | some refactoring

87ea3a8

alexandretl commited on Jan 23

Grouped Differential Attn v2 | GDAv2 for TPA

d4bb0ff

alexandretl commited on Jan 22

GDN proj same as in MG

d4b5b99

alexandretl commited on Jan 22

CDP LR expert scaling tests (linear or sqrt, inc) | missing args in config (not critical) |

b690eb3

alexandretl commited on Jan 22

Differential Attn v2 | proper coord check for MoE | proper init for MoE | some refactoring

2f56a15

alexandretl commited on Jan 21

attn layer simplification & IDM (for gpt baseline) | local/global removed | CDP coord check | CDP fix lm_head init scaling |

d10acb2

alexandretl commited on Jan 19

vwn start

d431617

alexandretl commited on Jan 13

coordcheck v2 (gqa paper) | some fixes relative to previous rehaul

e4fd764

alexandretl commited on Jan 13

big overhaul (simplication, removal of unused things)

9fd69c3

alexandretl commited on Jan 12

some prints and imports fixes

6686210

alexandretl commited on Jan 12

CompletedP | memory+norm logging | proper MoE with ScatterMoE, update bias, Latent-MoE | Muon experiments | VE for Mamba3 | fix torch recompiles during varlen training

b9f197c

alexandretl commited on Jan 2

Value Embeddings (attn & gdn) | MoE (dont use this one)

79c75e5

alexandretl commited on Dec 8, 2025

alpha normalize ademamix | mamba norms and gate | VWN | wnorm (nemotron-flash) | MG equivalence | fix IDM config saving | CCAv2 | MoBA | reduce lm head

d79da9a

alexandretl commited on Dec 3, 2025

head tying | gated mlp | gate of Mamba3 inside module

3b164a1

alexandretl commited on Nov 10, 2025

mamba3 flags | mamba3 default state size to 128, headdim to 64 | mamba2 | fix mamba3 mimo (JG) | (fake) moe | intra doc maskiiiing (with SS) | seednorm tests | coord checks

58b82e2

alexandretl commited on Nov 7, 2025

MLA | KDA | TPA | GDA | ResFormer | Mamba3 | DragonMimo (WIP) | tokenshift | SeeDNorm | shrink DA/GDN | gate shared across all block types |

bc8288b

alexandretl commited on Nov 4, 2025

CCE | Gate attn | ZCG | RoPE GDN | GQA GDN | uniconv GDN ||CCA | NSA | PLT (not tested) | DMA fix | SWR

959cbe5

alexandretl commited on Oct 22, 2025

Update training_dragon.py

8c51dce
verified

alexandretl commited on Oct 16, 2025

fix massi bugs prod

9dde504
verified

alexandretl commited on Oct 16, 2025

three conv (for now), works with main MG branch

6296289

alexandretl commited on Oct 13, 2025

uscaling training | resume training | slw training | eval loss training | fix data offset training | DSA & DMA (testing) | Qwen3Next-like arch

19e6554

alexandretl commited on Oct 13, 2025

fixes+refactoring

b94a4d0

alexandretl commited on Oct 2, 2025

zero centered gamma for norm | proper qkv proj in GDN (tp aware, same as MG) | training script (wip)

bd7b3d1

alexandretl commited on Oct 2, 2025

revamp GDN cache (as QwenNext) & conv1d

4f09326

alexandretl commited on Sep 24, 2025

refactor backends selection | fix eager attn softcap & window | fix flex backend window | flex attn & eager backends for DA | eager backend for GDN | refactor GDN variables

b914c22

alexandretl commited on Sep 23, 2025

flex attn backend for ATTN (tested) [ inc

9872c32

alexandretl commited on Sep 23, 2025

revert vLLM modifications (separate head mult back to lm_head)

4e57133

alexandretl commited on Sep 22, 2025

merged three convolutions of GDN (test on PIQA&SWDE)

54fbeee

alexandretl commited on Sep 22, 2025

merged in_proj and gate_proj of GDN (tested on PIQA)

c7717bf

alexandretl commited on Sep 19, 2025

Update configuration_dragon.py

a745443
verified

alexandretl commited on Sep 18, 2025

max pos embeddings

9053077

alexandretl commited on Sep 12, 2025

manual automap for DragonModel | vLLM compat (alpha head in model, persistant=False, contiguous for conv1d, max pos hardcoded for now)

58a7542

alexandretl commited on Sep 10, 2025

diff attn backend FA works (eager, no)

2db3d5e

alexandretl commited on Sep 10, 2025

diff attn FA2+FA3+eager backends (WIP)

92fd2b1

jgcb00 commited on Sep 10, 2025

Commit History

working resume | classic input embed | nGPT logit scaling | XSA | del M3 as_strided 10aee3a

FAN MLP | input norm is back | nGPT lm head | geo loss b9faeb1

fix proper scale for CDP | safer Geodesic 0427acd

mamba3 inference (correct prefill) 8833933

[big refactoring] remove ngram, ddl, stem, vwn, reduce lm head, seednorm, mla, uscaling, derf, old args, old layers, diffattnV1, old mamba3s + MAMBA3 TP AWARE 64e97d9

cosnet | complete SLW | mamba3 fast inference | fix position_ids shape | ProRes | proper print0 | optim work 24d333c

geodesic norm | shared expert gate | new mg dataset | no more buffers | moe router in fp32 70543c0

new TP mamba | derf mamba f168e50

MG equivalence | fast mamba3 | ddl update 702de97

ngram embeds | DDL new code + EC | Hyperball (AdamH, AdEMAMixH) | lr_expert 940f633

SLW end b5b44c3

DDL 70d8309

STEM fe38bae

fix n_kv_heads + pos_id GTPAv2 069edd6

M3 MIMO large output fix (see discussion) e1d8de1

FA3 fixes | some refactoring 98a21d2

equivalence with MG (updated GTDPA, GDN, M3MIMO ) | some refactoring 87ea3a8

Grouped Differential Attn v2 | GDAv2 for TPA d4bb0ff

GDN proj same as in MG d4b5b99

CDP LR expert scaling tests (linear or sqrt, inc) | missing args in config (not critical) | b690eb3

Differential Attn v2 | proper coord check for MoE | proper init for MoE | some refactoring 2f56a15

attn layer simplification & IDM (for gpt baseline) | local/global removed | CDP coord check | CDP fix lm_head init scaling | d10acb2

vwn start d431617

coordcheck v2 (gqa paper) | some fixes relative to previous rehaul e4fd764

big overhaul (simplication, removal of unused things) 9fd69c3

some prints and imports fixes 6686210

CompletedP | memory+norm logging | proper MoE with ScatterMoE, update bias, Latent-MoE | Muon experiments | VE for Mamba3 | fix torch recompiles during varlen training b9f197c

Value Embeddings (attn & gdn) | MoE (dont use this one) 79c75e5

alpha normalize ademamix | mamba norms and gate | VWN | wnorm (nemotron-flash) | MG equivalence | fix IDM config saving | CCAv2 | MoBA | reduce lm head d79da9a

head tying | gated mlp | gate of Mamba3 inside module 3b164a1

mamba3 flags | mamba3 default state size to 128, headdim to 64 | mamba2 | fix mamba3 mimo (JG) | (fake) moe | intra doc maskiiiing (with SS) | seednorm tests | coord checks 58b82e2

MLA | KDA | TPA | GDA | ResFormer | Mamba3 | DragonMimo (WIP) | tokenshift | SeeDNorm | shrink DA/GDN | gate shared across all block types | bc8288b

CCE | Gate attn | ZCG | RoPE GDN | GQA GDN | uniconv GDN ||CCA | NSA | PLT (not tested) | DMA fix | SWR 959cbe5

Update training_dragon.py 8c51dce verified

fix massi bugs prod 9dde504 verified

three conv (for now), works with main MG branch 6296289

uscaling training | resume training | slw training | eval loss training | fix data offset training | DSA & DMA (testing) | Qwen3Next-like arch 19e6554

fixes+refactoring b94a4d0

zero centered gamma for norm | proper qkv proj in GDN (tp aware, same as MG) | training script (wip) bd7b3d1

revamp GDN cache (as QwenNext) & conv1d 4f09326

refactor backends selection | fix eager attn softcap & window | fix flex backend window | flex attn & eager backends for DA | eager backend for GDN | refactor GDN variables b914c22

flex attn backend for ATTN (tested) [ inc 9872c32

revert vLLM modifications (separate head mult back to lm_head) 4e57133

merged three convolutions of GDN (test on PIQA&SWDE) 54fbeee

merged in_proj and gate_proj of GDN (tested on PIQA) c7717bf

Update configuration_dragon.py a745443 verified

max pos embeddings 9053077

manual automap for DragonModel | vLLM compat (alpha head in model, persistant=False, contiguous for conv1d, max pos hardcoded for now) 58a7542

diff attn backend FA works (eager, no) 2db3d5e

diff attn FA2+FA3+eager backends (WIP) 92fd2b1

working resume | classic input embed | nGPT logit scaling | XSA | del M3 as_strided

10aee3a

FAN MLP | input norm is back | nGPT lm head | geo loss

b9faeb1

fix proper scale for CDP | safer Geodesic

0427acd

mamba3 inference (correct prefill)

8833933

[big refactoring] remove ngram, ddl, stem, vwn, reduce lm head, seednorm, mla, uscaling, derf, old args, old layers, diffattnV1, old mamba3s + MAMBA3 TP AWARE

64e97d9

cosnet | complete SLW | mamba3 fast inference | fix position_ids shape | ProRes | proper print0 | optim work

24d333c

geodesic norm | shared expert gate | new mg dataset | no more buffers | moe router in fp32

70543c0

new TP mamba | derf mamba

f168e50

MG equivalence | fast mamba3 | ddl update

702de97

ngram embeds | DDL new code + EC | Hyperball (AdamH, AdEMAMixH) | lr_expert

940f633

SLW end

b5b44c3

DDL

70d8309

STEM

fe38bae

fix n_kv_heads + pos_id GTPAv2

069edd6

M3 MIMO large output fix (see discussion)

e1d8de1

FA3 fixes | some refactoring

98a21d2

equivalence with MG (updated GTDPA, GDN, M3MIMO ) | some refactoring

87ea3a8

Grouped Differential Attn v2 | GDAv2 for TPA

d4bb0ff

GDN proj same as in MG

d4b5b99

CDP LR expert scaling tests (linear or sqrt, inc) | missing args in config (not critical) |

b690eb3

Differential Attn v2 | proper coord check for MoE | proper init for MoE | some refactoring

2f56a15

attn layer simplification & IDM (for gpt baseline) | local/global removed | CDP coord check | CDP fix lm_head init scaling |

d10acb2

vwn start

d431617

coordcheck v2 (gqa paper) | some fixes relative to previous rehaul

e4fd764

big overhaul (simplication, removal of unused things)

9fd69c3

some prints and imports fixes

6686210

CompletedP | memory+norm logging | proper MoE with ScatterMoE, update bias, Latent-MoE | Muon experiments | VE for Mamba3 | fix torch recompiles during varlen training

b9f197c

Value Embeddings (attn & gdn) | MoE (dont use this one)

79c75e5

alpha normalize ademamix | mamba norms and gate | VWN | wnorm (nemotron-flash) | MG equivalence | fix IDM config saving | CCAv2 | MoBA | reduce lm head

d79da9a

head tying | gated mlp | gate of Mamba3 inside module

3b164a1

mamba3 flags | mamba3 default state size to 128, headdim to 64 | mamba2 | fix mamba3 mimo (JG) | (fake) moe | intra doc maskiiiing (with SS) | seednorm tests | coord checks

58b82e2

MLA | KDA | TPA | GDA | ResFormer | Mamba3 | DragonMimo (WIP) | tokenshift | SeeDNorm | shrink DA/GDN | gate shared across all block types |

bc8288b

CCE | Gate attn | ZCG | RoPE GDN | GQA GDN | uniconv GDN ||CCA | NSA | PLT (not tested) | DMA fix | SWR

959cbe5

Update training_dragon.py

8c51dce
verified

fix massi bugs prod

9dde504
verified

three conv (for now), works with main MG branch

6296289

uscaling training | resume training | slw training | eval loss training | fix data offset training | DSA & DMA (testing) | Qwen3Next-like arch

19e6554

fixes+refactoring

b94a4d0

zero centered gamma for norm | proper qkv proj in GDN (tp aware, same as MG) | training script (wip)

bd7b3d1

revamp GDN cache (as QwenNext) & conv1d

4f09326

refactor backends selection | fix eager attn softcap & window | fix flex backend window | flex attn & eager backends for DA | eager backend for GDN | refactor GDN variables

b914c22

flex attn backend for ATTN (tested) [ inc

9872c32

revert vLLM modifications (separate head mult back to lm_head)

4e57133

merged three convolutions of GDN (test on PIQA&SWDE)

54fbeee

merged in_proj and gate_proj of GDN (tested on PIQA)

c7717bf

Update configuration_dragon.py

a745443
verified

max pos embeddings

9053077

manual automap for DragonModel | vLLM compat (alpha head in model, persistant=False, contiguous for conv1d, max pos hardcoded for now)

58a7542

diff attn backend FA works (eager, no)

2db3d5e

diff attn FA2+FA3+eager backends (WIP)

92fd2b1