Hugging Face
Models
Datasets
Spaces
Buckets
new
Docs
Enterprise
Pricing
Website
Tasks
HuggingChat
Collections
Languages
Organizations
Community
Blog
Posts
Daily Papers
Learn
Discord
Forum
GitHub
Solutions
Team & Enterprise
Hugging Face PRO
Enterprise Support
Inference Providers
Inference Endpoints
Storage Buckets
Log In
Sign Up
Lgr54HFi
/
chomera
like
0
chimera51
custom_code
arxiv:
12 papers
Model card
Files
Files and versions
xet
Community
Copy to bucket
new
main
chomera
/
chimera
179 kB
Ctrl+K
Ctrl+K
1 contributor
History:
23 commits
Lgr54HFi
fix: MoE intermediate_size not scaled for tiny — 158M→4M MoE params
6cb7b4d
verified
about 1 month ago
training
fix: MoE intermediate_size not scaled for tiny — 158M→4M MoE params
about 1 month ago
__init__.py
Safe
2.43 kB
Upload folder using huggingface_hub
about 1 month ago
__main__.py
Safe
894 Bytes
Upload folder using huggingface_hub
about 1 month ago
cli.py
Safe
1.97 kB
Upload folder using huggingface_hub
about 1 month ago
config.py
Safe
3.11 kB
Upload folder using huggingface_hub
about 1 month ago
evolution.py
Safe
23.3 kB
perf: eliminate .item() graph breaks in evolution.py — use tensor comparisons for torch.compile compat"
about 1 month ago
hyper.py
Safe
18.7 kB
Upload folder using huggingface_hub
about 1 month ago
inference.py
Safe
15.1 kB
Upload folder using huggingface_hub
about 1 month ago
layers.py
Safe
21.1 kB
Upload folder using huggingface_hub
about 1 month ago
looping.py
Safe
2.82 kB
Upload folder using huggingface_hub
about 1 month ago
model.py
Safe
15.9 kB
Skip SpanEngine/Grammar/DebtLedger during training (inference-only ops on 200K logits)
about 1 month ago
moe.py
Safe
4.29 kB
Upload folder using huggingface_hub
about 1 month ago
multimodal.py
Safe
5.15 kB
Upload folder using huggingface_hub
about 1 month ago
paths.py
Safe
358 Bytes
Upload folder using huggingface_hub
about 1 month ago
quantization.py
Safe
17.4 kB
fix: NaN at step 150 — add gradient clamping to STE detach trick + lower max_grad_norm to 0.5\n\nThe pure detach() STE passes gradients through unbounded, causing\ngradient explosion around step 140-150 when loss is still high.\n\nFix: clamp the gradient contribution within the detach trick:\n w_q = clamp(w_scaled, -1, 1) + (round(clamped) - clamped).detach()\nThis ensures gradients are zero outside [-1, 1] (weights already at\nquantization boundary get no gradient push) while keeping the STE\nidentity pass-through inside the valid range.\n\nAlso reduces max_grad_norm from 1.0 to 0.5 for additional stability.\n\nRef: 4-bit CPU training paper (2603.13931) uses tanh soft clipping\nfor the same reason."
about 1 month ago
tokenizer.py
Safe
6.84 kB
Upload folder using huggingface_hub
about 1 month ago