Shane PRO

Crownelius

https://www.crowfeather.co

crownelius

AI & ML interests

LLM, RL, DL, ML, AGI, Distillation, Workflow Automation, Creative Writing

Recent Activity

upvoted an article about 9 hours ago

NEO-unify: Building Native Multimodal Unified Models End to End

upvoted an article about 10 hours ago

Learning Maths for the Last Time

upvoted an article about 10 hours ago

CompactAI

View all activity

Organizations

upvoted an article about 9 hours ago

Article

NEO-unify: Building Native Multimodal Unified Models End to End

Mar 5

•

151

upvoted 2 articles about 10 hours ago

Article

Learning Maths for the Last Time

about 12 hours ago

•

Article

CompactAI

10 days ago

•

updated a model about 12 hours ago

Crownelius/sparrow-1m-iter37-pc-negative

1.26M • Updated about 12 hours ago • 22

published a model about 12 hours ago

Crownelius/sparrow-1m-iter37-pc-negative

1.26M • Updated about 12 hours ago • 22

updated a model about 12 hours ago

Crownelius/sparrow-1m-iter6a-baseline

1.26M • Updated about 12 hours ago • 22

published a model about 12 hours ago

Crownelius/sparrow-1m-iter6a-baseline

1.26M • Updated about 12 hours ago • 22

published an article about 12 hours ago

Article

Learning Maths for the Last Time

about 12 hours ago

•

replied to their post about 15 hours ago

Same, but 40k on hardware and then train hard
Could sell my goats and get some api subscriptions.

New activity in deepseek-ai/DeepSeek-V4-Pro 1 day ago

Will there be small models like 12b?

👍👀 4

#164 opened 6 days ago by

Crownelius

posted an update 1 day ago

Post

2574

Day 4-6 [05/05/2026]
Howdy,

Is anybody else willing to put a second mortgage on their house, just to spend 40k USD in compute credits? Just me? k...

I got dreams, man. The datasets I could build with 40k would be insane.
Somebody called me a genius the other day, they'd be shocked to find out, that I would put my house on the line for 30 days of runpod usage.

What would you do with it?
I would turn arxiv into a dataset. Turn each arxiv paper into a QnA.
Or... maybe if I got 40k USD in credit's Id end up like those 16 lost scientists.

Food for thought.
Anyways, I think I'm going to make a post once a week.
In the meantime you can find me building small llm's in discord here:
https://discord.gg/4DdwS9D8x9

6 replies

liked a Space 3 days ago

CanIRunThisLLM

🏢

Estimate VRAM needed for LLM inference and training

liked a model 3 days ago

CompactAI-O/Shard-1

Updated 3 days ago • 10

updated a model 5 days ago

CompactAI-O/Shard-1

Updated 3 days ago • 10

published a model 5 days ago

CompactAI-O/Shard-1

Updated 3 days ago • 10

replied to their post 5 days ago

Final recipe locked: Qwen3-MoE, 3 experts top-1, vocab 262144 (Gemma 3 SP, per-digit input wrap), GQA 3:1, Muon for hidden 2D weights and AdamW for embed and router, WSD with sqrt cooldown, beta2 ramp from 0.95 to 0.97, z-loss 1e-4 with gradients this time (the last build had a no_grad bug that silently killed it), Qwen3 aux loss coefficient 0.001, expert-load monitor that warns on starvation. Three phases: 8K pretrain, then 32K continued pretrain, then 8K SFT.

posted an update 5 days ago

Post

5241

Day 3 - 05/02/2026
Scamp ships, hits the wall. New plan...

Scamp came back from training today... Didn't go so well, I'm still unsure...

Fast benchmark, temperature 0.7, top_p 0.9:
- "Capital of France is" produced "covered by the Crown" (grammatical, factually wrong)
- "23 + 19 = ?" produced "23. Answer: 23. Answer: 23..." (loops, math broken)
- "def fibonacci(n):" produced a list of letters

It speaks English. It can't reason. At 8K vocab and 50M params, it was never going to.

Next build: 412M MoE-3E. Three experts (math, language, code), top-1 routing, random init, let specialization emerge from gradient signal alone. Tried seeded Branch-Train-MiX first then dropped it. Adds compute for no clear win when the router will find its own attractors anyway.

Big lesson today came from limit testing on A100 80GB. Surprise, every planned phase ran out of memory even on 80GB. Root cause: at vocab 262144 (Gemma 3 standard), the output logits dominate during forward and backward. Fix: Liger Kernel's fused cross-entropy. It streams the loss computation instead of materialising the full B by T by vocab tensor. Without it the build would not run.

Scamp proved the pipeline runs end-to-end on real hardware. The 412M run starts tomorrow. If routing balances naturally and math finally crystallises, ships as Crowfeather-412M-3E with GGUF in F16, Q8, Q5, and Q4.

So... the training may have produced a poet if I had done it better. But I didn't, so instead... we get a malformed robot named Scamp... This is progress.

-Shane

P.S Join discord for discussion: https://discord.gg/8ZscHNmJYE and
I post my finished stuff here:

CompactAI-O