[DAY ONE] PROJECT CROWFEATHER 4/30/2026 ...The day I forgot to attach wandb.ai Just dropped Crowfeather-50m, the first checkpoint in a series, and yeah, no graphs.
54.5M params. Pretrain only. 17,500 steps banked on FineWeb-edu before Thunder credits ran dry. About 2.3B tokens, no SFT yet.
Architecture: Gemma-4 alternating sliding/global attention (1024 window, last layer always global) plus DeepSeek-V4 Muon optimizer plus WSD scheduler plus Gemma-2 logit soft-cap plus PaLM z-loss. Recipe in the model card.
What it can do: writes grammatical English. Knows that France has Rhine-adjacent monasteries (it picked Rouen instead of Paris but the vocabulary is in there). Tells stories about Mr. Fabien.
What it can't do yet: facts, code, math. Base LM, no SFT, no instruction tuning.
The series: Every additional training run becomes another model card here Every model card gets a matching post on this profile Continuation goes to Colab next, picking up from step 17500 out of 100k
Limited to one post a day on Hugging Face, so updates will trickle out at that pace. Follow [@Crownelius](@Crownelius) and [@Crowfeather](
Crowfeather) if you want to watch this thing learn in public. Next drop will either come with the finished pre-train or whatever step I land on before the bank takes my credit card away.
EARLY SNEAK PREVIEW of our first DeepSeek-V4-Pro dataset, Tachibana 4!
Tachibana 4 is our upcoming agentic coding dataset: - Questions prioritize real-world, challenging agentic coding tasks across a variety of programming languages and topics. - Areas of focus include back-end and front-end development, systems programming, distributed systems, performance optimization, data structures, databases and data engineering, game and mobile development, security engineering, compiler design, custom tooling, task automation, practical bugfixes, and more! - A wide variety of emphasized languages improves development capability: Python, C, C++, C#, Go, TypeScript, Java, JavaScript, Rust, Haskell, SQL, Shell, R, Ruby, assembly code, and more! - Synthethic prompts utilize a variety of personas, experience levels, and styles of communication to maximize real-world flexibility and usability.
These agentic datasets will power the upcoming Esper 4, and whatever you can build! We'll have more finetunes on the way as well! :) we're going to make open source better and better for your work!
If you would like to see Esper 4 and these datasets faster, this is the best way you can help us: sequelbox/SupportOpenSource
Same-base DARE-TIES merge of Qwen3.6-27B + 3 fine-tunes (rico03 Claude distill, Esper3.1, kai-os Opus reasoning anchor) via my Omnimerge_v2 method (OBIM-lite + DAREx-q + EMR election).
Hit a Qwen3.6-specific fragility: hyperparams that work flawlessly on 3.5 produced 80% unclosed-<think> on 3.6, collapsing pass@1 to ~20%. Per-tensor delta forensics localized the failure to mlp.{gate,up,down}_proj in layers 27β52. Fix: MLP-passthrough surgery β copy MLPs verbatim from base, keep merged attn + linear_attn. Leak β 0%.
Q6_K results (vs Qwen3.6 base / vs Omnimerge-v2 on Qwen3.5): β’ HumanEval: 84.76% (= base, +5.49 pp vs v2) β’ MBPP corrected: 73.40% (+15.80 pp vs base, β v2) β’ GPQA Diamond: ~84.75% partial 192/198 (+15.5 pp vs v2)
βΆ Qwen3.5-4B Importance-Signal Study (M1..M5)
Controlled 5-way comparison: same Qwen3.5-4B base, same 2 fine-tunes (Jackrong Claude-4.5 distill + Crow Opus-4.6 distill), only the importance signal driving DARE-TIES sparsification varies.
Findings: Fisher wins HE (+4.88 pp over vanilla), LRP wins MBPP (+2.60 pp). Both signals + Omnimerge_v2 recipe beat vanilla. To make multimodal-LM ex-LRP work end-to-end against Qwen3_5ForConditionalGeneration, I filed 5 patches against arcee-ai/mergekit PR #682 + 1 against rachtibat/lxt.
All five Mx checkpoints + Fisher/LRP signal safetensors + reproducer scripts published.
Poll: Will 2026 be the year of subquadratic attention?
The transformer architecture is cursed by its computational complexity. It is why you run out of tokens and have to compact. But some would argue that this is a feature not a bug and that this is also why these models are so good. We've been doing a lot of research on trying to make equally good models that are computationally cheaper, But so far, none of the approaches have stood the test of time. Or so it seems.
Please vote, don't be shy. Remember that the Dunning-Kruger effect is very real, so the person who knows less about transformers than you is going to vote. We want everyone's opinion, no matter confidence.
π if you think at least one frontier model* will have no O(n^2) attention by the end of 2026 π₯ If you disagree
* Frontier models - models that match / outperform the flagship claude, gemini or chatgpt at the time on multiple popular benchmarks
I just pushed Claude Code Agent Swarm with 20 coding agents on my desktop GPU workstation.
With local AI, I donβt have /fast CC switch, but I have /absurdlyfast: - 100β499 tokens/second read, yeah 100k, not a typo | 811 tok/sec generation - KV cache: 707β200 tokens - Hardware: 5+ year old GPUs 4xA6K gen1; Itβs not the car. Itβs the driver.
Qwen3 Coder Next AWQ with cache at BF16. Scores 82.1% in C# on 29-years-in-dev codebase vs Opus 4.5 at only 57.5%. When your codebase predates Stack Overflow, you don't need the biggest model; you need the one that actually remembers Windows 95.
My current bottleneck is my 27" monitor. Can't fit all 20 Theos on screen without squinting.