We present a methodology for training small language models on CPU at FP32 precision that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training. Across15modelsspanningfournovelarchitecturefamiliesโMixtureofAttentions(MoA),cross- architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language models (DiscoverLM)โtotal compute cost was $24 on a single AMD EPYC 9454P proces- sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper- iments demonstrating 5,810รsingle-operation error and 23,225รcompounding error ratio for FP16 at network depth; (2) sparse cognitive architectures where 0.02โ7% of parameters activate per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum training progressing from language to logic to transfer to depth; (4) continuous belt-fed data ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that transformers were designed for GPU hardware rather than mathematical optimality, and that architecture designed for geometric correctnessโmetric-space attention, triangle inequality enforcement, sparse expert routingโnaturally favor CPU execution. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.
Today weโre publicly releasing Kanon 2 Enricher, and with it, an entirely new class of AI model that weโre calling a hierarchical graphitization model. This is fundamentally different from both universal extraction models and generative models.
As a hierarchical graphitization model, Kanon 2 Enricher natively outputs a ๐ธ๐ป๐ผ๐๐น๐ฒ๐ฑ๐ด๐ฒ ๐ด๐ฟ๐ฎ๐ฝ๐ต rather than tokens, which makes it architecturally incapable of hallucinating or inventing text that wasnโt present in the input.
What that enables in practice is unlike any other model or ML architecture on the market:
โข ๐ก๐ผ ๐ต๐ฎ๐น๐น๐๐ฐ๐ถ๐ป๐ฎ๐๐ถ๐ผ๐ป๐ ๐ค It cannot hallucinate. All references and links are stored as spans, meaning exact character offsets anchored to the original text.
โข ๐๐ถ๐ฒ๐ฟ๐ฎ๐ฟ๐ฐ๐ต๐ถ๐ฐ๐ฎ๐น ๐๐ฒ๐ด๐บ๐ฒ๐ป๐๐ฎ๐๐ถ๐ผ๐ป, ๐ป๐ผ๐ ๐ท๐๐๐ ๐ฒ๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป ๐ It deconstructs a documentโs full nested hierarchy, down to chapters, sections, clauses, schedules, signatures, and even singular sentences, and classifies each span with dozens of contextual features.
โข ๐๐ป๐๐ถ๐๐ ๐ฒ๐ ๐๐ฟ๐ฎ๐ฐ๐๐ถ๐ผ๐ป, ๐ฑ๐ถ๐๐ฎ๐บ๐ฏ๐ถ๐ด๐๐ฎ๐๐ถ๐ผ๐ป, ๐ฎ๐ป๐ฑ ๐น๐ถ๐ป๐ธ๐ถ๐ป๐ด ๐ It resolves what references actually point to, then links entities, citations, and cross-references into a single coherent graph.
โข ๐๐ฟ๐ฎ๐ฝ๐ต-๐ณ๐ถ๐ฟ๐๐ ๐ฒ๐ณ๐ณ๐ถ๐ฐ๐ถ๐ฒ๐ป๐ฐ๐ ๐โโก๏ธ Small enough to run locally on a consumer PC with sub-second latency, and it stays reliable on long documents where front