Centenario-21B-MXFP4
Your Daily Driver, Reimagined
We did not build Centenario to win benchmarks. We built it because the way people actually work with AI—the real daily grind of coding, thinking, solving problems—deserves a model that understands that rhythm. This is your AI, running on your hardware, part of your Bodega OS, always ready, never sending your prompts to someone else's cloud.
Centenario means "the raging bull" in Spanish. Form follows function—this model is designed to charge through your workload with relentless efficiency and power.
Built Different
Most models are trained once and shipped. Centenario uses what we call cascade reinforcement learning, and it represents a fundamentally different approach to getting AI right. We start with offline RL—think of it as learning from a carefully curated playbook of excellent responses. This warm-up phase establishes strong fundamentals and guarantees that when we move to the next stage, we are not starting from chaos.
Then comes online RL, where the model learns from its own outputs in real scenarios. It continuously adapts, refining its understanding of what good responses actually look like in practice. This two-stage cascade gives us the best of both worlds: the efficiency of offline learning and the adaptability of online refinement. The result is a model that performs significantly better than single-stage approaches while using a fraction of the GPU time.
How It Works
Centenario runs on a mixture of experts architecture with 21 billion total parameters, but here is the clever part: it only activates 3.6 billion per token. This is not about having the biggest model—it is about having the smartest one. Every inference is efficient, every response is considered, and you get the intelligence you need without the overhead you do not.
The architecture uses alternating dense and locally banded sparse attention, which means it can maintain a 128K token context window without the memory explosion you would normally expect. Rotary position embeddings handle the positional information, and grouped multi-query attention with a group size of 8 keeps things moving fast. These are not just technical details—they are the reason Centenario can process your entire codebase without breaking a sweat.
We quantized it to MXFP4 at 4.25 bits per parameter, which brings the memory footprint down to as low as 11GB. On M1 Max, you are looking at 95 tokens per second with first token latency under 150ms. That is fast enough to feel instant, efficient enough to run all day on a laptop, and capable enough to handle whatever you throw at it.
Running on Bodega
Centenario is not just optimized for Bodega OS—it is designed as the core reasoning engine for it. When you are running Bodega, Centenario handles the orchestration, the decision-making, the understanding of what you are trying to accomplish. It coordinates with specialized worker models, manages your context, and maintains the thread of your work across sessions.
The MLX-based inference leverages the unified memory architecture of Apple Silicon, which means no costly data transfers between CPU and GPU. Streaming token generation keeps interactions responsive. Advanced memory management ensures sustained performance even during extended sessions. And because it is all running on your hardware, your data never leaves your machine.
We have also optimized it for the Harmony format, which is specifically designed for tool use and agentic workflows. You can configure reasoning effort—low for quick responses, medium for balanced performance, high for complex problem-solving. This flexibility comes from our most advanced internal research on how AI should adapt to different task demands.
What It Does Well
Centenario excels at the messy, real work of software development. Code generation across any major language. Architectural planning for complex systems. Debugging that actually understands the broader context of your codebase. Refactoring that maintains the spirit of your original design while improving the implementation.
But it is not just a coding tool. Centenario handles research and analysis, document synthesis, technical writing, data interpretation. It is conversational when you need to think through a problem, precise when you need specific answers, creative when you are exploring new ideas.
The key is that it does not try to be everything. It is a daily driver. Reliable, capable, efficient. The kind of tool that becomes invisible because it just works.
On-Premises AI That Respects You
Here is what matters: this runs on your machine. Your code, your ideas, your half-formed thoughts at 2 AM—none of it goes to the cloud. You are not rate-limited by some API. You are not hoping the service is up when you need it. You are not wondering what is being logged or how it will be used to train the next model.
Centenario is part of Bodega OS, which means it is part of a complete on-premises AI infrastructure. Fast retrieval engines for your documents and code. Efficient inference that does not cook your laptop. Models that work together instead of being siloed API endpoints. This is what local-first AI should be.
The Technical Reality
Let us talk specifics. 21 billion total parameters, 3.6 billion active per forward pass. 128K native context window with full RoPE support. Mixture of experts with intelligent sparse activation. Alternating dense and locally banded sparse attention layers. Grouped multi-query attention optimized for inference speed.
MXFP4 quantization at 4.25 bits per parameter brings memory usage to 11-21GB depending on your configuration. On M-series Apple Silicon, you are getting 70 tokens per second sustained throughput. First token latency stays under 150ms. The context window does not degrade at length—you can use all 128K tokens without the model falling apart.
Inference is handled through MLX, which means native optimization for Apple's unified memory architecture. No PyTorch overhead, no CUDA complexity, just clean efficient code that makes full use of your hardware.
Why Cascade RL Matters
Traditional fine-tuning approaches hit a wall. You can train on static datasets, but the model learns the average of its training data, not the ceiling of what is possible. Online RL solves this by learning from the model's own generations, but it is expensive and unstable if you start from a weak foundation.
Cascade RL solves both problems. The offline phase establishes that strong foundation efficiently. The online phase then pushes performance higher by learning from the model's best outputs in real scenarios. You get production-ready stability with frontier-level capabilities, and you do it with a fraction of the compute cost.
This is why Centenario punches above its weight class. It is not just about parameter count—it is about training methodology that actually works in the real world.
Disclaimer
SRSWTI is not the creator or owner of the underlying foundation model architecture. The foundation model is created and provided by third parties. SRSWTI has trained this model on top of the foundation model but does not endorse, support, represent or guarantee the completeness, truthfulness, accuracy, or reliability of any outputs. You understand that this model can produce content that might be offensive, harmful, inaccurate or otherwise inappropriate, or deceptive. SRSWTI may not monitor or control all model outputs and cannot, and does not, take responsibility for any such outputs. SRSWTI disclaims all warranties or guarantees about the accuracy, reliability or benefits of this model. SRSWTI further disclaims any warranty that the model will meet your requirements, be secure, uninterrupted or available at any time or location, or error-free, viruses-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to this model, your downloading of this model, or use of this model provided by or through SRSWTI.
Crafted by the Bodega team at SRSWTI Research Labs
Building the world's fastest inference and retrieval engines
Making AI accessible, efficient, and powerful for everyone
Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.
- Downloads last month
- 99
4-bit
