Bodega-Raptor-90M
Intelligence in Your Pocket
Bodega-Raptor-90M represents the extreme end of model miniaturization. At just 90 million parameters, this model proves that useful AI does not require billions of parameters and gigabytes of RAM. With a memory footprint under 600MB and inference speed exceeding 1000s tokens per second, this is the model you deploy when size and speed matter more than anything else.
Extreme Miniaturization
Ninety million parameters. That is smaller than most people's photo libraries. The entire model, loaded and ready to run, consumes less than 600MB of RAM. On Apple Silicon, it delivers over 1000stokens per second with sub-10ms latency for short completions. This is not just fast—it is instantaneous in a way that fundamentally changes how you can use AI.
The model fits on smartwatches. It runs on microcontrollers. It works on devices where you measure available memory in megabytes, not gigabytes. And it does all this while maintaining structured reasoning capabilities that make it genuinely useful rather than just impressively small.
What Raptor-90M Does
Despite its size, Raptor-90M supports tool calling. This is critical for practical deployment—the model can parse DOM structures, extract title headers, fetch specific data elements, and call functions to perform structured tasks. It understands the format and conventions of tool use, which means you can build actual agents around it rather than just using it for text generation.
The model excels at parsing and extraction tasks. Give it HTML and it can identify headers, extract links, pull out metadata, and structure the information for further processing. This makes it valuable for web scraping, content extraction, and data pipeline tasks where you need fast, reliable parsing without the overhead of larger models.
Summarization happens in milliseconds. The model can condense text, extract key points, and generate concise summaries with latency low enough for real-time applications. This is useful for live transcription summaries, quick document previews, or any scenario where you need rapid text processing.
Speed is Ultimate
When we say this model is fast, we mean it changes the performance characteristics of what you can build. Sub-10ms latency means you can call the model hundreds of times per second if needed. You can use it in hot paths of your application without introducing noticeable delay. You can chain multiple inference calls together and still complete the entire workflow faster than a single call to a larger model.
This speed enables new use cases. Real-time analysis of streaming data. Instant feedback on user inputs. Rapid iteration through multiple candidate responses. Processing thousands of documents per minute. The model is fast enough that inference time stops being a constraint on your application design.
On battery-powered devices, the efficiency matters even more. The model uses minimal power, generates minimal heat, and can run continuously without draining the battery. This makes it practical for always-on applications where larger models would be infeasible.
Extreme Edge Deployment
Raptor-90M runs on devices where you would not normally consider deploying AI. Smartwatches can load the full model and run inference locally. Raspberry Pi and similar single-board computers handle it easily. Microcontrollers with sufficient RAM can run lightweight inference for embedded applications.
The model is offline-first by design. Once loaded, it requires no network connection, no API calls, no external dependencies. This is essential for embedded systems, edge devices, and any application where network access is unreliable or undesirable.
Startup time is instant. The model loads in milliseconds, not seconds. This matters for applications that need to spawn model instances on demand, or for devices that need to preserve battery by unloading the model when not in use.
Tool Calling at the Edge
The fact that this model supports tool calling despite its tiny size opens up possibilities that were not practical before. You can build agents that run entirely on edge devices, calling local functions to interact with hardware, retrieve sensor data, or control actuators. The model can orchestrate multi-step workflows, decide which tools to call, and process the results—all while consuming less memory than a typical system service.
DOM parsing and web extraction become viable on resource-constrained devices. The model can navigate web pages, extract structured data, identify relevant content, and format it for downstream processing. This enables local web scrapers, content monitors, and data extraction pipelines that run on minimal hardware.
The structured output capabilities mean the model can generate JSON, fill templates, and produce machine-readable formats with high reliability. This is crucial for tool calling, where incorrect formatting breaks the entire workflow. Despite its size, the model maintains the precision needed for structured generation.
Technical Reality
Ninety million parameters. Memory footprint under 600MB when loaded. Inference speed over 1000s tokens per second on Apple Silicon. Sub-10ms first token latency for typical queries. Minimal power consumption suitable for battery-operated devices.
The model runs efficiently on M-series chips, but its size means it also works on older hardware, embedded systems, and devices with limited compute resources. You do not need a high-end machine to deploy this model—you just need a few hundred megabytes of available RAM.
Context window is optimized for the kinds of tasks this model handles well: parsing, extraction, summarization, and short-form reasoning. The model does not try to maintain book-length context because that is not what you use a 90M parameter model for. It focuses on doing small tasks extremely well rather than trying to be a general-purpose assistant.
Part of the Raptor Series
Raptor-90M is the smallest member of our Raptor series of generalist models. It shares the same design philosophy: be good at real tasks rather than chasing benchmark numbers. The model is not trying to compete with billion-parameter models on complex reasoning—it is trying to do simple things fast enough that you can use it in places where larger models are impractical.
For applications that need more capability, you can pair Raptor-90M with larger models in the Bodega OS ecosystem. Use the small model for rapid filtering, extraction, and parsing, then send only relevant content to larger models for deeper analysis. This hybrid approach gives you the speed of a tiny model where it matters while maintaining access to more sophisticated reasoning when you need it.
Disclaimer
SRSWTI is not the creator or owner of the underlying foundation model architecture. The foundation model is created and provided by third parties. SRSWTI has trained this model on top of the foundation model but does not endorse, support, represent or guarantee the completeness, truthfulness, accuracy, or reliability of any outputs. You understand that this model can produce content that might be offensive, harmful, inaccurate or otherwise inappropriate, or deceptive. SRSWTI may not monitor or control all model outputs and cannot, and does not, take responsibility for any such outputs. SRSWTI disclaims all warranties or guarantees about the accuracy, reliability or benefits of this model. SRSWTI further disclaims any warranty that the model will meet your requirements, be secure, uninterrupted or available at any time or location, or error-free, viruses-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to this model, your downloading of this model, or use of this model provided by or through SRSWTI.
Crafted by the Bodega team at SRSWTI Research Labs
Building the world's fastest inference and retrieval engines
Making AI accessible, efficient, and powerful for everyone
Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.
- Downloads last month
- 39
4-bit
