Bodega-Vertex-4B
The Sweet Spot of Capability and Efficiency
Bodega-Vertex-4B occupies the perfect balance point within Bodega OS: enough capability for sophisticated tasks, efficient enough to run constantly without draining resources. With 4 billion parameters quantized down to a 1GB memory footprint, this model serves as the workhorse of Bodega's routing, preprocessing, and context management systems.
The Router Model
Within Bodega OS, Vertex-4B functions primarily as an intelligent router. When queries come in, this model determines which specialized models should handle them, whether the query needs retrieval before inference, and how to structure the workflow for optimal results. The routing decisions happen in milliseconds at 60-100 tokens per second, making the model fast enough to sit in the critical path without introducing latency.
The model analyzes query intent, complexity, and domain to make routing decisions. Simple questions go to lighter models. Complex reasoning tasks route to larger models. Questions requiring current information trigger retrieval workflows. The model understands the capabilities and limitations of other models in the Bodega ecosystem, routing requests to maximize quality while minimizing compute.
This routing intelligence extends to multi-step workflows. The model can decompose complex requests into subtasks, determine the sequence of operations needed, and orchestrate calls to retrieval systems and specialized models. All of this orchestration happens locally as part of Bodega OS.
Summarization and Context Management
Vertex-4B handles summarization throughout Bodega's inference pipeline. When retrieved documents exceed context limits of larger models, Vertex-4B condenses them into focused summaries that preserve key information while fitting within token budgets. This summarization happens fast enough to preprocess retrieval results in real-time.
The model extends effective context limits for other models by intelligently compressing information. Long conversation histories get summarized to maintain coherent context without consuming the entire context window. Retrieved document collections get condensed into digestible summaries that larger models can reason over. Code files get abstracted to their essential structure and logic.
This context management is critical for Bodega's efficiency. Rather than forcing larger models to process verbose inputs, Vertex-4B preprocesses information into dense, information-rich summaries. The larger models then work with cleaner inputs, producing better results faster.
Input Processing and Structuring
The model serves as a preprocessing layer for other language models in Bodega OS. User inputs get cleaned, structured, and enriched before being passed to specialized models. Queries get reformulated for clarity. Ambiguous requests get disambiguated. Poorly structured inputs get reorganized into formats that downstream models handle better.
For retrieval workflows, Vertex-4B processes user queries into optimized search queries. It expands terms, identifies key concepts, and generates alternative phrasings to improve retrieval recall. The preprocessed queries produce better retrieval results, which means downstream models work with higher quality context.
Code inputs benefit from Vertex-4B's analysis. The model can identify the programming language, extract relevant context, and structure code snippets in ways that make them easier for specialized code models to process. This preprocessing improves the quality of code generation and analysis throughout Bodega.
Architecture and Performance
Four billion parameters quantized aggressively down to a 1GB memory footprint. This extreme quantization is possible because the model's tasks—routing, summarization, preprocessing—are less sensitive to quantization degradation than complex reasoning. The model maintains accuracy on these focused tasks while consuming minimal resources.
On Apple Silicon, the model delivers 60-100 tokens per second sustained throughput. This speed is essential for its role as a router and preprocessor—it needs to be fast enough that adding it to the pipeline does not slow down overall response times. MLX-based inference leverages unified memory architecture for efficient processing.
The tiny memory footprint means Vertex-4B can stay loaded alongside larger models without resource contention. It runs continuously, handling routing and preprocessing tasks as they arise, without needing to be loaded and unloaded based on demand.
Code Generation and Analysis
Beyond its core routing and preprocessing duties, Vertex-4B handles code generation and analysis for routine tasks. Simple code completion, basic refactoring suggestions, and straightforward bug detection work well at this model size. For complex architectural analysis or sophisticated code generation, the model routes requests to larger specialized models.
The model understands code structure well enough to extract relevant context, identify dependencies, and organize code snippets for analysis. This makes it valuable for preprocessing code before passing it to larger models, ensuring they receive well-structured inputs.
Running Within Bodega OS
Vertex-4B is deeply integrated into Bodega's architecture. It is not a standalone model you interact with directly—it is the orchestration layer that makes other models work together efficiently. When you query Bodega, Vertex-4B is likely the first model to see your request, determining how to handle it and preparing inputs for downstream processing.
The model runs entirely on your hardware with all other Bodega components. Routing decisions, summarization, preprocessing—everything stays local. This maintains privacy while enabling sophisticated multi-model workflows that would typically require cloud orchestration.
Technical Details
Four billion parameters quantized to 1GB memory footprint. Sustained throughput of 60-100 tokens per second on Apple Silicon. Sub-50ms first token latency for typical routing and preprocessing tasks. MLX-based inference optimized for unified memory architecture.
The model runs efficiently on all M-series chips, from M1 to M3 and beyond. The minimal memory requirement means it coexists easily with larger models, retrieval indices, and other Bodega components without resource conflicts.
Context window is optimized for the model's use cases: query analysis, document summarization, and input preprocessing. The model does not need to maintain book-length context because it processes information in focused chunks before passing it to other models.
Disclaimer
SRSWTI is not the creator or owner of the underlying foundation model architecture. The foundation model is created and provided by third parties. SRSWTI has trained this model on top of the foundation model but does not endorse, support, represent or guarantee the completeness, truthfulness, accuracy, or reliability of any outputs. You understand that this model can produce content that might be offensive, harmful, inaccurate or otherwise inappropriate, or deceptive. SRSWTI may not monitor or control all model outputs and cannot, and does not, take responsibility for any such outputs. SRSWTI disclaims all warranties or guarantees about the accuracy, reliability or benefits of this model. SRSWTI further disclaims any warranty that the model will meet your requirements, be secure, uninterrupted or available at any time or location, or error-free, viruses-free, or that any errors will be corrected, or otherwise. You will be solely responsible for any damage resulting from your use of or access to this model, your downloading of this model, or use of this model provided by or through SRSWTI.
Crafted by the Bodega team at SRSWTI Research Labs
Building the world's fastest inference and retrieval engines
Making AI accessible, efficient, and powerful for everyone
Developed by SRSWTI Inc. - Building world's fastest retrieval and inference engines.
- Downloads last month
- 37
8-bit
