Astria
Astria is a next-generation, fully local multimodal foundation model built on top of a Ministral-based language backbone and a custom vision encoder. This architecture significantly improves visual grounding, multilingual reasoning, and agentic reliability while remaining efficient enough for edge deployment.
π Astria Update Highlights
Me7warβs latest Astria update pushes the limits of small-scale multimodal AI, combining efficiency, reasoning, and vision capabilities:
Key Features
- Vision Mastery: Custom encoder enables deep image understanding and precise visualβtext alignment.
- Multilingual Support: Handles dozens of languagesβEnglish, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Chinese, Japanese, Koreanβwhile maintaining strong reasoning and generation.
- Agent-Ready: Native function calls, reliable JSON outputs, and strict prompt adherence make Astria fully agentic-capable.
- Edge Efficiency: Optimized for minimal hardware without sacrificing performance.
- Large Context Window: Up to 256k tokens for long-form reasoning, document-level comprehension, and complex multi-step tasks.
- Enhanced Reasoning: Ministral backbone ensures stronger factual grounding, smoother multimodal alignment, and improved long-horizon reasoning.
A fully local, compact model redefining what edge-deployable multimodal AI can achieve.
π Visual Reasoning Performance
Astria applies a custom evaluation using GPT-5 PRO as the judge.
92.53% β New SOTA
LLaVA baseline: 90.92%
A custom evaluation on 30 unseen images with 3 instruction types per image (conversation, description, complex reasoning) shows Astria outperforms GPT-5 in all categories.
Evaluation: Astria vs GPT-5
A custom evaluation set of 30 unseen images was constructed. Each image includes three instruction types:
- Conversational understanding
- Detailed visual description
- Complex multimodal reasoning
This yields 90 unique imageβlanguage tasks, evaluated on:
- Astria
- GPT-5
Scoring was performed by GPT-5 PRO, using a 1β10 scale per task.
Results
Astria outperforms GPT-5 across all instruction categories, validating the effectiveness of the custom vision encoder combined with the Ministral knowledge-enhanced language model.
Model Summary
- Vision Encoder: Custom-built, with precise visual-text alignment
- Language Backbone: Ministral-based, optimized for reasoning and factual accuracy
- Training: End-to-end multimodal alignment with knowledge supervision
- Output: Grounded, structured, and context-aware responses
- Deployment: Fully local and edge-optimized, supporting up to 256k token context
License
Astria is released under the Astria License for personal and non-commercial use. Commercial use requires explicit permission from the creator.
- Downloads last month
- 64



