Astria

Astria is a next-generation, fully local multimodal foundation model built on top of a Ministral-based language backbone and a custom vision encoder. This architecture significantly improves visual grounding, multilingual reasoning, and agentic reliability while remaining efficient enough for edge deployment.

🚀 Astria Update Highlights

Me7war’s latest Astria update pushes the limits of small-scale multimodal AI, combining efficiency, reasoning, and vision capabilities:

Key Features

Vision Mastery: Custom encoder enables deep image understanding and precise visual–text alignment.
Multilingual Support: Handles dozens of languages—English, French, Spanish, German, Italian, Portuguese, Dutch, Arabic, Chinese, Japanese, Korean—while maintaining strong reasoning and generation.
Agent-Ready: Native function calls, reliable JSON outputs, and strict prompt adherence make Astria fully agentic-capable.
Edge Efficiency: Optimized for minimal hardware without sacrificing performance.
Large Context Window: Up to 256k tokens for long-form reasoning, document-level comprehension, and complex multi-step tasks.
Enhanced Reasoning: Ministral backbone ensures stronger factual grounding, smoother multimodal alignment, and improved long-horizon reasoning.

A fully local, compact model redefining what edge-deployable multimodal AI can achieve.

📊 Visual Reasoning Performance

Astria applies a custom evaluation using GPT-5 PRO as the judge.

92.53% — New SOTA

LLaVA baseline: 90.92%

A custom evaluation on 30 unseen images with 3 instruction types per image (conversation, description, complex reasoning) shows Astria outperforms GPT-5 in all categories.

Evaluation: Astria vs GPT-5

A custom evaluation set of 30 unseen images was constructed. Each image includes three instruction types:

Conversational understanding
Detailed visual description
Complex multimodal reasoning

This yields 90 unique image–language tasks, evaluated on:

Astria
GPT-5

Scoring was performed by GPT-5 PRO, using a 1–10 scale per task.

Results

Astria outperforms GPT-5 across all instruction categories, validating the effectiveness of the custom vision encoder combined with the Ministral knowledge-enhanced language model.

Model Summary

Vision Encoder: Custom-built, with precise visual-text alignment
Language Backbone: Ministral-based, optimized for reasoning and factual accuracy
Training: End-to-end multimodal alignment with knowledge supervision
Output: Grounded, structured, and context-aware responses
Deployment: Fully local and edge-optimized, supporting up to 256k token context

License

Astria is released under the Astria License for personal and non-commercial use. Commercial use requires explicit permission from the creator.

Downloads last month: 8

Safetensors

Model size

9B params

Tensor type

BF16