🚀 Xoron-Dev: State-of-the-Art Multimodal MoE
Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared experts and integrates SOTA encoders (SigLIP-2) and diffusers (MobileDiffusion) for comprehensive any-to-any capabilities.
🌟 Model Highlights
- Architecture: Mixture of Experts (8 Experts + 1 Shared) with Sliding Window Attention.
- Vision: Native understanding of images (384px) and video (up to 32 frames) via SigLIP-2.
- Generation: Integrated MobileDiffusion for fast on-device Image & Video generation.
- Audio: Full duplex capabilities with Conformer-based ASR (Speech-to-Text) and Neural TTS.
- Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
- Context: Efficient 128K context window using sliding window attention (4096 local window).
📚 Training Data
Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.
🌐 Open Source Datasets
We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:
- Text & Code: Includes
Code-Feedback,HumanEvalPack,OpenOrca, andAgentInstructfor robust coding and reasoning capabilities. - Tool Use: Datasets like
Function-Calling-ChatMLandSynth-APIGenenable precise tool invocation. - Vision (Image/Video): Visual understanding is grounded in
ScienceQA,Video-MME, andVideoInstruct-100K. - Generation: Text-to-Image/Video capabilities are fine-tuned on
Stable-Diffusion-Prompts,Sora-Likert-Scoringdatasets by Rapidata, andWebVid-10M. - Audio: Speech tasks are powered by
LibriSpeech,LibriTTS-R, andHiFi-TTS.
🧪 Synthetic Data Pipeline
To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:
| Category | Description |
|---|---|
| Anti-Hallucination | Training the model to say "I don't know" (Synth-IDK), verify facts (Synth-FactCheck), and provide citations (Synth-Citation) rather than fabricating information. |
| System Administration | Simulated environments for Docker setup, SSH configuration, database management, and package installation (Synth-AptInstall). |
| Code Execution | Traces of code execution including Shell errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
| Git Operations | Simulated version control tasks including committing, handling diffs, and resolving merge conflicts. |
| Chain-of-Thought | explicit Synth-CoT data to encourage internal reasoning before generating final answers. |