File size: 5,747 Bytes
325016a 2ef001e 325016a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
language:
- en
license: mit
library_name: transformers
tags:
- multimodal
- moe
- text-to-image
- image editing
- image to video
- text-to-video
- video editing
- text-to-speech
- speech-to-text
- image-to-text
- video-to-text
- agentic
- tool-use
pipeline_tag: any-to-any
inference: false
datasets:
# === Code & Programming ===
- m-a-p/Code-Feedback
- iamtarun/python_code_instructions_18k_alpaca
- codeparrot/codeparrot-clean
- bigcode/humanevalpack
- loubnabnl/github-jupyter-code-to-text
- saurabh5/rlvr-code-data-Swift
- finbarr/rlvr-code-data-swift-code-edit
- ExAi/Code-Golang-QA-2k
- smcleod/golang-coder
# === Conversation & Agentic ===
- databricks/databricks-dolly-15k
- OpenAssistant/oasst1
- HuggingFaceH4/no_robots
- Open-Orca/OpenOrca
- abhi227070/converstion-to-summarization-dataset
- allenai/WildChat-1M
- THUDM/AgentInstruct
- glaiveai/glaive-code-assistant-v2
- stingning/ultrachat
- RyokoAI/ShareGPT52K
- AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset
# === Tool Use ===
- Locutusque/function-calling-chatml
- driaforall/pythonic-function-calling
- argilla/Synth-APIGen-v0.1
- interstellarninja/tool-calls-singleturn
- interstellarninja/tool-calls-multiturn
# === Vision (Image & Video) ===
- Naveengo/flickr8k
- ybelkada/football-dataset
- jmhessel/newyorker_caption_contest
- derek-thomas/ScienceQA
- HuggingFaceM4/WebSight
- lmms-lab/Video-MME
- MBZUAI/VideoInstruct-100K
# === Generation (Prompts & Media) ===
- Gustavosta/Stable-Diffusion-Prompts
- FredZhang7/stable-diffusion-prompts-2.47M
- succinctly/midjourney-prompts
- osunlp/MagicBrush
- timbrooks/instructpix2pix-clip-filtered
- Rapidata/sora-video-generation-physics-likert-scoring
- Rapidata/sora-video-generation-style-likert-scoring
- Rapidata/sora-video-generation-alignment-likert-scoring
- Rapidata/text-2-video-human-preferences
- Rapidata/text-2-video-human-preferences-sora-2
- TempoFunk/webvid-10M
- multimodalart/panda-70m
- nkp37/OpenVid-1M
- WenhaoWang/VidProM
- WenhaoWang/TIP-I2V
- jovianzm/img2vid-pexels-350k
- TencentARC/MiraData
- APRIL-AIGC/UltraVideo
- Mutonix/Vript
- Rapidata/image-to-video-human-preference-seedance-1-pro
# === Audio ===
- openslr/librispeech_asr
- blabble-io/libritts_r
- parler-tts/mls_eng_10k
- MikhailT/hifi-tts
# === File Ops ===
- renjiepi/medium_20000-file_operations_n100k1
---
# ๐ Xoron-Dev: State-of-the-Art Multimodal MoE
<div align="center">




</div>
**Xoron-Dev** is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a **Mixture of Experts (MoE)** backbone with DeepSeek-style shared experts and integrates SOTA encoders (SigLIP-2) and diffusers (MobileDiffusion) for comprehensive any-to-any capabilities.
## ๐ Model Highlights
* **Architecture:** Mixture of Experts (8 Experts + 1 Shared) with Sliding Window Attention.
* **Vision:** Native understanding of images (384px) and video (up to 32 frames) via SigLIP-2.
* **Generation:** Integrated MobileDiffusion for fast on-device Image & Video generation.
* **Audio:** Full duplex capabilities with Conformer-based ASR (Speech-to-Text) and Neural TTS.
* **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
* **Context:** Efficient 128K context window using sliding window attention (4096 local window).
---
## ๐ Training Data
Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.
### ๐ Open Source Datasets
We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:
* **Text & Code:** Includes `Code-Feedback`, `HumanEvalPack`, `OpenOrca`, and `AgentInstruct` for robust coding and reasoning capabilities.
* **Tool Use:** Datasets like `Function-Calling-ChatML` and `Synth-APIGen` enable precise tool invocation.
* **Vision (Image/Video):** Visual understanding is grounded in `ScienceQA`, `Video-MME`, and `VideoInstruct-100K`.
* **Generation:** Text-to-Image/Video capabilities are fine-tuned on `Stable-Diffusion-Prompts`, `Sora-Likert-Scoring` datasets by Rapidata, and `WebVid-10M`.
* **Audio:** Speech tasks are powered by `LibriSpeech`, `LibriTTS-R`, and `HiFi-TTS`.
### ๐งช Synthetic Data Pipeline
To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom `synth` engine. These datasets focus on complex behaviors often missing from public corpuses:
| Category | Description |
|----------|-------------|
| **Anti-Hallucination** | Training the model to say "I don't know" (`Synth-IDK`), verify facts (`Synth-FactCheck`), and provide citations (`Synth-Citation`) rather than fabricating information. |
| **System Administration** | Simulated environments for `Docker` setup, `SSH` configuration, database management, and package installation (`Synth-AptInstall`). |
| **Code Execution** | Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
| **Git Operations** | Simulated version control tasks including committing, handling diffs, and resolving merge conflicts. |
| **Chain-of-Thought** | explicit `Synth-CoT` data to encourage internal reasoning before generating final answers. | |