Any-to-Any
Transformers
Safetensors
English
xoron
multimodal
Mixture of Experts
text-to-image
image editing
image to video
text-to-video
video editing
text-to-speech
speech-to-text
speech-to-speech
image-to-text
video-to-text
agentic
tool-use
flow-matching
3d-rope
titok
vidtok
dual-stream-attention
zero-shot-voice-cloning
bigvgan
snake-activation
multi-receptive-field-fusion
custom_code
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,135 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: mit
|
| 5 |
+
library_name: transformers
|
| 6 |
+
tags:
|
| 7 |
+
- multimodal
|
| 8 |
+
- moe
|
| 9 |
+
- text-to-image
|
| 10 |
+
- image editing
|
| 11 |
+
- image to video
|
| 12 |
+
- text-to-video
|
| 13 |
+
- video editing
|
| 14 |
+
- text-to-speech
|
| 15 |
+
- speech-to-text
|
| 16 |
+
- image-to-text
|
| 17 |
+
- video-to-text
|
| 18 |
+
- agentic
|
| 19 |
+
- tool-use
|
| 20 |
+
pipeline_tag: any-to-any
|
| 21 |
+
inference: false
|
| 22 |
+
datasets:
|
| 23 |
+
# === Code & Programming ===
|
| 24 |
+
- m-a-p/Code-Feedback
|
| 25 |
+
- iamtarun/python_code_instructions_18k_alpaca
|
| 26 |
+
- codeparrot/codeparrot-clean
|
| 27 |
+
- bigcode/humanevalpack
|
| 28 |
+
- loubnabnl/github-jupyter-code-to-text
|
| 29 |
+
- saurabh5/rlvr-code-data-Swift
|
| 30 |
+
- finbarr/rlvr-code-data-swift-code-edit
|
| 31 |
+
- ExAi/Code-Golang-QA-2k
|
| 32 |
+
- smcleod/golang-coder
|
| 33 |
+
# === Conversation & Agentic ===
|
| 34 |
+
- databricks/databricks-dolly-15k
|
| 35 |
+
- OpenAssistant/oasst1
|
| 36 |
+
- HuggingFaceH4/no_robots
|
| 37 |
+
- Open-Orca/OpenOrca
|
| 38 |
+
- abhi227070/converstion-to-summarization-dataset
|
| 39 |
+
- allenai/WildChat-1M
|
| 40 |
+
- THUDM/AgentInstruct
|
| 41 |
+
- glaiveai/glaive-code-assistant-v2
|
| 42 |
+
- stingning/ultrachat
|
| 43 |
+
- RyokoAI/ShareGPT52K
|
| 44 |
+
- AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset
|
| 45 |
+
# === Tool Use ===
|
| 46 |
+
- Locutusque/function-calling-chatml
|
| 47 |
+
- driaforall/pythonic-function-calling
|
| 48 |
+
- argilla/Synth-APIGen-v0.1
|
| 49 |
+
- interstellarninja/tool-calls-singleturn
|
| 50 |
+
- interstellarninja/tool-calls-multiturn
|
| 51 |
+
# === Vision (Image & Video) ===
|
| 52 |
+
- Naveengo/flickr8k
|
| 53 |
+
- ybelkada/football-dataset
|
| 54 |
+
- jmhessel/newyorker_caption_contest
|
| 55 |
+
- derek-thomas/ScienceQA
|
| 56 |
+
- HuggingFaceM4/WebSight
|
| 57 |
+
- lmms-lab/Video-MME
|
| 58 |
+
- MBZUAI/VideoInstruct-100K
|
| 59 |
+
# === Generation (Prompts & Media) ===
|
| 60 |
+
- Gustavosta/Stable-Diffusion-Prompts
|
| 61 |
+
- FredZhang7/stable-diffusion-prompts-2.47M
|
| 62 |
+
- succinctly/midjourney-prompts
|
| 63 |
+
- osunlp/MagicBrush
|
| 64 |
+
- timbrooks/instructpix2pix-clip-filtered
|
| 65 |
+
- Rapidata/sora-video-generation-physics-likert-scoring
|
| 66 |
+
- Rapidata/sora-video-generation-style-likert-scoring
|
| 67 |
+
- Rapidata/sora-video-generation-alignment-likert-scoring
|
| 68 |
+
- Rapidata/text-2-video-human-preferences
|
| 69 |
+
- Rapidata/text-2-video-human-preferences-sora-2
|
| 70 |
+
- TempoFunk/webvid-10M
|
| 71 |
+
- multimodalart/panda-70m
|
| 72 |
+
- nkp37/OpenVid-1M
|
| 73 |
+
- WenhaoWang/VidProM
|
| 74 |
+
- WenhaoWang/TIP-I2V
|
| 75 |
+
- jovianzm/img2vid-pexels-350k
|
| 76 |
+
- TencentARC/MiraData
|
| 77 |
+
- APRIL-AIGC/UltraVideo
|
| 78 |
+
- Mutonix/Vript
|
| 79 |
+
- Rapidata/image-to-video-human-preference-seedance-1-pro
|
| 80 |
+
# === Audio ===
|
| 81 |
+
- openslr/librispeech_asr
|
| 82 |
+
- blabble-io/libritts_r
|
| 83 |
+
- parler-tts/mls_eng_10k
|
| 84 |
+
- MikhailT/hifi-tts
|
| 85 |
+
# === File Ops ===
|
| 86 |
+
- renjiepi/medium_20000-file_operations_n100k1
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
# ๐ Xoron-Dev: State-of-the-Art Multimodal MoE
|
| 90 |
+
|
| 91 |
+
<div align="center">
|
| 92 |
+
|
| 93 |
+

|
| 94 |
+

|
| 95 |
+

|
| 96 |
+

|
| 97 |
+
|
| 98 |
+
</div>
|
| 99 |
+
|
| 100 |
+
**Xoron-Dev** is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a **Mixture of Experts (MoE)** backbone with DeepSeek-style shared experts and integrates SOTA encoders (SigLIP-2) and diffusers (MobileDiffusion) for comprehensive any-to-any capabilities.
|
| 101 |
+
|
| 102 |
+
## ๐ Model Highlights
|
| 103 |
+
|
| 104 |
+
* **Architecture:** Mixture of Experts (8 Experts + 1 Shared) with Sliding Window Attention.
|
| 105 |
+
* **Vision:** Native understanding of images (384px) and video (up to 32 frames) via SigLIP-2.
|
| 106 |
+
* **Generation:** Integrated MobileDiffusion for fast on-device Image & Video generation.
|
| 107 |
+
* **Audio:** Full duplex capabilities with Conformer-based ASR (Speech-to-Text) and Neural TTS.
|
| 108 |
+
* **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
|
| 109 |
+
* **Context:** Efficient 128K context window using sliding window attention (4096 local window).
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
+
|
| 113 |
+
## ๐ Training Data
|
| 114 |
+
|
| 115 |
+
Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.
|
| 116 |
+
|
| 117 |
+
### ๐ Open Source Datasets
|
| 118 |
+
We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:
|
| 119 |
+
|
| 120 |
+
* **Text & Code:** Includes `Code-Feedback`, `HumanEvalPack`, `OpenOrca`, and `AgentInstruct` for robust coding and reasoning capabilities.
|
| 121 |
+
* **Tool Use:** Datasets like `Function-Calling-ChatML` and `Synth-APIGen` enable precise tool invocation.
|
| 122 |
+
* **Vision (Image/Video):** Visual understanding is grounded in `ScienceQA`, `Video-MME`, and `VideoInstruct-100K`.
|
| 123 |
+
* **Generation:** Text-to-Image/Video capabilities are fine-tuned on `Stable-Diffusion-Prompts`, `Sora-Likert-Scoring` datasets by Rapidata, and `WebVid-10M`.
|
| 124 |
+
* **Audio:** Speech tasks are powered by `LibriSpeech`, `LibriTTS-R`, and `HiFi-TTS`.
|
| 125 |
+
|
| 126 |
+
### ๐งช Synthetic Data Pipeline
|
| 127 |
+
To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom `synth` engine. These datasets focus on complex behaviors often missing from public corpuses:
|
| 128 |
+
|
| 129 |
+
| Category | Description |
|
| 130 |
+
|----------|-------------|
|
| 131 |
+
| **Anti-Hallucination** | Training the model to say "I don't know" (`Synth-IDK`), verify facts (`Synth-FactCheck`), and provide citations (`Synth-Citation`) rather than fabricating information. |
|
| 132 |
+
| **System Administration** | Simulated environments for `Docker` setup, `SSH` configuration, database management, and package installation (`Synth-AptInstall`). |
|
| 133 |
+
| **Code Execution** | Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors. |
|
| 134 |
+
| **Git Operations** | Simulated version control tasks including committing, handling diffs, and resolving merge conflicts. |
|
| 135 |
+
| **Chain-of-Thought** | explicit `Synth-CoT` data to encourage internal reasoning before generating final answers. |
|