Xoron-Dev-MultiMoe / README.md
Backup-bdg's picture
Update README.md
2ef001e verified
metadata
language:
  - en
license: mit
library_name: transformers
tags:
  - multimodal
  - moe
  - text-to-image
  - image editing
  - image to video
  - text-to-video
  - video editing
  - text-to-speech
  - speech-to-text
  - image-to-text
  - video-to-text
  - agentic
  - tool-use
pipeline_tag: any-to-any
inference: false
datasets:
  - m-a-p/Code-Feedback
  - iamtarun/python_code_instructions_18k_alpaca
  - codeparrot/codeparrot-clean
  - bigcode/humanevalpack
  - loubnabnl/github-jupyter-code-to-text
  - saurabh5/rlvr-code-data-Swift
  - finbarr/rlvr-code-data-swift-code-edit
  - ExAi/Code-Golang-QA-2k
  - smcleod/golang-coder
  - databricks/databricks-dolly-15k
  - OpenAssistant/oasst1
  - HuggingFaceH4/no_robots
  - Open-Orca/OpenOrca
  - abhi227070/converstion-to-summarization-dataset
  - allenai/WildChat-1M
  - THUDM/AgentInstruct
  - glaiveai/glaive-code-assistant-v2
  - stingning/ultrachat
  - RyokoAI/ShareGPT52K
  - AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset
  - Locutusque/function-calling-chatml
  - driaforall/pythonic-function-calling
  - argilla/Synth-APIGen-v0.1
  - interstellarninja/tool-calls-singleturn
  - interstellarninja/tool-calls-multiturn
  - Naveengo/flickr8k
  - ybelkada/football-dataset
  - jmhessel/newyorker_caption_contest
  - derek-thomas/ScienceQA
  - HuggingFaceM4/WebSight
  - lmms-lab/Video-MME
  - MBZUAI/VideoInstruct-100K
  - Gustavosta/Stable-Diffusion-Prompts
  - FredZhang7/stable-diffusion-prompts-2.47M
  - succinctly/midjourney-prompts
  - osunlp/MagicBrush
  - timbrooks/instructpix2pix-clip-filtered
  - Rapidata/sora-video-generation-physics-likert-scoring
  - Rapidata/sora-video-generation-style-likert-scoring
  - Rapidata/sora-video-generation-alignment-likert-scoring
  - Rapidata/text-2-video-human-preferences
  - Rapidata/text-2-video-human-preferences-sora-2
  - TempoFunk/webvid-10M
  - multimodalart/panda-70m
  - nkp37/OpenVid-1M
  - WenhaoWang/VidProM
  - WenhaoWang/TIP-I2V
  - jovianzm/img2vid-pexels-350k
  - TencentARC/MiraData
  - APRIL-AIGC/UltraVideo
  - Mutonix/Vript
  - Rapidata/image-to-video-human-preference-seedance-1-pro
  - openslr/librispeech_asr
  - blabble-io/libritts_r
  - parler-tts/mls_eng_10k
  - MikhailT/hifi-tts
  - renjiepi/medium_20000-file_operations_n100k1

πŸš€ Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev Logo License Params Context

Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared experts and integrates SOTA encoders (SigLIP-2) and diffusers (MobileDiffusion) for comprehensive any-to-any capabilities.

🌟 Model Highlights

  • Architecture: Mixture of Experts (8 Experts + 1 Shared) with Sliding Window Attention.
  • Vision: Native understanding of images (384px) and video (up to 32 frames) via SigLIP-2.
  • Generation: Integrated MobileDiffusion for fast on-device Image & Video generation.
  • Audio: Full duplex capabilities with Conformer-based ASR (Speech-to-Text) and Neural TTS.
  • Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
  • Context: Efficient 128K context window using sliding window attention (4096 local window).

πŸ“š Training Data

Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.

🌐 Open Source Datasets

We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:

  • Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
  • Tool Use: Datasets like Function-Calling-ChatML and Synth-APIGen enable precise tool invocation.
  • Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
  • Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
  • Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.

πŸ§ͺ Synthetic Data Pipeline

To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:

Category Description
Anti-Hallucination Training the model to say "I don't know" (Synth-IDK), verify facts (Synth-FactCheck), and provide citations (Synth-Citation) rather than fabricating information.
System Administration Simulated environments for Docker setup, SSH configuration, database management, and package installation (Synth-AptInstall).
Code Execution Traces of code execution including Shell errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations Simulated version control tasks including committing, handling diffs, and resolving merge conflicts.
Chain-of-Thought explicit Synth-CoT data to encourage internal reasoning before generating final answers.