Xoron-Dev-MultiMoe / README.md

Backup-bdg

Update README.md

2ef001e verified 1 day ago

preview code

raw

history blame contribute delete

5.75 kB

metadata

language:
  - en
license: mit
library_name: transformers
tags:
  - multimodal
  - moe
  - text-to-image
  - image editing
  - image to video
  - text-to-video
  - video editing
  - text-to-speech
  - speech-to-text
  - image-to-text
  - video-to-text
  - agentic
  - tool-use
pipeline_tag: any-to-any
inference: false
datasets:
  - m-a-p/Code-Feedback
  - iamtarun/python_code_instructions_18k_alpaca
  - codeparrot/codeparrot-clean
  - bigcode/humanevalpack
  - loubnabnl/github-jupyter-code-to-text
  - saurabh5/rlvr-code-data-Swift
  - finbarr/rlvr-code-data-swift-code-edit
  - ExAi/Code-Golang-QA-2k
  - smcleod/golang-coder
  - databricks/databricks-dolly-15k
  - OpenAssistant/oasst1
  - HuggingFaceH4/no_robots
  - Open-Orca/OpenOrca
  - abhi227070/converstion-to-summarization-dataset
  - allenai/WildChat-1M
  - THUDM/AgentInstruct
  - glaiveai/glaive-code-assistant-v2
  - stingning/ultrachat
  - RyokoAI/ShareGPT52K
  - AlicanKiraz0/Agentic-Chain-of-Thought-Coding-SFT-Dataset
  - Locutusque/function-calling-chatml
  - driaforall/pythonic-function-calling
  - argilla/Synth-APIGen-v0.1
  - interstellarninja/tool-calls-singleturn
  - interstellarninja/tool-calls-multiturn
  - Naveengo/flickr8k
  - ybelkada/football-dataset
  - jmhessel/newyorker_caption_contest
  - derek-thomas/ScienceQA
  - HuggingFaceM4/WebSight
  - lmms-lab/Video-MME
  - MBZUAI/VideoInstruct-100K
  - Gustavosta/Stable-Diffusion-Prompts
  - FredZhang7/stable-diffusion-prompts-2.47M
  - succinctly/midjourney-prompts
  - osunlp/MagicBrush
  - timbrooks/instructpix2pix-clip-filtered
  - Rapidata/sora-video-generation-physics-likert-scoring
  - Rapidata/sora-video-generation-style-likert-scoring
  - Rapidata/sora-video-generation-alignment-likert-scoring
  - Rapidata/text-2-video-human-preferences
  - Rapidata/text-2-video-human-preferences-sora-2
  - TempoFunk/webvid-10M
  - multimodalart/panda-70m
  - nkp37/OpenVid-1M
  - WenhaoWang/VidProM
  - WenhaoWang/TIP-I2V
  - jovianzm/img2vid-pexels-350k
  - TencentARC/MiraData
  - APRIL-AIGC/UltraVideo
  - Mutonix/Vript
  - Rapidata/image-to-video-human-preference-seedance-1-pro
  - openslr/librispeech_asr
  - blabble-io/libritts_r
  - parler-tts/mls_eng_10k
  - MikhailT/hifi-tts
  - renjiepi/medium_20000-file_operations_n100k1

🚀 Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev is a unified, multimodal AI model designed to understand and generate text, images, video, and audio within a single architecture. It leverages a Mixture of Experts (MoE) backbone with DeepSeek-style shared experts and integrates SOTA encoders (SigLIP-2) and diffusers (MobileDiffusion) for comprehensive any-to-any capabilities.

🌟 Model Highlights

Architecture: Mixture of Experts (8 Experts + 1 Shared) with Sliding Window Attention.
Vision: Native understanding of images (384px) and video (up to 32 frames) via SigLIP-2.
Generation: Integrated MobileDiffusion for fast on-device Image & Video generation.
Audio: Full duplex capabilities with Conformer-based ASR (Speech-to-Text) and Neural TTS.
Agentic: Trained for tool calling, file operations, and code execution with uncertainty estimation.
Context: Efficient 128K context window using sliding window attention (4096 local window).

📚 Training Data

Xoron-Dev is trained on a massive, curated mix of open-source Hugging Face datasets and specialized synthetic data generated to enhance agentic capabilities and reduce hallucinations.

🌐 Open Source Datasets

We utilize over 50 high-quality datasets from Hugging Face, categorized by modality:

Text & Code: Includes Code-Feedback, HumanEvalPack, OpenOrca, and AgentInstruct for robust coding and reasoning capabilities.
Tool Use: Datasets like Function-Calling-ChatML and Synth-APIGen enable precise tool invocation.
Vision (Image/Video): Visual understanding is grounded in ScienceQA, Video-MME, and VideoInstruct-100K.
Generation: Text-to-Image/Video capabilities are fine-tuned on Stable-Diffusion-Prompts, Sora-Likert-Scoring datasets by Rapidata, and WebVid-10M.
Audio: Speech tasks are powered by LibriSpeech, LibriTTS-R, and HiFi-TTS.

🧪 Synthetic Data Pipeline

To bridge the gap between general knowledge and actionable agentic behavior, we generate extensive synthetic datasets locally using our custom synth engine. These datasets focus on complex behaviors often missing from public corpuses:

Category	Description
Anti-Hallucination	Training the model to say "I don't know" (`Synth-IDK`), verify facts (`Synth-FactCheck`), and provide citations (`Synth-Citation`) rather than fabricating information.
System Administration	Simulated environments for `Docker` setup, `SSH` configuration, database management, and package installation (`Synth-AptInstall`).
Code Execution	Traces of code execution including `Shell` errors, timeouts, and multi-step debugging workflows to teach the model how to recover from errors.
Git Operations	Simulated version control tasks including committing, handling diffs, and resolving merge conflicts.
Chain-of-Thought	explicit `Synth-CoT` data to encourage internal reasoning before generating final answers.