14.4 GB
32 files
Updated 28 days ago
Name
Size
assets
.gitattributes1 kB
xet
README.md7.6 kB
xet
added_tokens.json14.5 kB
xet
audio_decoder.safetensors1.46 GB
xet
audio_encoder.safetensors466 MB
xet
audio_projector.safetensors2.1 MB
xet
chat_template.jinja1.12 kB
xet
components.json302 Bytes
xet
config.json3.71 kB
xet
configuration_xoron.py14.5 kB
xet
cross_attention.safetensors174 MB
xet
generator.safetensors629 MB
xet
llm.safetensors3.38 GB
xet
merges.txt1.67 MB
xet
modality_markers.safetensors12.8 kB
xet
model.safetensors.index.json419 kB
xet
modeling_xoron.py434 kB
xet
projector.safetensors52.9 MB
xet
special_tokens.json18.2 kB
xet
special_tokens_map.json79.2 kB
xet
streaming_state.json6.21 kB
xet
tokenizer.json11.5 MB
xet
tokenizer_config.json111 kB
xet
trainer_state.json702 Bytes
xet
training_state.pt5.23 GB
xet
video_encoder.safetensors1.92 GB
xet
video_generator.safetensors61.6 MB
xet
vision_encoder.safetensors1 GB
xet
vocab.json2.78 MB
xet
waveform_decoder.safetensors34.7 MB
xet
README.md

πŸš€ Xoron-Dev: State-of-the-Art Multimodal MoE

Xoron-Dev Logo License Params Context Version

Training-Stage

Xoron-Dev Logo

Xoron-Dev

✨ Xoron-Dev: The Elite SOTA Omni-Modal Intelligence

Xoron-Dev is the definitive open-source architecture for Omni-Modal Artificial Intelligence. Unlike legacy models that treat vision and audio as plugins, Xoron-Dev is designed for native, high-fidelity perception across every major sensory dimension.


🌟 Why Xoron-Dev?

Xoron-Dev represents a massive leap in multimodal reasoning, combining cutting-edge Sparse MoE architecture with a refined sensory stack.

1. πŸ‘οΈ SOTA Vision (SigLIP-2 & TiTok)

Xoron-Dev exclusively uses SigLIP-2 for superior zero-shot performance and semantic alignment.

  • TiTok 1D VAE: Images are compressed into 256 ultra-dense tokens, allowing Xoron to "see" high-resolution scenes with unprecedented efficiency.
  • 2D-RoPE: Integrated positional embeddings that maintain spatial relationships regardless of aspect ratio.

2. 🎬 Native Video Intelligence (VidTok)

Our custom VidTok encoder uses 3D Volumetric Compression to ingest up to 32 frames of high-definition video natively. Xoron doesn't just see a sequence of imagesβ€”it understands motion, causality, and temporal context.

3. πŸŽ™οΈ Raw PCM Audio (Conformer + BigVGAN)

Xoron-Dev processes Raw 16kHz PCM Audio directly. No Mel Spectrograms, no lossy Fourier transforms.

  • Micro-Latency S2S: True Speech-to-Speech interactions (<200ms) for natural, fluid conversations.
  • Zero-Shot Voice Cloning: Instantly clone any voice from a 5-second sample for high-fidelity personalized output.

🧠 The Brain: Aux-Lossless MoE & 128K Ring Attention

A sophisticated Mixture of Experts (MoE) backbone that dynamically routes the logic of every token through specialized hardware-aware sub-networks.

πŸ—οΈ Deep Expert Hierarchy

Unlike standard MoE models with uniform experts, Xoron-Dev implements a specialized Deep Expert system.

  • Expert Pool: 16 Experts Total (8 Standard + 8 Deep).
  • Variable Logical Depth: Deep Experts possess internal depths scaling from 2 up to 9 layers.
  • Expert Penalty Routing: A soft utilization penalty ($Cost \propto Depth$) ensures that the model only invokes deeper computation for tasks requiring maximum logical precision, maintaining high inference throughput for simpler tokens.

⚑ Reasoning Acceleration: Fast Ponder

Xoron-Dev features a dedicated FastPonderBlock for near-instant latent deliberation.

  • Attention-Free Reasoning: By bypassing the $O(N^2)$ Self-Attention stack during thought loops, the Depth-3 reasoning block propagates logic at 120+ thoughts/sec.
  • Dynamic Halting: A learned halt_head monitors latent entropy. Once the model reaches a decision (entropy threshold < 0.2), it breaks the ponder loop and returns to token decoding, reducing unnecessary FLOPs by up to 90%.

πŸ”˜ Infinite Context

Using Ring Attention, Xoron-Dev can analyze books, hour-long videos, or massive codebases with native 128K context window support.


πŸš€ Get Started with Xorfice

The easiest way to experience Xoron-Dev is via the xorfice engineβ€”the SOTA orchestrator for multimodal deployment.

Installation

pip install xorfice

High-Fidelity Interaction

from xorfice import XoronEngine

# The engine automatically handles weights and optimizations
# Correct model slug: Backup-bdg/Xoron-Dev-MultiMoe
engine = XoronEngine(model_path="Backup-bdg/Xoron-Dev-MultiMoe")

# Start an omni-modal conversation
response = engine.generate(
    prompt="Who is this person and what are they doing?",
    images="https://example.com/interview.jpg",
    videos="https://example.com/interview.mp4"
)
print(response["text"])

πŸ“ˆ SOTA Benchmarks & Features

Feature Xoron-Dev
Vision Backbone SigLIP-2
Video Compression VidTok 3D
Audio Ingestion Raw PCM
Inference Efficiency Sparse MoE (5B)
Context Window 128K (Ring)

🎨 Creative Generation

Fully integrated with MobileDiffusion, Xoron-Dev doesn't just understandβ€”it creates.

  • Text-to-Video (T2V)
  • Image-to-Video (I2V)
  • Text-to-Image (T2I)
  • Image-to-Image (I2I)
  • Video-to-Video (V2V)

Join the Revolution

Xoron-Dev is more than a modelβ€”it's a vision for the future of AI. Build your own multimodal agent today.

Powered by Xoron-Dev Team

Total size
14.4 GB
Files
32
Last updated
Jun 6
Pre-warmed CDN
US EU US EU

Contributors