Backup-bdg
/

Xoron-Dev-MultiMoe

Model card Files Files and versions

xet

Community

Backup-bdg commited on Feb 7

Commit

e2adcb4

verified ·

1 Parent(s): 45002c6

Update README.md

Browse files

Files changed (1) hide show

README.md +24 -54

README.md CHANGED Viewed

@@ -103,6 +103,7 @@ datasets:
 ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)
 ![Params](https://img.shields.io/badge/Parameters-1.5B_MoE-yellow?style=for-the-badge)
 ![Context](https://img.shields.io/badge/Context-128K-red?style=for-the-badge)
 </div>
@@ -111,14 +112,15 @@ datasets:
 ## 🌟 Model Highlights
 * **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
-* **Vision Encoder:** SigLIP-2 (384px) with **TiTok-style 1D tokenization**, **Dual-Stream Attention**, and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** for video (up to 16 frames).
-* **Image Generation:** **MoE-DiT** (Diffusion Transformer with MoE) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style).
-* **Video Generation:** **3D Causal Transformers** with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing**.
 * **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
 * **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
 * **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
-* **Fine-tuning:** LoRA variants including **rsLoRA**, **DoRA**, and **LoRA+** with configurable learning rate ratio.
-* **Multimodal Fusion:** Cross-Attention layers (4 layers, 8 heads) for deep multimodal integration.
 * **Performance:** Flash Attention support with FP16-native numerical stability.
 ---
@@ -152,7 +154,8 @@ datasets:
 ### 🎬 Video Encoder (3D Causal Transformers)
 | Feature | Description |
 |---------|-------------|
-| Max Frames | 16 frames |
 | Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
 | Attention | 3D Causal Self-Attention |
 | Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
@@ -163,7 +166,7 @@ datasets:
 |---------|-------------|
 | Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
 | Scheduler | **Flow Matching** (not DDPM) |
-| Output Resolution | 384×384 |
 | Position Encoding | 2D-RoPE |
 | Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
 | MoE Experts | 4 experts in DiT blocks |
@@ -173,14 +176,23 @@ datasets:
 ### 📹 Video Generation (3D Causal + Flow Matching)
 | Feature | Description |
 |---------|-------------|
-| Output Resolution | 256×256 |
-| Output Frames | 16 frames (default), up to 32 frames (max capacity) |
 | Scheduler | **Flow Matching** |
 | Position Encoding | **3D-RoPE** for (x, y, t) |
 | Attention | Factorized Spatial-Temporal (3D Causal) |
 | Expert Routing | **Temporal MoE** (4 experts) |
 | Guidance Scale | 7.5 (CFG) |
 ### 🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
 | Feature | Description |
 |---------|-------------|
@@ -199,53 +211,11 @@ Direct audio output without external vocoder:
 |---------|-------------|
 | Architecture | BigVGAN/HiFi-GAN style with transposed convolutions |
 | **Snake Activation** | `x + sin²(αx)/α` - preserves audio periodicity |
-| **Multi-Receptive Field Fusion** | Parallel residual stacks (kernels 3, 7, 11) |
 | Weight Normalization | Stable training, faster convergence |
-| Upsampling | 256x (rates: 8, 8, 2, 2) |
 | Streaming | `stream_decode()` for low-latency real-time output |
-### 🗣️ Speech-to-Speech API
-The model provides three main methods for voice interaction:
-| Method | Description |
-|--------|-------------|
-| `model.listen(audio)` | Encode speech to embeddings (ASR) |
-| `model.speak(text)` | Generate playable audio from text (TTS) |
-| `model.listen_and_respond(audio)` | Full conversation: listen → think → speak back |
-```python
-# Example: Talk to the model and it talks back
-response_audio = model.listen_and_respond(your_audio)  # Returns playable waveform
-# Example: Make the model say something
-audio = model.speak(tokenizer.encode("Hello, how can I help you?"))
-# Save as WAV file
-import soundfile as sf
-sf.write("response.wav", audio.cpu().numpy(), 16000)
-# Streaming for real-time (low latency)
-for chunk in model.waveform_decoder.stream_decode(features, chunk_size=10):
-    play_audio(chunk)  # Play each chunk as it's generated
-```
-### 🎯 Training Pipeline for Speech
-The model learns to speak using these datasets and losses:
-| Dataset | Type | Purpose |
-|---------|------|---------|
-| `openslr/librispeech_asr` | ASR | Learn to transcribe speech |
-| `blabble-io/libritts_r` | TTS | Learn to generate speech |
-| `parler-tts/mls_eng_10k` | TTS | Multi-speaker variety |
-| `MikhailT/hifi-tts` | TTS | High-fidelity speech |
-**Training Losses:**
-- **Mel Loss**: MSE between predicted and target mel spectrograms
-- **Duration Loss**: MSE for MAS-predicted durations
-- **Waveform L1 Loss**: Time-domain reconstruction
-- **Multi-Scale STFT Loss**: Frequency-domain quality (512/1024/2048 FFT)
----
 ## 📚 Training Data

 ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)
 ![Params](https://img.shields.io/badge/Parameters-1.5B_MoE-yellow?style=for-the-badge)
 ![Context](https://img.shields.io/badge/Context-128K-red?style=for-the-badge)
+![Version](https://img.shields.io/badge/Version-2.1-purple?style=for-the-badge)
 </div>
 ## 🌟 Model Highlights
 * **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
+* **Multi-Scale Training (NEW):** Random scale selection per batch - images (128-512px), videos (128-384px), frames (8-32 including 20).
+* **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** (4 experts) for video (8-32 frames).
+* **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output: 256-512px, 50 inference steps.
+* **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-32 frames @ 128-384px.
 * **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
 * **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
 * **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
+* **Fine-tuning:** LoRA variants including **rsLoRA**, **DoRA**, and **LoRA+** (r=32, α=64, 4x B matrix learning rate).
+* **Multimodal Fusion:** Cross-Attention layers (4 layers, 8 heads) + Perceiver Resampler for vision projection.
 * **Performance:** Flash Attention support with FP16-native numerical stability.
 ---
 ### 🎬 Video Encoder (3D Causal Transformers)
 | Feature | Description |
 |---------|-------------|
+| Frame Scales | 8, 12, 16, 24, 32 frames (multi-scale) |
+| Resolution Scales | 128, 192, 256, 320, 384px (multi-scale) |
 | Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
 | Attention | 3D Causal Self-Attention |
 | Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
 |---------|-------------|
 | Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
 | Scheduler | **Flow Matching** (not DDPM) |
+| Output Resolution | 256-512px (multi-scale: 256, 320, 384, 448, 512) |
 | Position Encoding | 2D-RoPE |
 | Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
 | MoE Experts | 4 experts in DiT blocks |
 ### 📹 Video Generation (3D Causal + Flow Matching)
 | Feature | Description |
 |---------|-------------|
+| Output Resolution | 128-384px (multi-scale: 128, 192, 256, 320, 384) |
+| Output Frames | 8-32 frames (multi-scale: 8, 12, 16, 20, 24, 32) |
 | Scheduler | **Flow Matching** |
 | Position Encoding | **3D-RoPE** for (x, y, t) |
 | Attention | Factorized Spatial-Temporal (3D Causal) |
 | Expert Routing | **Temporal MoE** (4 experts) |
 | Guidance Scale | 7.5 (CFG) |
+### 📐 Multi-Scale Training Configuration
+| Type | Scales | Probabilities |
+|------|--------|---------------|
+| **Image** | 128, 192, 256, 320, 384, 448, 512px | 5%, 10%, 30%, 25%, 15%, 10%, 5% |
+| **Video** | 128, 192, 256, 320, 384px | 10%, 20%, 35%, 25%, 10% |
+| **Frames** | 8, 12, 16, 20, 24, 32 | 10%, 15%, 30%, 20%, 15%, 10% |
+Multi-scale training is **enabled by default** with **random** strategy - each batch samples a different scale for variety.
 ### 🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
 | Feature | Description |
 |---------|-------------|
 |---------|-------------|
 | Architecture | BigVGAN/HiFi-GAN style with transposed convolutions |
 | **Snake Activation** | `x + sin²(αx)/α` - preserves audio periodicity |
+| **Multi-Receptive Field Fusion** | Parallel residual stacks (kernels 3, 7, 11, dilations 1/3/5) |
 | Weight Normalization | Stable training, faster convergence |
+| Upsampling | 256x total (rates: 8, 8, 2, 2) from features to 16kHz audio |
 | Streaming | `stream_decode()` for low-latency real-time output |
+| Output Range | [-1, 1] normalized waveform via tanh |
 ## 📚 Training Data