Any-to-Any
Transformers
Safetensors
English
xoron
multimodal
Mixture of Experts
text-to-image
image editing
image to video
text-to-video
video editing
text-to-speech
speech-to-text
speech-to-speech
image-to-text
video-to-text
agentic
tool-use
flow-matching
3d-rope
titok
vidtok
dual-stream-attention
zero-shot-voice-cloning
bigvgan
snake-activation
multi-receptive-field-fusion
custom_code
Update README.md
Browse files
README.md
CHANGED
|
@@ -103,6 +103,7 @@ datasets:
|
|
| 103 |

|
| 104 |

|
| 105 |

|
|
|
|
| 106 |
|
| 107 |
</div>
|
| 108 |
|
|
@@ -111,14 +112,15 @@ datasets:
|
|
| 111 |
## 🌟 Model Highlights
|
| 112 |
|
| 113 |
* **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
|
| 114 |
-
* **
|
| 115 |
-
* **
|
| 116 |
-
* **
|
|
|
|
| 117 |
* **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
|
| 118 |
* **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
|
| 119 |
* **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
|
| 120 |
-
* **Fine-tuning:** LoRA variants including **rsLoRA**, **DoRA**, and **LoRA+**
|
| 121 |
-
* **Multimodal Fusion:** Cross-Attention layers (4 layers, 8 heads)
|
| 122 |
* **Performance:** Flash Attention support with FP16-native numerical stability.
|
| 123 |
|
| 124 |
---
|
|
@@ -152,7 +154,8 @@ datasets:
|
|
| 152 |
### 🎬 Video Encoder (3D Causal Transformers)
|
| 153 |
| Feature | Description |
|
| 154 |
|---------|-------------|
|
| 155 |
-
|
|
|
|
|
| 156 |
| Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
|
| 157 |
| Attention | 3D Causal Self-Attention |
|
| 158 |
| Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
|
|
@@ -163,7 +166,7 @@ datasets:
|
|
| 163 |
|---------|-------------|
|
| 164 |
| Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
|
| 165 |
| Scheduler | **Flow Matching** (not DDPM) |
|
| 166 |
-
| Output Resolution | 384
|
| 167 |
| Position Encoding | 2D-RoPE |
|
| 168 |
| Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
|
| 169 |
| MoE Experts | 4 experts in DiT blocks |
|
|
@@ -173,14 +176,23 @@ datasets:
|
|
| 173 |
### 📹 Video Generation (3D Causal + Flow Matching)
|
| 174 |
| Feature | Description |
|
| 175 |
|---------|-------------|
|
| 176 |
-
| Output Resolution | 256
|
| 177 |
-
| Output Frames |
|
| 178 |
| Scheduler | **Flow Matching** |
|
| 179 |
| Position Encoding | **3D-RoPE** for (x, y, t) |
|
| 180 |
| Attention | Factorized Spatial-Temporal (3D Causal) |
|
| 181 |
| Expert Routing | **Temporal MoE** (4 experts) |
|
| 182 |
| Guidance Scale | 7.5 (CFG) |
|
| 183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 184 |
### 🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
|
| 185 |
| Feature | Description |
|
| 186 |
|---------|-------------|
|
|
@@ -199,53 +211,11 @@ Direct audio output without external vocoder:
|
|
| 199 |
|---------|-------------|
|
| 200 |
| Architecture | BigVGAN/HiFi-GAN style with transposed convolutions |
|
| 201 |
| **Snake Activation** | `x + sin²(αx)/α` - preserves audio periodicity |
|
| 202 |
-
| **Multi-Receptive Field Fusion** | Parallel residual stacks (kernels 3, 7, 11) |
|
| 203 |
| Weight Normalization | Stable training, faster convergence |
|
| 204 |
-
| Upsampling | 256x (rates: 8, 8, 2, 2) |
|
| 205 |
| Streaming | `stream_decode()` for low-latency real-time output |
|
| 206 |
-
|
| 207 |
-
### 🗣️ Speech-to-Speech API
|
| 208 |
-
The model provides three main methods for voice interaction:
|
| 209 |
-
|
| 210 |
-
| Method | Description |
|
| 211 |
-
|--------|-------------|
|
| 212 |
-
| `model.listen(audio)` | Encode speech to embeddings (ASR) |
|
| 213 |
-
| `model.speak(text)` | Generate playable audio from text (TTS) |
|
| 214 |
-
| `model.listen_and_respond(audio)` | Full conversation: listen → think → speak back |
|
| 215 |
-
|
| 216 |
-
```python
|
| 217 |
-
# Example: Talk to the model and it talks back
|
| 218 |
-
response_audio = model.listen_and_respond(your_audio) # Returns playable waveform
|
| 219 |
-
|
| 220 |
-
# Example: Make the model say something
|
| 221 |
-
audio = model.speak(tokenizer.encode("Hello, how can I help you?"))
|
| 222 |
-
|
| 223 |
-
# Save as WAV file
|
| 224 |
-
import soundfile as sf
|
| 225 |
-
sf.write("response.wav", audio.cpu().numpy(), 16000)
|
| 226 |
-
|
| 227 |
-
# Streaming for real-time (low latency)
|
| 228 |
-
for chunk in model.waveform_decoder.stream_decode(features, chunk_size=10):
|
| 229 |
-
play_audio(chunk) # Play each chunk as it's generated
|
| 230 |
-
```
|
| 231 |
-
|
| 232 |
-
### 🎯 Training Pipeline for Speech
|
| 233 |
-
The model learns to speak using these datasets and losses:
|
| 234 |
-
|
| 235 |
-
| Dataset | Type | Purpose |
|
| 236 |
-
|---------|------|---------|
|
| 237 |
-
| `openslr/librispeech_asr` | ASR | Learn to transcribe speech |
|
| 238 |
-
| `blabble-io/libritts_r` | TTS | Learn to generate speech |
|
| 239 |
-
| `parler-tts/mls_eng_10k` | TTS | Multi-speaker variety |
|
| 240 |
-
| `MikhailT/hifi-tts` | TTS | High-fidelity speech |
|
| 241 |
-
|
| 242 |
-
**Training Losses:**
|
| 243 |
-
- **Mel Loss**: MSE between predicted and target mel spectrograms
|
| 244 |
-
- **Duration Loss**: MSE for MAS-predicted durations
|
| 245 |
-
- **Waveform L1 Loss**: Time-domain reconstruction
|
| 246 |
-
- **Multi-Scale STFT Loss**: Frequency-domain quality (512/1024/2048 FFT)
|
| 247 |
-
|
| 248 |
-
---
|
| 249 |
|
| 250 |
## 📚 Training Data
|
| 251 |
|
|
|
|
| 103 |

|
| 104 |

|
| 105 |

|
| 106 |
+

|
| 107 |
|
| 108 |
</div>
|
| 109 |
|
|
|
|
| 112 |
## 🌟 Model Highlights
|
| 113 |
|
| 114 |
* **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
|
| 115 |
+
* **Multi-Scale Training (NEW):** Random scale selection per batch - images (128-512px), videos (128-384px), frames (8-32 including 20).
|
| 116 |
+
* **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** (4 experts) for video (8-32 frames).
|
| 117 |
+
* **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output: 256-512px, 50 inference steps.
|
| 118 |
+
* **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-32 frames @ 128-384px.
|
| 119 |
* **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
|
| 120 |
* **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
|
| 121 |
* **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
|
| 122 |
+
* **Fine-tuning:** LoRA variants including **rsLoRA**, **DoRA**, and **LoRA+** (r=32, α=64, 4x B matrix learning rate).
|
| 123 |
+
* **Multimodal Fusion:** Cross-Attention layers (4 layers, 8 heads) + Perceiver Resampler for vision projection.
|
| 124 |
* **Performance:** Flash Attention support with FP16-native numerical stability.
|
| 125 |
|
| 126 |
---
|
|
|
|
| 154 |
### 🎬 Video Encoder (3D Causal Transformers)
|
| 155 |
| Feature | Description |
|
| 156 |
|---------|-------------|
|
| 157 |
+
| Frame Scales | 8, 12, 16, 24, 32 frames (multi-scale) |
|
| 158 |
+
| Resolution Scales | 128, 192, 256, 320, 384px (multi-scale) |
|
| 159 |
| Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
|
| 160 |
| Attention | 3D Causal Self-Attention |
|
| 161 |
| Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
|
|
|
|
| 166 |
|---------|-------------|
|
| 167 |
| Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
|
| 168 |
| Scheduler | **Flow Matching** (not DDPM) |
|
| 169 |
+
| Output Resolution | 256-512px (multi-scale: 256, 320, 384, 448, 512) |
|
| 170 |
| Position Encoding | 2D-RoPE |
|
| 171 |
| Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
|
| 172 |
| MoE Experts | 4 experts in DiT blocks |
|
|
|
|
| 176 |
### 📹 Video Generation (3D Causal + Flow Matching)
|
| 177 |
| Feature | Description |
|
| 178 |
|---------|-------------|
|
| 179 |
+
| Output Resolution | 128-384px (multi-scale: 128, 192, 256, 320, 384) |
|
| 180 |
+
| Output Frames | 8-32 frames (multi-scale: 8, 12, 16, 20, 24, 32) |
|
| 181 |
| Scheduler | **Flow Matching** |
|
| 182 |
| Position Encoding | **3D-RoPE** for (x, y, t) |
|
| 183 |
| Attention | Factorized Spatial-Temporal (3D Causal) |
|
| 184 |
| Expert Routing | **Temporal MoE** (4 experts) |
|
| 185 |
| Guidance Scale | 7.5 (CFG) |
|
| 186 |
|
| 187 |
+
### 📐 Multi-Scale Training Configuration
|
| 188 |
+
| Type | Scales | Probabilities |
|
| 189 |
+
|------|--------|---------------|
|
| 190 |
+
| **Image** | 128, 192, 256, 320, 384, 448, 512px | 5%, 10%, 30%, 25%, 15%, 10%, 5% |
|
| 191 |
+
| **Video** | 128, 192, 256, 320, 384px | 10%, 20%, 35%, 25%, 10% |
|
| 192 |
+
| **Frames** | 8, 12, 16, 20, 24, 32 | 10%, 15%, 30%, 20%, 15%, 10% |
|
| 193 |
+
|
| 194 |
+
Multi-scale training is **enabled by default** with **random** strategy - each batch samples a different scale for variety.
|
| 195 |
+
|
| 196 |
### 🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
|
| 197 |
| Feature | Description |
|
| 198 |
|---------|-------------|
|
|
|
|
| 211 |
|---------|-------------|
|
| 212 |
| Architecture | BigVGAN/HiFi-GAN style with transposed convolutions |
|
| 213 |
| **Snake Activation** | `x + sin²(αx)/α` - preserves audio periodicity |
|
| 214 |
+
| **Multi-Receptive Field Fusion** | Parallel residual stacks (kernels 3, 7, 11, dilations 1/3/5) |
|
| 215 |
| Weight Normalization | Stable training, faster convergence |
|
| 216 |
+
| Upsampling | 256x total (rates: 8, 8, 2, 2) from features to 16kHz audio |
|
| 217 |
| Streaming | `stream_decode()` for low-latency real-time output |
|
| 218 |
+
| Output Range | [-1, 1] normalized waveform via tanh |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
|
| 220 |
## 📚 Training Data
|
| 221 |
|