Backup-bdg commited on
Commit
e2adcb4
·
verified ·
1 Parent(s): 45002c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -54
README.md CHANGED
@@ -103,6 +103,7 @@ datasets:
103
  ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)
104
  ![Params](https://img.shields.io/badge/Parameters-1.5B_MoE-yellow?style=for-the-badge)
105
  ![Context](https://img.shields.io/badge/Context-128K-red?style=for-the-badge)
 
106
 
107
  </div>
108
 
@@ -111,14 +112,15 @@ datasets:
111
  ## 🌟 Model Highlights
112
 
113
  * **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
114
- * **Vision Encoder:** SigLIP-2 (384px) with **TiTok-style 1D tokenization**, **Dual-Stream Attention**, and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** for video (up to 16 frames).
115
- * **Image Generation:** **MoE-DiT** (Diffusion Transformer with MoE) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style).
116
- * **Video Generation:** **3D Causal Transformers** with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing**.
 
117
  * **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
118
  * **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
119
  * **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
120
- * **Fine-tuning:** LoRA variants including **rsLoRA**, **DoRA**, and **LoRA+** with configurable learning rate ratio.
121
- * **Multimodal Fusion:** Cross-Attention layers (4 layers, 8 heads) for deep multimodal integration.
122
  * **Performance:** Flash Attention support with FP16-native numerical stability.
123
 
124
  ---
@@ -152,7 +154,8 @@ datasets:
152
  ### 🎬 Video Encoder (3D Causal Transformers)
153
  | Feature | Description |
154
  |---------|-------------|
155
- | Max Frames | 16 frames |
 
156
  | Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
157
  | Attention | 3D Causal Self-Attention |
158
  | Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
@@ -163,7 +166,7 @@ datasets:
163
  |---------|-------------|
164
  | Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
165
  | Scheduler | **Flow Matching** (not DDPM) |
166
- | Output Resolution | 384×384 |
167
  | Position Encoding | 2D-RoPE |
168
  | Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
169
  | MoE Experts | 4 experts in DiT blocks |
@@ -173,14 +176,23 @@ datasets:
173
  ### 📹 Video Generation (3D Causal + Flow Matching)
174
  | Feature | Description |
175
  |---------|-------------|
176
- | Output Resolution | 256×256 |
177
- | Output Frames | 16 frames (default), up to 32 frames (max capacity) |
178
  | Scheduler | **Flow Matching** |
179
  | Position Encoding | **3D-RoPE** for (x, y, t) |
180
  | Attention | Factorized Spatial-Temporal (3D Causal) |
181
  | Expert Routing | **Temporal MoE** (4 experts) |
182
  | Guidance Scale | 7.5 (CFG) |
183
 
 
 
 
 
 
 
 
 
 
184
  ### 🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
185
  | Feature | Description |
186
  |---------|-------------|
@@ -199,53 +211,11 @@ Direct audio output without external vocoder:
199
  |---------|-------------|
200
  | Architecture | BigVGAN/HiFi-GAN style with transposed convolutions |
201
  | **Snake Activation** | `x + sin²(αx)/α` - preserves audio periodicity |
202
- | **Multi-Receptive Field Fusion** | Parallel residual stacks (kernels 3, 7, 11) |
203
  | Weight Normalization | Stable training, faster convergence |
204
- | Upsampling | 256x (rates: 8, 8, 2, 2) |
205
  | Streaming | `stream_decode()` for low-latency real-time output |
206
-
207
- ### 🗣️ Speech-to-Speech API
208
- The model provides three main methods for voice interaction:
209
-
210
- | Method | Description |
211
- |--------|-------------|
212
- | `model.listen(audio)` | Encode speech to embeddings (ASR) |
213
- | `model.speak(text)` | Generate playable audio from text (TTS) |
214
- | `model.listen_and_respond(audio)` | Full conversation: listen → think → speak back |
215
-
216
- ```python
217
- # Example: Talk to the model and it talks back
218
- response_audio = model.listen_and_respond(your_audio) # Returns playable waveform
219
-
220
- # Example: Make the model say something
221
- audio = model.speak(tokenizer.encode("Hello, how can I help you?"))
222
-
223
- # Save as WAV file
224
- import soundfile as sf
225
- sf.write("response.wav", audio.cpu().numpy(), 16000)
226
-
227
- # Streaming for real-time (low latency)
228
- for chunk in model.waveform_decoder.stream_decode(features, chunk_size=10):
229
- play_audio(chunk) # Play each chunk as it's generated
230
- ```
231
-
232
- ### 🎯 Training Pipeline for Speech
233
- The model learns to speak using these datasets and losses:
234
-
235
- | Dataset | Type | Purpose |
236
- |---------|------|---------|
237
- | `openslr/librispeech_asr` | ASR | Learn to transcribe speech |
238
- | `blabble-io/libritts_r` | TTS | Learn to generate speech |
239
- | `parler-tts/mls_eng_10k` | TTS | Multi-speaker variety |
240
- | `MikhailT/hifi-tts` | TTS | High-fidelity speech |
241
-
242
- **Training Losses:**
243
- - **Mel Loss**: MSE between predicted and target mel spectrograms
244
- - **Duration Loss**: MSE for MAS-predicted durations
245
- - **Waveform L1 Loss**: Time-domain reconstruction
246
- - **Multi-Scale STFT Loss**: Frequency-domain quality (512/1024/2048 FFT)
247
-
248
- ---
249
 
250
  ## 📚 Training Data
251
 
 
103
  ![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)
104
  ![Params](https://img.shields.io/badge/Parameters-1.5B_MoE-yellow?style=for-the-badge)
105
  ![Context](https://img.shields.io/badge/Context-128K-red?style=for-the-badge)
106
+ ![Version](https://img.shields.io/badge/Version-2.1-purple?style=for-the-badge)
107
 
108
  </div>
109
 
 
112
  ## 🌟 Model Highlights
113
 
114
  * **Architecture:** Mixture of Experts (8 Experts + 1 Shared, top-2 routing) with Ring Attention and Aux-Lossless routing.
115
+ * **Multi-Scale Training (NEW):** Random scale selection per batch - images (128-512px), videos (128-384px), frames (8-32 including 20).
116
+ * **Vision Encoder:** SigLIP-2 (384px native) with **TiTok-style 1D tokenization** (256 compressed tokens), **Dual-Stream Attention** (2 layers), and **2D-RoPE** for images; **3D-RoPE** + **Temporal MoE** (4 experts) for video (8-32 frames).
117
+ * **Image Generation:** **MoE-DiT** (Diffusion Transformer with 4 MoE experts) using **Flow Matching**, **2D-RoPE**, and **Symmetric Dual-Stream Attention** (SD3/Flux-style). Multi-scale output: 256-512px, 50 inference steps.
118
+ * **Video Generation:** **3D Causal Transformers** (4 layers) with **Flow Matching**, **3D-RoPE** for (x,y,t) positions, and **Temporal Expert Routing** (4 experts). Multi-scale: 8-32 frames @ 128-384px.
119
  * **Audio (Speech-to-Speech):** **Conformer encoder with RMLA** and **Raw Waveform Tokenizer** for ASR; **Direct waveform decoder** (no vocoder needed!) with **MAS** for TTS; **Zero-Shot Speaker Cloning** with In-Context Audio Prompting. Talk to it, and it talks back!
120
  * **Agentic:** Trained for tool calling, file operations, and code execution with uncertainty estimation.
121
  * **Context:** Efficient 128K context using Ring Attention (4096 chunk size).
122
+ * **Fine-tuning:** LoRA variants including **rsLoRA**, **DoRA**, and **LoRA+** (r=32, α=64, 4x B matrix learning rate).
123
+ * **Multimodal Fusion:** Cross-Attention layers (4 layers, 8 heads) + Perceiver Resampler for vision projection.
124
  * **Performance:** Flash Attention support with FP16-native numerical stability.
125
 
126
  ---
 
154
  ### 🎬 Video Encoder (3D Causal Transformers)
155
  | Feature | Description |
156
  |---------|-------------|
157
+ | Frame Scales | 8, 12, 16, 24, 32 frames (multi-scale) |
158
+ | Resolution Scales | 128, 192, 256, 320, 384px (multi-scale) |
159
  | Position Encoding | **3D-RoPE** for (x, y, t) coordinates |
160
  | Attention | 3D Causal Self-Attention |
161
  | Expert Routing | **Temporal MoE** (4 experts, temporally-aware) |
 
166
  |---------|-------------|
167
  | Architecture | **MoE-DiT** (Diffusion Transformer with MoE) |
168
  | Scheduler | **Flow Matching** (not DDPM) |
169
+ | Output Resolution | 256-512px (multi-scale: 256, 320, 384, 448, 512) |
170
  | Position Encoding | 2D-RoPE |
171
  | Attention | **Symmetric Dual-Stream Attention** (SD3/Flux-style) |
172
  | MoE Experts | 4 experts in DiT blocks |
 
176
  ### 📹 Video Generation (3D Causal + Flow Matching)
177
  | Feature | Description |
178
  |---------|-------------|
179
+ | Output Resolution | 128-384px (multi-scale: 128, 192, 256, 320, 384) |
180
+ | Output Frames | 8-32 frames (multi-scale: 8, 12, 16, 20, 24, 32) |
181
  | Scheduler | **Flow Matching** |
182
  | Position Encoding | **3D-RoPE** for (x, y, t) |
183
  | Attention | Factorized Spatial-Temporal (3D Causal) |
184
  | Expert Routing | **Temporal MoE** (4 experts) |
185
  | Guidance Scale | 7.5 (CFG) |
186
 
187
+ ### 📐 Multi-Scale Training Configuration
188
+ | Type | Scales | Probabilities |
189
+ |------|--------|---------------|
190
+ | **Image** | 128, 192, 256, 320, 384, 448, 512px | 5%, 10%, 30%, 25%, 15%, 10%, 5% |
191
+ | **Video** | 128, 192, 256, 320, 384px | 10%, 20%, 35%, 25%, 10% |
192
+ | **Frames** | 8, 12, 16, 20, 24, 32 | 10%, 15%, 30%, 20%, 15%, 10% |
193
+
194
+ Multi-scale training is **enabled by default** with **random** strategy - each batch samples a different scale for variety.
195
+
196
  ### 🎤 Audio (Speech-to-Speech with RMLA + MAS + Zero-Shot Cloning)
197
  | Feature | Description |
198
  |---------|-------------|
 
211
  |---------|-------------|
212
  | Architecture | BigVGAN/HiFi-GAN style with transposed convolutions |
213
  | **Snake Activation** | `x + sin²(αx)/α` - preserves audio periodicity |
214
+ | **Multi-Receptive Field Fusion** | Parallel residual stacks (kernels 3, 7, 11, dilations 1/3/5) |
215
  | Weight Normalization | Stable training, faster convergence |
216
+ | Upsampling | 256x total (rates: 8, 8, 2, 2) from features to 16kHz audio |
217
  | Streaming | `stream_decode()` for low-latency real-time output |
218
+ | Output Range | [-1, 1] normalized waveform via tanh |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
219
 
220
  ## 📚 Training Data
221