Spaces:
Sleeping
Sleeping
| # Data Flow | |
| ```mermaid | |
| flowchart TD | |
| %% User Input Sources | |
| subgraph "User Input" | |
| MIC[π€ Microphone Recording<br/>Raw audio + timestamps] | |
| FILE[π File Upload<br/>WAV/FLAC files] | |
| TEXT[π Manual Transcripts<br/>Text input] | |
| LANG[π Language Selection<br/>25+ languages] | |
| end | |
| %% Data Processing Pipeline | |
| subgraph "Data Processing" | |
| AUDIO_PROC[Audio Processing<br/>Resampling to 16kHz<br/>Format conversion] | |
| TEXT_PROC[Text Processing<br/>Transcript validation<br/>Cleaning & formatting] | |
| JSONL_CONV[JSONL Conversion<br/>{"audio_path": "...", "text": "..."}] | |
| end | |
| %% Dataset Storage | |
| subgraph "Dataset Storage" | |
| LOCAL_DS[Local Dataset<br/>datasets/voxtral_user/<br/>data.jsonl + wavs/] | |
| HF_DS[HF Hub Dataset<br/>username/dataset-name<br/>Public sharing] | |
| end | |
| %% Training Data Flow | |
| subgraph "Training Data Pipeline" | |
| DS_LOADER[Dataset Loader<br/>_load_jsonl_dataset()<br/>or load_dataset()] | |
| AUDIO_CAST[Audio Casting<br/>Audio(sampling_rate=16000)] | |
| TRAIN_SPLIT[Train Split<br/>train_dataset] | |
| EVAL_SPLIT[Eval Split<br/>eval_dataset] | |
| end | |
| %% Model Training | |
| subgraph "Model Training" | |
| COLLATOR[VoxtralDataCollator<br/>Audio + Text batching<br/>Prompt construction] | |
| FORWARD[Forward Pass<br/>Audio β Features β Text] | |
| LOSS[Loss Calculation<br/>Masked LM loss] | |
| BACKWARD[Backward Pass<br/>Gradient computation] | |
| OPTIMIZE[Parameter Updates<br/>LoRA or full fine-tuning] | |
| end | |
| %% Training Outputs | |
| subgraph "Training Outputs" | |
| MODEL_FILES[Model Files<br/>model.safetensors<br/>config.json<br/>tokenizer.json] | |
| TRAINING_LOGS[Training Logs<br/>train_results.json<br/>training_config.json<br/>loss curves] | |
| CHECKPOINTS[Checkpoints<br/>Intermediate models<br/>best model tracking] | |
| end | |
| %% Publishing Pipeline | |
| subgraph "Publishing Pipeline" | |
| HF_REPO[HF Repository<br/>username/model-name<br/>Model hosting] | |
| MODEL_CARD[Model Card<br/>README.md<br/>Training details<br/>Usage examples] | |
| METADATA[Training Metadata<br/>Config + results<br/>Performance metrics] | |
| end | |
| %% Demo Deployment | |
| subgraph "Demo Deployment" | |
| SPACE_REPO[HF Space Repository<br/>username/model-name-demo<br/>Demo hosting] | |
| DEMO_APP[Demo Application<br/>Gradio interface<br/>Real-time inference] | |
| ENV_VARS[Environment Config<br/>HF_MODEL_ID<br/>MODEL_NAME<br/>secrets] | |
| end | |
| %% External Data Sources | |
| subgraph "External Data Sources" | |
| GRANARY[NVIDIA Granary<br/>Multilingual ASR data<br/>25+ languages] | |
| HF_COMM[HF Community Datasets<br/>Public ASR datasets<br/>Standard formats] | |
| end | |
| %% Data Flow Connections | |
| MIC --> AUDIO_PROC | |
| FILE --> AUDIO_PROC | |
| TEXT --> TEXT_PROC | |
| LANG --> TEXT_PROC | |
| AUDIO_PROC --> JSONL_CONV | |
| TEXT_PROC --> JSONL_CONV | |
| JSONL_CONV --> LOCAL_DS | |
| LOCAL_DS --> HF_DS | |
| LOCAL_DS --> DS_LOADER | |
| HF_DS --> DS_LOADER | |
| GRANARY --> DS_LOADER | |
| HF_COMM --> DS_LOADER | |
| DS_LOADER --> AUDIO_CAST | |
| AUDIO_CAST --> TRAIN_SPLIT | |
| AUDIO_CAST --> EVAL_SPLIT | |
| TRAIN_SPLIT --> COLLATOR | |
| EVAL_SPLIT --> COLLATOR | |
| COLLATOR --> FORWARD | |
| FORWARD --> LOSS | |
| LOSS --> BACKWARD | |
| BACKWARD --> OPTIMIZE | |
| OPTIMIZE --> MODEL_FILES | |
| OPTIMIZE --> TRAINING_LOGS | |
| OPTIMIZE --> CHECKPOINTS | |
| MODEL_FILES --> HF_REPO | |
| TRAINING_LOGS --> HF_REPO | |
| CHECKPOINTS --> HF_REPO | |
| HF_REPO --> MODEL_CARD | |
| TRAINING_LOGS --> MODEL_CARD | |
| MODEL_CARD --> SPACE_REPO | |
| HF_REPO --> SPACE_REPO | |
| ENV_VARS --> SPACE_REPO | |
| SPACE_REPO --> DEMO_APP | |
| %% Styling | |
| classDef input fill:#e3f2fd,stroke:#1976d2,stroke-width:2px | |
| classDef processing fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px | |
| classDef storage fill:#fff3e0,stroke:#f57c00,stroke-width:2px | |
| classDef training fill:#e8f5e8,stroke:#388e3c,stroke-width:2px | |
| classDef output fill:#fce4ec,stroke:#c2185b,stroke-width:2px | |
| classDef publishing fill:#e1f5fe,stroke:#0277bd,stroke-width:2px | |
| classDef deployment fill:#f5f5f5,stroke:#424242,stroke-width:2px | |
| classDef external fill:#efebe9,stroke:#5d4037,stroke-width:2px | |
| class MIC,FILE,TEXT,LANG input | |
| class AUDIO_PROC,TEXT_PROC,JSONL_CONV processing | |
| class LOCAL_DS,HF_DS storage | |
| class DS_LOADER,AUDIO_CAST,TRAIN_SPLIT,EVAL_SPLIT,COLLATOR,FORWARD,LOSS,BACKWARD,OPTIMIZE training | |
| class MODEL_FILES,TRAINING_LOGS,CHECKPOINTS output | |
| class HF_REPO,MODEL_CARD,METADATA publishing | |
| class SPACE_REPO,DEMO_APP,ENV_VARS deployment | |
| class GRANARY,HF_COMM external | |
| ``` | |
| ## Data Flow Overview | |
| This diagram illustrates the complete data flow through the Voxtral ASR Fine-tuning application, from user input to deployed demo. | |
| ### Data Input Sources | |
| #### User-Generated Data | |
| - **Microphone Recording**: Raw audio captured through browser microphone | |
| - **File Upload**: Existing WAV/FLAC audio files | |
| - **Manual Transcripts**: User-provided text transcriptions | |
| - **Language Selection**: Influences phrase selection from NVIDIA Granary | |
| #### External Data Sources | |
| - **NVIDIA Granary**: High-quality multilingual ASR dataset | |
| - **HF Community Datasets**: Public datasets from Hugging Face Hub | |
| ### Data Processing Pipeline | |
| #### Audio Processing | |
| ```python | |
| # Audio resampling and format conversion | |
| audio_data = librosa.load(audio_path, sr=16000) | |
| # Convert to WAV format for consistency | |
| sf.write(output_path, audio_data, 16000) | |
| ``` | |
| #### Text Processing | |
| ```python | |
| # Text cleaning and validation | |
| text = text.strip() | |
| # Basic validation (length, content checks) | |
| assert len(text) > 0, "Empty transcription" | |
| ``` | |
| #### JSONL Conversion | |
| ```python | |
| # Standard format for all datasets | |
| entry = { | |
| "audio_path": str(audio_file_path), | |
| "text": cleaned_transcription | |
| } | |
| # Write to JSONL file | |
| with open(jsonl_path, "a") as f: | |
| f.write(json.dumps(entry) + "\n") | |
| ``` | |
| ### Dataset Storage | |
| #### Local Storage Structure | |
| ``` | |
| datasets/voxtral_user/ | |
| βββ data.jsonl # Main dataset file | |
| βββ recorded_data.jsonl # From recordings | |
| βββ wavs/ # Audio files | |
| βββ recording_0000.wav | |
| βββ recording_0001.wav | |
| βββ ... | |
| ``` | |
| #### HF Hub Storage | |
| - **Public Datasets**: Shareable with community | |
| - **Version Control**: Dataset versioning and updates | |
| - **Standard Metadata**: Automatic README generation | |
| ### Training Data Pipeline | |
| #### Dataset Loading | |
| ```python | |
| # Load local JSONL | |
| ds = _load_jsonl_dataset("datasets/voxtral_user/data.jsonl") | |
| # Load HF dataset | |
| ds = load_dataset("username/dataset-name", split="train") | |
| ``` | |
| #### Audio Casting | |
| ```python | |
| # Ensure consistent sampling rate | |
| ds = ds.cast_column("audio", Audio(sampling_rate=16000)) | |
| ``` | |
| #### Train/Eval Split | |
| ```python | |
| # Create train and eval datasets | |
| train_dataset = ds.select(range(train_count)) | |
| eval_dataset = ds.select(range(train_count, train_count + eval_count)) | |
| ``` | |
| ### Training Process Flow | |
| #### Data Collation | |
| - **VoxtralDataCollator**: Custom collator for Voxtral model | |
| - **Audio Processing**: Convert audio to model inputs | |
| - **Prompt Construction**: Build `[AUDIO]...[AUDIO] <transcribe>` prompts | |
| - **Text Tokenization**: Process transcription targets | |
| - **Masking**: Mask audio prompt tokens during training | |
| #### Forward Pass | |
| 1. **Audio Input**: Raw audio waveforms | |
| 2. **Audio Tower**: Extract audio features | |
| 3. **Language Model**: Generate transcription autoregressively | |
| 4. **Loss Calculation**: Compare generated vs target text | |
| #### Backward Pass & Optimization | |
| - **Gradient Computation**: Backpropagation | |
| - **LoRA Updates**: Update only adapter parameters (LoRA mode) | |
| - **Full Updates**: Update all parameters (full fine-tuning) | |
| - **Optimizer Step**: Apply gradients with learning rate scheduling | |
| ### Training Outputs | |
| #### Model Files | |
| - **model.safetensors**: Model weights (safetensors format) | |
| - **config.json**: Model configuration | |
| - **tokenizer.json**: Tokenizer configuration | |
| - **generation_config.json**: Generation parameters | |
| #### Training Logs | |
| - **train_results.json**: Final training metrics | |
| - **eval_results.json**: Evaluation results | |
| - **training_config.json**: Training hyperparameters | |
| - **trainer_state.json**: Training state and checkpoints | |
| #### Checkpoints | |
| - **checkpoint-XXX/**: Intermediate model snapshots | |
| - **best-model/**: Best performing model | |
| - **final-model/**: Final trained model | |
| ### Publishing Pipeline | |
| #### HF Repository Structure | |
| ``` | |
| username/model-name/ | |
| βββ model.safetensors.index.json | |
| βββ model-00001-of-00002.safetensors | |
| βββ model-00002-of-00002.safetensors | |
| βββ config.json | |
| βββ tokenizer.json | |
| βββ training_config.json | |
| βββ train_results.json | |
| βββ README.md (model card) | |
| βββ training_results/ | |
| βββ training.log | |
| ``` | |
| #### Model Card Generation | |
| - **Template Processing**: Fill model_card.md template | |
| - **Variable Injection**: Training config, results, metadata | |
| - **Conditional Sections**: Handle quantized models, etc. | |
| ### Demo Deployment | |
| #### Space Repository Structure | |
| ``` | |
| username/model-name-demo/ | |
| βββ app.py # Gradio demo application | |
| βββ requirements.txt # Python dependencies | |
| βββ README.md # Space documentation | |
| βββ .env # Environment variables | |
| ``` | |
| #### Environment Configuration | |
| ```python | |
| # Space environment variables | |
| HF_MODEL_ID=username/model-name | |
| MODEL_NAME=Voxtral Fine-tuned Model | |
| HF_TOKEN=read_only_token # For model access | |
| BRAND_OWNER_NAME=username | |
| # ... other branding variables | |
| ``` | |
| ### Data Flow Patterns | |
| #### Streaming vs Batch Processing | |
| - **Training Data**: Batch processing for efficiency | |
| - **External Datasets**: Streaming loading for memory efficiency | |
| - **User Input**: Real-time processing with immediate feedback | |
| #### Data Validation | |
| - **Input Validation**: Check audio format, sampling rate, text length | |
| - **Quality Assurance**: Filter out empty or invalid entries | |
| - **Consistency Checks**: Ensure audio-text alignment | |
| #### Error Handling | |
| - **Graceful Degradation**: Fallback to local data if external sources fail | |
| - **Retry Logic**: Automatic retry for network failures | |
| - **Logging**: Comprehensive error logging and debugging | |
| ### Performance Considerations | |
| #### Memory Management | |
| - **Streaming Loading**: Process large datasets without loading everything | |
| - **Audio Caching**: Cache processed audio features | |
| - **Batch Optimization**: Balance batch size with available memory | |
| #### Storage Optimization | |
| - **Compression**: Use efficient audio formats | |
| - **Deduplication**: Avoid duplicate data entries | |
| - **Cleanup**: Remove temporary files after processing | |
| #### Network Efficiency | |
| - **Incremental Uploads**: Upload files as they're ready | |
| - **Resume Capability**: Resume interrupted uploads | |
| - **Caching**: Cache frequently accessed data | |
| ### Security & Privacy | |
| #### Data Privacy | |
| - **Local Processing**: Audio files processed locally when possible | |
| - **User Consent**: Clear data usage policies | |
| - **Anonymization**: Remove personally identifiable information | |
| #### Access Control | |
| - **Token Management**: Secure HF token storage | |
| - **Repository Permissions**: Appropriate public/private settings | |
| - **Rate Limiting**: Prevent abuse of demo interfaces | |
| ### Monitoring & Analytics | |
| #### Data Quality Metrics | |
| - **Audio Quality**: Sampling rate, format validation | |
| - **Text Quality**: Length, language detection, consistency | |
| - **Dataset Statistics**: Size, distribution, coverage | |
| #### Performance Metrics | |
| - **Processing Time**: Data loading, preprocessing, training time | |
| - **Model Metrics**: Loss, perplexity, WER (if available) | |
| - **Resource Usage**: Memory, CPU/GPU utilization | |
| #### User Analytics | |
| - **Usage Patterns**: Popular languages, dataset sizes | |
| - **Success Rates**: Training completion, deployment success | |
| - **Error Patterns**: Common failure modes and solutions | |
| See also: | |
| - [Architecture Overview](architecture.md) | |
| - [Interface Workflow](interface-workflow.md) | |
| - [Training Pipeline](training-pipeline.md) | |