Spaces:
Sleeping
Sleeping
| # Interface Workflow | |
| ```mermaid | |
| stateDiagram-v2 | |
| [*] --> LanguageSelection: User opens interface | |
| state "Language & Dataset Setup" as LangSetup { | |
| [*] --> LanguageSelection | |
| LanguageSelection --> LoadPhrases: Select language | |
| LoadPhrases --> DisplayPhrases: Load from NVIDIA Granary | |
| DisplayPhrases --> RecordingInterface: Show phrases & recording UI | |
| state RecordingInterface { | |
| [*] --> ShowInitialRows: Display first 10 phrases | |
| ShowInitialRows --> RecordAudio: User can record audio | |
| RecordAudio --> AddMoreRows: Optional - add 10 more rows | |
| AddMoreRows --> RecordAudio | |
| } | |
| } | |
| RecordingInterface --> DatasetCreation: User finishes recording | |
| state "Dataset Creation Options" as DatasetCreation { | |
| [*] --> FromRecordings: Create from recorded audio | |
| [*] --> FromUploads: Upload existing files | |
| FromRecordings --> ProcessRecordings: Save WAV files + transcripts | |
| FromUploads --> ProcessUploads: Process uploaded files + transcripts | |
| ProcessRecordings --> CreateJSONL: Generate JSONL dataset | |
| ProcessUploads --> CreateJSONL | |
| CreateJSONL --> DatasetReady: Dataset saved locally | |
| } | |
| DatasetCreation --> TrainingConfiguration: Dataset ready | |
| state "Training Setup" as TrainingConfiguration { | |
| [*] --> BasicSettings: Model, LoRA/full, batch size | |
| [*] --> AdvancedSettings: Learning rate, epochs, LoRA params | |
| BasicSettings --> ConfigureDeployment: Repo name, push options | |
| AdvancedSettings --> ConfigureDeployment | |
| ConfigureDeployment --> StartTraining: All settings configured | |
| } | |
| TrainingConfiguration --> TrainingProcess: Start training | |
| state "Training Process" as TrainingProcess { | |
| [*] --> InitializeTrackio: Setup experiment tracking | |
| InitializeTrackio --> RunTrainingScript: Execute train.py or train_lora.py | |
| RunTrainingScript --> StreamLogs: Show real-time training logs | |
| StreamLogs --> MonitorProgress: Track metrics & checkpoints | |
| MonitorProgress --> TrainingComplete: Training finished | |
| MonitorProgress --> HandleErrors: Training failed | |
| HandleErrors --> RetryOrExit: User can retry or exit | |
| } | |
| TrainingProcess --> PostTraining: Training complete | |
| state "Post-Training Actions" as PostTraining { | |
| [*] --> PushToHub: Push model to HF Hub | |
| [*] --> GenerateModelCard: Create model card | |
| [*] --> DeployDemoSpace: Deploy interactive demo | |
| PushToHub --> ModelPublished: Model available on HF Hub | |
| GenerateModelCard --> ModelDocumented: Model card created | |
| DeployDemoSpace --> DemoReady: Demo space deployed | |
| } | |
| PostTraining --> [*]: Process complete | |
| %% Alternative paths | |
| DatasetCreation --> PushDatasetOnly: Skip training, push dataset only | |
| PushDatasetOnly --> DatasetPublished: Dataset on HF Hub | |
| %% Error handling | |
| TrainingProcess --> ErrorRecovery: Handle training errors | |
| ErrorRecovery --> RetryTraining: Retry with different settings | |
| RetryTraining --> TrainingConfiguration | |
| %% Styling and notes | |
| note right of LanguageSelection : User selects language for<br/>authentic phrases from<br/>NVIDIA Granary dataset | |
| note right of RecordingInterface : Users record themselves<br/>reading displayed phrases | |
| note right of DatasetCreation : JSONL format: {"audio_path": "...", "text": "..."} | |
| note right of TrainingConfiguration : Configure LoRA parameters,<br/>learning rate, epochs, etc. | |
| note right of TrainingProcess : Real-time log streaming<br/>with Trackio integration | |
| note right of PostTraining : Automated deployment<br/>pipeline | |
| ``` | |
| ## Interface Workflow Overview | |
| This diagram illustrates the complete user journey through the Voxtral ASR Fine-tuning interface. The workflow is designed to be intuitive and guide users through each step of the fine-tuning process. | |
| ### Key Workflow Stages | |
| #### 1. Language & Dataset Setup | |
| - **Language Selection**: Users choose from 25+ European languages supported by NVIDIA Granary | |
| - **Phrase Loading**: System loads authentic, high-quality phrases in the selected language | |
| - **Recording Interface**: Dynamic interface showing phrases with audio recording components | |
| - **Progressive Disclosure**: Users can add more rows as needed (up to 100 recordings) | |
| #### 2. Dataset Creation | |
| - **From Recordings**: Process microphone recordings into WAV files and JSONL dataset | |
| - **From Uploads**: Handle existing WAV/FLAC files with manual transcripts | |
| - **JSONL Format**: Standard format with `audio_path` and `text` fields | |
| - **Local Storage**: Datasets stored in `datasets/voxtral_user/` directory | |
| #### 3. Training Configuration | |
| - **Basic Settings**: Model selection, LoRA vs full fine-tuning, batch size | |
| - **Advanced Settings**: Learning rate, epochs, gradient accumulation | |
| - **LoRA Parameters**: r, alpha, dropout, audio tower freezing options | |
| - **Repository Setup**: Model naming and Hugging Face Hub integration | |
| #### 4. Training Process | |
| - **Trackio Integration**: Automatic experiment tracking setup | |
| - **Script Execution**: Calls appropriate training script (`train.py` or `train_lora.py`) | |
| - **Log Streaming**: Real-time display of training progress and metrics | |
| - **Error Handling**: Graceful handling of training failures with retry options | |
| #### 5. Post-Training Actions | |
| - **Model Publishing**: Automatic push to Hugging Face Hub | |
| - **Model Card Generation**: Automated creation using `generate_model_card.py` | |
| - **Demo Deployment**: One-click deployment of interactive demo spaces | |
| ### Alternative Paths | |
| #### Dataset-Only Workflow | |
| - Users can create and publish datasets without training models | |
| - Useful for dataset curation and sharing | |
| #### Error Recovery | |
| - Training failures trigger error recovery flows | |
| - Users can retry with modified parameters | |
| - Comprehensive error logging and debugging information | |
| ### Technical Integration Points | |
| #### External Services | |
| - **NVIDIA Granary**: Source of high-quality multilingual ASR data | |
| - **Hugging Face Hub**: Model and dataset storage and sharing | |
| - **Trackio Spaces**: Experiment tracking and visualization | |
| #### Script Integration | |
| - **interface.py**: Main Gradio application orchestrating the workflow | |
| - **train.py/train_lora.py**: Core training scripts with Trackio integration | |
| - **push_to_huggingface.py**: Model/dataset publishing | |
| - **deploy_demo_space.py**: Automated demo deployment | |
| - **generate_model_card.py**: Model documentation generation | |
| ### User Experience Features | |
| #### Progressive Interface Reveal | |
| - Interface components are revealed as users progress through workflow | |
| - Reduces cognitive load and guides users step-by-step | |
| #### Real-time Feedback | |
| - Live log streaming during training | |
| - Progress indicators and status updates | |
| - Immediate feedback on dataset creation and validation | |
| #### Flexible Input Methods | |
| - Support for both live recording and file uploads | |
| - Multiple language options for diverse user needs | |
| - Scalable recording interface (10-100 samples) | |
| See also: | |
| - [Architecture Overview](architecture.md) | |
| - [Training Pipeline](training-pipeline.md) | |
| - [Data Flow](data-flow.md) | |