Malayalam Whisper to FUTO Keyboard: Full Pipeline
This repository contains an end-to-end Jupyter Notebook pipeline designed to prepare Hugging Face Whisper models for use with the FUTO Keyboard on Android.
The pipeline specifically focuses on adapting Malayalam Whisper models (such as whisper-small-malayalam) by applying Audio Context Fine-Tuning (ACFT), converting the standard Hugging Face weights into the GGML format, and quantizing the model for efficient mobile inference.
Features
- Audio Context Fine-Tuning (ACFT): Trains the target model to handle dynamic audio contexts (under 30 seconds) without endless looping or repetition, using a frozen reference model.
- GGML Conversion: Automatically clones the required OpenAI and
whisper.cpprepositories to convert the fine-tuned.safetensorsmodel into a standard.binfile. - Mobile Quantization: Compiles the
whisper-quantizetool and generates optimized, quantized versions (e.g.,q5_0) of the model tailored for smartphone hardware constraints. - Fast Dependency Management: Utilizes
uvfor rapid package installation and environment setup within the Colab runtime.
Requirements
This notebook is designed to be executed in Google Colab to leverage cloud GPU acceleration and avoid local hardware memory constraints.
- Environment: Google Colab
- Hardware: T4 GPU (Required for the ACFT training loop to complete in a reasonable timeframe)
- Storage: Access to Google Drive (if loading local datasets like Mozilla Common Voice
.tar.gzarchives or saving the final output directly to Drive).
Usage Instructions
1. Environment Setup
- Upload the
malayalam_whisper_full_pipeline.ipynbnotebook to your Google Colab environment. - Navigate to Runtime > Change runtime type and select T4 GPU.
- Run the first execution cell to install the required dependencies (
torch,transformers,datasets,librosa, etc.) viauv.
2. Execution
Execute the notebook cells sequentially. The pipeline handles:
- Downloading the target Whisper model and the Common Voice dataset.
- Running the 1500-step ACFT training loop to minimize MSE loss between the target and reference models.
- Merging the updated weights and saving the PyTorch structure.
- Running
convert-h5-to-ggml.pyto generate the baseggml-model.binfile. - Executing the
whisper.cppMakefile and generating quantized.binfiles.
Training ACFT (1500 steps)... Step 0 | Loss: 1.133623 Step 50 | Loss: 0.156094 Step 100 | Loss: 0.239148 Step 150 | Loss: 0.100868 Step 200 | Loss: 0.100326 Step 250 | Loss: 0.082330 Step 300 | Loss: 0.065249 Step 350 | Loss: 0.133438 Step 400 | Loss: 0.105161 Step 450 | Loss: 0.083460 Step 500 | Loss: 0.185798 Step 550 | Loss: 0.116877 Step 600 | Loss: 0.069572 Step 650 | Loss: 0.139821 Step 700 | Loss: 0.291859 Step 750 | Loss: 0.053645 Step 800 | Loss: 0.068235 Step 850 | Loss: 0.041750 Step 900 | Loss: 0.049185 Step 950 | Loss: 0.106350 Step 1000 | Loss: 0.154282 Step 1050 | Loss: 0.124018 Step 1100 | Loss: 0.120467 Step 1150 | Loss: 0.046497 Step 1200 | Loss: 0.032196
3. Deployment
Once the notebook finishes executing all quantization steps, the final models will be available in the designated /content/output/ directory (or your mounted Google Drive).
- Download the highly recommended
malayalam-futo-q5_0.binfile. - Transfer the
.binfile to your Android device's internal storage. - Open the FUTO Keyboard settings, navigate to the Voice Input section, and import the downloaded model.