Spaces:
Sleeping
Walkthrough: Phoneme Pronunciation Correction System (Design A)
I have successfully implemented the new Design A (Embedding-based) architecture for your phoneme pronunciation correction system. This architecture is optimized for your H100 GPU and 50GB storage constraints.
Changes Implemented
1. Core Model: phoneme_embedder.py
A custom Wav2Vec2PhonemeEmbedder class that replaces the standard linear classification head with a Cosine Similarity Embedding Head. This allows for a more robust acoustic-phoneme mapping.
4. Custom NPTEL Loader (nptel_loader.py)
To satisfy your requirement of using the official download scripts while staying under 50GB:
- It parses your local
download_scripts/to find the official Zenodo URLs. - It streams the concatenated parts directly into memory using a custom
ConcatenatedStream. - It pairs
.wavand.txtfiles on the fly and deletes them after yielding, keeping your disk usage at essentially zero.
How to Start Training
H100 Environment Setup: If you are using a remote H100 (e.g., Lambda, AWS), follow the GPU Setup Guide first to ensure CUDA and
libsndfileare ready.Hugging Face Login:
hf auth loginVerify Download Scripts: Ensure your
download_scripts/directory containsdownload_train_data.sh. I have already created these for you.Local Training Test (Dry Run): Before moving to the H100, you can verify the pipeline works on your laptop (even without a GPU) by running:
python train_streaming.py --hub_model_id test/dry-run --dry_runThis will:
- Automatically detect your CPU.
- Run exactly 5 steps.
- Disable Hub uploading and heavy logging.
- Use a batch size of 1 to save RAM.
How to use on your Local Device (Laptop)
Once you have a trained checkpoint on the H100, you need to prepare it for your local Windows laptop (where there is no H100).
- Prepare the Local Version:
Run this script on the H100 machine after training. It will create a folder with the full-precision weights mapped for CPU use.
python export_for_local.py --checkpoint nptel_embedder_checkpoints/checkpoint-50000 --output my_local_model - Download the Folder:
Download the
my_local_modelfolder to your laptop. - Run Inference Locally:
The
test_model.pyscript on your laptop will automatically detect the lack of a GPU and run the full-precision model on your CPU.python test_model.py --model_dir my_local_model --duration 4.0 --word because - Resume Training: On your next 24-hour session, simply run the same command. It will detect the local checkpoint or pull the latest one from the Hub.
How to Test Phoneme Correlation
Once you have a trained model in your output_dir:
python test_model.py --model_dir path_to_your_trained_model --word because
The script will now use the Cosine Similarity logits to identify phonemes and provide granular feedback via the PronunciationScorer.
Next Steps
- Monitor Hub Sync: Ensure your first few checkpoints (every 1000 steps) are successfully uploading to your HF Hub.
- Evaluate on OOVs: Test how the embedding space handles Out-Of-Vocabulary words compared to the old model.