# Walkthrough: Phoneme Pronunciation Correction System (Design A) I have successfully implemented the new **Design A (Embedding-based)** architecture for your phoneme pronunciation correction system. This architecture is optimized for your H100 GPU and 50GB storage constraints. ## Changes Implemented ### 1. Core Model: `phoneme_embedder.py` A custom `Wav2Vec2PhonemeEmbedder` class that replaces the standard linear classification head with a **Cosine Similarity Embedding Head**. This allows for a more robust acoustic-phoneme mapping. ### 4. Custom NPTEL Loader (`nptel_loader.py`) To satisfy your requirement of using the official download scripts while staying under 50GB: - It parses your local `download_scripts/` to find the official Zenodo URLs. - It streams the concatenated parts directly into memory using a custom `ConcatenatedStream`. - It pairs `.wav` and `.txt` files on the fly and deletes them after yielding, keeping your disk usage at essentially zero. --- ## How to Start Training 1. **H100 Environment Setup:** If you are using a remote H100 (e.g., Lambda, AWS), follow the [GPU Setup Guide](setup_gpu.md) first to ensure CUDA and `libsndfile` are ready. 2. **Hugging Face Login:** ```bash hf auth login ``` 2. **Verify Download Scripts:** Ensure your `download_scripts/` directory contains `download_train_data.sh`. I have already created these for you. 3. **Local Training Test (Dry Run):** Before moving to the H100, you can verify the pipeline works on your laptop (even without a GPU) by running: ```bash python train_streaming.py --hub_model_id test/dry-run --dry_run ``` This will: - Automatically detect your CPU. - Run exactly 5 steps. - Disable Hub uploading and heavy logging. - Use a batch size of 1 to save RAM. --- ## How to use on your Local Device (Laptop) Once you have a trained checkpoint on the H100, you need to prepare it for your local Windows laptop (where there is no H100). 1. **Prepare the Local Version:** Run this script on the H100 machine after training. It will create a folder with the full-precision weights mapped for CPU use. ```bash python export_for_local.py --checkpoint nptel_embedder_checkpoints/checkpoint-50000 --output my_local_model ``` 2. **Download the Folder:** Download the `my_local_model` folder to your laptop. 3. **Run Inference Locally:** The `test_model.py` script on your laptop will automatically detect the lack of a GPU and run the full-precision model on your CPU. ```bash python test_model.py --model_dir my_local_model --duration 4.0 --word because ``` 3. **Resume Training:** On your next 24-hour session, simply run the same command. It will detect the local checkpoint or pull the latest one from the Hub. ## How to Test Phoneme Correlation Once you have a trained model in your `output_dir`: ```bash python test_model.py --model_dir path_to_your_trained_model --word because ``` The script will now use the **Cosine Similarity logits** to identify phonemes and provide granular feedback via the `PronunciationScorer`. --- ## Next Steps - [ ] **Monitor Hub Sync:** Ensure your first few checkpoints (every 1000 steps) are successfully uploading to your HF Hub. - [ ] **Evaluate on OOVs:** Test how the embedding space handles Out-Of-Vocabulary words compared to the old model. ## Reference Documentation - [Architecture Overview](architecture_overview.md) - [GPU Setup Guide](setup_gpu.md) - [G2P Training & Maintenance Guide](../g2p/training_guide.md)