ASR / docs /walkthrough.md
MihirRPatil's picture
deploy: CDAC ASR backend with pitch/stress fix and LLM feedback
88a679b
|
Raw
History Blame Contribute Delete
3.56 kB

Walkthrough: Phoneme Pronunciation Correction System (Design A)

I have successfully implemented the new Design A (Embedding-based) architecture for your phoneme pronunciation correction system. This architecture is optimized for your H100 GPU and 50GB storage constraints.

Changes Implemented

1. Core Model: phoneme_embedder.py

A custom Wav2Vec2PhonemeEmbedder class that replaces the standard linear classification head with a Cosine Similarity Embedding Head. This allows for a more robust acoustic-phoneme mapping.

4. Custom NPTEL Loader (nptel_loader.py)

To satisfy your requirement of using the official download scripts while staying under 50GB:

  • It parses your local download_scripts/ to find the official Zenodo URLs.
  • It streams the concatenated parts directly into memory using a custom ConcatenatedStream.
  • It pairs .wav and .txt files on the fly and deletes them after yielding, keeping your disk usage at essentially zero.

How to Start Training

  1. H100 Environment Setup: If you are using a remote H100 (e.g., Lambda, AWS), follow the GPU Setup Guide first to ensure CUDA and libsndfile are ready.

  2. Hugging Face Login:

    hf auth login
    
  3. Verify Download Scripts: Ensure your download_scripts/ directory contains download_train_data.sh. I have already created these for you.

  4. Local Training Test (Dry Run): Before moving to the H100, you can verify the pipeline works on your laptop (even without a GPU) by running:

    python train_streaming.py --hub_model_id test/dry-run --dry_run
    

    This will:

    • Automatically detect your CPU.
    • Run exactly 5 steps.
    • Disable Hub uploading and heavy logging.
    • Use a batch size of 1 to save RAM.

How to use on your Local Device (Laptop)

Once you have a trained checkpoint on the H100, you need to prepare it for your local Windows laptop (where there is no H100).

  1. Prepare the Local Version: Run this script on the H100 machine after training. It will create a folder with the full-precision weights mapped for CPU use.
    python export_for_local.py --checkpoint nptel_embedder_checkpoints/checkpoint-50000 --output my_local_model
    
  2. Download the Folder: Download the my_local_model folder to your laptop.
  3. Run Inference Locally: The test_model.py script on your laptop will automatically detect the lack of a GPU and run the full-precision model on your CPU.
    python test_model.py --model_dir my_local_model --duration 4.0 --word because
    
  4. Resume Training: On your next 24-hour session, simply run the same command. It will detect the local checkpoint or pull the latest one from the Hub.

How to Test Phoneme Correlation

Once you have a trained model in your output_dir:

python test_model.py --model_dir path_to_your_trained_model --word because

The script will now use the Cosine Similarity logits to identify phonemes and provide granular feedback via the PronunciationScorer.


Next Steps

  • Monitor Hub Sync: Ensure your first few checkpoints (every 1000 steps) are successfully uploading to your HF Hub.
  • Evaluate on OOVs: Test how the embedding space handles Out-Of-Vocabulary words compared to the old model.

Reference Documentation