Spaces:

MihirRPatil
/

ASR

Sleeping

App Files Files Community

ASR / docs /walkthrough.md

MihirRPatil

deploy: CDAC ASR backend with pitch/stress fix and LLM feedback

88a679b 3 days ago

preview code

Raw

History Blame Contribute Delete

3.56 kB

Walkthrough: Phoneme Pronunciation Correction System (Design A)

I have successfully implemented the new Design A (Embedding-based) architecture for your phoneme pronunciation correction system. This architecture is optimized for your H100 GPU and 50GB storage constraints.

Changes Implemented

1. Core Model: `phoneme_embedder.py`

A custom Wav2Vec2PhonemeEmbedder class that replaces the standard linear classification head with a Cosine Similarity Embedding Head. This allows for a more robust acoustic-phoneme mapping.

4. Custom NPTEL Loader (`nptel_loader.py`)

To satisfy your requirement of using the official download scripts while staying under 50GB:

It parses your local download_scripts/ to find the official Zenodo URLs.
It streams the concatenated parts directly into memory using a custom ConcatenatedStream.
It pairs .wav and .txt files on the fly and deletes them after yielding, keeping your disk usage at essentially zero.

How to Start Training

H100 Environment Setup: If you are using a remote H100 (e.g., Lambda, AWS), follow the GPU Setup Guide first to ensure CUDA and libsndfile are ready.
Hugging Face Login:
```
hf auth login
```
Verify Download Scripts: Ensure your download_scripts/ directory contains download_train_data.sh. I have already created these for you.
Local Training Test (Dry Run): Before moving to the H100, you can verify the pipeline works on your laptop (even without a GPU) by running:
```
python train_streaming.py --hub_model_id test/dry-run --dry_run
```
This will:
- Automatically detect your CPU.
- Run exactly 5 steps.
- Disable Hub uploading and heavy logging.
- Use a batch size of 1 to save RAM.

How to use on your Local Device (Laptop)

Once you have a trained checkpoint on the H100, you need to prepare it for your local Windows laptop (where there is no H100).

Prepare the Local Version: Run this script on the H100 machine after training. It will create a folder with the full-precision weights mapped for CPU use.
```
python export_for_local.py --checkpoint nptel_embedder_checkpoints/checkpoint-50000 --output my_local_model
```
Download the Folder: Download the my_local_model folder to your laptop.
Run Inference Locally: The test_model.py script on your laptop will automatically detect the lack of a GPU and run the full-precision model on your CPU.
```
python test_model.py --model_dir my_local_model --duration 4.0 --word because
```
Resume Training: On your next 24-hour session, simply run the same command. It will detect the local checkpoint or pull the latest one from the Hub.

How to Test Phoneme Correlation

Once you have a trained model in your output_dir:

python test_model.py --model_dir path_to_your_trained_model --word because

The script will now use the Cosine Similarity logits to identify phonemes and provide granular feedback via the PronunciationScorer.

Next Steps

Monitor Hub Sync: Ensure your first few checkpoints (every 1000 steps) are successfully uploading to your HF Hub.
Evaluate on OOVs: Test how the embedding space handles Out-Of-Vocabulary words compared to the old model.