Nav772's picture
Update README.md
65b0f99 verified
---
title: Audio Language Translator
emoji: ๐ŸŒ
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 6.5.1
app_file: run.py
pinned: false
license: mit
suggested_hardware: t4-small
---
# ๐ŸŒ Audio Language Translator
Translate spoken audio between 15 languages using a complete AI pipeline.
## ๐ŸŽฏ What This Does
1. **Upload or record** audio in any supported language
2. **Automatic detection** of source language
3. **Translation** to your chosen target language
4. **Speech synthesis** in the target language with selectable voices
## ๐Ÿ”Œ REST API
This translator is also available as a REST API for developers!
**๐Ÿ“š Interactive API Docs:** [https://nav772-audio-language-translator.hf.space/docs](https://nav772-audio-language-translator.hf.space/docs)
### API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/health` | GET | Health check and model status |
| `/api/languages` | GET | List all 15 supported languages |
| `/api/voices/{lang}` | GET | Get available TTS voices for a language |
| `/api/transcribe` | POST | Transcribe audio only (no translation) |
| `/api/translate` | POST | Full pipeline (returns JSON) |
| `/api/translate/audio` | POST | Full pipeline (returns audio file) |
### Quick Example (Python)
```python
import requests
# Translate audio to Spanish
with open("input.wav", "rb") as f:
response = requests.post(
"https://nav772-audio-language-translator.hf.space/api/translate",
files={"file": f},
params={"target_language": "es"}
)
result = response.json()
print(f"Original: {result['original_text']}")
print(f"Translated: {result['translated_text']}")
```
### Quick Example (cURL)
```bash
curl -X POST \
"https://nav772-audio-language-translator.hf.space/api/translate?target_language=es" \
-F "file=@input.wav"
```
## ๐Ÿ› ๏ธ Built With This API
| Project | Developer | Description |
|---------|-----------|-------------|
| [Audio Translator App](https://github.com/kaunghtetsan1101/audio_translator) | [@kaunghtetsan11](https://huggingface.co/kaunghtetsan11) | Mobile app built using this API |
*Want your project featured here? Open a discussion or PR!*
## ๐Ÿ—๏ธ Architecture
```
Audio Input (any language)
โ†“
Whisper ASR (transcription + language detection)
โ†“
NLLB Translation (to target language)
โ†“
Edge-TTS (neural speech synthesis)
โ†“
Audio Output + Text Display
```
## ๐Ÿ”ง Technical Stack
| Component | Model | Parameters | Purpose |
|-----------|-------|------------|---------|
| **ASR** | openai/whisper-small | 244M | Speech recognition with automatic language detection |
| **Translation** | facebook/nllb-200-distilled-600M | 615M | Multilingual neural machine translation |
| **TTS** | Microsoft Edge-TTS | API | High-quality neural text-to-speech |
| **API** | FastAPI | - | REST API endpoints |
| **UI** | Gradio | - | Interactive web interface |
## ๐ŸŒ Supported Languages
### Tier 1: Multiple Voice Options (3 each)
- ๐Ÿ‡บ๐Ÿ‡ธ English (US/UK accents)
- ๐Ÿ‡ช๐Ÿ‡ธ Spanish (Spain/Mexico)
- ๐Ÿ‡ซ๐Ÿ‡ท French (France/Canada)
- ๐Ÿ‡ฉ๐Ÿ‡ช German (Germany/Austria)
- ๐Ÿ‡จ๐Ÿ‡ณ Chinese (Mandarin)
### Tier 2: Single High-Quality Voice
- ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic, ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi, ๐Ÿ‡ฏ๐Ÿ‡ต Japanese, ๐Ÿ‡ฐ๐Ÿ‡ท Korean, ๐Ÿ‡ง๐Ÿ‡ท Portuguese
- ๐Ÿ‡ท๐Ÿ‡บ Russian, ๐Ÿ‡ฎ๐Ÿ‡น Italian, ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch, ๐Ÿ‡ต๐Ÿ‡ฑ Polish, ๐Ÿ‡น๐Ÿ‡ท Turkish
**Total: 15 languages, 25 voices**
## ๐Ÿ“š Research Foundation
| Paper | Authors | Year | Contribution |
|-------|---------|------|--------------|
| [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) | Radford et al. | 2022 | Whisper ASR model |
| [No Language Left Behind](https://arxiv.org/abs/2207.04672) | Costa-jussร  et al. | 2022 | NLLB translation model |
## ๐Ÿ“ Limitations
- Audio length: Optimized for clips under 30 seconds
- Internet required: Edge-TTS requires connectivity
- GPU recommended: CPU inference is significantly slower
## โš ๏ธ Development Challenges & Solutions
### Challenge 1: Gradio 5.x/6.x Giant Audio Icons
**Problem:** Audio component SVG icons displayed extremely large (filling entire screen) in Gradio versions 5.x and 6.x.
**Attempted fixes that didn't work:**
- Custom CSS targeting SVG elements
- Using `elem_classes` and `scale` parameters
- Various Gradio version downgrades
**Solution:** Removed custom CSS entirely and used clean Gradio components. The issue was related to Shadow DOM in newer Gradio versions blocking external CSS.
### Challenge 2: Gradio 4.x + Python 3.13 Incompatibility
**Problem:** Older Gradio versions (4.x) failed to build due to `tokenizers` and `pyo3` not supporting Python 3.13.
**Error:** `Python interpreter version (3.13) is newer than PyO3's maximum supported version (3.12)`
**Solution:** Used Gradio 6.x which has native Python 3.13 support.
### Challenge 3: FastAPI + Gradio Mount Conflicts
**Problem:** Combining FastAPI API endpoints with Gradio UI caused "Invalid port" errors and infinite request loops.
**Error pattern:**
```
Invalid port: '7861_appimmutablechunksD2RdMstj.js'
GET /_app/immutable/chunks/D2RdMstj.js HTTP/1.1" 404 Not Found
```
**Root cause:** Using `demo.launch()` after `gr.mount_gradio_app()` created conflicting servers.
**Solution:**
1. Created separate `run.py` to handle uvicorn server
2. Used `gr.mount_gradio_app(api_app, demo, path="/")` without calling `demo.launch()`
3. Let uvicorn serve the combined FastAPI + Gradio app
### Challenge 4: HuggingFace Hub Compatibility
**Problem:** Older Gradio versions required older `huggingface_hub` versions, causing import errors.
**Error:** `ImportError: cannot import name 'HfFolder' from 'huggingface_hub'`
**Solution:** Removed version pins and let HuggingFace Spaces resolve compatible versions automatically.
### Key Takeaways
- **Version compatibility** is critical when combining multiple frameworks
- **Simpler is better** โ€” avoid custom CSS when possible
- **Separate concerns** โ€” use `run.py` for server logic, `app.py` for app definition
- **Test incrementally** โ€” verify UI works before adding API complexity
## ๐Ÿ‘ค Author
**[Nav772](https://huggingface.co/Nav772)** โ€” Built as part of an AI Engineering portfolio demonstrating multimodal AI capabilities and REST API development.
## ๐Ÿ“š Related Projects
- [LLM Evaluation Dashboard](https://huggingface.co/spaces/Nav772/llm-evaluation-dashboard)
- [RAG Document Q&A](https://huggingface.co/spaces/Nav772/rag-qa-document)
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)
## ๐Ÿ“„ License
MIT License