Spaces:

Nav772
/

audio-language-translator

Sleeping

File size: 6,637 Bytes

---
title: Audio Language Translator
emoji: 🌍
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 6.11.0
app_file: run.py
pinned: false
license: mit
suggested_hardware: t4-small
---

# 🌍 Audio Language Translator

Translate spoken audio between 15 languages using a complete AI pipeline.

## 🎯 What This Does

1. **Upload or record** audio in any supported language
2. **Automatic detection** of source language
3. **Translation** to your chosen target language
4. **Speech synthesis** in the target language with selectable voices

## 🔌 REST API

This translator is also available as a REST API for developers!

**📚 Interactive API Docs:** [https://nav772-audio-language-translator.hf.space/docs](https://nav772-audio-language-translator.hf.space/docs)

### API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/health` | GET | Health check and model status |
| `/api/languages` | GET | List all 15 supported languages |
| `/api/voices/{lang}` | GET | Get available TTS voices for a language |
| `/api/transcribe` | POST | Transcribe audio only (no translation) |
| `/api/translate` | POST | Full pipeline (returns JSON) |
| `/api/translate/audio` | POST | Full pipeline (returns audio file) |

### Quick Example (Python)
```python
import requests

# Translate audio to Spanish
with open("input.wav", "rb") as f:
    response = requests.post(
        "https://nav772-audio-language-translator.hf.space/api/translate",
        files={"file": f},
        params={"target_language": "es"}
    )

result = response.json()
print(f"Original: {result['original_text']}")
print(f"Translated: {result['translated_text']}")
```

### Quick Example (cURL)
```bash
curl -X POST \
  "https://nav772-audio-language-translator.hf.space/api/translate?target_language=es" \
  -F "file=@input.wav"
```

## 🛠️ Built With This API

| Project | Developer | Description |
|---------|-----------|-------------|
| [Audio Translator App](https://github.com/kaunghtetsan1101/audio_translator) | [@kaunghtetsan11](https://huggingface.co/kaunghtetsan11) | Mobile app built using this API |

*Want your project featured here? Open a discussion or PR!*

## 🏗️ Architecture
```
Audio Input (any language)
        ↓
Whisper ASR (transcription + language detection)
        ↓
NLLB Translation (to target language)
        ↓
Edge-TTS (neural speech synthesis)
        ↓
Audio Output + Text Display
```

## 🔧 Technical Stack

| Component | Model | Parameters | Purpose |
|-----------|-------|------------|---------|
| **ASR** | openai/whisper-small | 244M | Speech recognition with automatic language detection |
| **Translation** | facebook/nllb-200-distilled-600M | 615M | Multilingual neural machine translation |
| **TTS** | Microsoft Edge-TTS | API | High-quality neural text-to-speech |
| **API** | FastAPI | - | REST API endpoints |
| **UI** | Gradio | - | Interactive web interface |

## 🌐 Supported Languages

### Tier 1: Multiple Voice Options (3 each)
- 🇺🇸 English (US/UK accents)
- 🇪🇸 Spanish (Spain/Mexico)
- 🇫🇷 French (France/Canada)
- 🇩🇪 German (Germany/Austria)
- 🇨🇳 Chinese (Mandarin)

### Tier 2: Single High-Quality Voice
- 🇸🇦 Arabic, 🇮🇳 Hindi, 🇯🇵 Japanese, 🇰🇷 Korean, 🇧🇷 Portuguese
- 🇷🇺 Russian, 🇮🇹 Italian, 🇳🇱 Dutch, 🇵🇱 Polish, 🇹🇷 Turkish

**Total: 15 languages, 25 voices**

## 📚 Research Foundation

| Paper | Authors | Year | Contribution |
|-------|---------|------|--------------|
| [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) | Radford et al. | 2022 | Whisper ASR model |
| [No Language Left Behind](https://arxiv.org/abs/2207.04672) | Costa-jussà et al. | 2022 | NLLB translation model |

## 📝 Limitations

- Audio length: Optimized for clips under 30 seconds
- Internet required: Edge-TTS requires connectivity
- GPU recommended: CPU inference is significantly slower

## ⚠️ Development Challenges & Solutions

### Challenge 1: Gradio 5.x/6.x Giant Audio Icons
**Problem:** Audio component SVG icons displayed extremely large (filling entire screen) in Gradio versions 5.x and 6.x.

**Attempted fixes that didn't work:**
- Custom CSS targeting SVG elements
- Using `elem_classes` and `scale` parameters
- Various Gradio version downgrades

**Solution:** Removed custom CSS entirely and used clean Gradio components. The issue was related to Shadow DOM in newer Gradio versions blocking external CSS.

### Challenge 2: Gradio 4.x + Python 3.13 Incompatibility
**Problem:** Older Gradio versions (4.x) failed to build due to `tokenizers` and `pyo3` not supporting Python 3.13.

**Error:** `Python interpreter version (3.13) is newer than PyO3's maximum supported version (3.12)`

**Solution:** Used Gradio 6.x which has native Python 3.13 support.

### Challenge 3: FastAPI + Gradio Mount Conflicts
**Problem:** Combining FastAPI API endpoints with Gradio UI caused "Invalid port" errors and infinite request loops.

**Error pattern:**
```
Invalid port: '7861_appimmutablechunksD2RdMstj.js'
GET /_app/immutable/chunks/D2RdMstj.js HTTP/1.1" 404 Not Found
```

**Root cause:** Using `demo.launch()` after `gr.mount_gradio_app()` created conflicting servers.

**Solution:** 
1. Created separate `run.py` to handle uvicorn server
2. Used `gr.mount_gradio_app(api_app, demo, path="/")` without calling `demo.launch()`
3. Let uvicorn serve the combined FastAPI + Gradio app

### Challenge 4: HuggingFace Hub Compatibility
**Problem:** Older Gradio versions required older `huggingface_hub` versions, causing import errors.

**Error:** `ImportError: cannot import name 'HfFolder' from 'huggingface_hub'`

**Solution:** Removed version pins and let HuggingFace Spaces resolve compatible versions automatically.

### Key Takeaways
- **Version compatibility** is critical when combining multiple frameworks
- **Simpler is better** — avoid custom CSS when possible
- **Separate concerns** — use `run.py` for server logic, `app.py` for app definition
- **Test incrementally** — verify UI works before adding API complexity

## 👤 Author

**[Nav772](https://huggingface.co/Nav772)** — Built as part of an AI Engineering portfolio demonstrating multimodal AI capabilities and REST API development.

## 📚 Related Projects

- [LLM Evaluation Dashboard](https://huggingface.co/spaces/Nav772/llm-evaluation-dashboard)
- [RAG Document Q&A](https://huggingface.co/spaces/Nav772/rag-qa-document)
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)

## 📄 License

MIT License