File size: 6,637 Bytes
9767f33
 
7ea33e8
0806a7f
 
9767f33
e77f2f2
e6c92d6
9767f33
7ea33e8
 
9767f33
 
7ea33e8
 
 
 
 
 
 
 
 
 
 
6afb63e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
665e2fa
 
 
 
2773633
665e2fa
 
 
7ea33e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6afb63e
 
7ea33e8
 
 
 
 
 
 
 
 
 
 
6afb63e
 
7ea33e8
 
 
 
 
 
 
 
 
 
 
 
6afb63e
 
 
7ea33e8
ed7cc6f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7ea33e8
 
6afb63e
7ea33e8
 
 
665e2fa
6afb63e
 
7ea33e8
 
 
0cac47e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: Audio Language Translator
emoji: ๐ŸŒ
colorFrom: red
colorTo: yellow
sdk: gradio
sdk_version: 6.11.0
app_file: run.py
pinned: false
license: mit
suggested_hardware: t4-small
---

# ๐ŸŒ Audio Language Translator

Translate spoken audio between 15 languages using a complete AI pipeline.

## ๐ŸŽฏ What This Does

1. **Upload or record** audio in any supported language
2. **Automatic detection** of source language
3. **Translation** to your chosen target language
4. **Speech synthesis** in the target language with selectable voices

## ๐Ÿ”Œ REST API

This translator is also available as a REST API for developers!

**๐Ÿ“š Interactive API Docs:** [https://nav772-audio-language-translator.hf.space/docs](https://nav772-audio-language-translator.hf.space/docs)

### API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/health` | GET | Health check and model status |
| `/api/languages` | GET | List all 15 supported languages |
| `/api/voices/{lang}` | GET | Get available TTS voices for a language |
| `/api/transcribe` | POST | Transcribe audio only (no translation) |
| `/api/translate` | POST | Full pipeline (returns JSON) |
| `/api/translate/audio` | POST | Full pipeline (returns audio file) |

### Quick Example (Python)
```python
import requests

# Translate audio to Spanish
with open("input.wav", "rb") as f:
    response = requests.post(
        "https://nav772-audio-language-translator.hf.space/api/translate",
        files={"file": f},
        params={"target_language": "es"}
    )

result = response.json()
print(f"Original: {result['original_text']}")
print(f"Translated: {result['translated_text']}")
```

### Quick Example (cURL)
```bash
curl -X POST \
  "https://nav772-audio-language-translator.hf.space/api/translate?target_language=es" \
  -F "file=@input.wav"
```

## ๐Ÿ› ๏ธ Built With This API

| Project | Developer | Description |
|---------|-----------|-------------|
| [Audio Translator App](https://github.com/kaunghtetsan1101/audio_translator) | [@kaunghtetsan11](https://huggingface.co/kaunghtetsan11) | Mobile app built using this API |

*Want your project featured here? Open a discussion or PR!*

## ๐Ÿ—๏ธ Architecture
```
Audio Input (any language)
        โ†“
Whisper ASR (transcription + language detection)
        โ†“
NLLB Translation (to target language)
        โ†“
Edge-TTS (neural speech synthesis)
        โ†“
Audio Output + Text Display
```

## ๐Ÿ”ง Technical Stack

| Component | Model | Parameters | Purpose |
|-----------|-------|------------|---------|
| **ASR** | openai/whisper-small | 244M | Speech recognition with automatic language detection |
| **Translation** | facebook/nllb-200-distilled-600M | 615M | Multilingual neural machine translation |
| **TTS** | Microsoft Edge-TTS | API | High-quality neural text-to-speech |
| **API** | FastAPI | - | REST API endpoints |
| **UI** | Gradio | - | Interactive web interface |

## ๐ŸŒ Supported Languages

### Tier 1: Multiple Voice Options (3 each)
- ๐Ÿ‡บ๐Ÿ‡ธ English (US/UK accents)
- ๐Ÿ‡ช๐Ÿ‡ธ Spanish (Spain/Mexico)
- ๐Ÿ‡ซ๐Ÿ‡ท French (France/Canada)
- ๐Ÿ‡ฉ๐Ÿ‡ช German (Germany/Austria)
- ๐Ÿ‡จ๐Ÿ‡ณ Chinese (Mandarin)

### Tier 2: Single High-Quality Voice
- ๐Ÿ‡ธ๐Ÿ‡ฆ Arabic, ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi, ๐Ÿ‡ฏ๐Ÿ‡ต Japanese, ๐Ÿ‡ฐ๐Ÿ‡ท Korean, ๐Ÿ‡ง๐Ÿ‡ท Portuguese
- ๐Ÿ‡ท๐Ÿ‡บ Russian, ๐Ÿ‡ฎ๐Ÿ‡น Italian, ๐Ÿ‡ณ๐Ÿ‡ฑ Dutch, ๐Ÿ‡ต๐Ÿ‡ฑ Polish, ๐Ÿ‡น๐Ÿ‡ท Turkish

**Total: 15 languages, 25 voices**

## ๐Ÿ“š Research Foundation

| Paper | Authors | Year | Contribution |
|-------|---------|------|--------------|
| [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) | Radford et al. | 2022 | Whisper ASR model |
| [No Language Left Behind](https://arxiv.org/abs/2207.04672) | Costa-jussร  et al. | 2022 | NLLB translation model |

## ๐Ÿ“ Limitations

- Audio length: Optimized for clips under 30 seconds
- Internet required: Edge-TTS requires connectivity
- GPU recommended: CPU inference is significantly slower

## โš ๏ธ Development Challenges & Solutions

### Challenge 1: Gradio 5.x/6.x Giant Audio Icons
**Problem:** Audio component SVG icons displayed extremely large (filling entire screen) in Gradio versions 5.x and 6.x.

**Attempted fixes that didn't work:**
- Custom CSS targeting SVG elements
- Using `elem_classes` and `scale` parameters
- Various Gradio version downgrades

**Solution:** Removed custom CSS entirely and used clean Gradio components. The issue was related to Shadow DOM in newer Gradio versions blocking external CSS.

### Challenge 2: Gradio 4.x + Python 3.13 Incompatibility
**Problem:** Older Gradio versions (4.x) failed to build due to `tokenizers` and `pyo3` not supporting Python 3.13.

**Error:** `Python interpreter version (3.13) is newer than PyO3's maximum supported version (3.12)`

**Solution:** Used Gradio 6.x which has native Python 3.13 support.

### Challenge 3: FastAPI + Gradio Mount Conflicts
**Problem:** Combining FastAPI API endpoints with Gradio UI caused "Invalid port" errors and infinite request loops.

**Error pattern:**
```
Invalid port: '7861_appimmutablechunksD2RdMstj.js'
GET /_app/immutable/chunks/D2RdMstj.js HTTP/1.1" 404 Not Found
```

**Root cause:** Using `demo.launch()` after `gr.mount_gradio_app()` created conflicting servers.

**Solution:** 
1. Created separate `run.py` to handle uvicorn server
2. Used `gr.mount_gradio_app(api_app, demo, path="/")` without calling `demo.launch()`
3. Let uvicorn serve the combined FastAPI + Gradio app

### Challenge 4: HuggingFace Hub Compatibility
**Problem:** Older Gradio versions required older `huggingface_hub` versions, causing import errors.

**Error:** `ImportError: cannot import name 'HfFolder' from 'huggingface_hub'`

**Solution:** Removed version pins and let HuggingFace Spaces resolve compatible versions automatically.

### Key Takeaways
- **Version compatibility** is critical when combining multiple frameworks
- **Simpler is better** โ€” avoid custom CSS when possible
- **Separate concerns** โ€” use `run.py` for server logic, `app.py` for app definition
- **Test incrementally** โ€” verify UI works before adding API complexity

## ๐Ÿ‘ค Author

**[Nav772](https://huggingface.co/Nav772)** โ€” Built as part of an AI Engineering portfolio demonstrating multimodal AI capabilities and REST API development.

## ๐Ÿ“š Related Projects

- [LLM Evaluation Dashboard](https://huggingface.co/spaces/Nav772/llm-evaluation-dashboard)
- [RAG Document Q&A](https://huggingface.co/spaces/Nav772/rag-qa-document)
- [Movie Sentiment Analyzer](https://huggingface.co/spaces/Nav772/movie-sentiment-analyzer)

## ๐Ÿ“„ License

MIT License