Spaces:
Sleeping
Sleeping
| # Architecture et Documentation Technique | |
| ## ๐๏ธ Architecture de l'API | |
| ``` | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Client/Frontend โ | |
| โ (Web, Mobile, CLI, Python Client, cURL, etc.) โ | |
| โโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ HTTP/REST | |
| โโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ Flask API Server โ | |
| โ - Health Check GET /health โ | |
| โ - Documentation GET / โ | |
| โ - Langues GET /supported-languages โ | |
| โ - ASR (AudioโText) POST /asr โ | |
| โ - TTS (TextโAudio) POST /tts โ | |
| โโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ | |
| โ โ | |
| โโโโโโโโโโผโโโโโโโโโโโ โโโโโโโโโโโโผโโโโโโโโโโโ | |
| โ ASR Pipeline โ โ TTS Pipeline โ | |
| โ โ โ โ | |
| โ 1. Load Audio โ โ 1. Validate Text โ | |
| โ 2. Process โ โ 2. Load Model โ | |
| โ 3. Tokenize โ โ 3. Tokenize โ | |
| โ 4. Infer w/ MMS โ โ 4. Infer (VITS) โ | |
| โ 5. Decode โ โ 5. Generate WAV โ | |
| โโโโโโโโโโฌโโโโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโ | |
| โ โ | |
| โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโ | |
| โ Model Cache & Management โ | |
| โ - facebook/mms-1b-all (ASR) โ | |
| โ - facebook/mms-tts-* (8 langues) โ | |
| โ - Thread-safe loading โ | |
| โ - Lazy initialization โ | |
| โโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ | |
| โโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| โ PyTorch / GPU Support โ | |
| โ - Dรฉtection automatique GPU/CPU โ | |
| โ - Device management โ | |
| โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| ``` | |
| ## ๐ Flow des requรชtes | |
| ### ASR (Automatic Speech Recognition) | |
| ``` | |
| Audio File | |
| โ | |
| [Validation] โ Error if invalid | |
| โ | |
| [Load & Resample] โ Convert to 16kHz mono | |
| โ | |
| [Normalize] โ [-1, 1] range | |
| โ | |
| [Truncate] โ Max 30 seconds | |
| โ | |
| [Tokenize] โ Convert to features | |
| โ | |
| [Infer] โ facebook/mms-1b-all (GPU/CPU) | |
| โ | |
| [Decode] โ Text output | |
| โ | |
| JSON Response | |
| ``` | |
| ### TTS (Text-to-Speech) | |
| ``` | |
| Text + Language | |
| โ | |
| [Validation] โ Error if empty/too long | |
| โ | |
| [Load Model] โ facebook/mms-tts-{lang} | |
| โ | |
| [Tokenize] โ Convert text to token IDs | |
| โ | |
| [Infer] โ VITS model (GPU/CPU) | |
| โ | |
| [Generate WAV] โ Audio synthesis (22050 Hz) | |
| โ | |
| WAV File (audio/wav) | |
| ``` | |
| ## ๐ง Modรจles utilisรฉs | |
| ### ASR: facebook/mms-1b-all | |
| - **Architecture**: wav2vec2 | |
| - **Taille**: 964.8M parameters | |
| - **Langues**: 100+ (ISO 639-3) | |
| - **Input**: Audio 16kHz mono | |
| - **Output**: Transcription texte | |
| - **Entraรฎnement**: XLSL-R + Fine-tuning multilingual | |
| ### TTS: facebook/mms-tts-{language} | |
| - **Architecture**: VITS (Variational Inference Text-to-Speech) | |
| - **Taille**: ~5-10M parameters par modรจle | |
| - **Langues**: 8 (voir supported languages) | |
| - **Input**: Texte (max 1000 chars) | |
| - **Output**: Waveform 22050 Hz | |
| - **Entraรฎnement**: Multilingual dataset + data augmentation | |
| ## ๐ง Configuration | |
| ```python | |
| SAMPLE_RATE = 16000 # Taux d'รฉchantillonnage ASR | |
| MAX_AUDIO_LENGTH = 30 # Max 30 secondes d'audio | |
| MAX_TEXT_LENGTH = 1000 # Max 1000 caractรจres | |
| DEVICE = auto (GPU if available) | |
| MODEL_CACHE = Thread-safe dict | |
| ``` | |
| ## ๐ Performance | |
| | Mรฉtrique | Valeur | | |
| |----------|--------| | |
| | Premiรจre requรชte ASR | 2-5 min (chargement modรจle) | | |
| | Requรชtes suivantes ASR | 1-10 sec (audio 10sec) | | |
| | Premiรจre requรชte TTS | 30-60 sec (chargement modรจle) | | |
| | Requรชtes suivantes TTS | 1-5 sec (100 chars) | | |
| | Mรฉmoire GPU | ~2GB (ASR) + 1GB (TTS) | | |
| | Mรฉmoire RAM | ~1GB cache | | |
| ## ๐ Sรฉcuritรฉ | |
| ### Input Validation | |
| - โ Vรฉrification type fichier audio | |
| - โ Limitation taille audio (30s) | |
| - โ Limitation taille texte (1000 chars) | |
| - โ Vรฉrification contenu non-vide | |
| ### Rate Limiting (ร ajouter) | |
| ```python | |
| from flask_limiter import Limiter | |
| limiter = Limiter(app, key_func=lambda: request.remote_addr) | |
| @app.route('/tts') | |
| @limiter.limit("10/minute") | |
| def tts(): | |
| ... | |
| ``` | |
| ### Authentication (ร ajouter) | |
| ```python | |
| from functools import wraps | |
| def require_token(f): | |
| @wraps(f) | |
| def decorated(*args, **kwargs): | |
| token = request.headers.get('Authorization') | |
| if not validate_token(token): | |
| return {'error': 'Unauthorized'}, 401 | |
| return f(*args, **kwargs) | |
| return decorated | |
| ``` | |
| ## ๐ Optimisations | |
| ### Cache des modรจles | |
| - Modรจles chargรฉs une seule fois | |
| - Partage entre toutes les requรชtes | |
| - Thread-safe avec locks | |
| ### GPU Acceleration | |
| - Dรฉtection automatique GPU | |
| - Inference sur GPU si disponible | |
| - Fallback CPU automatique | |
| ### Memory Management | |
| - Gradients dรฉsactivรฉs pour infรฉrence | |
| - Modรจles en eval mode | |
| - Audio / texte tronquรฉs | |
| ## ๐ฆ Dรฉploiement | |
| ### Local Development | |
| ```bash | |
| python app_v2.py | |
| # Runs on http://localhost:7860 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t mms-api . | |
| docker run -p 7860:7860 mms-api | |
| ``` | |
| ### Docker Compose (avec GPU) | |
| ```bash | |
| docker-compose up | |
| ``` | |
| ### Hugging Face Spaces | |
| - Crรฉe un Space Docker | |
| - Push code vers HF | |
| - Auto-build et dรฉploiement | |
| - URL: https://huggingface.co/spaces/{user}/{space} | |
| ## ๐ก API Endpoints | |
| ### GET / | |
| Documentation et mรฉtadonnรฉes | |
| ### GET /health | |
| รtat du service et device info | |
| ### GET /supported-languages | |
| Langues supportรฉes ASR/TTS | |
| ### GET /models-info | |
| Infos dรฉtaillรฉes sur les modรจles | |
| ### POST /asr | |
| Transcription audio | |
| - **Input**: multipart/form-data (audio + language) | |
| - **Output**: JSON (transcription + mรฉtadonnรฉes) | |
| ### POST /tts | |
| Synthรจse vocale | |
| - **Input**: JSON (text + language) | |
| - **Output**: WAV audio file | |
| ## ๐ Debugging | |
| ### Logs | |
| ```bash | |
| # Local | |
| python app_v2.py | |
| # Voir les logs en stdout | |
| # Docker | |
| docker logs <container_id> | |
| # HF Spaces | |
| # Voir onglet "Logs" dans le Space | |
| ``` | |
| ### Common Issues | |
| **Issue**: Model not found | |
| **Solution**: Attendre le tรฉlรฉchargement des modรจles (5-10 min) | |
| **Issue**: CUDA out of memory | |
| **Solution**: Rรฉduire MAX_AUDIO_LENGTH ou utiliser CPU | |
| **Issue**: Port already in use | |
| **Solution**: `PORT=8080 python app_v2.py` | |
| ## ๐ฎ Roadmap | |
| - [ ] Streaming ASR/TTS | |
| - [ ] Batch processing | |
| - [ ] WebSockets pour streaming | |
| - [ ] Caching Redis | |
| - [ ] Database logging | |
| - [ ] Rate limiting | |
| - [ ] Authentication/API keys | |
| - [ ] Metrics (Prometheus) | |
| - [ ] Web UI (Gradio/Streamlit) | |
| - [ ] More languages | |
| - [ ] Emotion synthesis | |
| - [ ] Custom voices | |
| ## ๐ Rรฉfรฉrences | |
| - [Meta MMS Paper](https://arxiv.org/abs/2305.13516) | |
| - [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all) | |
| - [facebook/mms-tts](https://huggingface.co/facebook/mms-tts) | |
| - [Transformers Documentation](https://huggingface.co/docs/transformers) | |
| - [Flask Documentation](https://flask.palletsprojects.com/) | |