Spaces:

Ronaldodev
/

test

Sleeping

File size: 8,391 Bytes

3e08670

# Architecture et Documentation Technique

## 🏗️ Architecture de l'API

```
┌─────────────────────────────────────────────────────────────┐
│                      Client/Frontend                        │
│  (Web, Mobile, CLI, Python Client, cURL, etc.)             │
└────────────────────┬────────────────────────────────────────┘
                     │ HTTP/REST
┌────────────────────▼────────────────────────────────────────┐
│                    Flask API Server                         │
│  - Health Check         GET /health                         │
│  - Documentation        GET /                               │
│  - Langues              GET /supported-languages            │
│  - ASR (Audio→Text)     POST /asr                           │
│  - TTS (Text→Audio)     POST /tts                           │
└────────────┬──────────────────────────────┬─────────────────┘
             │                              │
    ┌────────▼──────────┐        ┌──────────▼──────────┐
    │  ASR Pipeline     │        │  TTS Pipeline      │
    │                   │        │                    │
    │ 1. Load Audio     │        │ 1. Validate Text   │
    │ 2. Process       │        │ 2. Load Model      │
    │ 3. Tokenize      │        │ 3. Tokenize        │
    │ 4. Infer w/ MMS  │        │ 4. Infer (VITS)    │
    │ 5. Decode        │        │ 5. Generate WAV    │
    └────────┬──────────┘        └──────────┬─────────┘
             │                              │
    ┌────────▼──────────────────────────────▼──────────┐
    │         Model Cache & Management               │
    │  - facebook/mms-1b-all (ASR)                   │
    │  - facebook/mms-tts-* (8 langues)             │
    │  - Thread-safe loading                        │
    │  - Lazy initialization                        │
    └────────┬──────────────────────────────────────┘
             │
    ┌────────▼──────────────────────────────────────┐
    │   PyTorch / GPU Support                      │
    │  - Détection automatique GPU/CPU             │
    │  - Device management                        │
    └──────────────────────────────────────────────┘
```

## 📊 Flow des requêtes

### ASR (Automatic Speech Recognition)

```
Audio File
    ↓
[Validation] → Error if invalid
    ↓
[Load & Resample] → Convert to 16kHz mono
    ↓
[Normalize] → [-1, 1] range
    ↓
[Truncate] → Max 30 seconds
    ↓
[Tokenize] → Convert to features
    ↓
[Infer] → facebook/mms-1b-all (GPU/CPU)
    ↓
[Decode] → Text output
    ↓
JSON Response
```

### TTS (Text-to-Speech)

```
Text + Language
    ↓
[Validation] → Error if empty/too long
    ↓
[Load Model] → facebook/mms-tts-{lang}
    ↓
[Tokenize] → Convert text to token IDs
    ↓
[Infer] → VITS model (GPU/CPU)
    ↓
[Generate WAV] → Audio synthesis (22050 Hz)
    ↓
WAV File (audio/wav)
```

## 🧠 Modèles utilisés

### ASR: facebook/mms-1b-all

- **Architecture**: wav2vec2
- **Taille**: 964.8M parameters
- **Langues**: 100+ (ISO 639-3)
- **Input**: Audio 16kHz mono
- **Output**: Transcription texte
- **Entraînement**: XLSL-R + Fine-tuning multilingual

### TTS: facebook/mms-tts-{language}

- **Architecture**: VITS (Variational Inference Text-to-Speech)
- **Taille**: ~5-10M parameters par modèle
- **Langues**: 8 (voir supported languages)
- **Input**: Texte (max 1000 chars)
- **Output**: Waveform 22050 Hz
- **Entraînement**: Multilingual dataset + data augmentation

## 🔧 Configuration

```python
SAMPLE_RATE = 16000           # Taux d'échantillonnage ASR
MAX_AUDIO_LENGTH = 30         # Max 30 secondes d'audio
MAX_TEXT_LENGTH = 1000        # Max 1000 caractères
DEVICE = auto (GPU if available)
MODEL_CACHE = Thread-safe dict
```

## 📈 Performance

| Métrique | Valeur |
|----------|--------|
| Première requête ASR | 2-5 min (chargement modèle) |
| Requêtes suivantes ASR | 1-10 sec (audio 10sec) |
| Première requête TTS | 30-60 sec (chargement modèle) |
| Requêtes suivantes TTS | 1-5 sec (100 chars) |
| Mémoire GPU | ~2GB (ASR) + 1GB (TTS) |
| Mémoire RAM | ~1GB cache |

## 🔐 Sécurité

### Input Validation
- ✅ Vérification type fichier audio
- ✅ Limitation taille audio (30s)
- ✅ Limitation taille texte (1000 chars)
- ✅ Vérification contenu non-vide

### Rate Limiting (À ajouter)
```python
from flask_limiter import Limiter
limiter = Limiter(app, key_func=lambda: request.remote_addr)

@app.route('/tts')
@limiter.limit("10/minute")
def tts():
    ...
```

### Authentication (À ajouter)
```python
from functools import wraps
def require_token(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        token = request.headers.get('Authorization')
        if not validate_token(token):
            return {'error': 'Unauthorized'}, 401
        return f(*args, **kwargs)
    return decorated
```

## 🚀 Optimisations

### Cache des modèles
- Modèles chargés une seule fois
- Partage entre toutes les requêtes
- Thread-safe avec locks

### GPU Acceleration
- Détection automatique GPU
- Inference sur GPU si disponible
- Fallback CPU automatique

### Memory Management
- Gradients désactivés pour inférence
- Modèles en eval mode
- Audio / texte tronqués

## 📦 Déploiement

### Local Development
```bash
python app_v2.py
# Runs on http://localhost:7860
```

### Docker
```bash
docker build -t mms-api .
docker run -p 7860:7860 mms-api
```

### Docker Compose (avec GPU)
```bash
docker-compose up
```

### Hugging Face Spaces
- Crée un Space Docker
- Push code vers HF
- Auto-build et déploiement
- URL: https://huggingface.co/spaces/{user}/{space}

## 📡 API Endpoints

### GET /
Documentation et métadonnées

### GET /health
État du service et device info

### GET /supported-languages
Langues supportées ASR/TTS

### GET /models-info
Infos détaillées sur les modèles

### POST /asr
Transcription audio
- **Input**: multipart/form-data (audio + language)
- **Output**: JSON (transcription + métadonnées)

### POST /tts
Synthèse vocale
- **Input**: JSON (text + language)
- **Output**: WAV audio file

## 🐛 Debugging

### Logs
```bash
# Local
python app_v2.py
# Voir les logs en stdout

# Docker
docker logs <container_id>

# HF Spaces
# Voir onglet "Logs" dans le Space
```

### Common Issues

**Issue**: Model not found
**Solution**: Attendre le téléchargement des modèles (5-10 min)

**Issue**: CUDA out of memory
**Solution**: Réduire MAX_AUDIO_LENGTH ou utiliser CPU

**Issue**: Port already in use
**Solution**: `PORT=8080 python app_v2.py`

## 🔮 Roadmap

- [ ] Streaming ASR/TTS
- [ ] Batch processing
- [ ] WebSockets pour streaming
- [ ] Caching Redis
- [ ] Database logging
- [ ] Rate limiting
- [ ] Authentication/API keys
- [ ] Metrics (Prometheus)
- [ ] Web UI (Gradio/Streamlit)
- [ ] More languages
- [ ] Emotion synthesis
- [ ] Custom voices

## 📚 Références

- [Meta MMS Paper](https://arxiv.org/abs/2305.13516)
- [facebook/mms-1b-all](https://huggingface.co/facebook/mms-1b-all)
- [facebook/mms-tts](https://huggingface.co/facebook/mms-tts)
- [Transformers Documentation](https://huggingface.co/docs/transformers)
- [Flask Documentation](https://flask.palletsprojects.com/)