Spaces:

Dama03
/

bassa_sentiment

Sleeping

App Files Files Community

Dama03 commited on Jul 18, 2025

Commit

7b2fa2c

1 Parent(s): 8b16f1e

ajout du code necessaire sans voice

Browse files

Files changed (7) hide show

.gitignore +5 -0
__init__.py +1 -0
sentiment_space/README.fr.md +312 -0
sentiment_space/app.py +59 -0
sentiment_space/auth.py +26 -0
sentiment_space/bassa_analyzer.py +533 -0
sentiment_space/requirements.txt +15 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+.env
+text
+__pycache__

__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # This file makes the ia directory a Python package

sentiment_space/README.fr.md ADDED Viewed

	@@ -0,0 +1,312 @@

+# Module d'Analyse de Sentiments Multilingue (Bassa, Français, Anglais)
+Ce document fournit les instructions nécessaires pour intégrer et utiliser le module d'analyse de sentiments. Le système est conçu pour fonctionner 100% hors ligne, sans dépendre d'APIs externes.
+## 1. Schéma d'Architecture
+Le système est composé de trois parties principales : un frontend React.js, un backend Node.js, et un service d'IA Python autonome.
+```
++-----------------------+      (JSON via HTTP POST)     +-------------------------+
+|                       |  <------------------------>   |                         |
+|  Frontend (React.js)  |        /api/analyze           |   Backend (Node.js)     |
+|                       |                                 |   - Express/Fastify    |
+| - Enregistre/saisit   |                                 |   - Gestion des routes |
+|   laudio ou le texte |                                 |   - Logique métier     |
+| - Envoie en Base64    |                                 |                         |
++-----------------------+                                 +------------+------------+
+                                                                       | (Appel HTTP)
+                                                                       |
+                                                 +-----------------------v-----------------------+
+                                                 |                                                 |
+                                                 |  Service IA (Python)                          |
+                                                 |  - FastAPI                                    |
+                                                 |  - bassa_analyzer.py                          |
+                                                 |  - Modèles IA (chargés localement)            |
+                                                 |  - Whisper (Transcription)                    |
+                                                 |  - XLM-Roberta (Analyse Sentiment)            |
+                                                 |  - Références Bassa (Audio Embeddings)        |
+                                                 |                                                 |
+                                                 +-------------------------------------------------+
+```
+##2rvice d'IA (Python)
+Le cœur de l'analyse est contenu dans le fichier `bassa_analyzer.py` et exposé via l'API FastAPI dans `api_server.py`.
+### Installation
+Assurez-vous que votre environnement virtuel est activé, puis installez les dépendances requises :
+```bash
+pip install -r requirements.txt
+```
+La liste des dépendances inclut `torch`, `transformers`, `soundfile`, `pandas`, `scikit-learn`, `langdetect`, `fastapi`, `uvicorn`, et `torchaudio`.
+### Lancement du Service d'IA
+Depuis le dossier `ia`, lancez le service avec Uvicorn :
+```bash
+uvicorn app:app --host 0.0.0.0 --port 8000
+```
+- `--host 0.0.0.0` : service accessible sur votre réseau local (nécessaire pour que le backend Node.js puisse s'y connecter).
+- `--port 8000` est le port d'écoute. Vous pouvez le changer si nécessaire.
+Le service d'IA est maintenant en cours d'exécution. Vous pouvez accéder à la documentation interactive de l'API (générée automatiquement) à l'adresse [http://127.0.0.1:8000/docs](http://127.0.0.1:8000/docs).
+### Endpoint de Santé
+Pour vérifier que le service d'IA est opérationnel, utilisez :
+```
+GET /health
+```
+Réponse :
+```json
+{"status": ok,message": "API opérationnelle"}
+```
+### Utilisation du Service d'IA
+Le service expose un endpoint principal :
+```
+POST /analyze
+```
+#### Format de la Requête
+```json[object Object]input_type:audio" | text",
+content: .." // Base64 pour audio, texte brut sinon
+}
+```
+#### Format de la Réponse
+```json
+[object Object]  langue_detectee:fr,//bss,fr, ou "en"
+ texte_original: rvice est vraiment exceptionnel, je suis ravi.",
+ traduction_anglaise": "[Translation to be implemented]",
+  "sentiment: ositif, // positif",neutre", ou negatif
+  confiance:00.9999 methode_utilisee": nlptown_bert_multilingual", // ou base_referencee"
+  processing_time": 1.23 // Temps de traitement en secondes
+}
+```
+## 3. Intégration Backend (Node.js)
+Le backend Node.js communique avec le service dIAPython via des appels HTTP.
+### Exemple d'Appel depuis le Backend Node.js
+Voici comment le service backend Node.js peut appeler le service d'IA Python en utilisant `node-fetch` (ou un autre client HTTP comme `axios`).
+```javascript
+const fetch = require('node-fetch');
+const AI_SERVICE_URL = http://127018000lyze';
+// Exemple de fonction dans votre service Node.js
+async function analyserContenu(inputType, content) {
+    try[object Object]       console.log(`Envoi de la requête au service d'IA: ${AI_SERVICE_URL}`);
+        const response = await fetch(AI_SERVICE_URL, {
+            method: 'POST',
+            headers:[object Object]
+               Content-Type':application/json,
+         Accept':application/json'
+            },
+            body: JSON.stringify({
+                input_type: inputType,
+                content: content
+            })
+        });
+        if (!response.ok) {
+            throw new Error(`Erreur HTTP: ${response.status}`);
+        }
+        const result = await response.json();
+        console.log('Résultat de l\analyse:', result);
+        return result;
+    } catch (error)[object Object]     console.error('Erreur lors de l\'appel au service d\'IA:, error);
+        throw error;
+    }
+}
+// Exemple d'utilisation
+app.post('/api/analyze, async (req, res) => {
+    try[object Object]     const { input_type, content } = req.body;
+        const result = await analyserContenu(input_type, content);
+        res.json(result);
+    } catch (error) {
+        res.status(500.json({ error:Erreur lors de l\'analyse' });
+    }
+});
+```
+## Exemple d'appel API depuis un backend Node.js
+Voici comment votre équipe backend Node.js peut appeler le service d'IA Python via HTTP, en utilisant `node-fetch` (ou `axios`).
+### Exemple de code Node.js (en français)
+```javascript
+const fetch = require('node-fetch');
+const URL_IA = 'http://127.0.0.1:8000/analyze'; // Adapter l'URL si besoin
+// Fonction pour analyser un texte ou un audio (en Base64)
+async function analyserAvecIA(inputType, contenu) {
+    try {
+        const reponse = await fetch(URL_IA, {
+            method: 'POST',
+            headers: {
+                'Content-Type': 'application/json',
+                'Accept': 'application/json'
+            },
+            body: JSON.stringify({
+                input_type: inputType, // 'text' ou 'audio'
+                content: contenu
+            })
+        });
+        if (!reponse.ok) {
+            throw new Error(`Erreur HTTP: ${reponse.status}`);
+        }
+        const resultat = await reponse.json();
+        console.log('Résultat IA:', resultat);
+        return resultat;
+    } catch (err) {
+        console.error('Erreur lors de l\'appel à l\'IA:', err);
+        throw err;
+    }
+}
+// Exemple d'utilisation pour du texte :
+(async () => {
+    const texte = "Je me sens très fatigué depuis quelques jours.";
+    const resultat = await analyserAvecIA('text', texte);
+    console.log(resultat);
+})();
+```
+### Format attendu de la requête
+```json
+{
+  "input_type": "text", // ou "audio"
+  "content": "..."      // texte brut ou audio encodé en Base64
+}
+```
+### Exemple de réponse JSON
+```json
+{
+  "sentiment": "negatif",
+  "score_sentiment": 0.87,
+  "theme": "fatigue",
+  "langue_detectee": "fr",
+  "texte_original": "Je me sens très fatigué depuis quelques jours."
+}
+```
+- `sentiment` : positif, neutre ou negatif
+- `score_sentiment` : score de confiance du modèle (0 à 1)
+- `theme` : thème médical détecté (en français)
+- `langue_detectee` : langue détectée (fr, en, bss)
+- `texte_original` : texte analysé ou transcription
+Votre backend Node.js peut ainsi relayer ce résultat à l'admin ou au frontend.
+## 4. Intégration Frontend (React.js)
+Le frontend communique avec le backend Node.js via un endpoint dAPI (par exemple, `/api/analyze`).
+### Envoi d'un Fichier Audio
+L'utilisateur enregistre ou sélectionne un fichier audio. Le fichier (Blob) doit être converti en chaîne Base64 avant dêtre envoyé.
+```typescript
+// Fonction pour convertir un Blob en Base64
+const convertBlobToBase64ob: Blob): Promise<string> => {
+    return new Promise((resolve, reject) => {
+        const reader = new FileReader();
+        reader.onerror = reject;
+        reader.onload = () => {
+            // Retourne seulement la partie Base64ta URL
+            const dataUrl = reader.result as string;
+            const base64= dataUrl.split(',)[1];           resolve(base64);
+        };
+        reader.readAsDataURL(blob);
+    });
+};
+// Fonction pour appeler l'API
+const analyserAudio = async (audioBlob: Blob) => {
+    try {
+        const base64Audio = await convertBlobToBase64(audioBlob);
+        const response = await fetch('/api/analyze', { // Votre endpoint Node.js
+            method: 'POST',
+            headers:[object Object]
+               Content-Type':application/json,    },
+            body: JSON.stringify({
+                input_type: "audio,           content: base64o
+            }),
+        });
+        if (!response.ok) {
+            throw new Error(`Erreur HTTP: ${response.status}`);
+        }
+        return await response.json();
+    } catch (error)[object Object]     console.error(Erreur lors de l'analyse audio:, error);
+        return { error: Impossible d'analyser laudio." };
+    }
+};
+```
+### Envoi de Texte
+L'envoi de texte est plus direct.
+```typescript
+const analyserTexte = async (texte: string) => {
+    try {
+        const response = await fetch('/api/analyze', { // Votre endpoint Node.js
+            method: 'POST',
+            headers:[object Object]
+               Content-Type':application/json,    },
+            body: JSON.stringify({
+                input_type: "text,           content: texte
+            }),
+        });
+        if (!response.ok) {
+            throw new Error(`Erreur HTTP: ${response.status}`);
+        }
+        return await response.json();
+    } catch (error)[object Object]     console.error(Erreur lors de l'analyse de texte:, error);
+        return { error: Impossible danalyser le texte." };
+    }
+};
+```
+## 5Architecture d'Intégration
+Cette architecture découplée est plus robuste, plus facile à maintenir et à mettre à l'échelle :
+1.  **React -> Node.js** : Le frontend envoie la requête (audio/texte) à votre backend principal Node.js.
+2.  **Node.js -> Python** : Le backend Node.js relaie la requête au service dIA Python.
+3.  **Python -> Node.js** : Le service d'IA retourne le résultat JSON de l'analyse.
+4.  **Node.js -> React** : Le backend Node.js renvoie la réponse finale au frontend.
+## 6. Conseils pour la Rapidité et la Robustesse
+- **Les modèles sont chargés une seule fois au démarrage du service d'IA** pour garantir des réponses rapides.

sentiment_space/app.py ADDED Viewed

	@@ -0,0 +1,59 @@

+import uvicorn
+from fastapi import FastAPI, HTTPException
+from pydantic import BaseModel
+from bassa_analyzer import analyze, load_models_and_references
+import time
+# --- API Setup ---
+app = FastAPI(
+    title="Sentiment Analysis API",
+    description="API rapide pour l'analyse de sentiment à partir de texte ou d'audio (Bassa, Français, Anglais).",
+    version="1.1.0"
+)
+# --- Load models at startup for speed ---
+@app.on_event("startup")
+def startup_event():
+    print("[API] Chargement des modèles IA...")
+    load_models_and_references(offline=True)
+    print("[API] Modèles chargés. API prête !")
+# --- Request & Response Models ---
+class AnalysisRequest(BaseModel):
+    input_type: str
+    content: str
+class AnalysisResponse(BaseModel):
+    sentiment: str
+    score_sentiment: float
+    theme: str
+    langue_detectee: str
+    texte_original: str
+# --- Health Check Endpoint ---
+@app.get("/health")
+def health():
+    return {"status": "ok", "message": "API opérationnelle"}
+# --- API Endpoint ---
+@app.post("/analyze", response_model=AnalysisResponse)
+def handle_analysis(request: AnalysisRequest):
+    """
+    Reçoit une requête d'analyse, traite avec bassa_analyzer,
+    et retourne le résultat.
+    """
+    start = time.time()
+    try:
+        result = analyze(input_type=request.input_type, content=request.content)
+        if "error" in result:
+            raise HTTPException(status_code=400, detail=result["error"])
+        return result
+    except ValueError as e:
+        raise HTTPException(status_code=422, detail=str(e))
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Erreur interne: {e}")
+# --- How to Run ---
+# uvicorn app:app --host 0.0.0.0 --port 8000
+if __name__ == "__main__":
+    uvicorn.run(app, host="127.0.0.1", port=8000)

sentiment_space/auth.py ADDED Viewed

	@@ -0,0 +1,26 @@

+import os
+from huggingface_hub import login, HfFolder
+from getpass import getpass
+from dotenv import load_dotenv
+load_dotenv()  # Loads from .env
+def configure_auth():
+    """Safe token handling"""
+    # 1. Check for existing token
+    token = os.getenv("HF_TOKEN") or HfFolder.get_token()
+    # 2. Prompt if no token found
+    if not token:
+        print("Enter Hugging Face token (will be hidden):")
+        token = getpass("> ")
+        # Verify token format
+        if not token.startswith("hf_"):
+            raise ValueError("Invalid token format. Must start with 'hf_'")
+    # 3. Securely store for session
+    os.environ["HF_TOKEN"] = token
+    login(token=token)
+    print("✅ Authentication successful - token stored in memory")
+if __name__ == "__main__":
+    configure_auth()

sentiment_space/bassa_analyzer.py ADDED Viewed

	@@ -0,0 +1,533 @@

+import base64
+import io
+import os
+import torch
+import soundfile as sf
+import pandas as pd
+import numpy as np
+from transformers import (
+    pipeline,
+    WhisperForConditionalGeneration,
+    WhisperProcessor,
+    AutoTokenizer,
+    AutoModelForSequenceClassification,
+    AutoModelForSeq2SeqLM
+)
+from langdetect import detect, detect_langs, DetectorFactory
+from sklearn.metrics.pairwise import cosine_similarity
+import traceback
+# --- Configuration ---
+DATASETS_DIR = os.path.dirname(__file__)
+REFERENCE_CSV_PATH = os.path.join(DATASETS_DIR, 'reference.csv')
+SIMILARITY_THRESHOLD = 0.85  # Minimum similarity score to trust the reference-based analysis
+# --- Global Variables for Models ---
+whisper_model = None
+whisper_processor = None
+sentiment_pipeline_model = None
+reference_embeddings = []
+translator_model = None
+translator_tokenizer = None
+langid_model = None  # SpeechBrain language ID model
+def get_audio_embedding(audio_data, processor, model):
+    """Generates a feature embedding from audio data using the Whisper encoder."""
+    input_features = processor(audio_data, sampling_rate=16000, return_tensors="pt").input_features
+    # Use the encoder to get a representative embedding
+    with torch.no_grad():
+        embedding = model.get_encoder()(input_features).last_hidden_state.mean(dim=1)
+    return embedding.cpu().numpy()
+def load_models_and_references(offline=False):
+    """Loads all models and reference data into memory. Called once on startup."""
+    global whisper_model, whisper_processor, sentiment_pipeline_model, reference_embeddings, translator_model, translator_tokenizer, langid_model
+    print(f"Initializing models in {'OFFLINE' if offline else 'ONLINE'} mode...")
+    # 0. Set device
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    print(f"Device set to use {device}")
+    # 1. Load Whisper model (for audio transcription and embeddings)
+    whisper_model_name = "openai/whisper-medium"  # switched to medium for efficiency
+    try:
+        whisper_processor = WhisperProcessor.from_pretrained(whisper_model_name, local_files_only=offline)
+        whisper_model = WhisperForConditionalGeneration.from_pretrained(
+            whisper_model_name,
+            local_files_only=offline
+        ).to(device)
+        # Freeze model parameters to save memory
+        whisper_model.eval()
+        for param in whisper_model.parameters():
+            param.requires_grad = False
+    except Exception as e:
+        print(f"Error loading Whisper model: {e}")
+        if offline:
+            print("Hint: Run this script once with an internet connection to download the necessary models.")
+        raise
+    # 2. Load Sentiment Analysis Model
+    sentiment_model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
+    try:
+        sentiment_tokenizer = AutoTokenizer.from_pretrained(sentiment_model_name, local_files_only=offline)
+        sentiment_model = AutoModelForSequenceClassification.from_pretrained(
+            sentiment_model_name,
+            local_files_only=offline
+        ).to(device)
+        sentiment_pipeline_model = pipeline(
+            "sentiment-analysis",
+            model=sentiment_model,
+            tokenizer=sentiment_tokenizer,
+            device=0 if device == "cuda" else -1
+        )
+    except Exception as e:
+        print(f"Error loading sentiment pipeline: {e}")
+        if offline:
+            print("Hint: Run this script once with an internet connection to download the necessary models.")
+        raise
+    # 3. Load Translation Model (French to English)
+    try:
+        translator_model_name = "Helsinki-NLP/opus-mt-fr-en"
+        translator_tokenizer = AutoTokenizer.from_pretrained(
+            translator_model_name,
+            local_files_only=offline
+        )
+        translator_model = AutoModelForSeq2SeqLM.from_pretrained(
+            translator_model_name,
+            local_files_only=offline
+        ).to(device)
+        translator_model.eval()
+    except Exception as e:
+        print(f"Warning: Could not load translation model. French-to-English translation will be unavailable. Error: {e}")
+        translator_model = None
+        translator_tokenizer = None
+    # 4. Load and process reference audios
+    try:
+        df = pd.read_csv(REFERENCE_CSV_PATH)
+        reference_embeddings = []
+        for index, row in df.iterrows():
+            audio_path = os.path.join(DATASETS_DIR, row['audio_path'])
+            try:
+                # Load audio file
+                audio_data, orig_sr = sf.read(audio_path)
+                # Apply the same preprocessing as input audio: mono, 16kHz, noise-reduced, trimmed
+                audio_data = preprocess_audio(audio_data, orig_sr)
+                # Get audio embedding using the same preprocessing as input
+                with torch.no_grad():
+                    input_features = whisper_processor(
+                        audio_data,
+                        sampling_rate=16000,
+                        return_tensors="pt"
+                    ).input_features.to(whisper_model.device)
+                    embedding = whisper_model.get_encoder()(input_features).last_hidden_state.mean(dim=1).cpu().numpy()
+                # Store reference data
+                reference_embeddings.append({
+                    'sentiment': row['sentiment'],
+                    'embedding': embedding,
+                    'audio_path': row['audio_path']
+                })
+            except Exception as e:
+                print(f"Warning: Could not process reference audio {row['audio_path']}: {e}")
+        print(f"Loaded {len(reference_embeddings)} reference audio embeddings.")
+    except Exception as e:
+        print(f"Error loading reference audios: {e}")
+        if not os.path.exists(REFERENCE_CSV_PATH):
+            print(f"Error: Reference CSV file not found at {REFERENCE_CSV_PATH}")
+        reference_embeddings = []
+def preprocess_audio(audio_data, orig_sr):
+    """Ensure audio is mono, 16kHz, normalized, noise-reduced, and trimmed of silence."""
+    import torchaudio.transforms as T
+    import torch
+    import numpy as np
+    # Convert to mono if needed
+    if len(audio_data.shape) > 1:
+        audio_data = audio_data.mean(axis=0)
+    # Convert numpy to torch tensor if needed
+    if not isinstance(audio_data, torch.Tensor):
+        audio_data = torch.tensor(audio_data, dtype=torch.float32)
+    # Resample if needed
+    if orig_sr != 16000:
+        resampler = T.Resample(orig_sr, 16000)
+        audio_data = resampler(audio_data)
+    # Normalize
+    audio_data = audio_data / (audio_data.abs().max() + 1e-8)
+    # Simple noise reduction using spectral gating
+    try:
+        # Convert to frequency domain
+        stft = torch.stft(audio_data, n_fft=1024, hop_length=256, return_complex=True)
+        # Calculate noise floor (using first 0.5 seconds as noise estimate)
+        noise_samples = min(int(0.5 * 16000), len(audio_data))
+        noise_stft = torch.stft(audio_data[:noise_samples], n_fft=1024, hop_length=256, return_complex=True)
+        noise_floor = torch.mean(torch.abs(noise_stft), dim=1, keepdim=True)
+        # Apply spectral gating
+        magnitude = torch.abs(stft)
+        phase = torch.angle(stft)
+        # Gate threshold (adjust this value for more/less aggressive noise reduction)
+        gate_threshold = 2.0 * noise_floor
+        # Apply gating
+        gated_magnitude = torch.where(magnitude > gate_threshold, magnitude, magnitude * 0.1)
+        # Reconstruct signal
+        stft_cleaned = gated_magnitude * torch.exp(1j * phase)
+        audio_data = torch.istft(stft_cleaned, n_fft=1024, hop_length=256)
+    except Exception as e:
+        print(f"Warning: Noise reduction failed, using original audio: {e}")
+        # If noise reduction fails, continue with original audio
+    # Trim silence using VAD
+    try:
+        audio_data = T.Vad(sample_rate=16000)(audio_data.unsqueeze(0)).squeeze(0)
+    except Exception as e:
+        print(f"Warning: VAD failed, using original audio: {e}")
+    # Final normalization
+    audio_data = audio_data / (audio_data.abs().max() + 1e-8)
+    return audio_data.numpy()
+def translate_fr_to_en(text):
+    """Translate French text to English using the loaded model."""
+    global translator_model, translator_tokenizer
+    if translator_model is None or translator_tokenizer is None:
+        return "[Translation not available]"
+    try:
+        # Tokenize the input text
+        inputs = translator_tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
+        # Move to the same device as the model
+        inputs = {k: v.to(translator_model.device) for k, v in inputs.items()}
+        # Generate translation
+        with torch.no_grad():
+            translated = translator_model.generate(**inputs)
+        # Decode and clean up the output
+        translation = translator_tokenizer.decode(translated[0], skip_special_tokens=True)
+        return translation
+    except Exception as e:
+        print(f"Translation error: {e}")
+        return f"[Translation error: {str(e)}]"
+def validate_text_semantics(text, language):
+    """Check if text makes semantic sense for the given language."""
+    if not text or len(text.strip()) < 3:
+        return False
+    text_lower = text.lower().strip()
+    # Common meaningful words for each language
+    french_words = {
+        'je', 'tu', 'il', 'elle', 'nous', 'vous', 'ils', 'elles', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont',
+        'le', 'la', 'les', 'un', 'une', 'des', 'ce', 'cette', 'ces', 'mon', 'ma', 'mes', 'ton', 'ta', 'tes',
+        'et', 'ou', 'mais', 'avec', 'sans', 'pour', 'dans', 'sur', 'sous', 'devant', 'derrière', 'entre',
+        'bon', 'bonne', 'mauvais', 'mauvaise', 'grand', 'grande', 'petit', 'petite', 'ouveau', 'nouvelle',
+        'content', 'heureux', 'triste', 'fatigué', 'malade', 'fort', 'faible', 'riche', 'pauvre',
+        'maison', 'voiture', 'travail', 'famille', 'ami', 'temps', 'jour', 'nuit', 'matin', 'soir',
+        'manger', 'boire', 'dormir', 'parler', 'écouter', 'voir', 'regarder', 'penser', 'avoir'
+    }
+    english_words = {
+        'i', 'you', 'he', 'she', 'it', 'we', 'they', 'am', 'is', 'are', 'was', 'were', 'be', 'been',
+        'the', 'a', 'an', 'this', 'that', 'these', 'those', 'my', 'your', 'his', 'her', 'their', 'our',
+        'and', 'or', 'but', 'with', 'without', 'for', 'in', 'on', 'under', 'over', 'between', 'among',
+        'good', 'bad', 'big', 'small', 'old', 'young', 'happy', 'sad', 'tired', 'sick', 'strong', 'weak',
+        'house', 'car', 'work', 'family', 'friend', 'time', 'day', 'night', 'morning', 'evening',
+        'eat', 'drink', 'sleep', 'talk', 'listen', 'see', 'watch', 'think', 'know'
+    }
+    # Count meaningful words
+    words = text_lower.split()
+    if language == 'fr':
+        meaningful_words = sum(1 for word in words if word in french_words)
+    else:  # english
+        meaningful_words = sum(1 for word in words if word in english_words)
+    # Check if at least 30% of words are meaningful
+    if len(words) > 0:
+        meaningful_ratio = meaningful_words / len(words)
+        return meaningful_ratio >= 0.3
+    return False
+def infer_medical_theme(text, sentiment):
+    """Infère un thème médical (en français) à partir du texte et du sentiment."""
+    text = text.lower() if text else ""
+    themes = [
+        ("douleur", ["douleur", "mal", "souffrance", "tête", "ventre", "blessure", "brûlure", "crampe", "migraine", "a mal", "j'ai mal", "je souffre", "courbature", "arthrose", "lombalgie", "cervicalgie", "fracture", "entorse"]),
+        ("anxiété", ["anxiété", "peur", "angoisse", "stress", "inquiet", "inquiétude", "panique", "crainte", "phobie", "appréhension"]),
+        ("satisfaction", ["satisfait", "content", "heureux", "bien", "amélioration", "guéri", "merci", "soulagement", "rassuré", "confiance", "espoir"]),
+        ("traitement", ["traitement", "médicament", "opération", "chirurgie", "soin", "injection", "piqûre", "prise en charge", "thérapie", "rééducation", "soins intensifs", "hospitalisation", "transfusion"]),
+        ("diagnostic", ["diagnostic", "maladie", "symptôme", "fièvre", "infection", "test", "examen", "analyse", "bilan", "scanner", "irm", "radio", "prise de sang"]),
+        ("fatigue", ["fatigue", "épuisé", "fatigué", "lassitude", "sommeil", "insomnie", "épuisement", "endormi", "sommeiller"]),
+        ("sommeil", ["sommeil", "dormir", "insomnie", "réveil", "cauchemar", "sommeiller", "nuit blanche"]),
+        ("appétit", ["appétit", "manger", "faim", "perte d'appétit", "anorexie", "boulimie", "nourriture", "aliment"]),
+        ("mobilité", ["marcher", "mobilité", "déplacement", "fauteuil", "béquille", "boiter", "paralysie", "immobilisation", "chute"]),
+        ("respiration", ["respirer", "respiration", "essoufflé", "asthme", "bronchite", "poumon", "dyspnée", "toux", "apnée"]),
+        ("humeur", ["humeur", "dépression", "triste", "déprimé", "moral", "colère", "irritable", "pleurer", "joie", "motivation"]),
+        ("isolement", ["seul", "isolement", "solitude", "abandonné", "délaissé", "rejeté"]),
+        ("famille", ["famille", "enfant", "fils", "fille", "mari", "épouse", "parents", "proche", "visite"]),
+        ("médicaments", ["médicament", "comprimé", "pilule", "ordonnance", "pharmacie", "dose", "posologie"]),
+        ("suivi", ["suivi", "contrôle", "consultation", "rendez-vous", "visite", "bilan", "monitoring"]),
+        ("guérison", ["guérison", "guéri", "rétabli", "rémission", "amélioration", "progression"]),
+        ("rechute", ["rechute", "récidive", "retour", "aggravation"]),
+        ("prévention", ["prévention", "vaccin", "vaccination", "protection", "dépistage"]),
+        ("urgence", ["urgence", "urgence médicale", "samu", "pompiers", "urgence vitale", "urgence absolue"]),
+    ]
+    for theme, keywords in themes:
+        for kw in keywords:
+            if kw in text:
+                return theme
+    if sentiment == "negatif":
+        return "douleur"
+    elif sentiment == "positif":
+        return "satisfaction"
+    elif sentiment == "neutre":
+        return "traitement"
+    else:
+        return "traitement"
+def analyze(input_type: str, content: str) -> dict:
+    """
+    Analyzes sentiment from an audio or text input in Bassa, French, or English.
+    """
+    if not content:
+        raise ValueError("content cannot be empty")
+    text_original = ""
+    audio_embedding = None
+    methode_utilisee = "bert_fallback"
+    sentiment, confiance = None, 0.0
+    lang = None
+    # --- Step 1: Process Input ---
+    if input_type == "audio":
+        try:
+            audio_bytes = base64.b64decode(content)
+            audio_data, samplerate = sf.read(io.BytesIO(audio_bytes))
+            # Preprocess audio: mono, 16kHz, normalized, trimmed
+            audio_data = preprocess_audio(audio_data, samplerate)
+            samplerate = 16000
+            if whisper_processor is None or whisper_model is None:
+                return {"error": "Whisper model or processor is not loaded. Please check initialization."}
+            detected_lang = None
+            processed_input = whisper_processor(audio_data, sampling_rate=16000, return_tensors="pt")
+            input_features = processed_input.input_features.to(whisper_model.device)
+            attention_mask = processed_input.get("attention_mask")
+            # First, try automatic language detection without forcing any language
+            output_auto = whisper_model.generate(
+                input_features,
+                attention_mask=attention_mask,
+                return_dict_in_generate=True,
+                output_scores=True
+            )
+            trans_auto = whisper_processor.batch_decode(output_auto.sequences, skip_special_tokens=True)[0]
+            def avg_logprob(output):
+                if output.scores and len(output.scores) > 0:
+                    return torch.cat([s for s in output.scores]).mean().item()
+                return float('-inf')
+            score_auto = avg_logprob(output_auto)
+            len_auto = len(trans_auto.strip())
+            print(f"[DEBUG] Auto transcription: '{trans_auto}' (score: {score_auto:.4f})")
+            # Check if auto transcription is meaningful
+            if len_auto >= 3:
+                try:
+                    lang_auto = detect(trans_auto)
+                    print(f"[DEBUG] Auto langdetect: {lang_auto}")
+                    # If auto detection is confident French or English, use it directly
+                    if lang_auto in ['fr', 'en']:
+                        lang = lang_auto
+                        text_original = trans_auto
+                        print(f"[DEBUG] Auto detection confident: {lang_auto}, proceeding directly")
+                    else:
+                        # Not French or English, likely Bassa - will check references
+                        lang = 'bss'
+                        text_original = "[Bassa audio, no transcription]"
+                        print(f"[DEBUG] Auto detection not French/English, checking Bassa references")
+                except:
+                    # Langdetect failed, likely Bassa
+                    lang = 'bss'
+                    text_original = "[Bassa audio, no transcription]"
+                    print(f"[DEBUG] Langdetect failed, defaulting to Bassa")
+            else:
+                # Auto transcription too short, likely Bassa
+                lang = 'bss'
+                text_original = "[Bassa audio, no transcription]"
+                print(f"[DEBUG] Auto transcription too short, defaulting to Bassa")
+            # For Bassa detection, prioritize reference-based analysis over forced decoding
+            # Only use forced decoding if auto detection is very confident about French/English
+            if lang == 'bss' and len_auto >= 5:  # Auto transcription is substantial
+                print(f"[DEBUG] Auto detection suggests Bassa, checking reference similarity...")
+                # First, check if this audio matches any Bassa reference
+                if audio_embedding is not None and reference_embeddings:
+                    similarities = [cosine_similarity(audio_embedding, ref['embedding'])[0][0] for ref in reference_embeddings]
+                    max_similarity = max(similarities)
+                    print(f"[DEBUG] Max reference similarity: {max_similarity:.4f}")
+                    if max_similarity > SIMILARITY_THRESHOLD:
+                        print(f"[DEBUG] High similarity with Bassa reference, confirming Bassa")
+                        # Keep as Bassa, reference-based analysis will be used
+                    else:
+                        print(f"[DEBUG] Low similarity with Bassa references, trying forced decoding...")
+                        # Only try forced decoding if reference similarity is low
+                        # Use Whisper forced decoding for both French and English, pick best
+                        forced_decoder_ids_fr = whisper_processor.get_decoder_prompt_ids(language="french", task="transcribe")
+                        forced_decoder_ids_en = whisper_processor.get_decoder_prompt_ids(language="english", task="transcribe")
+                        output_fr = whisper_model.generate(
+                            input_features,
+                            attention_mask=attention_mask,
+                            forced_decoder_ids=forced_decoder_ids_fr,
+                            return_dict_in_generate=True,
+                            output_scores=True
+                        )
+                        output_en = whisper_model.generate(
+                            input_features,
+                            attention_mask=attention_mask,
+                            forced_decoder_ids=forced_decoder_ids_en,
+                            return_dict_in_generate=True,
+                            output_scores=True
+                        )
+                        trans_fr = whisper_processor.batch_decode(output_fr.sequences, skip_special_tokens=True)[0]
+                        trans_en = whisper_processor.batch_decode(output_en.sequences, skip_special_tokens=True)[0]
+                        score_fr = avg_logprob(output_fr)
+                        score_en = avg_logprob(output_en)
+                        print(f"[DEBUG] Forced French: '{trans_fr}' (score: {score_fr:.4f})")
+                        print(f"[DEBUG] Forced English: '{trans_en}' (score: {score_en:.4f})")
+                        # Only switch to French/English if forced decoding produces significantly better results
+                        # and the text makes clear semantic sense
+                        len_fr = len(trans_fr.strip())
+                        len_en = len(trans_en.strip())
+                        if len_fr >= 5 and score_fr > score_auto + 1.0 and validate_text_semantics(trans_fr, 'fr'):
+                            lang = 'fr'
+                            text_original = trans_fr
+                            print(f"[DEBUG] Forced French significantly better and makes sense")
+                        elif len_en >= 5 and score_en > score_auto + 1.0 and validate_text_semantics(trans_en, 'en'):
+                            lang = 'en'
+                            text_original = trans_en
+                            print(f"[DEBUG] Forced English significantly better and makes sense")
+                        else:
+                            # Keep as Bassa
+                            print(f"[DEBUG] Forced decoding didn't produce better results, keeping as Bassa")
+                    print(f"[DEBUG] No reference embeddings available, keeping as Bassa")
+            # Always compute embedding for reference-based analysis
+            with torch.no_grad():
+                audio_embedding = whisper_model.get_encoder()(input_features, attention_mask=attention_mask).last_hidden_state.mean(dim=1).cpu().numpy()
+        except Exception as e:
+            # Print the full traceback to the console for detailed debugging
+            print(f"--- DETAILED ERROR TRACEBACK ---")
+            traceback.print_exc()
+            print(f"----------------------------------")
+            return {"error": f"Failed to process audio file: {e}"}
+    else:
+        text_original = content
+    if not text_original.strip() and lang != 'bss':
+        return {"error": "Transcription failed or text is empty."}
+    # --- Step 2: Language Detection ---
+    if lang is None:
+        try:
+            lang = detect(text_original)
+            if lang not in ['fr', 'en']:
+                lang = 'bss'
+        except:
+            lang = 'bss'
+    # --- Step 3: Sentiment Analysis ---
+    # A. Bassa Reference-Based Analysis (if applicable)
+    if lang == 'bss' and audio_embedding is not None and reference_embeddings:
+        similarities = [cosine_similarity(audio_embedding, ref['embedding'])[0][0] for ref in reference_embeddings]
+        max_similarity = max(similarities)
+        if max_similarity > SIMILARITY_THRESHOLD:
+            best_match_index = np.argmax(similarities)
+            sentiment = reference_embeddings[best_match_index]['sentiment']
+            confiance = max_similarity
+            methode_utilisee = "base_referencee"
+    # B. Fallback to General Model or Mark as Uncertain
+    if sentiment is None:
+        if lang == 'bss':
+            sentiment = 'neutre'
+            confiance = 0.0
+            methode_utilisee = "reference_miss"
+        else:
+            try:
+                sentiment_result = sentiment_pipeline_model(text_original)[0]
+                star_rating = int(sentiment_result['label'].split()[0])
+                if star_rating <= 2:
+                    sentiment = 'negatif'
+                elif star_rating == 3:
+                    sentiment = 'neutre'
+                else:
+                    sentiment = 'positif'
+                confiance = sentiment_result['score']
+                methode_utilisee = "nlptown_bert_multilingual"
+            except Exception as e:
+                print(f"Sentiment analysis error: {e}")
+                sentiment = 'neutre'
+                confiance = 0.0
+                methode_utilisee = "sentiment_error"
+    # --- Step 4: Translation ---
+    traduction_anglaise = "[Translation disabled: sentencepiece not installed]"
+    # --- Step 5: Format Output ---
+    theme = infer_medical_theme(text_original, sentiment)
+    return {
+        "sentiment": sentiment,
+        "score_sentiment": round(float(confiance), 4),
+        "theme": theme,
+        "langue_detectee": lang,
+        "texte_original": text_original
+    }
+# --- Initialization ---
+# Set offline=False for the very first run to download models.
+# Set offline=True for all subsequent runs to work without internet.
+load_models_and_references(offline=False)
+if __name__ == '__main__':
+    print("\n--- Running Test Analysis ---")
+    test_text = "Je suis très content de ce produit, c'est formidable!"
+    result = analyze(input_type="text", content=test_text)
+    print(f"Analysis for text: '{test_text}'")
+    import json
+    print(json.dumps(result, indent=2, ensure_ascii=False))

sentiment_space/requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+transformers
+huggingface_hub
+datasets[audio]
+accelerate
+torch
+torchaudio
+soundfile
+librosa
+pandas
+scikit-learn
+langdetect
+fastapi
+uvicorn[standard]
+protobuf
+sentencepiece