Spaces:

DomLoyer
/

syscred

Running

App Files Files Community

D Ф m i И i q ц e L Ф y e r commited on Feb 2

Commit

e70050b

1 Parent(s): 1ae34e8

Deploy SysCRED with PyTorch

Browse files

Files changed (38) hide show

Dockerfile +37 -0
README.md +21 -6
requirements.txt +31 -0
syscred/README.md +1 -0
syscred/SysCRED_Documentation.md +659 -0
syscred/__init__.py +55 -0
syscred/api_clients.py +560 -0
syscred/backend_app.py +363 -0
syscred/benchmark_data.json +92 -0
syscred/config.py +291 -0
syscred/database.py +54 -0
syscred/debug_factcheck.py +43 -0
syscred/debug_graph_json.py +58 -0
syscred/debug_init.py +33 -0
syscred/debug_local_server.py +25 -0
syscred/diagnose_imports.py +37 -0
syscred/eval_metrics.py +349 -0
syscred/graph_rag.py +171 -0
syscred/ir_engine.py +410 -0
syscred/ontology_manager.py +509 -0
syscred/requirements-light.txt +31 -0
syscred/requirements.txt +34 -0
syscred/requirements_light.txt +19 -0
syscred/run_benchmark.py +135 -0
syscred/run_trec_benchmark.py +414 -0
syscred/save_to_notes.sh +121 -0
syscred/seo_analyzer.py +610 -0
syscred/setup.py +65 -0
syscred/static/index.html +850 -0
syscred/static/js/d3.min.js +0 -0
syscred/test_graphrag.py +87 -0
syscred/test_phase1.py +28 -0
syscred/test_phase2.py +55 -0
syscred/test_suite.py +64 -0
syscred/test_trec_integration.py +271 -0
syscred/trec_dataset.py +409 -0
syscred/trec_retriever.py +446 -0
syscred/verification_system.py +926 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,37 @@

+# SysCRED Docker Configuration for Hugging Face Spaces
+# Full version with PyTorch and Transformers
+FROM python:3.10-slim
+WORKDIR /app
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONPATH=/app
+ENV SYSCRED_LOAD_ML_MODELS=true
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements (full version with ML)
+COPY requirements.txt /app/requirements.txt
+# Install dependencies (includes PyTorch, Transformers)
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY syscred/ /app/syscred/
+# Create user for HF Spaces (required)
+RUN useradd -m -u 1000 user
+USER user
+ENV HOME=/home/user
+ENV PATH=/home/user/.local/bin:$PATH
+WORKDIR /app
+EXPOSE 7860
+# Run with HF Spaces port (7860)
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "2", "--timeout", "300", "syscred.backend_app:app"]

README.md CHANGED Viewed

@@ -1,12 +1,27 @@
 ---
-title: Syscred
-emoji: 🌍
-colorFrom: green
-colorTo: indigo
 sdk: docker
 pinned: false
 license: mit
-short_description: Credibility Verification System - Neuro-Symbolic Fact Checki
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: SysCRED - Système de Vérification de Crédibilité
+emoji: 🔍
+colorFrom: purple
+colorTo: blue
 sdk: docker
 pinned: false
 license: mit
+app_port: 7860
 ---
+# SysCRED - Credibility Verification System
+A hybrid neuro-symbolic system for credibility verification and fact-checking.
+## Features
+- 🔍 URL and text credibility analysis
+- 🧠 NLP-based coherence analysis with Transformers
+- 📊 SEO and source reputation scoring
+- 🌐 Knowledge graph visualization with D3.js
+- 🔗 Ontology-based reasoning with RDFLib
+## Author
+**Dominique S. Loyer** - UQAM
+## Usage
+Enter a URL or paste text to analyze its credibility score based on multiple factors.

requirements.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+# SysCRED - Full Requirements for Hugging Face Spaces
+# Système Hybride de Vérification de Crédibilité
+# (c) Dominique S. Loyer
+# === Core Dependencies ===
+requests>=2.28.0
+beautifulsoup4>=4.11.0
+python-whois>=0.8.0
+# === RDF/Ontology ===
+rdflib>=6.0.0
+# === Machine Learning (Full) ===
+transformers>=4.30.0
+torch>=2.0.0
+numpy>=1.24.0
+sentence-transformers>=2.2.0
+# === Explainability ===
+lime>=0.2.0
+# === Web Backend ===
+flask>=2.3.0
+flask-cors>=4.0.0
+python-dotenv>=1.0.0
+pandas>=2.0.0
+# === Production/Database ===
+gunicorn>=20.1.0
+psycopg2-binary>=2.9.0
+flask-sqlalchemy>=3.0.0

syscred/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ # syscred

syscred/SysCRED_Documentation.md ADDED Viewed

	@@ -0,0 +1,659 @@

+# 🔬 SysCRED - Documentation Complète
+## Système Neuro-Symbolique de Vérification de Crédibilité
+> **Version:** 2.0
+> **Auteur:** Dominique S. Loyer
+> **Citation Key:** `loyerModelingHybridSystem2025`
+> **DOI:** [10.5281/zenodo.17943226](https://doi.org/10.5281/zenodo.17943226)
+> **Dernière mise à jour:** Janvier 2026
+---
+## 📋 Table des Matières
+1. [Vue d'ensemble](#vue-densemble)
+2. [Architecture du système](#architecture-du-système)
+3. [Modules et fichiers](#modules-et-fichiers)
+4. [Installation et configuration](#installation-et-configuration)
+5. [Commandes et utilisation](#commandes-et-utilisation)
+6. [Choix de conception](#choix-de-conception)
+7. [Améliorations réalisées](#améliorations-réalisées)
+8. [Améliorations futures](#améliorations-futures)
+9. [API Reference](#api-reference)
+10. [Ontologie OWL](#ontologie-owl)
+---
+## Vue d'ensemble
+### Qu'est-ce que SysCRED?
+SysCRED (System for CREdibility Detection) est un **système hybride neuro-symbolique** conçu pour évaluer automatiquement la crédibilité des informations en ligne. Il combine:
+- **Approche symbolique** (règles explicites, transparentes et explicables)
+- **Approche neuronale** (modèles NLP pour sentiment, biais, entités)
+- **Ontologie OWL** (traçabilité et raisonnement sémantique)
+### Philosophie du projet
+Le système est conçu comme **prototype de recherche doctorale** avec ces principes:
+1. **Explicabilité (xAI)**: Chaque décision peut être tracée et justifiée
+2. **Hybridité**: Combine le meilleur des règles et du ML
+3. **Reproductibilité**: Code open-source, documentation complète
+4. **Modularité**: Chaque composant est indépendant et testable
+---
+## Architecture du système
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        SysCRED v2.0                              │
+├─────────────────────────────────────────────────────────────────┤
+│  ┌───────────────┐  ┌───────────────┐  ┌───────────────┐        │
+│  │   INPUT       │  │   APIs        │  │   OUTPUT      │        │
+│  │  URL / Texte  │──│  Externes     │──│   Rapport     │        │
+│  └───────────────┘  └───────────────┘  └───────────────┘        │
+│         │                  │                  ▲                  │
+│         ▼                  ▼                  │                  │
+│  ┌─────────────────────────────────────────────────────┐        │
+│  │            VERIFICATION SYSTEM                       │        │
+│  │  ┌─────────────────┐  ┌─────────────────┐           │        │
+│  │  │ RULE-BASED      │  │ NLP ANALYSIS    │           │        │
+│  │  │ • Réputation    │  │ • Sentiment     │           │        │
+│  │  │ • Âge domaine   │  │ • NER           │           │        │
+│  │  │ • Fact-check    │  │ • Biais         │           │        │
+│  │  │ • Marqueurs     │  │ • Cohérence     │           │        │
+│  │  └─────────────────┘  └─────────────────┘           │        │
+│  │                    ↓                                 │        │
+│  │         ┌─────────────────────────┐                 │        │
+│  │         │  SCORE CALCULATION      │                 │        │
+│  │         │  (pondération hybride)  │                 │        │
+│  │         └─────────────────────────┘                 │        │
+│  └─────────────────────────────────────────────────────┘        │
+│         │                                                        │
+│         ▼                                                        │
+│  ┌─────────────────────────────────────────────────────┐        │
+│  │            ONTOLOGY MANAGER (OWL/RDF)               │        │
+│  │         Traçabilité et raisonnement                 │        │
+│  └─────────────────────────────────────────────────────┘        │
+└─────────────────────────────────────────────────────────────────┘
+```
+### Flux de traitement
+1. **Entrée** → URL ou texte brut
+2. **Récupération** → Contenu web (si URL)
+3. **Prétraitement** → Nettoyage du texte
+4. **Données externes** → WHOIS, fact-check APIs
+5. **Analyse règles** → Marqueurs linguistiques, réputation
+6. **Analyse NLP** → Sentiment, biais, entités
+7. **Calcul score** → Pondération hybride (0-1)
+8. **Génération rapport** → JSON structuré
+9. **Sauvegarde ontologie** → Triplets RDF
+---
+## Modules et fichiers
+### Structure du projet
+```
+syscred/
+├── __init__.py              # Package init
+├── config.py                # Configuration centralisée
+├── verification_system.py   # Système principal
+├── api_clients.py           # Clients APIs externes
+├── ontology_manager.py      # Gestion OWL/RDF
+├── seo_analyzer.py          # Analyse SEO/PageRank
+├── backend_app.py           # API Flask REST
+├── eval_metrics.py          # Métriques d'évaluation
+├── ir_engine.py             # Moteur de recherche
+├── requirements.txt         # Dépendances Python
+├── setup.py                 # Installation package
+├── syscred_kaggle.ipynb     # Notebook Kaggle
+├── syscred_colab.ipynb      # Notebook Colab (avec Drive)
+└── kaggle_to_gdrive_backup.ipynb  # Backup notebooks
+```
+### Description des modules
+#### `config.py` - Configuration centralisée
+**But:** Centraliser tous les paramètres du système dans un seul fichier.
+**Classes:**
+- `Config` - Configuration de base
+- `DevelopmentConfig` - Pour développement local
+- `ProductionConfig` - Pour production
+- `TestingConfig` - Pour tests (ML désactivé)
+**Paramètres clés:**
+| Paramètre | Description | Valeur par défaut |
+|-----------|-------------|-------------------|
+| `HOST` | Adresse du serveur | `0.0.0.0` |
+| `PORT` | Port du serveur | `5000` |
+| `DEBUG` | Mode debug | `true` |
+| `LOAD_ML_MODELS` | Charger les modèles ML | `true` |
+| `WEB_FETCH_TIMEOUT` | Timeout HTTP (sec) | `10` |
+**Pondérations des scores:**
+```python
+SCORE_WEIGHTS = {
+    'source_reputation': 0.25,  # Réputation de la source
+    'domain_age': 0.10,         # Âge du domaine
+    'sentiment_neutrality': 0.15, # Neutralité du ton
+    'entity_presence': 0.15,    # Présence d'entités vérifiables
+    'coherence': 0.15,          # Cohérence textuelle
+    'fact_check': 0.20          # Résultats fact-check
+}
+```
+**Variables d'environnement:**
+```bash
+export SYSCRED_ENV=production      # Environnement (dev/prod/testing)
+export SYSCRED_PORT=8080           # Port personnalisé
+export SYSCRED_GOOGLE_API_KEY=xxx  # Clé Google Fact Check
+export SYSCRED_LOAD_ML=false       # Désactiver ML
+```
+---
+#### `verification_system.py` - Système principal
+**But:** Pipeline principal de vérification de crédibilité.
+**Classe principale:** `CredibilityVerificationSystem`
+**Méthodes principales:**
+| Méthode | Description |
+|---------|-------------|
+| `__init__()` | Initialise le système, charge les modèles |
+| `verify_information(input)` | Pipeline principal de vérification |
+| `rule_based_analysis(text, data)` | Analyse symbolique |
+| `nlp_analysis(text)` | Analyse NLP (ML) |
+| `calculate_overall_score()` | Calcule le score final |
+| `generate_report()` | Génère le rapport JSON |
+**Modèles ML utilisés:**
+| Modèle | Usage |
+|--------|-------|
+| `distilbert-base-uncased-finetuned-sst-2-english` | Sentiment |
+| `dbmdz/bert-large-cased-finetuned-conll03-english` | NER |
+| `bert-base-uncased` | Détection de biais (placeholder) |
+| `LIME` | Explication des prédictions |
+---
+#### `api_clients.py` - Clients APIs externes
+**But:** Abstraire toutes les interactions avec les APIs externes.
+**Classe principale:** `ExternalAPIClients`
+**APIs intégrées:**
+| API | Méthode | Description |
+|-----|---------|-------------|
+| Web Content | `fetch_web_content()` | Récupère et parse le HTML |
+| WHOIS | `whois_lookup()` | Âge et registrar du domaine |
+| Google Fact Check | `google_fact_check()` | Vérification des faits |
+| Source Reputation | `get_source_reputation()` | Base de données interne |
+| CommonCrawl | `estimate_backlinks()` | Estimation backlinks |
+**Data classes:**
+- `WebContent` - Contenu web parsé
+- `DomainInfo` - Informations WHOIS
+- `FactCheckResult` - Résultat fact-check
+- `ExternalData` - Données agrégées
+---
+#### `ontology_manager.py` - Gestion OWL/RDF
+**But:** Traçabilité sémantique avec ontologie OWL.
+**Fonctionnalités:**
+- Chargement d'ontologie de base (.ttl)
+- Ajout de triplets RDF pour chaque évaluation
+- Sauvegarde des données accumulées
+- Requêtes SPARQL
+**Ontologie utilisée:**
+- Format: Turtle (.ttl)
+- Namespace: `http://syscred.uqam.ca/ontology#`
+- Concepts: `Evaluation`, `Source`, `CredibilityScore`, `Evidence`
+---
+#### `backend_app.py` - API Flask
+**But:** Exposer SysCRED via API REST.
+**Endpoints:**
+| Endpoint | Méthode | Description |
+|----------|---------|-------------|
+| `/api/verify` | POST | Vérification principale |
+| `/api/seo` | POST | Analyse SEO uniquement |
+| `/api/ontology/stats` | GET | Statistiques ontologie |
+| `/api/health` | GET | Vérification santé |
+| `/api/config` | GET | Configuration actuelle |
+**Exemple requête:**
+```bash
+curl -X POST http://localhost:5000/api/verify \
+  -H "Content-Type: application/json" \
+  -d '{"input_data": "https://example.com/article"}'
+```
+---
+## Installation et configuration
+### Prérequis
+- Python 3.8+
+- pip
+- Git
+### Installation locale
+```bash
+# Cloner le repository
+git clone https://github.com/DominiqueLoyer/syscred.git
+cd syscred
+# Créer environnement virtuel
+python -m venv venv
+source venv/bin/activate  # Linux/Mac
+# ou: venv\Scripts\activate  # Windows
+# Installer les dépendances
+pip install -r requirements.txt
+# Installer le package en mode développement
+pip install -e .
+```
+### Installation des dépendances
+```bash
+# Dépendances principales
+pip install transformers torch numpy
+pip install flask flask-cors
+pip install rdflib owlrl
+pip install requests beautifulsoup4
+# Dépendances optionnelles
+pip install python-whois  # Pour WHOIS
+pip install lime          # Pour explications ML
+```
+### Fichier requirements.txt
+```
+transformers>=4.30.0
+torch>=2.0.0
+numpy>=1.24.0
+flask>=2.3.0
+flask-cors>=4.0.0
+rdflib>=6.3.0
+owlrl>=6.0.0
+requests>=2.31.0
+beautifulsoup4>=4.12.0
+python-whois>=0.8.0
+lime>=0.2.0
+```
+---
+## Commandes et utilisation
+### Démarrer l'API Flask
+```bash
+# Mode développement
+cd /path/to/syscred
+python backend_app.py
+# Avec variables d'environnement
+SYSCRED_PORT=8080 SYSCRED_DEBUG=true python backend_app.py
+# Mode production
+SYSCRED_ENV=production python backend_app.py
+```
+### Tester le système en ligne de commande
+```bash
+# Test direct du module
+python -m syscred.verification_system
+# Test avec entrée personnalisée
+python -c "
+from syscred.verification_system import CredibilityVerificationSystem
+sys = CredibilityVerificationSystem(load_ml_models=False)
+result = sys.verify_information('https://www.lemonde.fr')
+print(result['scoreCredibilite'])
+"
+```
+### Utilisation dans Kaggle/Colab
+Ouvrez le notebook `syscred_kaggle.ipynb` ou `syscred_colab.ipynb`:
+```python
+# Cellule 1: Installation
+!pip install transformers torch rdflib requests beautifulsoup4
+# Cellule 2: Importer et tester
+from syscred import CredibilityVerificationSystem
+sys = CredibilityVerificationSystem()
+result = sys.verify_information("https://example.com")
+```
+### API REST - Exemples
+```bash
+# Vérifier une URL
+curl -X POST http://localhost:5000/api/verify \
+  -H "Content-Type: application/json" \
+  -d '{"input_data": "https://www.bbc.com/article"}'
+# Vérifier du texte
+curl -X POST http://localhost:5000/api/verify \
+  -H "Content-Type: application/json" \
+  -d '{"input_data": "This is a verified news report."}'
+# Vérifier la santé
+curl http://localhost:5000/api/health
+# Obtenir la configuration
+curl http://localhost:5000/api/config
+```
+---
+## Choix de conception
+### Pourquoi approche hybride neuro-symbolique?
+| Approche | Forces | Faiblesses |
+|----------|--------|------------|
+| **Règles** | Transparent, explicable, rapide | Rigide, couverture limitée |
+| **ML/NLP** | Flexible, patterns complexes | Boîte noire, besoin données |
+| **Hybride** | Combine les deux! | Plus complexe |
+**Décision:** Utiliser les règles pour les cas clairs (réputation connue, marqueurs linguistiques) et le ML pour les nuances (sentiment, biais).
+### Pourquoi ces pondérations?
+Les poids par défaut reflètent l'importance relative de chaque facteur selon la littérature:
+```python
+SCORE_WEIGHTS = {
+    'source_reputation': 0.25,  # Le plus important: source connue
+    'fact_check': 0.20,         # Vérification externe
+    'sentiment_neutrality': 0.15,
+    'entity_presence': 0.15,
+    'coherence': 0.15,
+    'domain_age': 0.10          # Moins important seul
+}
+```
+### Pourquoi LIME pour l'explicabilité?
+- **Local Interpretable Model-agnostic Explanations**
+- Fonctionne avec n'importe quel modèle
+- Génère des explications compréhensibles
+- Standard académique reconnu
+### Pourquoi OWL/RDF?
+- **Traçabilité**: Chaque évaluation est enregistrée
+- **Raisonnement**: Inférences automatiques possibles (OWL-RL)
+- **Interopérabilité**: Standard W3C, compatible SPARQL
+- **Publication**: Données linked data
+---
+## Améliorations réalisées
+### Version 2.0 (Janvier 2026)
+1. **Configuration centralisée** (`config.py`)
+   - Variables d'environnement
+   - Profils dev/prod/testing
+   - Pondérations configurables
+2. **API Clients refactorisés** (`api_clients.py`)
+   - Data classes typées
+   - Gestion d'erreurs robuste
+   - WHOIS lookup réel
+3. **Notebooks Kaggle/Colab**
+   - `syscred_kaggle.ipynb` - Version Kaggle
+   - `syscred_colab.ipynb` - Version avec Google Drive
+   - Badges "Open in" pour facilité
+4. **Fix du bug `NameError: result`**
+   - Variable locale dans section RDF
+   - Fallback si aucun résultat
+5. **README professionnel**
+   - Badge DOI Zenodo
+   - Quick start
+   - API endpoints documentés
+6. **Notebook backup Kaggle→Drive**
+   - `kaggle_to_gdrive_backup.ipynb`
+   - Sauvegarde automatique
+---
+## Améliorations futures
+### Court terme (Prochains mois)
+- [ ] **Google Fact Check API réel** - Intégrer la clé API
+- [ ] **CommonCrawl backlinks** - Analyse réelle des backlinks
+- [ ] **Plus de sources** - Étendre `SOURCE_REPUTATIONS`
+- [ ] **Tests unitaires** - Couverture >80%
+### Moyen terme (6-12 mois)
+- [ ] **Modèle de biais fine-tuné** - Entraîner sur donées réelles
+- [ ] **Cache Redis** - Mise en cache des résultats
+- [ ] **Interface web moderne** - React/Vue frontend
+- [ ] **Docker** - Conteneurisation
+### Long terme (Thèse)
+- [ ] **Évaluation formelle** - Dataset de benchmark
+- [ ] **Multi-langue** - Support français natif
+- [ ] **Graphe de connaissances** - Neo4j intégration
+- [ ] **Apprentissage continu** - Feedback loop
+---
+## API Reference
+### Classe `CredibilityVerificationSystem`
+```python
+class CredibilityVerificationSystem:
+    def __init__(
+        self,
+        google_api_key: Optional[str] = None,
+        ontology_base_path: Optional[str] = None,
+        ontology_data_path: Optional[str] = None,
+        load_ml_models: bool = True
+    ):
+        """
+        Initialize the credibility verification system.
+        Args:
+            google_api_key: API key for Google Fact Check
+            ontology_base_path: Path to base ontology TTL
+            ontology_data_path: Path to store data
+            load_ml_models: Whether to load ML models
+        """
+    def verify_information(self, input_data: str) -> Dict[str, Any]:
+        """
+        Main pipeline to verify credibility.
+        Args:
+            input_data: URL or text to verify
+        Returns:
+            Complete evaluation report with:
+            - idRapport: Unique report ID
+            - scoreCredibilite: 0.0-1.0
+            - resumeAnalyse: French summary
+            - detailsScore: Score breakdown
+            - reglesAppliquees: Rule-based results
+            - analyseNLP: NLP analysis results
+        """
+```
+### Classe `Config`
+```python
+class Config:
+    # Chemins
+    BASE_DIR: Path
+    ONTOLOGY_BASE_PATH: Path
+    ONTOLOGY_DATA_PATH: Path
+    # Serveur
+    HOST: str = "0.0.0.0"
+    PORT: int = 5000
+    DEBUG: bool = True
+    # API Keys
+    GOOGLE_FACT_CHECK_API_KEY: Optional[str]
+    # Modèles ML
+    LOAD_ML_MODELS: bool = True
+    SENTIMENT_MODEL: str
+    NER_MODEL: str
+    # Pondérations
+    SCORE_WEIGHTS: Dict[str, float]
+    CREDIBILITY_THRESHOLDS: Dict[str, float]
+    SOURCE_REPUTATIONS: Dict[str, str]
+    @classmethod
+    def load_external_reputations(cls, filepath: str) -> None:
+        """Charger réputations depuis fichier JSON."""
+    @classmethod
+    def update_weights(cls, new_weights: Dict[str, float]) -> None:
+        """Mettre à jour les pondérations."""
+    @classmethod
+    def to_dict(cls) -> Dict:
+        """Exporter configuration en dictionnaire."""
+```
+---
+## Ontologie OWL
+### Structure conceptuelle
+```
+syscred:Evaluation
+  └── syscred:evaluates → syscred:Information
+  └── syscred:hasScore → xsd:float
+  └── syscred:hasEvidence → syscred:Evidence
+  └── syscred:generatedAt → xsd:dateTime
+syscred:Information
+  └── syscred:hasSource → syscred:Source
+  └── syscred:hasContent → xsd:string
+syscred:Source
+  └── syscred:hasDomain → xsd:string
+  └── syscred:hasReputation → syscred:ReputationLevel
+  └── syscred:hasDomainAge → xsd:integer
+syscred:Evidence
+  └── syscred:type → xsd:string (Linguistic, FactCheck, etc.)
+  └── syscred:value → xsd:string
+  └── syscred:impact → xsd:float
+```
+### Exemple de triplets générés
+```turtle
+@prefix syscred: <http://syscred.uqam.ca/ontology#> .
+@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
+syscred:eval_1705890000 a syscred:Evaluation ;
+    syscred:evaluates syscred:info_lemonde_article ;
+    syscred:hasScore "0.85"^^xsd:float ;
+    syscred:generatedAt "2026-01-21T13:40:00"^^xsd:dateTime ;
+    syscred:hasEvidence syscred:evidence_1 .
+syscred:evidence_1 a syscred:Evidence ;
+    syscred:type "SourceReputation" ;
+    syscred:value "High" ;
+    syscred:impact "0.25"^^xsd:float .
+```
+---
+## Scripts utilitaires
+### Script de backup vers Obsidian/Notion
+Créez ce script dans `/Users/bk280625/documents041025/MonCode/`:
+```bash
+#!/bin/bash
+# save_syscred_docs.sh
+# Usage: ./save_syscred_docs.sh
+DOC_SOURCE="/Users/bk280625/documents041025/MonCode/syscred/SysCRED_Documentation.md"
+OBSIDIAN_VAULT="/Users/bk280625/Documents/Obsidian/PhD"
+DATE=$(date +%Y%m%d)
+# Copier vers Obsidian
+cp "$DOC_SOURCE" "$OBSIDIAN_VAULT/SysCRED_Documentation_$DATE.md"
+echo "✅ Copié vers Obsidian: $OBSIDIAN_VAULT"
+# Ouvrir dans Obsidian (Mac)
+open "obsidian://open?vault=PhD&file=SysCRED_Documentation_$DATE"
+# Pour Notion: utiliser l'API ou copier manuellement
+# Notion n'a pas d'import direct de fichiers locaux
+echo "📋 Pour Notion: Copiez le contenu de $DOC_SOURCE"
+echo "   Ou utilisez: https://notion.so/import"
+```
+---
+## Références
+- Loyer, D. S. (2025). *Modeling and Hybrid System for Verification of Sources Credibility*. UQAM.
+- Loyer, D. S. (2025). *Ontology of a Verification System for Liability of the Information*. DIC-9335.
+---
+*Documentation générée le 21 janvier 2026*
+*SysCRED v2.0 - Dominique S. Loyer - UQAM*

syscred/__init__.py ADDED Viewed

	@@ -0,0 +1,55 @@

+# -*- coding: utf-8 -*-
+"""
+SysCRED - Système Neuro-Symbolique de Vérification de Crédibilité
+===================================================================
+PhD Thesis Prototype - (c) Dominique S. Loyer
+Citation Key: loyerModelingHybridSystem2025
+Modules:
+- api_clients: Web scraping, WHOIS, Fact Check APIs
+- ir_engine: BM25, QLD, TF-IDF, PRF (from TREC)
+- trec_retriever: Evidence retrieval for fact-checking (NEW v2.3)
+- trec_dataset: TREC AP88-90 data loader (NEW v2.3)
+- seo_analyzer: SEO analysis, PageRank estimation
+- eval_metrics: MAP, NDCG, P@K, Recall, MRR
+- ontology_manager: RDFLib integration
+- verification_system: Main credibility pipeline
+- graph_rag: GraphRAG for contextual memory
+"""
+__version__ = "2.3.0"
+__author__ = "Dominique S. Loyer"
+__citation__ = "loyerModelingHybridSystem2025"
+# Core classes
+from syscred.verification_system import CredibilityVerificationSystem
+from syscred.api_clients import ExternalAPIClients
+from syscred.ontology_manager import OntologyManager
+from syscred.seo_analyzer import SEOAnalyzer
+from syscred.ir_engine import IREngine
+from syscred.eval_metrics import EvaluationMetrics
+# TREC Integration (NEW - Feb 2026)
+from syscred.trec_retriever import TRECRetriever, Evidence, RetrievalResult
+from syscred.trec_dataset import TRECDataset, TRECTopic
+# Convenience alias
+SysCRED = CredibilityVerificationSystem
+__all__ = [
+    # Core
+    'CredibilityVerificationSystem',
+    'SysCRED',
+    'ExternalAPIClients',
+    'OntologyManager',
+    'SEOAnalyzer',
+    'IREngine',
+    'EvaluationMetrics',
+    # TREC (NEW)
+    'TRECRetriever',
+    'TRECDataset',
+    'TRECTopic',
+    'Evidence',
+    'RetrievalResult',
+]

syscred/api_clients.py ADDED Viewed

	@@ -0,0 +1,560 @@

+# -*- coding: utf-8 -*-
+"""
+API Clients Module - SysCRED
+============================
+Handles all external API calls for the credibility verification system.
+APIs intégrées:
+- Web content fetching (requests + BeautifulSoup)
+- WHOIS lookup for domain age
+- Google Fact Check Tools API
+- Backlinks estimation via CommonCrawl
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerModelingHybridSystem2025
+"""
+import requests
+from urllib.parse import urlparse
+from datetime import datetime, timedelta
+from typing import Optional, List, Dict, Any
+from dataclasses import dataclass
+import re
+import json
+from functools import lru_cache
+# Optional imports with fallbacks
+try:
+    from bs4 import BeautifulSoup
+    HAS_BS4 = True
+except ImportError:
+    HAS_BS4 = False
+    print("Warning: BeautifulSoup not installed. Run: pip install beautifulsoup4")
+try:
+    import whois
+    HAS_WHOIS = True
+except ImportError:
+    HAS_WHOIS = False
+    print("Warning: python-whois not installed. Run: pip install python-whois")
+# --- Data Classes for Structured Results ---
+@dataclass
+class WebContent:
+    """Represents fetched web content."""
+    url: str
+    title: Optional[str]
+    text_content: str
+    meta_description: Optional[str]
+    meta_keywords: List[str]
+    links: List[str]
+    fetch_timestamp: str
+    success: bool
+    error: Optional[str] = None
+@dataclass
+class DomainInfo:
+    """Represents domain WHOIS information."""
+    domain: str
+    creation_date: Optional[datetime]
+    expiration_date: Optional[datetime]
+    registrar: Optional[str]
+    age_days: Optional[int]
+    success: bool
+    error: Optional[str] = None
+@dataclass
+class FactCheckResult:
+    """Represents a single fact-check claim review."""
+    claim: str
+    claimant: Optional[str]
+    rating: str
+    publisher: str
+    url: str
+    review_date: Optional[str]
+@dataclass
+class ExternalData:
+    """Combined external data for credibility analysis."""
+    fact_checks: List[FactCheckResult]
+    source_reputation: str
+    domain_age_days: Optional[int]
+    domain_info: Optional[DomainInfo]
+    related_articles: List[Dict[str, str]]
+    backlinks_count: int
+    backlinks_sample: List[Dict[str, str]]
+class ExternalAPIClients:
+    """
+    Central class for all external API integrations.
+    Replaces simulated functions with real API calls.
+    """
+    def __init__(self, google_api_key: Optional[str] = None):
+        """
+        Initialize API clients.
+        Args:
+            google_api_key: API key for Google Fact Check Tools API (optional)
+        """
+        self.google_api_key = google_api_key
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8',
+            'Accept-Language': 'en-US,en;q=0.9,fr;q=0.8',
+            'Referer': 'https://www.google.com/',
+            'Upgrade-Insecure-Requests': '1',
+            'Sec-Fetch-Dest': 'document',
+            'Sec-Fetch-Mode': 'navigate',
+            'Sec-Fetch-Site': 'none',
+            'Sec-Fetch-User': '?1'
+        })
+        # Reputation database (can be extended or loaded from file)
+        self.known_reputations = {
+            # High credibility sources
+            'lemonde.fr': 'High',
+            'nytimes.com': 'High',
+            'reuters.com': 'High',
+            'bbc.com': 'High',
+            'theguardian.com': 'High',
+            'apnews.com': 'High',
+            'nature.com': 'High',
+            'sciencedirect.com': 'High',
+            'scholar.google.com': 'High',
+            'factcheck.org': 'High',
+            'snopes.com': 'High',
+            'politifact.com': 'High',
+            # Medium credibility
+            'wikipedia.org': 'Medium',
+            'medium.com': 'Medium',
+            'huffpost.com': 'Medium',
+            # Low credibility (known misinformation spreaders)
+            'infowars.com': 'Low',
+            'naturalnews.com': 'Low',
+        }
+    def fetch_web_content(self, url: str, timeout: int = 10) -> WebContent:
+        """
+        Fetch and parse web content from a URL.
+        Args:
+            url: The URL to fetch
+            timeout: Request timeout in seconds
+        Returns:
+            WebContent dataclass with extracted information
+        """
+        timestamp = datetime.now().isoformat()
+        if not HAS_BS4:
+            return WebContent(
+                url=url, title=None, text_content="",
+                meta_description=None, meta_keywords=[],
+                links=[], fetch_timestamp=timestamp,
+                success=False, error="BeautifulSoup not installed"
+            )
+        try:
+            try:
+                response = self.session.get(url, timeout=timeout, allow_redirects=True)
+                response.raise_for_status()
+            except (requests.exceptions.SSLError, requests.exceptions.ConnectionError):
+                print(f"[SysCRED] SSL/Connection error for {url}. Retrying without verification...")
+                # Suppress warnings for unverified HTTPS request
+                import urllib3
+                urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+                response = self.session.get(url, timeout=timeout, allow_redirects=True, verify=False)
+                response.raise_for_status()
+            soup = BeautifulSoup(response.text, 'html.parser')
+            # Extract title
+            title = soup.title.string.strip() if soup.title else None
+            # Extract meta description
+            meta_desc = soup.find('meta', attrs={'name': 'description'})
+            meta_description = meta_desc.get('content', '') if meta_desc else None
+            # Extract meta keywords
+            meta_kw = soup.find('meta', attrs={'name': 'keywords'})
+            meta_keywords = []
+            if meta_kw and meta_kw.get('content'):
+                meta_keywords = [k.strip() for k in meta_kw.get('content', '').split(',')]
+            # Remove script and style elements
+            for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
+                element.decompose()
+            # Extract main text content
+            text_content = soup.get_text(separator=' ', strip=True)
+            # Clean up excessive whitespace
+            text_content = re.sub(r'\s+', ' ', text_content)
+            # Extract links
+            links = []
+            for a_tag in soup.find_all('a', href=True)[:50]:  # Limit to 50 links
+                href = a_tag['href']
+                if href.startswith('http'):
+                    links.append(href)
+            return WebContent(
+                url=url,
+                title=title,
+                text_content=text_content[:10000],  # Limit text size
+                meta_description=meta_description,
+                meta_keywords=meta_keywords,
+                links=links,
+                fetch_timestamp=timestamp,
+                success=True
+            )
+        except requests.exceptions.Timeout:
+            return WebContent(
+                url=url, title=None, text_content="",
+                meta_description=None, meta_keywords=[], links=[],
+                fetch_timestamp=timestamp, success=False,
+                error=f"Timeout after {timeout}s"
+            )
+        except requests.exceptions.RequestException as e:
+            return WebContent(
+                url=url, title=None, text_content="",
+                meta_description=None, meta_keywords=[], links=[],
+                fetch_timestamp=timestamp, success=False,
+                error=str(e)
+            )
+        except Exception as e:
+            return WebContent(
+                url=url, title=None, text_content="",
+                meta_description=None, meta_keywords=[], links=[],
+                fetch_timestamp=timestamp, success=False,
+                error=f"Parsing error: {str(e)}"
+            )
+    @lru_cache(maxsize=128)
+    def whois_lookup(self, url_or_domain: str) -> DomainInfo:
+        """
+        Perform WHOIS lookup to get domain registration information.
+        Args:
+            url_or_domain: URL or domain name
+        Returns:
+            DomainInfo dataclass with domain details
+        """
+        # Extract domain from URL if needed
+        if url_or_domain.startswith('http'):
+            domain = urlparse(url_or_domain).netloc
+        else:
+            domain = url_or_domain
+        # Remove 'www.' prefix
+        if domain.startswith('www.'):
+            domain = domain[4:]
+        if not HAS_WHOIS:
+            return DomainInfo(
+                domain=domain,
+                creation_date=None, expiration_date=None,
+                registrar=None, age_days=None,
+                success=False, error="python-whois not installed"
+            )
+        try:
+            w = whois.whois(domain)
+            # Handle creation_date (can be a list or single value)
+            creation_date = w.creation_date
+            if isinstance(creation_date, list):
+                creation_date = creation_date[0]
+            # Handle expiration_date
+            expiration_date = w.expiration_date
+            if isinstance(expiration_date, list):
+                expiration_date = expiration_date[0]
+            # Calculate age in days
+            age_days = None
+            if creation_date:
+                if isinstance(creation_date, datetime):
+                    age_days = (datetime.now() - creation_date).days
+            return DomainInfo(
+                domain=domain,
+                creation_date=creation_date,
+                expiration_date=expiration_date,
+                registrar=w.registrar,
+                age_days=age_days,
+                success=True
+            )
+        except Exception as e:
+            return DomainInfo(
+                domain=domain,
+                creation_date=None, expiration_date=None,
+                registrar=None, age_days=None,
+                success=False, error=str(e)
+            )
+    def google_fact_check(self, query: str, language: str = "fr") -> List[FactCheckResult]:
+        """
+        Query Google Fact Check Tools API.
+        Args:
+            query: The claim or text to check
+            language: Language code (default: French)
+        Returns:
+            List of FactCheckResult objects
+        """
+        results = []
+        if not self.google_api_key:
+            print("[Info] Google Fact Check API key not configured. Using simulation.")
+            return self._simulate_fact_check(query)
+        try:
+            api_url = "https://factchecktools.googleapis.com/v1alpha1/claims:search"
+            params = {
+                'key': self.google_api_key,
+                'query': query[:200],  # API has character limit
+                # 'languageCode': language  # Removed to allow all languages (e.g. English queries)
+            }
+            response = self.session.get(api_url, params=params, timeout=10)
+            response.raise_for_status()
+            data = response.json()
+            claims = data.get('claims', [])
+            for claim in claims[:5]:  # Limit to 5 results
+                text = claim.get('text', '')
+                claimant = claim.get('claimant')
+                for review in claim.get('claimReview', []):
+                    results.append(FactCheckResult(
+                        claim=text,
+                        claimant=claimant,
+                        rating=review.get('textualRating', 'Unknown'),
+                        publisher=review.get('publisher', {}).get('name', 'Unknown'),
+                        url=review.get('url', ''),
+                        review_date=review.get('reviewDate')
+                    ))
+            return results
+        except Exception as e:
+            print(f"[Warning] Google Fact Check API error: {e}")
+            return self._simulate_fact_check(query)
+    def _simulate_fact_check(self, query: str) -> List[FactCheckResult]:
+        """Fallback simulation when API is not available."""
+        # Check for known misinformation patterns
+        misinformation_keywords = [
+            'conspiracy', 'hoax', 'fake', 'miracle cure', 'they don\'t want you to know',
+            'mainstream media lies', 'deep state', 'plandemic'
+        ]
+        query_lower = query.lower()
+        for keyword in misinformation_keywords:
+            if keyword in query_lower:
+                return [FactCheckResult(
+                    claim=f"Text contains potential misinformation marker: '{keyword}'",
+                    claimant=None,
+                    rating="Needs Verification",
+                    publisher="SysCRED Heuristic",
+                    url="",
+                    review_date=datetime.now().isoformat()
+                )]
+        return []  # No fact checks found
+    @lru_cache(maxsize=128)
+    def get_source_reputation(self, url: str) -> str:
+        """
+        Get reputation score for a source/domain.
+        Args:
+            url: URL or domain to check
+        Returns:
+            Reputation level: 'High', 'Medium', 'Low', or 'Unknown'
+        """
+        if url.startswith('http'):
+            domain = urlparse(url).netloc
+        else:
+            domain = url
+        # Remove www prefix
+        if domain.startswith('www.'):
+            domain = domain[4:]
+        # Check known reputations
+        for known_domain, reputation in self.known_reputations.items():
+            if domain.endswith(known_domain) or known_domain in domain:
+                return reputation
+        # Heuristics for unknown domains
+        # Academic domains tend to be more credible
+        if domain.endswith('.edu') or domain.endswith('.gov') or domain.endswith('.ac.uk'):
+            return 'High'
+        # Personal sites and free hosting are less credible
+        if any(x in domain for x in ['.blogspot.', '.wordpress.', '.wix.', '.weebly.']):
+            return 'Low'
+        return 'Unknown'
+    def estimate_backlinks(self, url: str) -> Dict[str, Any]:
+        """
+        Estimate relative authority/backlinks based on available signals.
+        Since real backlink databases (Ahrefs, Moz) are paid/proprietary,
+        we use a composite heuristic based on:
+        1. Domain age (older domains tend to have more backlinks)
+        2. Known reputation (High reputation sources imply high backlinks)
+        3. Google Fact Check mentions (as a proxy for visibility in fact-checks)
+        """
+        domain = urlparse(url).netloc
+        if domain.startswith('www.'):
+            domain = domain[4:]
+        # 1. Base Score from Reputation
+        reputation = self.get_source_reputation(domain)
+        base_count = 0
+        if reputation == 'High':
+            base_count = 10000  # High authority
+        elif reputation == 'Medium':
+            base_count = 1000   # Medium authority
+        elif reputation == 'Low':
+            base_count = 50     # Low authority
+        else:
+            base_count = 100    # Unknown
+        # 2. Multiplier from Domain Age
+        age_multiplier = 1.0
+        domain_info = self.whois_lookup(domain)
+        if domain_info.success and domain_info.age_days:
+            # Add 10% for every year of age, max 5x
+            years = domain_info.age_days / 365
+            age_multiplier = min(5.0, 1.0 + (years * 0.1))
+        estimated_count = int(base_count * age_multiplier)
+        # 3. Adjust for specific TLDs
+        if domain.endswith('.edu') or domain.endswith('.gov'):
+            estimated_count *= 2
+        return {
+            'estimated_count': estimated_count,
+            'sample_backlinks': [], # Real sample requires SERP API
+            'method': 'heuristic_v2.1',
+            'note': 'Estimated from domain age and reputation (Proxy)'
+        }
+    def fetch_external_data(self, input_data: str, fc_query: str = None) -> ExternalData:
+        """
+        Main method to fetch all external data for credibility analysis.
+        This replaces the simulated fetch_external_data function.
+        Args:
+            input_data: URL or text to analyze
+        Returns:
+            ExternalData with all gathered information
+        """
+        from urllib.parse import urlparse
+        # Determine if input is URL
+        is_url = False
+        try:
+            result = urlparse(input_data)
+            is_url = all([result.scheme, result.netloc])
+        except:
+            pass
+        # Initialize results
+        domain_age_days = None
+        domain_info = None
+        source_reputation = 'Unknown'
+        fact_checks = []
+        backlinks_data = {'estimated_count': 0, 'sample_backlinks': []}
+        if is_url:
+            # Get domain information
+            domain_info = self.whois_lookup(input_data)
+            if domain_info.success:
+                domain_age_days = domain_info.age_days
+            # Get source reputation
+            source_reputation = self.get_source_reputation(input_data)
+            # Get backlink estimation
+            backlinks_data = self.estimate_backlinks(input_data)
+        # Perform fact check on the content/URL
+        # Use provided query or fall back to input_data
+        query_to_use = fc_query if fc_query else input_data
+        fact_checks = self.google_fact_check(query_to_use)
+        return ExternalData(
+            fact_checks=fact_checks,
+            source_reputation=source_reputation,
+            domain_age_days=domain_age_days,
+            domain_info=domain_info,
+            related_articles=[],  # TODO: Implement related article search
+            backlinks_count=backlinks_data.get('estimated_count', 0),
+            backlinks_sample=backlinks_data.get('sample_backlinks', [])
+        )
+# --- Testing ---
+if __name__ == "__main__":
+    print("=== Testing ExternalAPIClients ===\n")
+    client = ExternalAPIClients()
+    # Test 1: Web content fetching
+    print("Test 1: Fetching web content from Le Monde...")
+    content = client.fetch_web_content("https://www.lemonde.fr")
+    print(f"  Success: {content.success}")
+    print(f"  Title: {content.title}")
+    print(f"  Text length: {len(content.text_content)} chars")
+    print(f"  Links found: {len(content.links)}")
+    print()
+    # Test 2: WHOIS lookup
+    print("Test 2: WHOIS lookup for lemonde.fr...")
+    domain_info = client.whois_lookup("https://www.lemonde.fr")
+    print(f"  Success: {domain_info.success}")
+    print(f"  Domain: {domain_info.domain}")
+    print(f"  Age: {domain_info.age_days} days")
+    print(f"  Registrar: {domain_info.registrar}")
+    print()
+    # Test 3: Source reputation
+    print("Test 3: Source reputation checks...")
+    test_urls = [
+        "https://www.nytimes.com/article",
+        "https://www.infowars.com/post",
+        "https://random-blog.wordpress.com"
+    ]
+    for url in test_urls:
+        rep = client.get_source_reputation(url)
+        print(f"  {url}: {rep}")
+    print()
+    # Test 4: Full external data
+    print("Test 4: Full external data fetch...")
+    external_data = client.fetch_external_data("https://www.bbc.com/news")
+    print(f"  Source reputation: {external_data.source_reputation}")
+    print(f"  Domain age: {external_data.domain_age_days} days")
+    print(f"  Fact checks found: {len(external_data.fact_checks)}")
+    print("\n=== Tests Complete ===")

syscred/backend_app.py ADDED Viewed

	@@ -0,0 +1,363 @@

+# -*- coding: utf-8 -*-
+"""
+SysCRED Backend API - Flask Server
+===================================
+REST API for the credibility verification system.
+Endpoints:
+- POST /api/verify - Verify URL or text credibility
+- POST /api/seo - Get SEO analysis only
+- GET /api/ontology/stats - Get ontology statistics
+- GET /api/health - Health check
+- GET /api/config - View current configuration
+(c) Dominique S. Loyer - PhD Thesis Prototype
+"""
+import sys
+import os
+import traceback
+from flask import Flask, request, jsonify, send_from_directory
+from flask_cors import CORS
+# Add syscred package to path
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+# Import SysCRED modules
+try:
+    from syscred.verification_system import CredibilityVerificationSystem
+    from syscred.seo_analyzer import SEOAnalyzer
+    from syscred.ontology_manager import OntologyManager
+    from syscred.ontology_manager import OntologyManager
+    from syscred.config import config, Config
+    from syscred.database import init_db, db, AnalysisResult
+    SYSCRED_AVAILABLE = True
+    print("[SysCRED Backend] Modules imported successfully")
+except ImportError as e:
+    SYSCRED_AVAILABLE = False
+    print(f"[SysCRED Backend] Warning: Could not import modules: {e}")
+    # Define dummy init_db to prevent crash
+    def init_db(app): pass
+    # Fallback config
+    class Config:
+        HOST = "0.0.0.0"
+        PORT = 5000
+        DEBUG = True
+        ONTOLOGY_BASE_PATH = None
+        ONTOLOGY_DATA_PATH = None
+        LOAD_ML_MODELS = True
+        GOOGLE_FACT_CHECK_API_KEY = None
+    config = Config()
+# --- Initialize Flask App ---
+app = Flask(__name__)
+CORS(app)  # Enable CORS for frontend
+# Initialize Database
+try:
+    init_db(app) # [NEW] Setup DB connection
+except Exception as e:
+    print(f"[SysCRED Backend] Warning: DB init failed: {e}")
+# --- Initialize SysCRED System ---
+credibility_system = None
+seo_analyzer = None
+def initialize_system():
+    """Initialize the credibility system (lazy loading)."""
+    global credibility_system, seo_analyzer
+    if not SYSCRED_AVAILABLE:
+        print("[SysCRED Backend] Cannot initialize - modules not available")
+        return False
+    try:
+        # Initialize SEO analyzer (lightweight)
+        seo_analyzer = SEOAnalyzer()
+        print("[SysCRED Backend] SEO Analyzer initialized")
+        # Initialize full system (may take time to load ML models)
+        print("[SysCRED Backend] Initializing credibility system (loading ML models)...")
+        ontology_base = str(config.ONTOLOGY_BASE_PATH) if config.ONTOLOGY_BASE_PATH else None
+        ontology_data = str(config.ONTOLOGY_DATA_PATH) if config.ONTOLOGY_DATA_PATH else None
+        credibility_system = CredibilityVerificationSystem(
+            ontology_base_path=ontology_base if ontology_base and os.path.exists(ontology_base) else None,
+            ontology_data_path=ontology_data,
+            load_ml_models=config.LOAD_ML_MODELS,
+            google_api_key=config.GOOGLE_FACT_CHECK_API_KEY
+        )
+        print("[SysCRED Backend] System initialized successfully!")
+        return True
+    except Exception as e:
+        print(f"[SysCRED Backend] Error initializing system: {e}")
+        traceback.print_exc()
+        return False
+# --- API Routes ---
+@app.route('/')
+def index():
+    """Serve the frontend."""
+    return send_from_directory('static', 'index.html')
+@app.route('/api/health', methods=['GET'])
+def health_check():
+    """Health check endpoint."""
+    return jsonify({
+        'status': 'healthy',
+        'syscred_available': SYSCRED_AVAILABLE,
+        'system_initialized': credibility_system is not None,
+        'seo_analyzer_ready': seo_analyzer is not None
+    })
+@app.route('/api/verify', methods=['POST'])
+def verify_endpoint():
+    """
+    Main verification endpoint.
+    Request JSON:
+    {
+        "input_data": "URL or text to verify",
+        "include_seo": true/false (optional, default true),
+        "include_pagerank": true/false (optional, default true)
+    }
+    """
+    global credibility_system
+    # Lazy initialization
+    if credibility_system is None:
+        if not initialize_system():
+            return jsonify({
+                'error': 'System initialization failed. Check server logs.'
+            }), 503
+    # Validate request
+    if not request.is_json:
+        return jsonify({'error': 'Request must be JSON'}), 400
+    data = request.get_json()
+    input_data = data.get('input_data', '').strip()
+    if not input_data:
+        return jsonify({'error': "'input_data' is required"}), 400
+    include_seo = data.get('include_seo', True)
+    include_pagerank = data.get('include_pagerank', True)
+    print(f"[SysCRED Backend] Verifying: {input_data[:100]}...")
+    try:
+        # Run main verification
+        result = credibility_system.verify_information(input_data)
+        if 'error' in result:
+            return jsonify(result), 400
+        # Add SEO analysis if requested and it's a URL
+        if include_seo and credibility_system.is_url(input_data):
+            try:
+                web_content = credibility_system.api_clients.fetch_web_content(input_data)
+                if web_content.success:
+                    seo_result = seo_analyzer.analyze_seo(
+                        url=input_data,
+                        title=web_content.title,
+                        meta_description=web_content.meta_description,
+                        text_content=web_content.text_content
+                    )
+                    result['seoAnalysis'] = {
+                        'titleLength': seo_result.title_length,
+                        'titleHasKeywords': seo_result.title_has_keywords,
+                        'metaDescriptionLength': seo_result.meta_description_length,
+                        'wordCount': seo_result.word_count,
+                        'readabilityScore': round(seo_result.readability_score, 2),
+                        'seoScore': round(seo_result.seo_score, 2),
+                        'topKeywords': list(seo_result.keyword_density.keys())
+                    }
+            except Exception as e:
+                print(f"[SysCRED Backend] SEO analysis error: {e}")
+                result['seoAnalysis'] = {'error': str(e)}
+        # Add PageRank estimation if requested
+        if include_pagerank and credibility_system.is_url(input_data):
+            try:
+                external_data = credibility_system.api_clients.fetch_external_data(input_data)
+                pr_result = seo_analyzer.estimate_pagerank(
+                    url=input_data,
+                    domain_age_days=external_data.domain_age_days,
+                    source_reputation=external_data.source_reputation
+                )
+                result['pageRankEstimation'] = {
+                    'estimatedPR': round(pr_result.estimated_pr, 3),
+                    'confidence': round(pr_result.confidence, 2),
+                    'factors': pr_result.factors,
+                    'explanation': pr_result.explanation_text
+                }
+            except Exception as e:
+                print(f"[SysCRED Backend] PageRank estimation error: {e}")
+                result['pageRankEstimation'] = {'error': str(e)}
+        print(f"[SysCRED Backend] Score: {result.get('scoreCredibilite', 'N/A')}")
+        # [NEW] Persist to Database
+        try:
+            new_analysis = AnalysisResult(
+                url=input_data[:500],
+                credibility_score=result.get('scoreCredibilite', 0.5),
+                summary=result.get('resumeAnalyse', ''),
+                source_reputation=result.get('detailsScore', {}).get('factors', [{}])[0].get('value')
+            )
+            db.session.add(new_analysis)
+            db.session.commit()
+            print(f"[SysCRED-DB] Result saved. ID: {new_analysis.id}")
+        except Exception as e:
+            print(f"[SysCRED-DB] Save failed: {e}")
+        return jsonify(result), 200
+    except Exception as e:
+        print(f"[SysCRED Backend] Error: {e}")
+        traceback.print_exc()
+        return jsonify({'error': f'Internal error: {str(e)}'}), 500
+@app.route('/api/seo', methods=['POST'])
+def seo_endpoint():
+    """
+    SEO-only analysis endpoint (faster, no ML models needed).
+    Request JSON:
+    {
+        "url": "URL to analyze"
+    }
+    """
+    global seo_analyzer
+    if seo_analyzer is None:
+        seo_analyzer = SEOAnalyzer()
+    if not request.is_json:
+        return jsonify({'error': 'Request must be JSON'}), 400
+    data = request.get_json()
+    url = data.get('url', '').strip()
+    if not url or not url.startswith('http'):
+        return jsonify({'error': 'Valid URL is required'}), 400
+    try:
+        # Fetch content
+        from syscred.api_clients import ExternalAPIClients
+        api_client = ExternalAPIClients()
+        web_content = api_client.fetch_web_content(url)
+        if not web_content.success:
+            return jsonify({'error': f'Failed to fetch URL: {web_content.error}'}), 400
+        # SEO analysis
+        seo_result = seo_analyzer.analyze_seo(
+            url=url,
+            title=web_content.title,
+            meta_description=web_content.meta_description,
+            text_content=web_content.text_content
+        )
+        # IR metrics
+        ir_metrics = seo_analyzer.get_ir_metrics(web_content.text_content)
+        # PageRank estimation
+        external_data = api_client.fetch_external_data(url)
+        pr_result = seo_analyzer.estimate_pagerank(
+            url=url,
+            domain_age_days=external_data.domain_age_days,
+            source_reputation=external_data.source_reputation
+        )
+        return jsonify({
+            'url': url,
+            'title': web_content.title,
+            'seo': {
+                'titleLength': seo_result.title_length,
+                'metaDescriptionLength': seo_result.meta_description_length,
+                'wordCount': seo_result.word_count,
+                'readabilityScore': round(seo_result.readability_score, 2),
+                'seoScore': round(seo_result.seo_score, 2),
+                'keywordDensity': seo_result.keyword_density
+            },
+            'irMetrics': {
+                'documentLength': ir_metrics.document_length,
+                'topTerms': ir_metrics.top_terms[:5],
+                'avgTermFrequency': round(ir_metrics.avg_term_frequency, 4)
+            },
+            'pageRank': {
+                'estimated': round(pr_result.estimated_pr, 3),
+                'confidence': round(pr_result.confidence, 2),
+                'factors': pr_result.factors
+            },
+            'domain': {
+                'reputation': external_data.source_reputation,
+                'ageDays': external_data.domain_age_days
+            }
+        }), 200
+    except Exception as e:
+        print(f"[SysCRED Backend] SEO endpoint error: {e}")
+        traceback.print_exc()
+        return jsonify({'error': str(e)}), 500
+@app.route('/api/ontology/graph', methods=['GET'])
+def ontology_graph():
+    """Get ontology graph data for D3.js."""
+    global credibility_system
+    if credibility_system and credibility_system.ontology_manager:
+        graph_data = credibility_system.ontology_manager.get_graph_json()
+        return jsonify(graph_data), 200
+    else:
+        # Return empty graph rather than 400 to avoid breaking frontend
+        return jsonify({'nodes': [], 'links': []}), 200
+@app.route('/api/ontology/stats', methods=['GET'])
+def ontology_stats():
+    """Get ontology statistics."""
+    global credibility_system
+    if credibility_system and credibility_system.ontology_manager:
+        stats = credibility_system.ontology_manager.get_statistics()
+        return jsonify(stats), 200
+    else:
+        return jsonify({
+            'error': 'Ontology not loaded',
+            'base_triples': 0,
+            'data_triples': 0
+        }), 200
+# --- Main ---
+if __name__ == '__main__':
+    print("=" * 60)
+    print("SysCRED Backend API Server")
+    print("(c) Dominique S. Loyer - PhD Thesis Prototype")
+    print("=" * 60)
+    print()
+    # Initialize system at startup
+    print("[SysCRED Backend] Pre-initializing system...")
+    initialize_system()
+    print()
+    print("[SysCRED Backend] Starting Flask server...")
+    print("[SysCRED Backend] Endpoints:")
+    print("  - POST /api/verify     - Full credibility verification")
+    print("  - POST /api/seo        - SEO analysis only (faster)")
+    print("  - GET  /api/ontology/stats - Ontology statistics")
+    print("  - GET  /api/health     - Health check")
+    print()
+    app.run(host='0.0.0.0', port=5001, debug=True)

syscred/benchmark_data.json ADDED Viewed

	@@ -0,0 +1,92 @@

+[
+    {
+        "url": "https://www.lemonde.fr",
+        "label": "High",
+        "expected_score_range": [
+            0.7,
+            1.0
+        ],
+        "category": "News (General)"
+    },
+    {
+        "url": "https://www.bbc.com",
+        "label": "High",
+        "expected_score_range": [
+            0.7,
+            1.0
+        ],
+        "category": "News (International)"
+    },
+    {
+        "url": "https://www.nature.com",
+        "label": "High",
+        "expected_score_range": [
+            0.8,
+            1.0
+        ],
+        "category": "Science"
+    },
+    {
+        "url": "https://www.who.int",
+        "label": "High",
+        "expected_score_range": [
+            0.8,
+            1.0
+        ],
+        "category": "Health/Institution"
+    },
+    {
+        "url": "https://www.reuters.com",
+        "label": "High",
+        "expected_score_range": [
+            0.7,
+            1.0
+        ],
+        "category": "News (Agency)"
+    },
+    {
+        "url": "https://www.infowars.com",
+        "label": "Low",
+        "expected_score_range": [
+            0.0,
+            0.4
+        ],
+        "category": "Conspiracy"
+    },
+    {
+        "url": "https://www.naturalnews.com",
+        "label": "Low",
+        "expected_score_range": [
+            0.0,
+            0.4
+        ],
+        "category": "Pseudoscience"
+    },
+    {
+        "url": "https://truthsocial.com",
+        "label": "Low",
+        "expected_score_range": [
+            0.0,
+            0.5
+        ],
+        "category": "Social/Biased"
+    },
+    {
+        "url": "https://www.theflatearthsociety.org",
+        "label": "Low",
+        "expected_score_range": [
+            0.0,
+            0.4
+        ],
+        "category": "Conspiracy"
+    },
+    {
+        "url": "https://beforeitsnews.com",
+        "label": "Low",
+        "expected_score_range": [
+            0.0,
+            0.4
+        ],
+        "category": "Fake News"
+    }
+]

syscred/config.py ADDED Viewed

	@@ -0,0 +1,291 @@

+# -*- coding: utf-8 -*-
+"""
+SysCRED Configuration
+=====================
+Configuration centralisée pour le système de vérification de crédibilité.
+Usage:
+    from syscred.config import Config
+    # Accéder aux paramètres
+    config = Config()
+    port = config.PORT
+    # Ou avec variables d'environnement
+    # export SYSCRED_GOOGLE_API_KEY=your_key
+    # export SYSCRED_PORT=8080
+(c) Dominique S. Loyer - PhD Thesis Prototype
+"""
+import os
+from pathlib import Path
+from typing import Dict, Optional
+from dotenv import load_dotenv
+# Charger les variables depuis .env
+# Charger les variables depuis .env (Project Root)
+# Path: .../systemFactChecking/02_Code/syscred/config.py
+# Root .env is at .../systemFactChecking/.env (3 levels up)
+current_path = Path(__file__).resolve()
+env_path = current_path.parent.parent.parent / '.env'
+if not env_path.exists():
+    print(f"[Config] WARNING: .env not found at {env_path}")
+    # Try alternate location (sometimes CWD matters)
+    env_path = Path.cwd().parent / '.env'
+load_dotenv(dotenv_path=env_path)
+print(f"[Config] Loading .env from {env_path}")
+print(f"[Config] SYSCRED_GOOGLE_API_KEY loaded: {'Yes' if os.environ.get('SYSCRED_GOOGLE_API_KEY') else 'No'}")
+class Config:
+    """
+    Configuration centralisée pour SysCRED.
+    Les valeurs peuvent être override par des variables d'environnement
+    préfixées par SYSCRED_.
+    """
+    # === Chemins ===
+    BASE_DIR = Path(__file__).parent.parent
+    ONTOLOGY_BASE_PATH = BASE_DIR / "sysCRED_onto26avrtil.ttl"
+    ONTOLOGY_DATA_PATH = BASE_DIR / "ontology" / "sysCRED_data.ttl"
+    # === Serveur Flask ===
+    HOST = os.getenv("SYSCRED_HOST", "0.0.0.0")
+    PORT = int(os.getenv("SYSCRED_PORT", "5000"))
+    DEBUG = os.getenv("SYSCRED_DEBUG", "true").lower() == "true"
+    # === API Keys ===
+    GOOGLE_FACT_CHECK_API_KEY = os.getenv("SYSCRED_GOOGLE_API_KEY")
+    DATABASE_URL = os.getenv("DATABASE_URL") # [NEW] Read DB URL from env
+    # === Modèles ML ===
+    # Support both SYSCRED_LOAD_ML and SYSCRED_LOAD_ML_MODELS (for Render)
+    LOAD_ML_MODELS = os.getenv("SYSCRED_LOAD_ML_MODELS", os.getenv("SYSCRED_LOAD_ML", "true")).lower() == "true"
+    SENTIMENT_MODEL = "distilbert-base-uncased-finetuned-sst-2-english"
+    NER_MODEL = "dbmdz/bert-large-cased-finetuned-conll03-english"
+    # === Timeouts ===
+    WEB_FETCH_TIMEOUT = int(os.getenv("SYSCRED_TIMEOUT", "10"))
+    # === TREC IR Configuration (NEW - Feb 2026) ===
+    TREC_INDEX_PATH = os.getenv("SYSCRED_TREC_INDEX", None)  # Lucene/Pyserini index
+    TREC_CORPUS_PATH = os.getenv("SYSCRED_TREC_CORPUS", None)  # JSONL corpus
+    TREC_TOPICS_PATH = os.getenv("SYSCRED_TREC_TOPICS", None)  # Topics directory
+    TREC_QRELS_PATH = os.getenv("SYSCRED_TREC_QRELS", None)  # Qrels directory
+    # BM25 Parameters (optimized on AP88-90)
+    BM25_K1 = float(os.getenv("SYSCRED_BM25_K1", "0.9"))
+    BM25_B = float(os.getenv("SYSCRED_BM25_B", "0.4"))
+    # PRF (Pseudo-Relevance Feedback) settings
+    ENABLE_PRF = os.getenv("SYSCRED_ENABLE_PRF", "true").lower() == "true"
+    PRF_TOP_DOCS = int(os.getenv("SYSCRED_PRF_TOP_DOCS", "3"))
+    PRF_EXPANSION_TERMS = int(os.getenv("SYSCRED_PRF_TERMS", "10"))
+    # === Pondération des scores ===
+    SCORE_WEIGHTS = {
+        'source_reputation': 0.25,
+        'domain_age': 0.10,
+        'sentiment_neutrality': 0.15,
+        'entity_presence': 0.15,
+        'coherence': 0.15,
+        'fact_check': 0.20
+    }
+    # === Seuils de crédibilité ===
+    CREDIBILITY_THRESHOLDS = {
+        'HIGH': 0.7,
+        'MEDIUM': 0.4,
+        'LOW': 0.0
+    }
+    # === Base de données de réputation ===
+    # Les sources peuvent être étendues ou chargées d'un fichier externe
+    SOURCE_REPUTATIONS: Dict[str, str] = {
+        # === HAUTE CRÉDIBILITÉ ===
+        # Médias internationaux
+        'lemonde.fr': 'High',
+        'nytimes.com': 'High',
+        'reuters.com': 'High',
+        'bbc.com': 'High',
+        'bbc.co.uk': 'High',
+        'theguardian.com': 'High',
+        'apnews.com': 'High',
+        'afp.com': 'High',
+        'france24.com': 'High',
+        # Médias canadiens
+        'cbc.ca': 'High',
+        'radio-canada.ca': 'High',
+        'lapresse.ca': 'High',
+        'ledevoir.com': 'High',
+        'theglobeandmail.com': 'High',
+        # Sources académiques
+        'nature.com': 'High',
+        'sciencedirect.com': 'High',
+        'scholar.google.com': 'High',
+        'pubmed.ncbi.nlm.nih.gov': 'High',
+        'jstor.org': 'High',
+        'springer.com': 'High',
+        'ieee.org': 'High',
+        'acm.org': 'High',
+        'arxiv.org': 'High',
+        # Fact-checkers
+        'factcheck.org': 'High',
+        'snopes.com': 'High',
+        'politifact.com': 'High',
+        'fullfact.org': 'High',
+        'checknews.fr': 'High',
+        # Institutions
+        'who.int': 'High',
+        'un.org': 'High',
+        'europa.eu': 'High',
+        'canada.ca': 'High',
+        'gouv.fr': 'High',
+        'gouv.qc.ca': 'High',
+        # === CRÉDIBILITÉ MOYENNE ===
+        'wikipedia.org': 'Medium',
+        'medium.com': 'Medium',
+        'huffpost.com': 'Medium',
+        'buzzfeed.com': 'Medium',
+        'vice.com': 'Medium',
+        'slate.com': 'Medium',
+        'theconversation.com': 'Medium',
+        # === BASSE CRÉDIBILITÉ ===
+        'infowars.com': 'Low',
+        'naturalnews.com': 'Low',
+        'breitbart.com': 'Low',
+        'dailystormer.su': 'Low',
+        'beforeitsnews.com': 'Low',
+        'worldtruth.tv': 'Low',
+        'yournewswire.com': 'Low',
+    }
+    # === Patterns de mésinformation ===
+    MISINFORMATION_KEYWORDS = [
+        'conspiracy', 'hoax', 'fake news', 'miracle cure',
+        "they don't want you to know", 'mainstream media lies',
+        'deep state', 'plandemic', 'wake up sheeple',
+        'big pharma cover-up', 'government conspiracy',
+        'censored truth', 'what they hide'
+    ]
+    @classmethod
+    def load_external_reputations(cls, filepath: str) -> None:
+        """
+        Charger des réputations supplémentaires depuis un fichier JSON.
+        Args:
+            filepath: Chemin vers le fichier JSON avec format:
+                      {"domain.com": "High", "autre.com": "Low"}
+        """
+        import json
+        try:
+            with open(filepath, 'r') as f:
+                external_reps = json.load(f)
+                cls.SOURCE_REPUTATIONS.update(external_reps)
+                print(f"[Config] Loaded {len(external_reps)} external reputations")
+        except Exception as e:
+            print(f"[Config] Could not load external reputations: {e}")
+    @classmethod
+    def update_weights(cls, new_weights: Dict[str, float]) -> None:
+        """
+        Mettre à jour les pondérations des scores.
+        Args:
+            new_weights: Dictionnaire avec les nouvelles pondérations
+        """
+        cls.SCORE_WEIGHTS.update(new_weights)
+        # Normaliser pour que la somme = 1
+        total = sum(cls.SCORE_WEIGHTS.values())
+        cls.SCORE_WEIGHTS = {k: v/total for k, v in cls.SCORE_WEIGHTS.items()}
+        print(f"[Config] Updated weights: {cls.SCORE_WEIGHTS}")
+    @classmethod
+    def to_dict(cls) -> Dict:
+        """Exporter la configuration actuelle en dictionnaire."""
+        return {
+            'host': cls.HOST,
+            'port': cls.PORT,
+            'debug': cls.DEBUG,
+            'google_api_configured': cls.GOOGLE_FACT_CHECK_API_KEY is not None,
+            'ml_models_enabled': cls.LOAD_ML_MODELS,
+            'score_weights': cls.SCORE_WEIGHTS,
+            'known_sources_count': len(cls.SOURCE_REPUTATIONS),
+            'ontology_base': str(cls.ONTOLOGY_BASE_PATH),
+            'ontology_data': str(cls.ONTOLOGY_DATA_PATH),
+        }
+    @classmethod
+    def print_config(cls) -> None:
+        """Afficher la configuration actuelle."""
+        print("=" * 50)
+        print("SysCRED Configuration")
+        print("=" * 50)
+        for key, value in cls.to_dict().items():
+            print(f"  {key}: {value}")
+        print("=" * 50)
+# === Configuration par environnement ===
+class DevelopmentConfig(Config):
+    """Configuration pour développement local."""
+    DEBUG = True
+    LOAD_ML_MODELS = True
+class ProductionConfig(Config):
+    """Configuration pour production."""
+    DEBUG = False
+    LOAD_ML_MODELS = True
+    HOST = "0.0.0.0"
+class TestingConfig(Config):
+    """Configuration pour tests."""
+    DEBUG = True
+    LOAD_ML_MODELS = False  # Plus rapide pour les tests
+    WEB_FETCH_TIMEOUT = 5
+# Sélection automatique de la configuration
+def get_config() -> Config:
+    """
+    Retourne la configuration appropriée selon l'environnement.
+    Variable d'environnement: SYSCRED_ENV (development, production, testing)
+    """
+    env = os.getenv("SYSCRED_ENV", "development").lower()
+    configs = {
+        'development': DevelopmentConfig,
+        'production': ProductionConfig,
+        'testing': TestingConfig,
+    }
+    return configs.get(env, DevelopmentConfig)
+# Instance par défaut
+config = get_config()
+if __name__ == "__main__":
+    # Test de la configuration
+    config.print_config()
+    print("\n=== Source Reputations Sample ===")
+    for domain, rep in list(config.SOURCE_REPUTATIONS.items())[:10]:
+        print(f"  {domain}: {rep}")

syscred/database.py ADDED Viewed

	@@ -0,0 +1,54 @@

+# -*- coding: utf-8 -*-
+"""
+Database Manager for SysCRED
+===========================
+Handles connection to Supabase (PostgreSQL) and defines models.
+"""
+import os
+from flask_sqlalchemy import SQLAlchemy
+from datetime import datetime
+# Initialize SQLAlchemy
+db = SQLAlchemy()
+class AnalysisResult(db.Model):
+    """Stores the result of a credibility analysis."""
+    __tablename__ = 'analysis_results'
+    id = db.Column(db.Integer, primary_key=True)
+    url = db.Column(db.String(500), nullable=False)
+    credibility_score = db.Column(db.Float, nullable=False)
+    summary = db.Column(db.Text)
+    created_at = db.Column(db.DateTime, default=datetime.utcnow)
+    # Metadata stored as JSON if supported, or simplified columns
+    source_reputation = db.Column(db.String(50))
+    fact_check_count = db.Column(db.Integer, default=0)
+    def to_dict(self):
+        return {
+            'id': self.id,
+            'url': self.url,
+            'score': self.credibility_score,
+            'summary': self.summary,
+            'created_at': self.created_at.isoformat(),
+            'source_reputation': self.source_reputation
+        }
+def init_db(app):
+    """Initialize the database with the Flask app."""
+    # Fallback to sqlite for local dev if no DATABASE_URL
+    db_url = os.environ.get('DATABASE_URL')
+    if db_url and db_url.startswith("postgres://"):
+        db_url = db_url.replace("postgres://", "postgresql://", 1)
+    app.config['SQLALCHEMY_DATABASE_URI'] = db_url or 'sqlite:///syscred.db'
+    app.config['SQLALCHEMY_TRACK_MODIFICATIONS'] = False
+    db.init_app(app)
+    # Create tables if they don't exist (basic migration)
+    with app.app_context():
+        db.create_all()
+        print("[SysCRED-DB] Database tables initialized.")

syscred/debug_factcheck.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import os
+import requests
+from dotenv import load_dotenv
+# Load environment variables
+load_dotenv(dotenv_path='/Users/bk280625/documents041025/MonCode/syscred/.env')
+API_KEY = os.getenv('SYSCRED_GOOGLE_API_KEY')
+print(f"Loaded API Key: {API_KEY[:5]}...{API_KEY[-5:] if API_KEY else 'None'}")
+if not API_KEY:
+    print("❌ Error: API Key not found in .env")
+    exit(1)
+query = "La terre est plate"
+url = "https://factchecktools.googleapis.com/v1alpha1/claims:search"
+params = {
+    'key': API_KEY,
+    'query': query,
+}
+print(f"\nSending request for query: '{query}'...")
+try:
+    response = requests.get(url, params=params)
+    print(f"Status Code: {response.status_code}")
+    if response.status_code == 200:
+        data = response.json()
+        claims = data.get('claims', [])
+        print(f"✅ Success! Found {len(claims)} claims.")
+        for i, claim in enumerate(claims[:3]):
+            print(f"\n--- Result {i+1} ---")
+            print(f"Claim: {claim.get('text')}")
+            print(f"Claimant: {claim.get('claimant')}")
+            reviews = claim.get('claimReview', [])
+            if reviews:
+                print(f"Rating: {reviews[0].get('textualRating')}")
+                print(f"URL: {reviews[0].get('url')}")
+    else:
+        print(f"❌ API Error: {response.text}")
+except Exception as e:
+    print(f"❌ Connection Error: {e}")

syscred/debug_graph_json.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import sys
+from pathlib import Path
+import json
+# Add project root to path (one level up from this script)
+sys.path.append(str(Path(__file__).parent.parent))
+from syscred.ontology_manager import OntologyManager
+from syscred.config import config
+def debug_graph():
+    print("=== Debugging Ontology Graph Extraction ===")
+    # Initialize manager
+    base_path = str(config.ONTOLOGY_BASE_PATH)
+    data_path = str(config.ONTOLOGY_DATA_PATH)
+    print(f"Loading data from: {data_path}")
+    manager = OntologyManager(base_ontology_path=base_path, data_path=data_path)
+    # Get Stats
+    stats = manager.get_statistics()
+    print(f"Total Triples: {stats['total_triples']}")
+    print(f"Evaluations: {stats.get('total_evaluations', 'N/A')}")
+    # Try getting graph JSON
+    print("\nExtracting Graph JSON...")
+    graph_data = manager.get_graph_json()
+    nodes = graph_data.get('nodes', [])
+    links = graph_data.get('links', [])
+    print(f"Nodes found: {len(nodes)}")
+    print(f"Links found: {len(links)}")
+    if len(nodes) > 0:
+        print("\n--- Sample Nodes ---")
+        for n in nodes[:3]:
+            print(json.dumps(n, indent=2))
+    else:
+        print("\n❌ No nodes found! Checking latest report query...")
+        # Manually run the query to see what's wrong
+        query = """
+        PREFIX cred: <http://www.dic9335.uqam.ca/ontologies/credibility-verification#>
+        SELECT ?report ?timestamp WHERE {
+            ?report a cred:RapportEvaluation .
+            ?report cred:completionTimestamp ?timestamp .
+        }
+        ORDER BY DESC(?timestamp)
+        LIMIT 5
+        """
+        print(f"Running SPARQL:\n{query}")
+        results = manager.data_graph.query(query)
+        for row in results:
+            print(f"Found Report: {row.report} at {row.timestamp}")
+if __name__ == "__main__":
+    debug_graph()

syscred/debug_init.py ADDED Viewed

	@@ -0,0 +1,33 @@

+import sys
+import os
+import traceback
+# Setup path
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from syscred.verification_system import CredibilityVerificationSystem
+from syscred.config import config
+from syscred.seo_analyzer import SEOAnalyzer
+print("=== DEBUG INITIALIZATION ===")
+try:
+    print("[1] Config check:")
+    print(f"    Base Ontology: {config.ONTOLOGY_BASE_PATH}")
+    print(f"    Data Path: {config.ONTOLOGY_DATA_PATH}")
+    print("\n[2] Initializing SEO Analyzer...")
+    seo = SEOAnalyzer()
+    print("    OK")
+    print("\n[3] Initializing Verification System...")
+    sys = CredibilityVerificationSystem(
+        ontology_base_path=config.ONTOLOGY_BASE_PATH,
+        ontology_data_path=config.ONTOLOGY_DATA_PATH,
+        load_ml_models=False # Disable ML for basic init test
+    )
+    print("    OK - System initialized successfully.")
+except Exception as e:
+    print(f"\n❌ FATAL ERROR: {e}")
+    traceback.print_exc()

syscred/debug_local_server.py ADDED Viewed

	@@ -0,0 +1,25 @@

+import requests
+import json
+url = "http://localhost:5001/api/verify"
+payload = {
+    "input_data": "la terre est plate",
+    "include_seo": True
+}
+headers = {'Content-Type': 'application/json'}
+try:
+    print(f"Sending POST to {url} with payload: {payload}")
+    response = requests.post(url, json=payload, headers=headers)
+    print(f"Status: {response.status_code}")
+    if response.status_code == 200:
+        data = response.json()
+        print("\n--- JSON RESPONSE PARTIAL ---")
+        facts = data.get('reglesAppliquees', {}).get('fact_checking', [])
+        print(f"Fact Checks Count: {len(facts)}")
+        print("Fact Checks Items:", json.dumps(facts, indent=2, ensure_ascii=False))
+    else:
+        print("Error:", response.text)
+except Exception as e:
+    print(f"Connection failed: {e}")

syscred/diagnose_imports.py ADDED Viewed

	@@ -0,0 +1,37 @@

+import sys
+import os
+import traceback
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+print("--- DIAGNOSTIC START ---")
+try:
+    print("[1] Importing config...")
+    from syscred.config import config
+    print("    OK")
+except Exception:
+    traceback.print_exc()
+try:
+    print("[2] Importing database...")
+    from syscred.database import init_db
+    print("    OK")
+except Exception:
+    traceback.print_exc()
+try:
+    print("[3] Importing ontology_manager...")
+    from syscred.ontology_manager import OntologyManager
+    print("    OK")
+except Exception:
+    traceback.print_exc()
+try:
+    print("[4] Importing verification_system...")
+    from syscred.verification_system import CredibilityVerificationSystem
+    print("    OK")
+except Exception:
+    traceback.print_exc()
+print("--- DIAGNOSTIC END ---")

syscred/eval_metrics.py ADDED Viewed

	@@ -0,0 +1,349 @@

+# -*- coding: utf-8 -*-
+"""
+Evaluation Metrics Module - SysCRED
+====================================
+Information Retrieval evaluation metrics for TREC-style experiments.
+Metrics:
+- MAP (Mean Average Precision)
+- NDCG (Normalized Discounted Cumulative Gain)
+- P@K (Precision at K)
+- Recall@K
+- MRR (Mean Reciprocal Rank)
+Based on pytrec_eval for official TREC evaluation.
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerEvaluationModelesRecherche2025
+"""
+import math
+from typing import Dict, List, Tuple, Any
+from collections import defaultdict
+# Check for pytrec_eval
+try:
+    import pytrec_eval
+    HAS_PYTREC_EVAL = True
+except ImportError:
+    HAS_PYTREC_EVAL = False
+    print("[EvalMetrics] pytrec_eval not installed. Using built-in metrics.")
+class EvaluationMetrics:
+    """
+    IR Evaluation metrics using pytrec_eval or built-in implementations.
+    Supports TREC-style evaluation with:
+    - Official pytrec_eval (if available)
+    - Fallback pure-Python implementations
+    """
+    def __init__(self):
+        """Initialize the metrics calculator."""
+        self.use_pytrec = HAS_PYTREC_EVAL
+    # --- Built-in Metric Implementations ---
+    @staticmethod
+    def precision_at_k(retrieved: List[str], relevant: set, k: int) -> float:
+        """
+        Calculate Precision@K.
+        P@K = |relevant ∩ retrieved[:k]| / k
+        """
+        if k <= 0:
+            return 0.0
+        retrieved_k = retrieved[:k]
+        relevant_retrieved = len([d for d in retrieved_k if d in relevant])
+        return relevant_retrieved / k
+    @staticmethod
+    def recall_at_k(retrieved: List[str], relevant: set, k: int) -> float:
+        """
+        Calculate Recall@K.
+        R@K = |relevant ∩ retrieved[:k]| / |relevant|
+        """
+        if not relevant:
+            return 0.0
+        retrieved_k = retrieved[:k]
+        relevant_retrieved = len([d for d in retrieved_k if d in relevant])
+        return relevant_retrieved / len(relevant)
+    @staticmethod
+    def average_precision(retrieved: List[str], relevant: set) -> float:
+        """
+        Calculate Average Precision for a single query.
+        AP = (1/|relevant|) × Σ (P@k × rel(k))
+        """
+        if not relevant:
+            return 0.0
+        hits = 0
+        sum_precision = 0.0
+        for i, doc in enumerate(retrieved):
+            if doc in relevant:
+                hits += 1
+                sum_precision += hits / (i + 1)
+        return sum_precision / len(relevant)
+    @staticmethod
+    def dcg_at_k(retrieved: List[str], relevance: Dict[str, int], k: int) -> float:
+        """
+        Calculate DCG@K (Discounted Cumulative Gain).
+        DCG@K = Σ (2^rel(i) - 1) / log2(i + 2)
+        """
+        dcg = 0.0
+        for i, doc in enumerate(retrieved[:k]):
+            rel = relevance.get(doc, 0)
+            dcg += (2 ** rel - 1) / math.log2(i + 2)
+        return dcg
+    @staticmethod
+    def ndcg_at_k(retrieved: List[str], relevance: Dict[str, int], k: int) -> float:
+        """
+        Calculate NDCG@K (Normalized DCG).
+        NDCG@K = DCG@K / IDCG@K
+        """
+        dcg = EvaluationMetrics.dcg_at_k(retrieved, relevance, k)
+        # Calculate IDCG (ideal DCG)
+        sorted_rels = sorted(relevance.values(), reverse=True)[:k]
+        idcg = 0.0
+        for i, rel in enumerate(sorted_rels):
+            idcg += (2 ** rel - 1) / math.log2(i + 2)
+        return dcg / idcg if idcg > 0 else 0.0
+    @staticmethod
+    def reciprocal_rank(retrieved: List[str], relevant: set) -> float:
+        """
+        Calculate Reciprocal Rank.
+        RR = 1 / rank of first relevant document
+        """
+        for i, doc in enumerate(retrieved):
+            if doc in relevant:
+                return 1.0 / (i + 1)
+        return 0.0
+    # --- TREC-Style Evaluation ---
+    def evaluate_run(
+        self,
+        run: Dict[str, List[Tuple[str, float]]],
+        qrels: Dict[str, Dict[str, int]],
+        metrics: List[str] = None
+    ) -> Dict[str, Dict[str, float]]:
+        """
+        Evaluate a run against qrels (relevance judgments).
+        Args:
+            run: {query_id: [(doc_id, score), ...]}
+            qrels: {query_id: {doc_id: relevance}}
+            metrics: List of metrics to compute
+                     ['map', 'ndcg', 'P_5', 'P_10', 'recall_100']
+        Returns:
+            {query_id: {metric: value}}
+        """
+        if metrics is None:
+            metrics = ['map', 'ndcg', 'P_5', 'P_10', 'P_20', 'recall_100', 'recip_rank']
+        if self.use_pytrec and HAS_PYTREC_EVAL:
+            return self._evaluate_pytrec(run, qrels, metrics)
+        else:
+            return self._evaluate_builtin(run, qrels, metrics)
+    def _evaluate_pytrec(
+        self,
+        run: Dict[str, List[Tuple[str, float]]],
+        qrels: Dict[str, Dict[str, int]],
+        metrics: List[str]
+    ) -> Dict[str, Dict[str, float]]:
+        """Evaluate using pytrec_eval."""
+        # Convert run format for pytrec_eval
+        pytrec_run = {}
+        for qid, docs in run.items():
+            pytrec_run[qid] = {doc_id: score for doc_id, score in docs}
+        # Create evaluator
+        evaluator = pytrec_eval.RelevanceEvaluator(qrels, set(metrics))
+        # Evaluate
+        results = evaluator.evaluate(pytrec_run)
+        return results
+    def _evaluate_builtin(
+        self,
+        run: Dict[str, List[Tuple[str, float]]],
+        qrels: Dict[str, Dict[str, int]],
+        metrics: List[str]
+    ) -> Dict[str, Dict[str, float]]:
+        """Evaluate using built-in implementations."""
+        results = {}
+        for qid, docs_scores in run.items():
+            if qid not in qrels:
+                continue
+            q_results = {}
+            retrieved = [doc_id for doc_id, _ in docs_scores]
+            relevance = qrels[qid]
+            relevant = set(doc_id for doc_id, rel in relevance.items() if rel > 0)
+            for metric in metrics:
+                if metric == 'map':
+                    q_results['map'] = self.average_precision(retrieved, relevant)
+                elif metric == 'ndcg':
+                    q_results['ndcg'] = self.ndcg_at_k(retrieved, relevance, 1000)
+                elif metric.startswith('ndcg_cut_'):
+                    k = int(metric.split('_')[-1])
+                    q_results[metric] = self.ndcg_at_k(retrieved, relevance, k)
+                elif metric.startswith('P_'):
+                    k = int(metric.split('_')[-1])
+                    q_results[metric] = self.precision_at_k(retrieved, relevant, k)
+                elif metric.startswith('recall_'):
+                    k = int(metric.split('_')[-1])
+                    q_results[metric] = self.recall_at_k(retrieved, relevant, k)
+                elif metric == 'recip_rank':
+                    q_results['recip_rank'] = self.reciprocal_rank(retrieved, relevant)
+            results[qid] = q_results
+        return results
+    def compute_aggregate(
+        self,
+        results: Dict[str, Dict[str, float]]
+    ) -> Dict[str, float]:
+        """
+        Compute aggregate metrics across all queries.
+        Returns mean values for each metric.
+        """
+        if not results:
+            return {}
+        aggregated = defaultdict(list)
+        for qid, metrics in results.items():
+            for metric, value in metrics.items():
+                aggregated[metric].append(value)
+        return {metric: sum(values) / len(values)
+                for metric, values in aggregated.items()}
+    def format_results(
+        self,
+        results: Dict[str, Dict[str, float]],
+        include_per_query: bool = False
+    ) -> str:
+        """Format results as a readable string."""
+        lines = []
+        # Aggregate
+        agg = self.compute_aggregate(results)
+        lines.append("=" * 50)
+        lines.append("AGGREGATE METRICS")
+        lines.append("=" * 50)
+        for metric, value in sorted(agg.items()):
+            lines.append(f"  {metric:20s}: {value:.4f}")
+        # Per-query (optional)
+        if include_per_query:
+            lines.append("")
+            lines.append("=" * 50)
+            lines.append("PER-QUERY METRICS")
+            lines.append("=" * 50)
+            for qid in sorted(results.keys()):
+                lines.append(f"\nQuery {qid}:")
+                for metric, value in sorted(results[qid].items()):
+                    lines.append(f"  {metric:20s}: {value:.4f}")
+        return '\n'.join(lines)
+def parse_qrels_file(filepath: str) -> Dict[str, Dict[str, int]]:
+    """
+    Parse a TREC qrels file.
+    Format: query_id 0 doc_id relevance
+    """
+    qrels = defaultdict(dict)
+    with open(filepath, 'r') as f:
+        for line in f:
+            parts = line.strip().split()
+            if len(parts) >= 4:
+                qid, _, docid, rel = parts[:4]
+                qrels[qid][docid] = int(rel)
+    return dict(qrels)
+def parse_run_file(filepath: str) -> Dict[str, List[Tuple[str, float]]]:
+    """
+    Parse a TREC run file.
+    Format: query_id Q0 doc_id rank score run_tag
+    """
+    run = defaultdict(list)
+    with open(filepath, 'r') as f:
+        for line in f:
+            parts = line.strip().split()
+            if len(parts) >= 5:
+                qid, _, docid, rank, score = parts[:5]
+                run[qid].append((docid, float(score)))
+    # Sort by score descending
+    for qid in run:
+        run[qid].sort(key=lambda x: x[1], reverse=True)
+    return dict(run)
+# --- Testing ---
+if __name__ == "__main__":
+    print("=" * 60)
+    print("SysCRED Evaluation Metrics - Tests")
+    print("=" * 60)
+    metrics = EvaluationMetrics()
+    print(f"\nUsing pytrec_eval: {metrics.use_pytrec}")
+    # Test data
+    retrieved = ['doc1', 'doc2', 'doc3', 'doc4', 'doc5', 'doc6', 'doc7', 'doc8', 'doc9', 'doc10']
+    relevant = {'doc1', 'doc3', 'doc5', 'doc8'}
+    relevance = {'doc1': 2, 'doc3': 1, 'doc5': 2, 'doc8': 1}
+    print("\n--- Built-in Metrics Tests ---")
+    print(f"P@5:  {metrics.precision_at_k(retrieved, relevant, 5):.4f}")
+    print(f"P@10: {metrics.precision_at_k(retrieved, relevant, 10):.4f}")
+    print(f"R@5:  {metrics.recall_at_k(retrieved, relevant, 5):.4f}")
+    print(f"R@10: {metrics.recall_at_k(retrieved, relevant, 10):.4f}")
+    print(f"AP:   {metrics.average_precision(retrieved, relevant):.4f}")
+    print(f"NDCG@10: {metrics.ndcg_at_k(retrieved, relevance, 10):.4f}")
+    print(f"RR:   {metrics.reciprocal_rank(retrieved, relevant):.4f}")
+    # Test run evaluation
+    print("\n--- Run Evaluation Test ---")
+    run = {
+        'Q1': [(doc, 10-i) for i, doc in enumerate(retrieved)],
+        'Q2': [('doc2', 10), ('doc1', 9), ('doc4', 8), ('doc3', 7)]
+    }
+    qrels = {
+        'Q1': relevance,
+        'Q2': {'doc1': 1, 'doc3': 2}
+    }
+    results = metrics.evaluate_run(run, qrels)
+    print(metrics.format_results(results))
+    print("\n" + "=" * 60)
+    print("Tests complete!")
+    print("=" * 60)

syscred/graph_rag.py ADDED Viewed

	@@ -0,0 +1,171 @@

+# -*- coding: utf-8 -*-
+"""
+GraphRAG Module - SysCRED
+=========================
+Retrieves context from the Knowledge Graph to enhance verification.
+Transforms "Passive" Graph into "Active" Context.
+(c) Dominique S. Loyer - PhD Thesis Prototype
+"""
+from typing import List, Dict, Any, Optional
+from syscred.ontology_manager import OntologyManager
+class GraphRAG:
+    """
+    Retrieval Augmented Generation using the Semantic Knowledge Graph.
+    """
+    def __init__(self, ontology_manager: OntologyManager):
+        self.om = ontology_manager
+    def get_context(self, domain: str, keywords: List[str] = []) -> Dict[str, str]:
+        """
+        Retrieve context for a specific verification task.
+        Args:
+            domain: The domain being analyzed (e.g., 'lemonde.fr')
+            keywords: List of keywords from the claim (not yet used in V1)
+        Returns:
+            Dictionary with natural language context strings.
+        """
+        if not self.om:
+            return {"graph_context": "No ontology manager available."}
+        context_parts = []
+        # 1. Source History
+        source_history = self._get_source_history(domain)
+        if source_history:
+            context_parts.append(source_history)
+        # 2. Pattern Matching (Similar Claims)
+        similar_uris = []
+        if keywords:
+            similar_result = self._find_similar_claims(keywords)
+            if similar_result["text"]:
+                context_parts.append(similar_result["text"])
+                similar_uris = similar_result["uris"]
+        full_context = "\n\n".join(context_parts) if context_parts else "No prior knowledge found in the graph."
+        return {
+            "full_text": full_context,
+            "source_history": source_history,
+            "similar_uris": similar_uris  # [NEW] Return URIs for linking
+        }
+    def _get_source_history(self, domain: str) -> str:
+        """
+        Query the graph for all previous evaluations of this domain.
+        """
+        if not domain:
+            return ""
+        # We reuse the specific query logic but tailored for retrieval
+        query = """
+        PREFIX cred: <https://github.com/DominiqueLoyer/systemFactChecking#>
+        SELECT ?score ?level ?timestamp
+        WHERE {
+            ?info cred:informationURL ?url .
+            ?request cred:concernsInformation ?info .
+            ?report cred:isReportOf ?request .
+            ?report cred:credibilityScoreValue ?score .
+            ?report cred:assignsCredibilityLevel ?level .
+            ?report cred:completionTimestamp ?timestamp .
+            FILTER(CONTAINS(STR(?url), "%s"))
+        }
+        ORDER BY DESC(?timestamp)
+        LIMIT 5
+        """ % domain
+        results = []
+        try:
+            combined = self.om.base_graph + self.om.data_graph
+            for row in combined.query(query):
+                results.append({
+                    "score": float(row.score),
+                    "level": str(row.level).split('#')[-1],
+                    "date": str(row.timestamp).split('T')[0]
+                })
+        except Exception as e:
+            print(f"[GraphRAG] Query error: {e}")
+            return ""
+        if not results:
+            return f"The graph contains no previous evaluations for {domain}."
+        # Summarize
+        count = len(results)
+        avg_score = sum(r['score'] for r in results) / count
+        last_verdict = results[0]['level']
+        summary = (
+            f"Graph Memory for '{domain}':\n"
+            f"- Analyzed {count} times previously.\n"
+            f"- Average Credibility Score: {avg_score:.2f} / 1.0\n"
+            f"- Most recent verdict ({results[0]['date']}): {last_verdict}.\n"
+        )
+        return summary
+    def _find_similar_claims(self, keywords: List[str]) -> Dict[str, Any]:
+        """
+        Find evaluation history for content containing specific keywords.
+        Returns dict with 'text' (for LLM) and 'uris' (for Graph linking).
+        """
+        if not keywords:
+            return {"text": "", "uris": []}
+        # Build REGEX filter for keywords (OR logic)
+        # e.g., (fake|hoax|conspiracy)
+        clean_kws = [k for k in keywords if len(k) > 3] # Skip short words
+        if not clean_kws:
+            return {"text": "", "uris": []}
+        regex_pattern = "|".join(clean_kws)
+        query = """
+        PREFIX cred: <https://github.com/DominiqueLoyer/systemFactChecking#>
+        SELECT ?report ?content ?score ?level ?timestamp
+        WHERE {
+            ?info cred:informationContent ?content .
+            ?request cred:concernsInformation ?info .
+            ?report cred:isReportOf ?request .
+            ?report cred:credibilityScoreValue ?score .
+            ?report cred:assignsCredibilityLevel ?level .
+            ?report cred:completionTimestamp ?timestamp .
+            FILTER(REGEX(?content, "%s", "i"))
+        }
+        ORDER BY DESC(?timestamp)
+        LIMIT 3
+        """ % regex_pattern
+        results = []
+        try:
+            combined = self.om.base_graph + self.om.data_graph
+            for row in combined.query(query):
+                results.append({
+                    "uri": str(row.report),
+                    "content": str(row.content)[:100] + "...",
+                    "score": float(row.score),
+                    "verdict": str(row.level).split('#')[-1]
+                })
+        except Exception as e:
+            print(f"[GraphRAG] Similar claims error: {e}")
+            return {"text": "", "uris": []}
+        if not results:
+            return {"text": "", "uris": []}
+        lines = [f"Found {len(results)} similar claims in history:"]
+        for r in results:
+            lines.append(f"- \"{r['content']}\" ({r['verdict']}, Score: {r['score']:.2f})")
+        return {
+            "text": "\n".join(lines),
+            "uris": [r['uri'] for r in results]
+        }

syscred/ir_engine.py ADDED Viewed

	@@ -0,0 +1,410 @@

+# -*- coding: utf-8 -*-
+"""
+IR Engine Module - SysCRED
+===========================
+Information Retrieval engine extracted from TREC AP88-90 project.
+Features:
+- TF-IDF calculation (custom and via Pyserini)
+- BM25 scoring (via Lucene/Pyserini)
+- Query Likelihood Dirichlet (QLD)
+- Pseudo-Relevance Feedback (PRF)
+- Porter Stemming integration
+Based on: TREC_AP88-90_5juin2025.py
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerEvaluationModelesRecherche2025
+"""
+import re
+import math
+from typing import Dict, List, Tuple, Optional, Any
+from dataclasses import dataclass
+from collections import Counter
+# Check for optional dependencies
+try:
+    import nltk
+    from nltk.corpus import stopwords
+    from nltk.stem import PorterStemmer
+    from nltk.tokenize import word_tokenize
+    HAS_NLTK = True
+except ImportError:
+    HAS_NLTK = False
+try:
+    from pyserini.search.lucene import LuceneSearcher
+    HAS_PYSERINI = True
+except ImportError:
+    HAS_PYSERINI = False
+# --- Data Classes ---
+@dataclass
+class SearchResult:
+    """A single search result."""
+    doc_id: str
+    score: float
+    rank: int
+    snippet: Optional[str] = None
+@dataclass
+class SearchResponse:
+    """Complete search response."""
+    query_id: str
+    query_text: str
+    results: List[SearchResult]
+    model: str  # 'bm25', 'qld', 'tfidf'
+    total_hits: int
+    search_time_ms: float
+class IREngine:
+    """
+    Information Retrieval engine with multiple scoring methods.
+    Supports:
+    - Built-in TF-IDF/BM25 (no dependencies)
+    - Pyserini/Lucene BM25 and QLD (if pyserini installed)
+    - Query expansion with Pseudo-Relevance Feedback
+    """
+    # BM25 default parameters
+    BM25_K1 = 0.9
+    BM25_B = 0.4
+    def __init__(self, index_path: str = None, use_stemming: bool = True):
+        """
+        Initialize the IR engine.
+        Args:
+            index_path: Path to Lucene/Pyserini index (optional)
+            use_stemming: Whether to apply Porter stemming
+        """
+        self.index_path = index_path
+        self.use_stemming = use_stemming
+        self.searcher = None
+        # Initialize NLTK components
+        if HAS_NLTK:
+            try:
+                self.stopwords = set(stopwords.words('english'))
+                self.stemmer = PorterStemmer() if use_stemming else None
+            except LookupError:
+                print("[IREngine] Downloading NLTK resources...")
+                nltk.download('stopwords', quiet=True)
+                nltk.download('punkt', quiet=True)
+                nltk.download('punkt_tab', quiet=True)
+                self.stopwords = set(stopwords.words('english'))
+                self.stemmer = PorterStemmer() if use_stemming else None
+        else:
+            # Fallback stopwords
+            self.stopwords = {
+                'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to',
+                'for', 'of', 'with', 'by', 'from', 'as', 'is', 'was', 'are',
+                'were', 'been', 'be', 'have', 'has', 'had', 'do', 'does',
+                'did', 'will', 'would', 'could', 'should', 'may', 'might',
+                'must', 'shall', 'can', 'need', 'this', 'that', 'these',
+                'those', 'it', 'its', 'they', 'them', 'he', 'she', 'him',
+                'her', 'his', 'we', 'you', 'i', 'my', 'your', 'our', 'their'
+            }
+            self.stemmer = None
+        # Initialize Pyserini searcher if available
+        if HAS_PYSERINI and index_path:
+            try:
+                self.searcher = LuceneSearcher(index_path)
+                print(f"[IREngine] Pyserini searcher initialized with index: {index_path}")
+            except Exception as e:
+                print(f"[IREngine] Failed to initialize Pyserini: {e}")
+    def preprocess(self, text: str) -> str:
+        """
+        Preprocess text with tokenization, stopword removal, and optional stemming.
+        This matches the TREC preprocessing pipeline.
+        """
+        if not isinstance(text, str):
+            return ""
+        text = text.lower()
+        if HAS_NLTK:
+            try:
+                tokens = word_tokenize(text)
+            except LookupError:
+                # Fallback tokenization
+                tokens = re.findall(r'\b[a-z]+\b', text)
+        else:
+            tokens = re.findall(r'\b[a-z]+\b', text)
+        # Filter stopwords and non-alpha
+        filtered = [t for t in tokens if t.isalpha() and t not in self.stopwords]
+        # Apply stemming if enabled
+        if self.stemmer:
+            filtered = [self.stemmer.stem(t) for t in filtered]
+        return ' '.join(filtered)
+    def calculate_tf(self, tokens: List[str]) -> Dict[str, float]:
+        """Calculate term frequency."""
+        if not tokens:
+            return {}
+        counts = Counter(tokens)
+        total = len(tokens)
+        return {term: count / total for term, count in counts.items()}
+    def calculate_bm25_score(
+        self,
+        query_terms: List[str],
+        doc_terms: List[str],
+        doc_length: int,
+        avg_doc_length: float,
+        doc_freq: Dict[str, int],
+        corpus_size: int
+    ) -> float:
+        """
+        Calculate BM25 score for a document.
+        BM25(D, Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))
+        """
+        doc_term_counts = Counter(doc_terms)
+        score = 0.0
+        for term in query_terms:
+            if term not in doc_term_counts:
+                continue
+            tf = doc_term_counts[term]
+            df = doc_freq.get(term, 1)
+            idf = math.log((corpus_size - df + 0.5) / (df + 0.5) + 1)
+            numerator = tf * (self.BM25_K1 + 1)
+            denominator = tf + self.BM25_K1 * (1 - self.BM25_B + self.BM25_B * doc_length / avg_doc_length)
+            score += idf * (numerator / denominator)
+        return score
+    def search_pyserini(
+        self,
+        query: str,
+        model: str = 'bm25',
+        k: int = 100,
+        query_id: str = "Q1"
+    ) -> SearchResponse:
+        """
+        Search using Pyserini/Lucene.
+        Args:
+            query: Query text
+            model: 'bm25' or 'qld'
+            k: Number of results
+            query_id: Query identifier
+        """
+        import time
+        start = time.time()
+        if not self.searcher:
+            raise RuntimeError("Pyserini searcher not initialized. Provide index_path.")
+        # Configure similarity
+        if model == 'bm25':
+            self.searcher.set_bm25(k1=self.BM25_K1, b=self.BM25_B)
+        elif model == 'qld':
+            self.searcher.set_qld()
+        else:
+            self.searcher.set_bm25()
+        # Preprocess query
+        processed_query = self.preprocess(query)
+        # Search
+        hits = self.searcher.search(processed_query, k=k)
+        results = []
+        for i, hit in enumerate(hits):
+            results.append(SearchResult(
+                doc_id=hit.docid,
+                score=hit.score,
+                rank=i + 1
+            ))
+        elapsed = (time.time() - start) * 1000
+        return SearchResponse(
+            query_id=query_id,
+            query_text=query,
+            results=results,
+            model=model,
+            total_hits=len(results),
+            search_time_ms=elapsed
+        )
+    def pseudo_relevance_feedback(
+        self,
+        query: str,
+        top_docs_texts: List[str],
+        num_expansion_terms: int = 10
+    ) -> str:
+        """
+        Expand query using Pseudo-Relevance Feedback (PRF).
+        Uses top-k retrieved documents to find expansion terms.
+        """
+        query_tokens = self.preprocess(query).split()
+        # Collect terms from top documents
+        expansion_candidates = Counter()
+        for doc_text in top_docs_texts:
+            doc_tokens = self.preprocess(doc_text).split()
+            # Count terms not in original query
+            for token in doc_tokens:
+                if token not in query_tokens:
+                    expansion_candidates[token] += 1
+        # Get top expansion terms
+        expansion_terms = [term for term, _ in expansion_candidates.most_common(num_expansion_terms)]
+        # Create expanded query
+        expanded_query = query + ' ' + ' '.join(expansion_terms)
+        return expanded_query
+    def format_trec_run(
+        self,
+        responses: List[SearchResponse],
+        run_tag: str
+    ) -> str:
+        """
+        Format results in TREC run file format.
+        Format: query_id Q0 doc_id rank score run_tag
+        """
+        lines = []
+        for response in responses:
+            for result in response.results:
+                lines.append(
+                    f"{response.query_id} Q0 {result.doc_id} "
+                    f"{result.rank} {result.score:.6f} {run_tag}"
+                )
+        return '\n'.join(lines)
+# --- Kaggle/Colab Utilities ---
+def setup_kaggle_environment():
+    """Setup environment for Kaggle notebooks."""
+    import subprocess
+    import sys
+    print("=" * 60)
+    print("SysCRED - Kaggle Environment Setup")
+    print("=" * 60)
+    # Check for GPU/TPU
+    import torch
+    if torch.cuda.is_available():
+        print(f"✓ GPU available: {torch.cuda.get_device_name(0)}")
+    else:
+        print("✗ No GPU detected")
+    # Install required packages
+    packages = [
+        'pyserini',
+        'transformers',
+        'pytrec_eval',
+        'nltk',
+        'rdflib'
+    ]
+    print("\nInstalling packages...")
+    for pkg in packages:
+        try:
+            subprocess.run(
+                [sys.executable, '-m', 'pip', 'install', '-q', pkg],
+                check=True,
+                capture_output=True
+            )
+            print(f"  ✓ {pkg}")
+        except:
+            print(f"  ✗ {pkg} - install failed")
+    # Download NLTK resources
+    import nltk
+    for resource in ['stopwords', 'punkt', 'punkt_tab', 'wordnet']:
+        try:
+            nltk.download(resource, quiet=True)
+        except:
+            pass
+    print("\n✓ Environment setup complete")
+def load_kaggle_dataset(dataset_path: str) -> str:
+    """
+    Load a Kaggle dataset.
+    Args:
+        dataset_path: Path like '/kaggle/input/trec-ap88-90'
+    """
+    import os
+    if os.path.exists(dataset_path):
+        print(f"✓ Dataset found: {dataset_path}")
+        return dataset_path
+    else:
+        print(f"✗ Dataset not found: {dataset_path}")
+        print("Make sure to add the dataset to your Kaggle notebook.")
+        return None
+# --- Testing ---
+if __name__ == "__main__":
+    print("=" * 60)
+    print("SysCRED IR Engine - Tests")
+    print("=" * 60)
+    engine = IREngine(use_stemming=True)
+    # Test preprocessing
+    print("\n1. Testing preprocessing...")
+    sample = "Information Retrieval systems help users find relevant documents."
+    processed = engine.preprocess(sample)
+    print(f"   Original: {sample}")
+    print(f"   Processed: {processed}")
+    # Test BM25
+    print("\n2. Testing BM25 calculation...")
+    query_terms = engine.preprocess("information retrieval").split()
+    doc_terms = engine.preprocess(sample).split()
+    score = engine.calculate_bm25_score(
+        query_terms=query_terms,
+        doc_terms=doc_terms,
+        doc_length=len(doc_terms),
+        avg_doc_length=10,
+        doc_freq={'inform': 5, 'retriev': 3},
+        corpus_size=100
+    )
+    print(f"   BM25 Score: {score:.4f}")
+    # Test PRF
+    print("\n3. Testing Pseudo-Relevance Feedback...")
+    expanded = engine.pseudo_relevance_feedback(
+        query="information retrieval",
+        top_docs_texts=[
+            "Information retrieval is finding relevant documents in a collection.",
+            "Search engines use retrieval models like BM25 and TF-IDF.",
+            "Query expansion improves retrieval effectiveness."
+        ]
+    )
+    print(f"   Original query: information retrieval")
+    print(f"   Expanded query: {expanded}")
+    print("\n" + "=" * 60)
+    print("Tests complete!")
+    print("=" * 60)

syscred/ontology_manager.py ADDED Viewed

	@@ -0,0 +1,509 @@

+# -*- coding: utf-8 -*-
+"""
+Ontology Manager Module - SysCRED
+==================================
+Manages the RDF ontology for the credibility verification system.
+Handles reading, writing, and querying of semantic triplets.
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerModelingHybridSystem2025
+"""
+from typing import Optional, List, Dict, Any
+from datetime import datetime
+from dataclasses import dataclass
+import os
+# RDFLib imports with fallback
+try:
+    from rdflib import Graph, Namespace, Literal, URIRef, BNode
+    from rdflib.namespace import RDF, RDFS, OWL, XSD
+    HAS_RDFLIB = True
+except ImportError:
+    HAS_RDFLIB = False
+    print("Warning: rdflib not installed. Run: pip install rdflib")
+@dataclass
+class EvaluationRecord:
+    """Represents a stored evaluation from the ontology."""
+    evaluation_id: str
+    url_or_text: str
+    score: float
+    level: str
+    timestamp: str
+    fact_checks: List[str]
+class OntologyManager:
+    """
+    Manages the credibility ontology using RDFLib.
+    Handles:
+    - Loading base ontology
+    - Adding evaluation triplets
+    - Querying historical data
+    - Exporting enriched ontology
+    """
+    # Namespace for the credibility ontology
+    CRED_NS = "https://github.com/DominiqueLoyer/systemFactChecking#"
+    def __init__(self, base_ontology_path: Optional[str] = None, data_path: Optional[str] = None):
+        """
+        Initialize the ontology manager.
+        Args:
+            base_ontology_path: Path to the base ontology TTL file
+            data_path: Path to store/load accumulated data triplets
+        """
+        if not HAS_RDFLIB:
+            raise ImportError("rdflib is required. Install with: pip install rdflib")
+        self.base_path = base_ontology_path
+        self.data_path = data_path
+        # Create namespace
+        self.cred = Namespace(self.CRED_NS)
+        # Initialize graphs
+        self.base_graph = Graph()
+        self.data_graph = Graph()
+        # Bind prefixes for nicer serialization
+        self._bind_prefixes(self.base_graph)
+        self._bind_prefixes(self.data_graph)
+        # Load ontology files if they exist
+        if base_ontology_path and os.path.exists(base_ontology_path):
+            self.load_base_ontology(base_ontology_path)
+        if data_path and os.path.exists(data_path):
+            self.load_data_graph(data_path)
+        # Counter for generating unique IDs
+        self._evaluation_counter = 0
+    def _bind_prefixes(self, graph: Graph):
+        """Bind common prefixes to a graph."""
+        graph.bind("cred", self.cred)
+        graph.bind("owl", OWL)
+        graph.bind("rdf", RDF)
+        graph.bind("rdfs", RDFS)
+        graph.bind("xsd", XSD)
+    def load_base_ontology(self, path: str) -> bool:
+        """Load the base ontology from a TTL file."""
+        try:
+            self.base_graph.parse(path, format='turtle')
+            print(f"[OntologyManager] Loaded base ontology: {len(self.base_graph)} triples")
+            return True
+        except Exception as e:
+            print(f"[OntologyManager] Error loading base ontology: {e}")
+            return False
+    def load_data_graph(self, path: str) -> bool:
+        """Load accumulated data triplets."""
+        try:
+            self.data_graph.parse(path, format='turtle')
+            print(f"[OntologyManager] Loaded data graph: {len(self.data_graph)} triples")
+            return True
+        except Exception as e:
+            print(f"[OntologyManager] Error loading data graph: {e}")
+            return False
+    def add_evaluation_triplets(self, report: Dict[str, Any]) -> str:
+        """
+        Add triplets for a new credibility evaluation.
+        Args:
+            report: The evaluation report dictionary from CredibilityVerificationSystem
+        Returns:
+            The URI of the created RapportEvaluation individual
+        """
+        timestamp = datetime.now()
+        timestamp_str = timestamp.strftime("%Y%m%d_%H%M%S")
+        self._evaluation_counter += 1
+        # Create URIs for new individuals
+        report_uri = self.cred[f"Report_{timestamp_str}_{self._evaluation_counter}"]
+        request_uri = self.cred[f"Request_{timestamp_str}_{self._evaluation_counter}"]
+        info_uri = self.cred[f"Info_{timestamp_str}_{self._evaluation_counter}"]
+        # Get data from report
+        score = report.get('scoreCredibilite', 0.5)
+        input_data = report.get('informationEntree', '')
+        summary = report.get('resumeAnalyse', '')
+        # Determine credibility level based on score
+        if score >= 0.7:
+            level_uri = self.cred.Niveau_Haut
+            info_class = self.cred.InformationHauteCredibilite
+        elif score >= 0.4:
+            level_uri = self.cred.Niveau_Moyen
+            info_class = self.cred.InformationMoyenneCredibilite
+        else:
+            level_uri = self.cred.Niveau_Bas
+            info_class = self.cred.InformationFaibleCredibilite
+        # Add Information triplets
+        self.data_graph.add((info_uri, RDF.type, self.cred.InformationSoumise))
+        self.data_graph.add((info_uri, RDF.type, info_class))
+        self.data_graph.add((info_uri, self.cred.informationContent,
+                            Literal(input_data[:500], datatype=XSD.string)))
+        # Check if it's a URL
+        if input_data.startswith('http'):
+            self.data_graph.add((info_uri, self.cred.informationURL,
+                                Literal(input_data, datatype=XSD.anyURI)))
+        # Add Request triplets
+        self.data_graph.add((request_uri, RDF.type, self.cred.RequeteEvaluation))
+        self.data_graph.add((request_uri, self.cred.concernsInformation, info_uri))
+        self.data_graph.add((request_uri, self.cred.submissionTimestamp,
+                            Literal(timestamp.isoformat(), datatype=XSD.dateTime)))
+        self.data_graph.add((request_uri, self.cred.requestStatus,
+                            Literal("Completed", datatype=XSD.string)))
+        # Add Report triplets
+        self.data_graph.add((report_uri, RDF.type, self.cred.RapportEvaluation))
+        self.data_graph.add((report_uri, self.cred.isReportOf, request_uri))
+        self.data_graph.add((report_uri, self.cred.credibilityScoreValue,
+                            Literal(float(score), datatype=XSD.float)))
+        self.data_graph.add((report_uri, self.cred.assignsCredibilityLevel, level_uri))
+        self.data_graph.add((report_uri, self.cred.completionTimestamp,
+                            Literal(timestamp.isoformat(), datatype=XSD.dateTime)))
+        self.data_graph.add((report_uri, self.cred.reportSummary,
+                            Literal(summary, datatype=XSD.string)))
+        # Add NLP results if available
+        nlp_results = report.get('analyseNLP', {})
+        if nlp_results:
+            nlp_result_uri = self.cred[f"NLPResult_{timestamp_str}_{self._evaluation_counter}"]
+            self.data_graph.add((nlp_result_uri, RDF.type, self.cred.ResultatNLP))
+            self.data_graph.add((report_uri, self.cred.includesNLPResult, nlp_result_uri))
+            sentiment = nlp_results.get('sentiment', {})
+            if sentiment:
+                self.data_graph.add((nlp_result_uri, self.cred.sentimentScore,
+                                    Literal(float(sentiment.get('score', 0.5)), datatype=XSD.float)))
+            coherence = nlp_results.get('coherence_score')
+            if coherence is not None:
+                self.data_graph.add((nlp_result_uri, self.cred.coherenceScore,
+                                    Literal(float(coherence), datatype=XSD.float)))
+        # Add source analysis if available
+        rules = report.get('reglesAppliquees', {})
+        source_analysis = rules.get('source_analysis', {})
+        if source_analysis:
+            source_uri = self.cred[f"SourceAnalysis_{timestamp_str}_{self._evaluation_counter}"]
+            self.data_graph.add((source_uri, RDF.type, self.cred.InfoSourceAnalyse))
+            self.data_graph.add((report_uri, self.cred.includesSourceAnalysis, source_uri))
+            reputation = source_analysis.get('reputation', 'Unknown')
+            self.data_graph.add((source_uri, self.cred.sourceAnalyzedReputation,
+                                Literal(reputation, datatype=XSD.string)))
+            domain_age = source_analysis.get('domain_age_days')
+            if domain_age is not None:
+                self.data_graph.add((source_uri, self.cred.sourceMentionsCount,
+                                    Literal(int(domain_age), datatype=XSD.integer)))
+        # Add fact check results
+        fact_checks = rules.get('fact_checking', [])
+        for i, fc in enumerate(fact_checks):
+            evidence_uri = self.cred[f"Evidence_{timestamp_str}_{self._evaluation_counter}_{i}"]
+            self.data_graph.add((evidence_uri, RDF.type, self.cred.PreuveFactuelle))
+            self.data_graph.add((report_uri, self.cred.basedOnEvidence, evidence_uri))
+            self.data_graph.add((evidence_uri, self.cred.evidenceClaim,
+                                Literal(fc.get('claim', ''), datatype=XSD.string)))
+            self.data_graph.add((evidence_uri, self.cred.evidenceVerdict,
+                                Literal(fc.get('rating', ''), datatype=XSD.string)))
+            self.data_graph.add((evidence_uri, self.cred.evidenceSource,
+                                Literal(fc.get('publisher', ''), datatype=XSD.string)))
+            if fc.get('url'):
+                self.data_graph.add((evidence_uri, self.cred.evidenceURL,
+                                    Literal(fc.get('url', ''), datatype=XSD.anyURI)))
+        # [NEW] Link similar claims found by GraphRAG
+        similar_uris = report.get('similar_claims_uris', [])
+        for sim_uri_str in similar_uris:
+            try:
+                sim_uri = URIRef(sim_uri_str)
+                self.data_graph.add((report_uri, RDFS.seeAlso, sim_uri))
+            except Exception as e:
+                print(f"[Ontology] Error linking similar URI {sim_uri_str}: {e}")
+        print(f"[OntologyManager] Added evaluation triplets. Report: {report_uri}")
+        return str(report_uri)
+    def query_source_history(self, url: str) -> List[EvaluationRecord]:
+        """
+        Query all previous evaluations for a URL/domain.
+        Args:
+            url: URL to search for
+        Returns:
+            List of EvaluationRecord for this source
+        """
+        results = []
+        # SPARQL query to find all evaluations for this URL
+        query = """
+        PREFIX cred: <http://www.dic9335.uqam.ca/ontologies/credibility-verification#>
+        PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
+        SELECT ?report ?score ?level ?timestamp ?content
+        WHERE {
+            ?info cred:informationURL ?url .
+            ?request cred:concernsInformation ?info .
+            ?report cred:isReportOf ?request .
+            ?report cred:credibilityScoreValue ?score .
+            ?report cred:assignsCredibilityLevel ?level .
+            ?report cred:completionTimestamp ?timestamp .
+            ?info cred:informationContent ?content .
+            FILTER(CONTAINS(STR(?url), "%s"))
+        }
+        ORDER BY DESC(?timestamp)
+        """ % url
+        try:
+            # Query combined graph (base + data)
+            combined = self.base_graph + self.data_graph
+            for row in combined.query(query):
+                results.append(EvaluationRecord(
+                    evaluation_id=str(row.report),
+                    url_or_text=str(row.content) if row.content else url,
+                    score=float(row.score),
+                    level=str(row.level).split('#')[-1],
+                    timestamp=str(row.timestamp),
+                    fact_checks=[]
+                ))
+        except Exception as e:
+            print(f"[OntologyManager] Query error: {e}")
+        return results
+    def get_statistics(self) -> Dict[str, Any]:
+        """Get statistics about the ontology data."""
+        stats = {
+            'base_triples': len(self.base_graph),
+            'data_triples': len(self.data_graph),
+            'total_triples': len(self.base_graph) + len(self.data_graph),
+        }
+        # Count evaluations
+        query = """
+        PREFIX cred: <http://www.dic9335.uqam.ca/ontologies/credibility-verification#>
+        SELECT (COUNT(?report) as ?count) WHERE {
+            ?report a cred:RapportEvaluation .
+        }
+        """
+        try:
+            for row in self.data_graph.query(query):
+                stats['total_evaluations'] = int(row.count)
+        except:
+            stats['total_evaluations'] = 0
+        return stats
+    def get_graph_json(self) -> Dict[str, List]:
+        """
+        Convert ontology data into D3.js JSON format (Nodes & Links).
+        """
+        nodes = []
+        links = []
+        added_nodes = set()
+        # Get the latest report ID
+        latest_query = """
+        PREFIX cred: <https://github.com/DominiqueLoyer/systemFactChecking#>
+        SELECT ?report ?timestamp WHERE {
+            ?report a cred:RapportEvaluation .
+            ?report cred:completionTimestamp ?timestamp .
+        }
+        ORDER BY DESC(?timestamp)
+        LIMIT 1
+        """
+        latest_report = None
+        try:
+            for row in self.data_graph.query(latest_query):
+                latest_report = row.report
+        except:
+            pass
+        if not latest_report:
+            return {'nodes': [], 'links': []}
+        # Helper to add node if unique
+        def add_node(uri, label, type_class, group):
+            if str(uri) not in added_nodes:
+                nodes.append({
+                    'id': str(uri),
+                    'name': str(label),
+                    'group': group,
+                    'type': str(type_class).split('#')[-1]
+                })
+                added_nodes.add(str(uri))
+        # Add Central Node (Report)
+        add_node(latest_report, "Latest Report", "cred:RapportEvaluation", 1)
+        # Query triples related to this report (Level 1)
+        related_query = """
+        PREFIX cred: <https://github.com/DominiqueLoyer/systemFactChecking#>
+        SELECT ?p ?o ?oType ?oLabel WHERE {
+            <%s> ?p ?o .
+            OPTIONAL { ?o a ?oType } .
+            OPTIONAL { ?o cred:evidenceSnippet ?oLabel } .
+            OPTIONAL { ?o cred:sourceAnalyzedReputation ?oLabel } .
+        }
+        """ % str(latest_report)
+        try:
+            # Level 1: Report -> Components
+            for row in self.data_graph.query(related_query):
+                p = row.p
+                o = row.o
+                # Skip generic system triples like rdf:type, but allow rdfs:seeAlso
+                if str(p) == str(RDF.type): continue
+                if 'Literal' in str(type(o)): continue # Skip basic literals
+                # Determine Group/Color
+                o_type = str(row.oType) if row.oType else "Unknown"
+                group = 2 # Default gray
+                if 'High' in o_type or 'Supporting' in o_type: group = 3 # Green (Positive)
+                if 'Low' in o_type or 'Refuting' in o_type: group = 4 # Red (Negative)
+                if 'Rapport' in o_type: group = 1 # Purple (Hub)
+                if 'SourceAnalysis' in o_type: group = 5 # Blue (Source)
+                if str(p) == str(RDFS.seeAlso): group = 7 # Orange for similar claims
+                # Add Target Node (Level 1)
+                o_label = row.oLabel if row.oLabel else str(o).split('#')[-1]
+                add_node(o, o_label, o_type, group)
+                # Add Link L1
+                link_type = 'primary'
+                if str(p) == str(RDFS.seeAlso):
+                     link_type = 'similar' # Special dash style for similar claims?
+                links.append({
+                    'source': str(latest_report),
+                    'target': str(o),
+                    'value': 2,
+                    'type': link_type
+                })
+                # Level 2: Component -> Details (Recursive enrich)
+                # Specifically for SourceAnalysis and Evidence
+                l2_query = """
+                SELECT ?p2 ?o2 ?o2Type WHERE {
+                    <%s> ?p2 ?o2 .
+                    OPTIONAL { ?o2 a ?o2Type } .
+                    FILTER(isURI(?o2))
+                }""" % str(o)
+                for row2 in self.data_graph.query(l2_query):
+                     o2 = row2.o2
+                     if str(row2.p2) == str(RDF.type): continue
+                     o2_label = str(o2).split('#')[-1]
+                     add_node(o2, o2_label, "Detail", 6) # Group 6 for leaf nodes
+                     links.append({
+                        'source': str(o),
+                        'target': str(o2),
+                        'value': 1,
+                        'type': 'secondary'
+                     })
+        except Exception as e:
+            print(f"Graph query error: {e}")
+        return {'nodes': nodes, 'links': links}
+    def export_to_ttl(self, output_path: str, include_base: bool = False) -> bool:
+        """
+        Export the ontology to a TTL file.
+        Args:
+            output_path: Path to write the TTL file
+            include_base: If True, include base ontology in export
+        Returns:
+            True if successful
+        """
+        try:
+            if include_base:
+                combined = self.base_graph + self.data_graph
+                combined.serialize(destination=output_path, format='turtle')
+            else:
+                self.data_graph.serialize(destination=output_path, format='turtle')
+            print(f"[OntologyManager] Exported to: {output_path}")
+            return True
+        except Exception as e:
+            print(f"[OntologyManager] Export error: {e}")
+            return False
+    def save_data(self) -> bool:
+        """Save the data graph to its configured path."""
+        if self.data_path:
+            return self.export_to_ttl(self.data_path, include_base=False)
+        return False
+# --- Testing ---
+if __name__ == "__main__":
+    print("=== Testing OntologyManager ===\n")
+    # Test with base ontology
+    base_path = "/Users/bk280625/documents041025/MonCode/sysCRED_onto26avrtil.ttl"
+    data_path = "/Users/bk280625/documents041025/MonCode/ontology/sysCRED_data.ttl"
+    manager = OntologyManager(base_ontology_path=base_path, data_path=None)
+    # Test adding evaluation
+    sample_report = {
+        'scoreCredibilite': 0.72,
+        'informationEntree': 'https://www.lemonde.fr/article/test',
+        'resumeAnalyse': "L'analyse suggère une crédibilité MOYENNE à ÉLEVÉE.",
+        'analyseNLP': {
+            'sentiment': {'label': 'POSITIVE', 'score': 0.85},
+            'coherence_score': 0.78
+        },
+        'reglesAppliquees': {
+            'source_analysis': {
+                'reputation': 'High',
+                'domain_age_days': 9000
+            },
+            'fact_checking': [
+                {'claim': 'Article verified by fact-checkers', 'rating': 'True'}
+            ]
+        }
+    }
+    print("Test 1: Adding evaluation triplets...")
+    report_uri = manager.add_evaluation_triplets(sample_report)
+    print(f"  Created: {report_uri}")
+    print()
+    # Test statistics
+    print("Test 2: Getting statistics...")
+    stats = manager.get_statistics()
+    for key, value in stats.items():
+        print(f"  {key}: {value}")
+    print()
+    # Export test
+    print("Test 3: Exporting data graph...")
+    os.makedirs(os.path.dirname(data_path), exist_ok=True)
+    manager.export_to_ttl(data_path)
+    print(f"  Exported to: {data_path}")
+    print("\n=== Tests Complete ===")

syscred/requirements-light.txt ADDED Viewed

	@@ -0,0 +1,31 @@

+# SysCRED - Light Requirements (for Render Free Tier)
+# Système Hybride de Vérification de Crédibilité
+# (c) Dominique S. Loyer
+#
+# NOTE: ML features (embeddings) disabled for memory constraints
+# For full ML support, use Railway, Fly.io, or Google Cloud Run
+# === Core Dependencies ===
+requests>=2.28.0
+beautifulsoup4>=4.11.0
+python-whois>=0.8.0
+# === RDF/Ontology ===
+rdflib>=6.0.0
+# === Data Processing (lightweight) ===
+numpy>=1.24.0
+pandas>=2.0.0
+# === Web Backend ===
+flask>=2.3.0
+flask-cors>=4.0.0
+python-dotenv>=1.0.0
+# === Production/Database ===
+gunicorn>=20.1.0
+psycopg2-binary>=2.9.0
+flask-sqlalchemy>=3.0.0
+# === Development/Testing ===
+pytest>=7.0.0

syscred/requirements.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+# SysCRED - Requirements
+# Système Hybride de Vérification de Crédibilité
+# (c) Dominique S. Loyer
+# === Core Dependencies ===
+requests>=2.28.0
+beautifulsoup4>=4.11.0
+python-whois>=0.8.0
+# === RDF/Ontology ===
+rdflib>=6.0.0
+# === Machine Learning ===
+transformers>=4.30.0
+torch>=2.0.0
+numpy>=1.24.0
+sentence-transformers>=2.2.0
+# === Explainability ===
+lime>=0.2.0
+# === Web Backend ===
+flask>=2.3.0
+flask-cors>=4.0.0
+python-dotenv>=1.0.0
+pandas>=2.0.0
+# === Production/Database ===
+gunicorn>=20.1.0
+psycopg2-binary>=2.9.0
+flask-sqlalchemy>=3.0.0
+# === Development/Testing ===
+pytest>=7.0.0

syscred/requirements_light.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+# SysCRED - Requirements (Light Version for Render Free Tier)
+# Sans ML models - Mode heuristique uniquement
+# (c) Dominique S. Loyer
+# === Core Dependencies ===
+requests>=2.28.0
+beautifulsoup4>=4.11.0
+python-whois>=0.8.0
+# === RDF/Ontology ===
+rdflib>=6.0.0
+# === Web Backend ===
+flask>=2.3.0
+flask-cors>=4.0.0
+python-dotenv>=1.0.0
+# === Production ===
+gunicorn>=20.1.0

syscred/run_benchmark.py ADDED Viewed

	@@ -0,0 +1,135 @@

+import json
+import time
+import os
+import sys
+from pathlib import Path
+from typing import Dict, List
+import pandas as pd
+from datetime import datetime
+# Add project root to path (one level up from this script)
+sys.path.append(str(Path(__file__).parent.parent))
+from syscred.verification_system import CredibilityVerificationSystem
+from syscred.config import config
+def run_benchmark():
+    print("="*60)
+    print("      SysCRED v2.1 - Scientific Evaluation Benchmark      ")
+    print("="*60)
+    # Load Benchmark Data
+    data_path = Path(__file__).parent / "benchmark_data.json"
+    if not data_path.exists():
+        print(f"❌ Error: {data_path} not found.")
+        return
+    with open(data_path, 'r') as f:
+        dataset = json.load(f)
+    print(f"Loaded {len(dataset)} test cases.\n")
+    # Initialize System with Full Capabilities
+    print("Initializing SysCRED (ML Models + Google API)...")
+    system = CredibilityVerificationSystem(
+        ontology_base_path=str(config.ONTOLOGY_BASE_PATH),
+        ontology_data_path=str(config.ONTOLOGY_DATA_PATH),
+        load_ml_models=True, # Use full ML for benchmark
+        google_api_key=config.GOOGLE_FACT_CHECK_API_KEY
+    )
+    print("System ready.\n")
+    results = []
+    # Run Evaluation
+    for i, item in enumerate(dataset):
+        url = item['url']
+        label = item['label']
+        print(f"[{i+1}/{len(dataset)}] Analyzing: {url} (Expected: {label})...")
+        start_time = time.time()
+        try:
+            # Run analysis
+            # We treat empty text fallbacks as valid logic path
+            report = system.verify_information(url)
+            score = report.get('score_credibilite', 0.5)
+            # Determine System Verdict
+            sys_verdict = "High" if score >= 0.55 else "Low"
+            # Compare
+            match = (sys_verdict == label) or (label == "High" and sys_verdict == "High") or (label == "Low" and sys_verdict == "Low")
+            # Handling Medium? For binary benchmark, we assume simplified threshold.
+            # Or we can map:
+            #   High (>=0.7)
+            #   Medium (0.4-0.7)
+            #   Low (<0.4)
+            # Simple Binary Metric for Precision/Recall:
+            # Positive Class = "High Credibility"
+            results.append({
+                "url": url,
+                "expected": label,
+                "score": score,
+                "system_verdict": sys_verdict,
+                "match": match,
+                "time": time.time() - start_time,
+                "error": None
+            })
+            print(f"   -> Score: {score:.2f} | Verdict: {sys_verdict} | match: {'✅' if match else '❌'}")
+        except Exception as e:
+            print(f"   -> ❌ Error: {e}")
+            results.append({
+                "url": url,
+                "expected": label,
+                "score": 0,
+                "system_verdict": "Error",
+                "match": False,
+                "time": time.time() - start_time,
+                "error": str(e)
+            })
+    # Calculate Metrics
+    print("\n" + "="*60)
+    print("RESULTS SUMMARY")
+    print("="*60)
+    df = pd.DataFrame(results)
+    # Logic for metrics
+    # TP: System=High, Expected=High
+    # FP: System=High, Expected=Low
+    # TN: System=Low, Expected=Low
+    # FN: System=Low, Expected=High
+    tp = len(df[(df['system_verdict'] == 'High') & (df['expected'] == 'High')])
+    fp = len(df[(df['system_verdict'] == 'High') & (df['expected'] == 'Low')])
+    tn = len(df[(df['system_verdict'] == 'Low') & (df['expected'] == 'Low')])
+    fn = len(df[(df['system_verdict'] == 'Low') & (df['expected'] == 'High')])
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
+    accuracy = (tp + tn) / len(df) if len(df) > 0 else 0
+    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
+    print(f"Total Cases: {len(df)}")
+    print(f"Accuracy:    {accuracy:.2%}")
+    print(f"Precision:   {precision:.2%}")
+    print(f"Recall:      {recall:.2%}")
+    print(f"F1-Score:    {f1:.2f}")
+    print("\nConfusion Matrix:")
+    print(f"      | Pred High | Pred Low")
+    print(f"True High |    {tp}    |    {fn}")
+    print(f"True Low  |    {fp}    |    {tn}")
+    # Save detailed report
+    report_path = Path(__file__).parent / "benchmark_results.csv"
+    df.to_csv(report_path, index=False)
+    print(f"\nDetailed CSV Saved to: {report_path}")
+if __name__ == "__main__":
+    run_benchmark()

syscred/run_trec_benchmark.py ADDED Viewed

	@@ -0,0 +1,414 @@

+# -*- coding: utf-8 -*-
+"""
+TREC Benchmark Script - SysCRED
+================================
+Run TREC-style evaluation on the fact-checking system.
+This script:
+1. Loads TREC AP88-90 topics and qrels
+2. Runs retrieval with multiple models (BM25, QLD, TF-IDF)
+3. Evaluates using pytrec_eval metrics
+4. Generates comparison tables and visualizations
+Usage:
+    python run_trec_benchmark.py --index /path/to/index --qrels /path/to/qrels
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerEvaluationModelesRecherche2025
+"""
+import os
+import sys
+import json
+import argparse
+import time
+from pathlib import Path
+from typing import Dict, List, Any, Tuple
+from collections import defaultdict
+# Add parent directory to path
+sys.path.insert(0, str(Path(__file__).parent))
+from syscred.trec_retriever import TRECRetriever, RetrievalResult
+from syscred.trec_dataset import TRECDataset, SAMPLE_TOPICS
+from syscred.eval_metrics import EvaluationMetrics
+class TRECBenchmark:
+    """
+    TREC-style benchmark runner for SysCRED.
+    Runs multiple retrieval configurations and compares performance
+    using standard IR metrics.
+    """
+    # Configurations to test
+    CONFIGURATIONS = [
+        {"name": "BM25", "model": "bm25", "prf": False},
+        {"name": "BM25+PRF", "model": "bm25", "prf": True},
+        {"name": "QLD", "model": "qld", "prf": False},
+        {"name": "QLD+PRF", "model": "qld", "prf": True},
+    ]
+    # Metrics to evaluate
+    METRICS = ["map", "ndcg", "P_10", "P_20", "recall_100", "recip_rank"]
+    def __init__(
+        self,
+        index_path: str = None,
+        corpus_path: str = None,
+        topics_path: str = None,
+        qrels_path: str = None,
+        output_dir: str = None
+    ):
+        """
+        Initialize the benchmark runner.
+        Args:
+            index_path: Path to Lucene index
+            corpus_path: Path to JSONL corpus
+            topics_path: Path to TREC topics
+            qrels_path: Path to TREC qrels
+            output_dir: Directory for output files
+        """
+        self.index_path = index_path
+        self.corpus_path = corpus_path
+        self.topics_path = topics_path
+        self.qrels_path = qrels_path
+        self.output_dir = Path(output_dir) if output_dir else Path("benchmark_results")
+        # Create output directory
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        # Initialize components
+        self.dataset = TRECDataset(
+            topics_dir=topics_path,
+            qrels_dir=qrels_path,
+            corpus_path=corpus_path
+        )
+        self.retriever = TRECRetriever(
+            index_path=index_path,
+            corpus_path=corpus_path,
+            use_stemming=True
+        )
+        self.metrics = EvaluationMetrics()
+        # Results storage
+        self.results: Dict[str, Dict[str, Any]] = {}
+    def load_data(self):
+        """Load topics and qrels."""
+        print("\n" + "=" * 60)
+        print("Loading TREC Data")
+        print("=" * 60)
+        # Load topics
+        if self.topics_path:
+            self.dataset.load_topics(self.topics_path)
+        else:
+            # Use sample topics
+            print("[Benchmark] Using sample topics (no topics file provided)")
+            self.dataset.topics = SAMPLE_TOPICS.copy()
+        # Load qrels
+        if self.qrels_path:
+            self.dataset.load_qrels(self.qrels_path)
+        else:
+            print("[Benchmark] No qrels provided - evaluation will be limited")
+        # Load corpus if available
+        if self.corpus_path:
+            self.dataset.load_corpus_jsonl(self.corpus_path)
+        stats = self.dataset.get_statistics()
+        print(f"\nDataset Statistics:")
+        for key, value in stats.items():
+            print(f"  {key}: {value}")
+    def run_configuration(
+        self,
+        config: Dict[str, Any],
+        query_type: str = "short",
+        k: int = 100
+    ) -> Tuple[str, Dict[str, Any]]:
+        """
+        Run a single retrieval configuration.
+        Returns:
+            (run_tag, results_dict)
+        """
+        config_name = config["name"]
+        model = config["model"]
+        use_prf = config["prf"]
+        run_tag = f"syscred_{config_name}_{query_type}"
+        print(f"\n--- Running: {run_tag} ---")
+        queries = self.dataset.get_topic_queries(query_type)
+        if not queries:
+            print(f"  No queries available!")
+            return run_tag, {}
+        # Run retrieval
+        start_time = time.time()
+        all_results = []
+        run_lines = []
+        for topic_id, query_text in queries.items():
+            result = self.retriever.retrieve_evidence(
+                claim=query_text,
+                k=k,
+                model=model,
+                use_prf=use_prf
+            )
+            for evidence in result.evidences:
+                all_results.append({
+                    "topic_id": topic_id,
+                    "doc_id": evidence.doc_id,
+                    "score": evidence.score,
+                    "rank": evidence.rank
+                })
+                run_lines.append(
+                    f"{topic_id} Q0 {evidence.doc_id} {evidence.rank} {evidence.score:.6f} {run_tag}"
+                )
+        elapsed = time.time() - start_time
+        # Save run file
+        run_file = self.output_dir / f"{run_tag}.run"
+        with open(run_file, 'w') as f:
+            f.write("\n".join(run_lines))
+        print(f"  Queries: {len(queries)}")
+        print(f"  Total results: {len(all_results)}")
+        print(f"  Time: {elapsed:.2f}s")
+        print(f"  Saved: {run_file}")
+        return run_tag, {
+            "config": config,
+            "query_type": query_type,
+            "results": all_results,
+            "run_file": str(run_file),
+            "elapsed_time": elapsed
+        }
+    def evaluate_run(self, run_tag: str, results: Dict[str, Any]) -> Dict[str, float]:
+        """
+        Evaluate a run using pytrec_eval.
+        Returns dictionary of metric -> value (aggregated across queries).
+        """
+        if not self.dataset.qrels:
+            print(f"  [Skip evaluation - no qrels]")
+            return {}
+        # Convert results to pytrec format: {query_id: [(doc_id, score), ...]}
+        run = defaultdict(list)
+        for r in results["results"]:
+            run[r["topic_id"]].append((r["doc_id"], r["score"]))
+        # Sort each query's results by score descending
+        for qid in run:
+            run[qid].sort(key=lambda x: x[1], reverse=True)
+        # Convert qrels to pytrec format
+        qrels = {}
+        for topic_id, docs in self.dataset.qrels.items():
+            qrels[topic_id] = {doc_id: rel for doc_id, rel in docs.items()}
+        # Evaluate
+        try:
+            per_query_results = self.metrics.evaluate_run(dict(run), qrels, self.METRICS)
+            # Aggregate results across queries
+            aggregated = self.metrics.compute_aggregate(per_query_results)
+            return aggregated
+        except Exception as e:
+            print(f"  [Evaluation error: {e}]")
+            return {}
+    def run_full_benchmark(self, query_types: List[str] = None, k: int = 100):
+        """
+        Run the complete benchmark suite.
+        Args:
+            query_types: List of query types to test ("short", "long")
+            k: Number of results per query
+        """
+        if query_types is None:
+            query_types = ["short", "long"]
+        print("\n" + "=" * 60)
+        print("TREC Benchmark - SysCRED")
+        print("=" * 60)
+        # Load data
+        self.load_data()
+        # Run all configurations
+        print("\n" + "=" * 60)
+        print("Running Retrieval Experiments")
+        print("=" * 60)
+        for query_type in query_types:
+            for config in self.CONFIGURATIONS:
+                run_tag, results = self.run_configuration(
+                    config, query_type, k
+                )
+                if results:
+                    self.results[run_tag] = results
+                    # Evaluate
+                    metrics = self.evaluate_run(run_tag, results)
+                    self.results[run_tag]["metrics"] = metrics
+        # Generate report
+        self.generate_report()
+        return self.results
+    def generate_report(self):
+        """Generate summary report."""
+        print("\n" + "=" * 60)
+        print("Benchmark Results Summary")
+        print("=" * 60)
+        # Table header
+        header = ["Configuration", "Query", "MAP", "NDCG", "P@10", "MRR", "Time(s)"]
+        print("\n" + " | ".join(f"{h:^12}" for h in header))
+        print("-" * 100)
+        # Table rows
+        for run_tag, data in self.results.items():
+            metrics = data.get("metrics", {})
+            row = [
+                data["config"]["name"][:12],
+                data["query_type"][:5],
+                f"{metrics.get('map', 0):.4f}",
+                f"{metrics.get('ndcg', 0):.4f}",
+                f"{metrics.get('P_10', 0):.4f}",
+                f"{metrics.get('recip_rank', 0):.4f}",
+                f"{data.get('elapsed_time', 0):.2f}"
+            ]
+            print(" | ".join(f"{v:^12}" for v in row))
+        # Save detailed results
+        results_file = self.output_dir / "benchmark_results.json"
+        # Make results JSON serializable
+        serializable_results = {}
+        for run_tag, data in self.results.items():
+            serializable_results[run_tag] = {
+                "config": data["config"],
+                "query_type": data["query_type"],
+                "metrics": data.get("metrics", {}),
+                "elapsed_time": data.get("elapsed_time", 0),
+                "num_results": len(data.get("results", []))
+            }
+        with open(results_file, 'w') as f:
+            json.dump(serializable_results, f, indent=2)
+        print(f"\nDetailed results saved to: {results_file}")
+        # Generate LaTeX table
+        self._generate_latex_table()
+    def _generate_latex_table(self):
+        """Generate LaTeX table for paper."""
+        latex_file = self.output_dir / "results_table.tex"
+        lines = [
+            r"\begin{table}[ht]",
+            r"\centering",
+            r"\caption{TREC AP88-90 Retrieval Results}",
+            r"\label{tab:trec-results}",
+            r"\begin{tabular}{l|l|cccc}",
+            r"\toprule",
+            r"Model & Query & MAP & NDCG & P@10 & MRR \\",
+            r"\midrule"
+        ]
+        for run_tag, data in self.results.items():
+            metrics = data.get("metrics", {})
+            row = (
+                f"{data['config']['name']} & {data['query_type']} & "
+                f"{metrics.get('map', 0):.4f} & "
+                f"{metrics.get('ndcg', 0):.4f} & "
+                f"{metrics.get('P_10', 0):.4f} & "
+                f"{metrics.get('recip_rank', 0):.4f} \\\\"
+            )
+            lines.append(row)
+        lines.extend([
+            r"\bottomrule",
+            r"\end{tabular}",
+            r"\end{table}"
+        ])
+        with open(latex_file, 'w') as f:
+            f.write("\n".join(lines))
+        print(f"LaTeX table saved to: {latex_file}")
+def main():
+    """Main entry point."""
+    parser = argparse.ArgumentParser(
+        description="Run TREC benchmark for SysCRED"
+    )
+    parser.add_argument(
+        "--index", "-i",
+        help="Path to Lucene index"
+    )
+    parser.add_argument(
+        "--corpus", "-c",
+        help="Path to JSONL corpus"
+    )
+    parser.add_argument(
+        "--topics", "-t",
+        help="Path to TREC topics file/directory"
+    )
+    parser.add_argument(
+        "--qrels", "-q",
+        help="Path to TREC qrels file/directory"
+    )
+    parser.add_argument(
+        "--output", "-o",
+        default="benchmark_results",
+        help="Output directory for results"
+    )
+    parser.add_argument(
+        "--k",
+        type=int,
+        default=100,
+        help="Number of results per query"
+    )
+    args = parser.parse_args()
+    # Run benchmark
+    benchmark = TRECBenchmark(
+        index_path=args.index,
+        corpus_path=args.corpus,
+        topics_path=args.topics,
+        qrels_path=args.qrels,
+        output_dir=args.output
+    )
+    results = benchmark.run_full_benchmark(k=args.k)
+    print("\n" + "=" * 60)
+    print("Benchmark Complete!")
+    print("=" * 60)
+if __name__ == "__main__":
+    main()

syscred/save_to_notes.sh ADDED Viewed

	@@ -0,0 +1,121 @@

+#!/bin/bash
+# ============================================
+# save_to_notes.sh
+# Script pour sauvegarder la documentation vers Obsidian et Notion
+#
+# Usage: ./save_to_notes.sh [chemin_fichier_optionnel]
+#
+# Par défaut: Sauvegarde SysCRED_Documentation.md
+# ============================================
+# Configuration - MODIFIEZ CES CHEMINS SELON VOTRE SETUP
+OBSIDIAN_VAULT="${OBSIDIAN_VAULT:-/Users/bk280625/documents041025/Obsidian_UQAM25_bk051225}"
+NOTION_CLIPBOARD=true  # true = copie dans le presse-papiers pour Notion
+# Couleurs pour output
+GREEN='\033[0;32m'
+BLUE='\033[0;34m'
+YELLOW='\033[1;33m'
+NC='\033[0m' # No Color
+# Date pour le versioning
+DATE=$(date +%Y%m%d)
+DATETIME=$(date +"%Y-%m-%d %H:%M")
+# Fichier source (argument ou défaut)
+if [ -n "$1" ]; then
+    DOC_SOURCE="$1"
+else
+    DOC_SOURCE="/Users/bk280625/documents041025/MonCode/syscred/SysCRED_Documentation.md"
+fi
+# Vérifier que le fichier existe
+if [ ! -f "$DOC_SOURCE" ]; then
+    echo -e "${YELLOW}⚠️  Fichier non trouvé: $DOC_SOURCE${NC}"
+    exit 1
+fi
+# Nom du fichier sans chemin
+FILENAME=$(basename "$DOC_SOURCE" .md)
+echo -e "${BLUE}📝 Sauvegarde de: $DOC_SOURCE${NC}"
+echo "   Date: $DATETIME"
+echo ""
+# ============================================
+# 1. OBSIDIAN
+# ============================================
+echo -e "${BLUE}📚 OBSIDIAN${NC}"
+# Créer le dossier Obsidian s'il n'existe pas
+if [ ! -d "$OBSIDIAN_VAULT" ]; then
+    echo "   ⚠️  Vault Obsidian non trouvé: $OBSIDIAN_VAULT"
+    echo "   Création du dossier..."
+    mkdir -p "$OBSIDIAN_VAULT"
+fi
+# Copier le fichier avec date
+OBSIDIAN_FILE="$OBSIDIAN_VAULT/${FILENAME}.md"
+cp "$DOC_SOURCE" "$OBSIDIAN_FILE"
+if [ -f "$OBSIDIAN_FILE" ]; then
+    echo -e "   ${GREEN}✅ Copié: $OBSIDIAN_FILE${NC}"
+    # Ouvrir dans Obsidian (Mac uniquement)
+    if [[ "$OSTYPE" == "darwin"* ]]; then
+        # Encoder le nom de fichier pour l'URL
+        ENCODED_FILE=$(echo "$FILENAME" | sed 's/ /%20/g')
+        VAULT_NAME=$(basename "$OBSIDIAN_VAULT")
+        # Ouvrir Obsidian avec le fichier
+        open "obsidian://open?vault=$VAULT_NAME&file=$ENCODED_FILE" 2>/dev/null
+        echo "   📖 Ouvert dans Obsidian"
+    fi
+else
+    echo "   ❌ Échec de copie"
+fi
+echo ""
+# ============================================
+# 2. NOTION (via presse-papiers)
+# ============================================
+echo -e "${BLUE}📋 NOTION${NC}"
+if [ "$NOTION_CLIPBOARD" = true ]; then
+    # Copier le contenu dans le presse-papiers
+    if [[ "$OSTYPE" == "darwin"* ]]; then
+        # macOS
+        cat "$DOC_SOURCE" | pbcopy
+        echo -e "   ${GREEN}✅ Contenu copié dans le presse-papiers${NC}"
+        echo "   📝 Pour coller dans Notion:"
+        echo "      1. Ouvrez Notion"
+        echo "      2. Créez une nouvelle page"
+        echo "      3. Cmd+V pour coller"
+    elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
+        # Linux avec xclip
+        if command -v xclip &> /dev/null; then
+            cat "$DOC_SOURCE" | xclip -selection clipboard
+            echo -e "   ${GREEN}✅ Contenu copié dans le presse-papiers${NC}"
+        else
+            echo "   ⚠️  xclip non installé (sudo apt install xclip)"
+        fi
+    fi
+fi
+echo ""
+# ============================================
+# 3. RÉSUMÉ
+# ============================================
+echo -e "${GREEN}================================${NC}"
+echo -e "${GREEN}✨ Sauvegarde terminée!${NC}"
+echo -e "${GREEN}================================${NC}"
+echo ""
+echo "Fichiers:"
+echo "  • Original: $DOC_SOURCE"
+echo "  • Obsidian: $OBSIDIAN_FILE"
+echo "  • Notion:   📋 (presse-papiers)"
+echo ""
+echo "Taille: $(wc -c < "$DOC_SOURCE" | tr -d ' ') octets"
+echo "Lignes: $(wc -l < "$DOC_SOURCE" | tr -d ' ')"

syscred/seo_analyzer.py ADDED Viewed

	@@ -0,0 +1,610 @@

+# -*- coding: utf-8 -*-
+"""
+SEO Analyzer Module - SysCRED
+==============================
+Provides SEO analysis and Information Retrieval metrics for credibility assessment.
+Implements:
+- TF-IDF calculation
+- BM25 scoring
+- PageRank estimation/explanation
+- SEO meta tag analysis
+- Backlink quality assessment
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerModelingHybridSystem2025
+Note sur la scalabilité:
+- Pour des corpus de grande taille, envisager Cython ou Rust pour TF-IDF/BM25
+- Les calculs matriciels peuvent bénéficier de NumPy optimisé ou de bibliothèques C
+"""
+import math
+import re
+from typing import List, Dict, Tuple, Optional, Any
+from dataclasses import dataclass
+from collections import Counter
+from urllib.parse import urlparse
+try:
+    import numpy as np
+    HAS_NUMPY = True
+except ImportError:
+    HAS_NUMPY = False
+# --- Data Classes ---
+@dataclass
+class SEOAnalysis:
+    """Results of SEO analysis for a webpage."""
+    url: str
+    title_length: int
+    title_has_keywords: bool
+    meta_description_length: int
+    has_meta_keywords: bool
+    heading_structure: Dict[str, int]  # h1, h2, h3 counts
+    word_count: int
+    keyword_density: Dict[str, float]
+    readability_score: float
+    seo_score: float  # Overall 0-1 score
+@dataclass
+class PageRankExplanation:
+    """Explainable PageRank estimation."""
+    url: str
+    estimated_pr: float
+    factors: List[Dict[str, Any]]
+    explanation_text: str
+    confidence: float
+@dataclass
+class IRMetrics:
+    """Information Retrieval metrics for a document."""
+    tf_idf_scores: Dict[str, float]
+    bm25_score: float
+    top_terms: List[Tuple[str, float]]
+    document_length: int
+    avg_term_frequency: float
+class SEOAnalyzer:
+    """
+    Analyze SEO factors and compute IR metrics for credibility assessment.
+    This module helps explain WHY a URL might rank well (or poorly) in search engines,
+    which is a factor in its credibility assessment.
+    """
+    # BM25 parameters (classic values)
+    BM25_K1 = 1.5  # Term frequency saturation
+    BM25_B = 0.75   # Length normalization
+    # Stopwords (expandable)
+    STOPWORDS = {
+        'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for',
+        'of', 'with', 'by', 'from', 'as', 'is', 'was', 'are', 'were', 'been',
+        'be', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would',
+        'could', 'should', 'may', 'might', 'must', 'shall', 'can', 'need',
+        'this', 'that', 'these', 'those', 'it', 'its', 'they', 'them',
+        'he', 'she', 'him', 'her', 'his', 'my', 'your', 'our', 'their',
+        'what', 'which', 'who', 'whom', 'when', 'where', 'why', 'how',
+        'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other',
+        'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
+        'than', 'too', 'very', 'just', 'also', 'now', 'here', 'there',
+        # French stopwords
+        'le', 'la', 'les', 'un', 'une', 'des', 'du', 'de', 'et', 'ou',
+        'mais', 'donc', 'car', 'ni', 'que', 'qui', 'quoi', 'dont', 'où',
+        'ce', 'cette', 'ces', 'mon', 'ma', 'mes', 'ton', 'ta', 'tes',
+        'son', 'sa', 'ses', 'notre', 'nos', 'votre', 'vos', 'leur', 'leurs',
+        'je', 'tu', 'il', 'elle', 'nous', 'vous', 'ils', 'elles', 'on',
+        'est', 'sont', 'être', 'avoir', 'fait', 'faire', 'dit', 'dire',
+        'plus', 'moins', 'très', 'bien', 'tout', 'tous', 'toute', 'toutes',
+        'pour', 'par', 'sur', 'sous', 'avec', 'sans', 'dans', 'en', 'au', 'aux'
+    }
+    def __init__(self):
+        """Initialize the SEO analyzer."""
+        # Reference corpus statistics (can be updated with real data)
+        self.avg_doc_length = 500  # Average document length in words
+        self.corpus_size = 1000     # Number of documents in reference corpus
+        # IDF values for common terms (placeholder - would be computed from real corpus)
+        self.idf_cache = {}
+    def tokenize(self, text: str, remove_stopwords: bool = True) -> List[str]:
+        """
+        Tokenize text into words.
+        Args:
+            text: Input text
+            remove_stopwords: Whether to remove stopwords
+        Returns:
+            List of tokens
+        """
+        if not text:
+            return []
+        # Lowercase and extract words
+        text = text.lower()
+        tokens = re.findall(r'\b[a-zA-ZÀ-ÿ]{2,}\b', text)
+        if remove_stopwords:
+            tokens = [t for t in tokens if t not in self.STOPWORDS]
+        return tokens
+    def calculate_tf(self, tokens: List[str]) -> Dict[str, float]:
+        """
+        Calculate Term Frequency for each token.
+        TF(t) = (count of t in document) / (total terms in document)
+        """
+        if not tokens:
+            return {}
+        term_counts = Counter(tokens)
+        total_terms = len(tokens)
+        return {term: count / total_terms for term, count in term_counts.items()}
+    def calculate_idf(self, term: str, doc_frequency: int = None) -> float:
+        """
+        Calculate Inverse Document Frequency.
+        IDF(t) = log(N / (1 + df(t)))
+        Args:
+            term: The term to calculate IDF for
+            doc_frequency: Number of documents containing the term
+                          (if None, use heuristic based on term length)
+        """
+        if term in self.idf_cache:
+            return self.idf_cache[term]
+        if doc_frequency is None:
+            # Heuristic: shorter common words appear in more documents
+            if len(term) <= 3:
+                doc_frequency = self.corpus_size * 0.5
+            elif len(term) <= 5:
+                doc_frequency = self.corpus_size * 0.3
+            elif len(term) <= 8:
+                doc_frequency = self.corpus_size * 0.1
+            else:
+                doc_frequency = self.corpus_size * 0.05
+        idf = math.log(self.corpus_size / (1 + doc_frequency))
+        self.idf_cache[term] = idf
+        return idf
+    def calculate_tf_idf(self, text: str) -> Dict[str, float]:
+        """
+        Calculate TF-IDF scores for all terms in a document.
+        TF-IDF(t,d) = TF(t,d) × IDF(t)
+        Args:
+            text: Document text
+        Returns:
+            Dictionary of term -> TF-IDF score
+        """
+        tokens = self.tokenize(text)
+        tf_scores = self.calculate_tf(tokens)
+        tf_idf = {}
+        for term, tf in tf_scores.items():
+            idf = self.calculate_idf(term)
+            tf_idf[term] = tf * idf
+        return tf_idf
+    def calculate_bm25(
+        self,
+        query: str,
+        document: str,
+        k1: float = None,
+        b: float = None
+    ) -> float:
+        """
+        Calculate BM25 relevance score between query and document.
+        BM25(D, Q) = Σ IDF(qi) × (f(qi,D) × (k1 + 1)) / (f(qi,D) + k1 × (1 - b + b × |D|/avgdl))
+        Args:
+            query: Query string
+            document: Document text
+            k1: Term frequency saturation parameter
+            b: Length normalization parameter
+        Returns:
+            BM25 score
+        """
+        k1 = k1 or self.BM25_K1
+        b = b or self.BM25_B
+        query_tokens = self.tokenize(query)
+        doc_tokens = self.tokenize(document, remove_stopwords=False)
+        if not query_tokens or not doc_tokens:
+            return 0.0
+        doc_length = len(doc_tokens)
+        doc_term_counts = Counter(doc_tokens)
+        score = 0.0
+        for term in query_tokens:
+            if term not in doc_term_counts:
+                continue
+            tf = doc_term_counts[term]
+            idf = self.calculate_idf(term)
+            numerator = tf * (k1 + 1)
+            denominator = tf + k1 * (1 - b + b * doc_length / self.avg_doc_length)
+            score += idf * (numerator / denominator)
+        return score
+    def analyze_seo(
+        self,
+        url: str,
+        title: Optional[str],
+        meta_description: Optional[str],
+        text_content: str,
+        headings: Dict[str, List[str]] = None
+    ) -> SEOAnalysis:
+        """
+        Perform comprehensive SEO analysis.
+        Args:
+            url: Page URL
+            title: Page title
+            meta_description: Meta description
+            text_content: Main text content
+            headings: Dictionary of heading levels (h1, h2, etc.) and their texts
+        Returns:
+            SEOAnalysis with all metrics
+        """
+        tokens = self.tokenize(text_content)
+        word_count = len(tokens)
+        # Title analysis
+        title_length = len(title) if title else 0
+        title_tokens = self.tokenize(title) if title else []
+        # Check if title contains main keywords from content
+        content_top_terms = Counter(tokens).most_common(10)
+        title_has_keywords = any(
+            term in title_tokens
+            for term, _ in content_top_terms[:5]
+        ) if title_tokens else False
+        # Meta description analysis
+        meta_length = len(meta_description) if meta_description else 0
+        # Heading structure
+        headings = headings or {}
+        heading_structure = {
+            'h1': len(headings.get('h1', [])),
+            'h2': len(headings.get('h2', [])),
+            'h3': len(headings.get('h3', []))
+        }
+        # Keyword density (top 5 terms)
+        keyword_density = {}
+        for term, count in Counter(tokens).most_common(5):
+            keyword_density[term] = count / word_count if word_count > 0 else 0
+        # Readability score (simple metric based on average word/sentence length)
+        sentences = re.split(r'[.!?]+', text_content)
+        avg_sentence_length = word_count / len(sentences) if sentences else 0
+        # Convert to readability score (0-1, where 1 is optimal ~15-20 words/sentence)
+        if 15 <= avg_sentence_length <= 20:
+            readability_score = 1.0
+        elif 10 <= avg_sentence_length <= 25:
+            readability_score = 0.8
+        elif 5 <= avg_sentence_length <= 30:
+            readability_score = 0.6
+        else:
+            readability_score = 0.4
+        # Overall SEO score
+        seo_factors = []
+        # Title score (optimal: 50-60 chars)
+        if 50 <= title_length <= 60:
+            seo_factors.append(1.0)
+        elif 30 <= title_length <= 70:
+            seo_factors.append(0.7)
+        else:
+            seo_factors.append(0.3)
+        # Meta description (optimal: 150-160 chars)
+        if 150 <= meta_length <= 160:
+            seo_factors.append(1.0)
+        elif 100 <= meta_length <= 200:
+            seo_factors.append(0.7)
+        else:
+            seo_factors.append(0.3)
+        # Has exactly one H1
+        seo_factors.append(1.0 if heading_structure['h1'] == 1 else 0.5)
+        # Content length (optimal: 300+ words)
+        if word_count >= 1000:
+            seo_factors.append(1.0)
+        elif word_count >= 500:
+            seo_factors.append(0.8)
+        elif word_count >= 300:
+            seo_factors.append(0.6)
+        else:
+            seo_factors.append(0.3)
+        seo_score = sum(seo_factors) / len(seo_factors) if seo_factors else 0.5
+        return SEOAnalysis(
+            url=url,
+            title_length=title_length,
+            title_has_keywords=title_has_keywords,
+            meta_description_length=meta_length,
+            has_meta_keywords=bool(keyword_density),
+            heading_structure=heading_structure,
+            word_count=word_count,
+            keyword_density=keyword_density,
+            readability_score=readability_score,
+            seo_score=seo_score
+        )
+    def estimate_pagerank(
+        self,
+        url: str,
+        backlinks: List[Dict[str, Any]] = None,
+        domain_age_days: int = None,
+        source_reputation: str = None
+    ) -> PageRankExplanation:
+        """
+        Estimate and explain PageRank-like score.
+        This is NOT the actual Google PageRank, but an explainable approximation
+        based on available factors that contribute to search ranking.
+        PageRank Formula (simplified):
+        PR(A) = (1-d) + d × Σ (PR(Ti) / C(Ti))
+        Where:
+        - d = damping factor (0.85)
+        - Ti = pages pointing to A
+        - C(Ti) = number of outgoing links from Ti
+        Args:
+            url: Target URL
+            backlinks: List of backlink information
+            domain_age_days: Age of the domain in days
+            source_reputation: Known reputation level
+        Returns:
+            PageRankExplanation with estimated score and factors
+        """
+        d = 0.85  # Damping factor
+        base_pr = (1 - d)  # Starting PageRank
+        factors = []
+        pr_contributions = []
+        # Factor 1: Domain Age
+        if domain_age_days is not None:
+            if domain_age_days > 365 * 5:  # > 5 years
+                age_contribution = 0.3
+                age_description = "Domaine ancien (5+ ans) - forte confiance"
+            elif domain_age_days > 365 * 2:  # > 2 years
+                age_contribution = 0.2
+                age_description = "Domaine établi (2-5 ans) - bonne confiance"
+            elif domain_age_days > 365:  # > 1 year
+                age_contribution = 0.1
+                age_description = "Domaine récent (1-2 ans) - confiance modérée"
+            else:
+                age_contribution = 0.0
+                age_description = "Domaine très récent (<1 an) - confiance faible"
+            factors.append({
+                'name': 'Domain Age',
+                'value': f"{domain_age_days} days ({domain_age_days/365:.1f} years)",
+                'contribution': age_contribution,
+                'description': age_description
+            })
+            pr_contributions.append(age_contribution)
+        # Factor 2: Source Reputation
+        if source_reputation:
+            if source_reputation == 'High':
+                rep_contribution = 0.3
+                rep_description = "Source réputée - équivalent à beaucoup de backlinks de qualité"
+            elif source_reputation == 'Medium':
+                rep_contribution = 0.15
+                rep_description = "Source connue - équivalent à quelques backlinks de qualité"
+            else:
+                rep_contribution = 0.0
+                rep_description = "Source inconnue ou peu fiable - pas de boost de réputation"
+            factors.append({
+                'name': 'Source Reputation',
+                'value': source_reputation,
+                'contribution': rep_contribution,
+                'description': rep_description
+            })
+            pr_contributions.append(rep_contribution)
+        # Factor 3: Backlinks (if available)
+        backlinks = backlinks or []
+        if backlinks:
+            # Estimate backlink contribution
+            high_quality_count = sum(1 for bl in backlinks if bl.get('quality', 'low') == 'high')
+            medium_quality_count = sum(1 for bl in backlinks if bl.get('quality', 'low') == 'medium')
+            # Each high-quality backlink contributes more
+            backlink_contribution = min(0.3, high_quality_count * 0.05 + medium_quality_count * 0.02)
+            factors.append({
+                'name': 'Backlinks',
+                'value': f"{len(backlinks)} total ({high_quality_count} high quality)",
+                'contribution': backlink_contribution,
+                'description': f"Liens entrants détectés - contribution au classement"
+            })
+            pr_contributions.append(backlink_contribution)
+        # Factor 4: Domain type (TLD)
+        parsed = urlparse(url)
+        domain = parsed.netloc
+        if domain.endswith('.edu') or domain.endswith('.gov'):
+            tld_contribution = 0.2
+            tld_description = "Domaine .edu/.gov - haute autorité institutionnelle"
+        elif domain.endswith('.ac.uk') or domain.endswith('.gouv.fr'):
+            tld_contribution = 0.15
+            tld_description = "Domaine académique/gouvernemental - bonne autorité"
+        elif domain.endswith('.org'):
+            tld_contribution = 0.05
+            tld_description = "Domaine .org - légère autorité"
+        else:
+            tld_contribution = 0.0
+            tld_description = "Domaine commercial standard"
+        factors.append({
+            'name': 'Domain Type (TLD)',
+            'value': domain,
+            'contribution': tld_contribution,
+            'description': tld_description
+        })
+        pr_contributions.append(tld_contribution)
+        # Calculate final estimated PageRank
+        total_contribution = sum(pr_contributions)
+        estimated_pr = base_pr + d * total_contribution
+        estimated_pr = min(1.0, max(0.0, estimated_pr))  # Clamp to [0, 1]
+        # Generate explanation
+        explanation_parts = [
+            f"PageRank estimé: {estimated_pr:.3f}",
+            f"",
+            f"Formule: PR = (1-d) + d × Σ(contributions)",
+            f"        PR = {base_pr:.2f} + {d:.2f} × {total_contribution:.2f}",
+            f"",
+            f"Facteurs contributifs:"
+        ]
+        for factor in factors:
+            explanation_parts.append(
+                f"  • {factor['name']}: +{factor['contribution']:.2f} - {factor['description']}"
+            )
+        # Confidence based on how many factors we have data for
+        confidence = min(1.0, len([f for f in factors if f['contribution'] > 0]) / 4)
+        return PageRankExplanation(
+            url=url,
+            estimated_pr=estimated_pr,
+            factors=factors,
+            explanation_text="\n".join(explanation_parts),
+            confidence=confidence
+        )
+    def get_ir_metrics(self, text: str, query: str = None) -> IRMetrics:
+        """
+        Get comprehensive IR metrics for a document.
+        Args:
+            text: Document text
+            query: Optional query for BM25 calculation
+        Returns:
+            IRMetrics with TF-IDF, BM25, and other metrics
+        """
+        tokens = self.tokenize(text)
+        tf_idf = self.calculate_tf_idf(text)
+        # Top terms by TF-IDF
+        top_terms = sorted(tf_idf.items(), key=lambda x: x[1], reverse=True)[:10]
+        # BM25 score (if query provided)
+        bm25_score = 0.0
+        if query:
+            bm25_score = self.calculate_bm25(query, text)
+        # Average term frequency
+        tf = self.calculate_tf(tokens)
+        avg_tf = sum(tf.values()) / len(tf) if tf else 0
+        return IRMetrics(
+            tf_idf_scores=tf_idf,
+            bm25_score=bm25_score,
+            top_terms=top_terms,
+            document_length=len(tokens),
+            avg_term_frequency=avg_tf
+        )
+# --- Testing ---
+if __name__ == "__main__":
+    print("=" * 60)
+    print("SysCRED SEO Analyzer - Tests")
+    print("=" * 60 + "\n")
+    analyzer = SEOAnalyzer()
+    # Test 1: TF-IDF
+    print("1. Testing TF-IDF calculation...")
+    sample_text = """
+    The credibility of online information is crucial in today's digital age.
+    Fact-checking organizations help verify claims and identify misinformation.
+    Source reputation and domain age are important credibility factors.
+    """
+    tf_idf = analyzer.calculate_tf_idf(sample_text)
+    top_5 = sorted(tf_idf.items(), key=lambda x: x[1], reverse=True)[:5]
+    print("   Top 5 TF-IDF terms:")
+    for term, score in top_5:
+        print(f"     {term}: {score:.4f}")
+    print()
+    # Test 2: BM25
+    print("2. Testing BM25 scoring...")
+    query = "credibility verification"
+    bm25_score = analyzer.calculate_bm25(query, sample_text)
+    print(f"   Query: '{query}'")
+    print(f"   BM25 Score: {bm25_score:.4f}")
+    print()
+    # Test 3: SEO Analysis
+    print("3. Testing SEO analysis...")
+    seo = analyzer.analyze_seo(
+        url="https://example.com/article",
+        title="Understanding Online Credibility - A Complete Guide",
+        meta_description="Learn about the key factors that determine the credibility of online information sources.",
+        text_content=sample_text
+    )
+    print(f"   Title length: {seo.title_length} chars")
+    print(f"   Meta description length: {seo.meta_description_length} chars")
+    print(f"   Word count: {seo.word_count}")
+    print(f"   SEO Score: {seo.seo_score:.2f}")
+    print()
+    # Test 4: PageRank Estimation
+    print("4. Testing PageRank estimation...")
+    pr = analyzer.estimate_pagerank(
+        url="https://www.lemonde.fr/article",
+        domain_age_days=9125,  # ~25 years
+        source_reputation="High"
+    )
+    print(f"   Estimated PageRank: {pr.estimated_pr:.3f}")
+    print(f"   Confidence: {pr.confidence:.2f}")
+    print("\n   Explanation:")
+    print("   " + pr.explanation_text.replace("\n", "\n   "))
+    print("\n" + "=" * 60)
+    print("Tests complete!")
+    print("=" * 60)

syscred/setup.py ADDED Viewed

	@@ -0,0 +1,65 @@

+# -*- coding: utf-8 -*-
+"""
+SysCRED - Système de Vérification de Crédibilité
+=================================================
+PhD Thesis Prototype - Neuro-Symbolic Credibility Verification
+(c) Dominique S. Loyer
+Citation Key: loyerModelingHybridSystem2025
+"""
+from setuptools import setup, find_packages
+setup(
+    name="syscred",
+    version="2.0.0",
+    author="Dominique S. Loyer",
+    author_email="loyer.dominique_sebastien@courrier.uqam.ca",
+    description="Neuro-Symbolic Credibility Verification System",
+    long_description=open("README.md").read() if __import__("os").path.exists("README.md") else "",
+    long_description_content_type="text/markdown",
+    url="https://github.com/DominiqueLoyer/syscred",
+    packages=find_packages(),
+    python_requires=">=3.9",
+    install_requires=[
+        "requests>=2.28.0",
+        "beautifulsoup4>=4.11.0",
+        "rdflib>=6.0.0",
+        "nltk>=3.7",
+    ],
+    extras_require={
+        "ml": [
+            "torch>=2.0.0",
+            "transformers>=4.30.0",
+            "numpy>=1.23.0,<2.0",
+        ],
+        "ir": [
+            "pyserini>=0.21.0",
+            "pytrec_eval>=0.5",
+        ],
+        "web": [
+            "flask>=2.0.0",
+            "flask-cors>=3.0.0",
+        ],
+        "full": [
+            "torch>=2.0.0",
+            "transformers>=4.30.0",
+            "numpy>=1.23.0,<2.0",
+            "pyserini>=0.21.0",
+            "pytrec_eval>=0.5",
+            "flask>=2.0.0",
+            "flask-cors>=3.0.0",
+            "lime>=0.2.0",
+        ],
+    },
+    classifiers=[
+        "Development Status :: 4 - Beta",
+        "Intended Audience :: Science/Research",
+        "Topic :: Scientific/Engineering :: Artificial Intelligence",
+        "License :: OSI Approved :: MIT License",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
+    ],
+    keywords="credibility verification nlp ontology information-retrieval",
+)

syscred/static/index.html ADDED Viewed

	@@ -0,0 +1,850 @@

+<!DOCTYPE html>
+<html lang="fr">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>SysCRED - Vérification de Crédibilité</title>
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
+    <script src="/static/js/d3.min.js"></script>
+    <style>
+        .graph-container {
+            width: 100%;
+            height: 500px;
+            min-height: 500px;
+            background: rgba(0, 0, 0, 0.2);
+            border-radius: 12px;
+            border: 1px solid rgba(255, 255, 255, 0.1);
+            position: relative;
+            display: block;
+            /* Force display */
+        }
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: 'Inter', sans-serif;
+            background: linear-gradient(135deg, #0f0f23 0%, #1a1a3e 50%, #0d0d1f 100%);
+            min-height: 100vh;
+            color: #e0e0e0;
+            padding: 2rem;
+        }
+        .container {
+            max-width: 900px;
+            margin: 0 auto;
+        }
+        header {
+            text-align: center;
+            margin-bottom: 3rem;
+        }
+        h1 {
+            font-size: 2.5rem;
+            font-weight: 700;
+            background: linear-gradient(135deg, #00d4ff, #7c3aed, #f472b6);
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+            background-clip: text;
+            margin-bottom: 0.5rem;
+        }
+        .subtitle {
+            color: #8b8ba7;
+            font-size: 1.1rem;
+        }
+        .search-box {
+            background: rgba(255, 255, 255, 0.03);
+            backdrop-filter: blur(20px);
+            border: 1px solid rgba(255, 255, 255, 0.08);
+            border-radius: 20px;
+            padding: 2rem;
+            margin-bottom: 2rem;
+        }
+        .input-group {
+            display: flex;
+            gap: 1rem;
+            align-items: center;
+        }
+        input[type="text"] {
+            flex: 1;
+            padding: 1rem 1.5rem;
+            font-size: 1rem;
+            border: 2px solid rgba(124, 58, 237, 0.3);
+            border-radius: 12px;
+            background: rgba(0, 0, 0, 0.3);
+            color: #fff;
+            transition: all 0.3s ease;
+        }
+        input[type="text"]:focus {
+            outline: none;
+            border-color: #7c3aed;
+            box-shadow: 0 0 20px rgba(124, 58, 237, 0.3);
+        }
+        input[type="text"]::placeholder {
+            color: #6b6b8a;
+        }
+        button {
+            padding: 1rem 2rem;
+            font-size: 1rem;
+            font-weight: 600;
+            border: none;
+            border-radius: 12px;
+            background: linear-gradient(135deg, #7c3aed, #a855f7);
+            color: white;
+            cursor: pointer;
+            transition: all 0.3s ease;
+            white-space: nowrap;
+        }
+        button:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 10px 30px rgba(124, 58, 237, 0.4);
+        }
+        button:disabled {
+            opacity: 0.6;
+            cursor: not-allowed;
+            transform: none;
+        }
+        .results {
+            display: none;
+        }
+        .results.visible {
+            display: block;
+            animation: fadeIn 0.5s ease;
+        }
+        @keyframes fadeIn {
+            from {
+                opacity: 0;
+                transform: translateY(20px);
+            }
+            to {
+                opacity: 1;
+                transform: translateY(0);
+            }
+        }
+        .score-card {
+            background: rgba(255, 255, 255, 0.03);
+            backdrop-filter: blur(20px);
+            border: 1px solid rgba(255, 255, 255, 0.08);
+            border-radius: 20px;
+            padding: 2rem;
+            text-align: center;
+            margin-bottom: 2rem;
+        }
+        .score-value {
+            font-size: 4rem;
+            font-weight: 700;
+            margin: 1rem 0;
+        }
+        .score-high {
+            color: #22c55e;
+        }
+        .score-medium {
+            color: #eab308;
+        }
+        .score-low {
+            color: #ef4444;
+        }
+        .score-label {
+            font-size: 1.2rem;
+            color: #8b8ba7;
+            margin-bottom: 1rem;
+        }
+        .credibility-badge {
+            display: inline-block;
+            padding: 0.5rem 1.5rem;
+            border-radius: 50px;
+            font-weight: 600;
+            font-size: 0.9rem;
+            text-transform: uppercase;
+            letter-spacing: 1px;
+        }
+        .badge-high {
+            background: rgba(34, 197, 94, 0.2);
+            color: #22c55e;
+            border: 1px solid #22c55e;
+        }
+        .badge-medium {
+            background: rgba(234, 179, 8, 0.2);
+            color: #eab308;
+            border: 1px solid #eab308;
+        }
+        .badge-low {
+            background: rgba(239, 68, 68, 0.2);
+            color: #ef4444;
+            border: 1px solid #ef4444;
+        }
+        .details-grid {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
+            gap: 1rem;
+            margin-bottom: 2rem;
+        }
+        .detail-card {
+            background: rgba(255, 255, 255, 0.03);
+            border: 1px solid rgba(255, 255, 255, 0.08);
+            border-radius: 12px;
+            padding: 1.5rem;
+        }
+        .detail-label {
+            font-size: 0.85rem;
+            color: #8b8ba7;
+            margin-bottom: 0.5rem;
+        }
+        .detail-value {
+            font-size: 1.1rem;
+            font-weight: 600;
+            color: #fff;
+        }
+        .summary-box {
+            background: rgba(124, 58, 237, 0.1);
+            border: 1px solid rgba(124, 58, 237, 0.3);
+            border-radius: 12px;
+            padding: 1.5rem;
+            margin-bottom: 2rem;
+        }
+        .summary-title {
+            font-weight: 600;
+            margin-bottom: 0.5rem;
+            color: #a855f7;
+        }
+        .loading {
+            text-align: center;
+            padding: 3rem;
+            display: none;
+        }
+        .loading.visible {
+            display: block;
+        }
+        .spinner {
+            width: 50px;
+            height: 50px;
+            border: 3px solid rgba(124, 58, 237, 0.2);
+            border-top-color: #7c3aed;
+            border-radius: 50%;
+            animation: spin 1s linear infinite;
+            margin: 0 auto 1rem;
+        }
+        @keyframes spin {
+            to {
+                transform: rotate(360deg);
+            }
+        }
+        .error {
+            background: rgba(239, 68, 68, 0.1);
+            border: 1px solid rgba(239, 68, 68, 0.3);
+            border-radius: 12px;
+            padding: 1.5rem;
+            color: #ef4444;
+            display: none;
+        }
+        .error.visible {
+            display: block;
+        }
+        footer {
+            text-align: center;
+            margin-top: 3rem;
+            color: #6b6b8a;
+            font-size: 0.9rem;
+        }
+        footer a {
+            color: #7c3aed;
+            text-decoration: none;
+        }
+        /* Node Details Overlay */
+        .node-details-overlay {
+            position: absolute;
+            top: 20px;
+            right: 20px;
+            background: rgba(15, 15, 35, 0.95);
+            border: 1px solid rgba(124, 58, 237, 0.3);
+            border-radius: 12px;
+            padding: 1.5rem;
+            width: 300px;
+            display: none;
+            backdrop-filter: blur(10px);
+            z-index: 100;
+            box-shadow: 0 10px 30px rgba(0,0,0,0.5);
+            pointer-events: auto;
+        }
+        .node-details-overlay.visible {
+            display: block;
+            animation: fadeIn 0.3s ease;
+        }
+        .close-btn {
+            position: absolute;
+            top: 10px;
+            right: 15px;
+            background: none;
+            border: none;
+            color: #8b8ba7;
+            font-size: 1.5rem;
+            cursor: pointer;
+            padding: 0;
+            line-height: 1;
+            width: auto;
+            height: auto;
+            box-shadow: none;
+        }
+        .close-btn:hover {
+            color: #fff;
+            transform: none;
+            box-shadow: none;
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <header>
+            <h1>🔬 SysCRED</h1>
+            <p class="subtitle">Système Neuro-Symbolique de Vérification de Crédibilité</p>
+        </header>
+        <div class="search-box">
+            <div class="input-group">
+                <input type="text" id="urlInput" placeholder="Entrez une URL à analyser (ex: https://www.lemonde.fr)"
+                    autofocus>
+                <button id="analyzeBtn" onclick="analyzeUrl()">
+                    🔍 Analyser
+                </button>
+            </div>
+        </div>
+        <div class="loading" id="loading">
+            <div class="spinner"></div>
+            <p>Analyse en cours...</p>
+        </div>
+        <div class="error" id="error"></div>
+        <div class="results" id="results">
+            <div class="score-card">
+                <div class="score-label">Score de Crédibilité</div>
+                <div class="score-value" id="scoreValue">0.00</div>
+                <div class="credibility-badge" id="credibilityBadge">-</div>
+            </div>
+            <div class="summary-box">
+                <div class="summary-title">📋 Résumé de l'analyse</div>
+                <p id="summary">-</p>
+            </div>
+            <div class="details-grid" id="detailsGrid"></div>
+            <div class="graph-section" style="margin-top: 3rem;">
+                <div class="summary-title" style="margin-bottom: 2rem; color: #60a5fa;">🕸️ Réseau Neuro-Symbolique
+                    (Ontologie)</div>
+                <!-- Debug link -->
+                <small style="color: #666; cursor: pointer;"
+                    onclick="alert('D3 Loaded: ' + (typeof d3 !== 'undefined'))">Debug: Vérifier D3</small>
+                <div id="cy" class="graph-container"></div>
+            </div>
+        </div>
+        <footer>
+            <p>SysCRED v2.0 - Prototype de recherche doctorale</p>
+            <p>© Dominique S. Loyer - UQAM | <a href="https://doi.org/10.5281/zenodo.17943226" target="_blank">DOI:
+                    10.5281/zenodo.17943226</a></p>
+        </footer>
+    </div>
+    <script>
+        const API_URL = 'http://localhost:5001';
+        async function analyzeUrl() {
+            const urlInput = document.getElementById('urlInput');
+            const loading = document.getElementById('loading');
+            const results = document.getElementById('results');
+            const error = document.getElementById('error');
+            const btn = document.getElementById('analyzeBtn');
+            const inputData = urlInput.value.trim();
+            if (!inputData) {
+                alert('Veuillez entrer une URL');
+                return;
+            }
+            // Reset UI
+            results.classList.remove('visible');
+            error.classList.remove('visible');
+            loading.classList.add('visible');
+            btn.disabled = true;
+            try {
+                const response = await fetch(`${API_URL}/api/verify`, {
+                    method: 'POST',
+                    headers: {
+                        'Content-Type': 'application/json',
+                    },
+                    body: JSON.stringify({
+                        input_data: inputData,
+                        include_seo: true,
+                        include_pagerank: true
+                    })
+                });
+                const data = await response.json();
+                if (!response.ok) {
+                    throw new Error(data.error || 'Erreur lors de l\'analyse');
+                }
+                displayResults(data);
+            } catch (err) {
+                error.textContent = `❌ Erreur: ${err.message}`;
+                error.classList.add('visible');
+            } finally {
+                loading.classList.remove('visible');
+                btn.disabled = false;
+            }
+        }
+        function displayResults(data) {
+            const results = document.getElementById('results');
+            const scoreValue = document.getElementById('scoreValue');
+            const credibilityBadge = document.getElementById('credibilityBadge');
+            const summary = document.getElementById('summary');
+            const detailsGrid = document.getElementById('detailsGrid');
+            // Score
+            const score = data.scoreCredibilite || 0;
+            scoreValue.textContent = score.toFixed(2);
+            // Conditional Display: Hide Score Card if TEXT input, show if URL
+            const isUrl = data.informationEntree && data.informationEntree.startsWith('http');
+            const scoreCard = document.querySelector('.score-card');
+            if (isUrl) {
+                scoreCard.style.display = 'block';
+                // Color based on score
+                scoreValue.className = 'score-value';
+                credibilityBadge.className = 'credibility-badge';
+                if (score >= 0.7) {
+                    scoreValue.classList.add('score-high');
+                    credibilityBadge.classList.add('badge-high');
+                    credibilityBadge.textContent = '✓ Crédibilité Élevée';
+                } else if (score >= 0.4) {
+                    scoreValue.classList.add('score-medium');
+                    credibilityBadge.classList.add('badge-medium');
+                    credibilityBadge.textContent = '⚠ Crédibilité Moyenne';
+                } else {
+                    scoreValue.classList.add('score-low');
+                    credibilityBadge.classList.add('badge-low');
+                    credibilityBadge.textContent = '✗ Crédibilité Faible';
+                }
+            } else {
+                // Hide score card for text queries as requested
+                scoreCard.style.display = 'none';
+            }
+            // Summary
+            summary.textContent = data.resumeAnalyse || 'Aucun résumé disponible';
+            // Build details HTML
+            let detailsHTML = '';
+            // Source reputation from rule analysis
+            const ruleResults = data.reglesAppliquees || {};
+            const sourceAnalysis = ruleResults.source_analysis || {};
+            if (sourceAnalysis.reputation) {
+                const repColor = sourceAnalysis.reputation === 'High' ? '#22c55e' :
+                    sourceAnalysis.reputation === 'Low' ? '#ef4444' : '#eab308';
+                detailsHTML += `
+                    <div class="detail-card">
+                        <div class="detail-label">🏛️ Réputation Source</div>
+                        <div class="detail-value" style="color: ${repColor}">${sourceAnalysis.reputation}</div>
+                    </div>
+                `;
+            }
+            if (sourceAnalysis.domain_age_days) {
+                const years = (sourceAnalysis.domain_age_days / 365).toFixed(1);
+                detailsHTML += `
+                    <div class="detail-card">
+                        <div class="detail-label">📅 Âge du Domaine</div>
+                        <div class="detail-value">${years} ans</div>
+                    </div>
+                `;
+            }
+            // NLP analysis
+            const nlpAnalysis = data.analyseNLP || {};
+            if (nlpAnalysis.sentiment) {
+                detailsHTML += `
+                    <div class="detail-card">
+                        <div class="detail-label">💭 Sentiment</div>
+                        <div class="detail-value">${nlpAnalysis.sentiment.label} (${(nlpAnalysis.sentiment.score * 100).toFixed(0)}%)</div>
+                    </div>
+                `;
+            }
+            if (nlpAnalysis.coherence_score !== null && nlpAnalysis.coherence_score !== undefined) {
+                detailsHTML += `
+                    <div class="detail-card">
+                        <div class="detail-label">📊 Cohérence</div>
+                        <div class="detail-value">${(nlpAnalysis.coherence_score * 100).toFixed(0)}%</div>
+                    </div>
+                `;
+            }
+            // Add PageRank if available
+            if (data.pageRankEstimation && data.pageRankEstimation.estimatedPR) {
+                detailsHTML += `
+                    <div class="detail-card">
+                        <div class="detail-label">📈 PageRank Estimé</div>
+                        <div class="detail-value">${data.pageRankEstimation.estimatedPR.toFixed(3)}</div>
+                    </div>
+                `;
+            }
+            // Add SEO score if available
+            if (data.seoAnalysis && data.seoAnalysis.seoScore) {
+                detailsHTML += `
+                    <div class="detail-card">
+                        <div class="detail-label">🔍 Score SEO</div>
+                        <div class="detail-value">${data.seoAnalysis.seoScore}</div>
+                    </div>
+                `;
+            }
+            // Fact checks
+            const factChecks = ruleResults.fact_checking || [];
+            if (factChecks.length > 0) {
+                // Add a header for fact checks
+                detailsHTML += `
+                    <div style="grid-column: 1 / -1; margin-top: 1rem; margin-bottom: 0.5rem; font-weight: 600; color: #f472b6;">
+                        🕵️ Fact-Checks Trouvés (${factChecks.length})
+                    </div>
+                `;
+                factChecks.forEach(fc => {
+                    detailsHTML += `
+                        <div class="detail-card" style="grid-column: 1 / -1; border-color: rgba(244, 114, 182, 0.3);">
+                            <div class="detail-label">🔍 ${fc.publisher || 'Source inconnue'}</div>
+                            <div class="detail-value" style="font-size: 1rem; margin-bottom: 0.5rem;">"${fc.claim}"</div>
+                            <div style="display: flex; justify-content: space-between; align-items: center;">
+                                <span style="color: #f472b6; font-weight: 700;">Verdict: ${fc.rating}</span>
+                                <a href="${fc.url}" target="_blank" style="color: #a855f7; text-decoration: none; font-size: 0.9rem;">Lire le rapport →</a>
+                            </div>
+                        </div>
+                    `;
+                });
+            }
+            detailsGrid.innerHTML = detailsHTML;
+            results.classList.add('visible');
+            // Fetch and render graph with slight delay to ensure DOM is ready
+            requestAnimationFrame(() => {
+                renderD3Graph();
+            });
+        }
+        async function renderD3Graph() {
+            logDebug("Starting renderD3Graph...");
+            const container = document.getElementById('cy');
+            // Check if D3 is loaded
+            if (typeof d3 === 'undefined') {
+                container.innerHTML = '<p class="error visible">Erreur: D3.js n\'a pas pu être chargé.</p>';
+                logDebug("ERROR: D3 undefined");
+                return;
+            }
+            try {
+                container.innerHTML = '<div class="spinner"></div>'; // Loading state
+                logDebug("Fetching graph data...");
+                const response = await fetch(`${API_URL}/api/ontology/graph`);
+                const data = await response.json();
+                container.innerHTML = ''; // Clear loading
+                logDebug(`Data received. Nodes: ${data.nodes ? data.nodes.length : 0}, Links: ${data.links ? data.links.length : 0}`);
+                if (!data.nodes || data.nodes.length === 0) {
+                    container.innerHTML = '<p style="text-align:center; padding:2rem; color:#6b6b8a; width:100%; display:flex; justify-content:center; align-items:center; height:100%;">Ayçune donnée ontologique disponible.</p>';
+                    return;
+                }
+                // Get dimensions
+                const width = container.clientWidth || 800;
+                const height = container.clientHeight || 500;
+                logDebug(`Container size: ${width}x${height}`);
+                const svg = d3.select(container).append("svg")
+                    .attr("width", "100%")
+                    .attr("height", "100%")
+                    .attr("viewBox", [-width / 2, -height / 2, width, height])
+                    .style("background-color", "rgba(0,0,0,0.2)"); // Visible background
+                // ADDED: Overlay for details
+                const overlay = document.createElement('div');
+                overlay.id = 'nodeDetails';
+                overlay.className = 'node-details-overlay';
+                overlay.innerHTML = `
+                    <button class="close-btn" onclick="document.getElementById('nodeDetails').classList.remove('visible')">×</button>
+                    <h3 id="detailTitle" style="color:#fff; margin-bottom:0.5rem; font-size:1.1rem; border-bottom:1px solid rgba(255,255,255,0.1); padding-bottom:0.5rem;"></h3>
+                    <div id="detailBody" style="font-size:0.9rem; color:#ccc; line-height:1.5;"></div>
+                `;
+                container.appendChild(overlay);
+                logDebug("SVG created. Starting simulation...");
+                // Colors: 1=Purple(Report), 2=Gray(Unknown), 3=Green(Good), 4=Red(Bad)
+                const color = d3.scaleOrdinal()
+                    .domain([1, 2, 3, 4])
+                    .range(["#8b5cf6", "#94a3b8", "#22c55e", "#ef4444"]);
+                const simulation = d3.forceSimulation(data.nodes)
+                    .force("link", d3.forceLink(data.links).id(d => d.id).distance(120))
+                    .force("charge", d3.forceManyBody().strength(-400))
+                    .force("center", d3.forceCenter(0, 0));
+                // ADDED: Container click to close overlay
+                svg.on("click", () => {
+                   document.getElementById('nodeDetails').classList.remove('visible');
+                   node.attr("stroke", "#fff").attr("stroke-width", 1.5);
+                });
+                // Arrow marker
+                svg.append("defs").selectAll("marker")
+                    .data(["end"])
+                    .join("marker")
+                    .attr("id", "arrow")
+                    .attr("viewBox", "0 -5 10 10")
+                    .attr("refX", 22)
+                    .attr("refY", 0)
+                    .attr("markerWidth", 6)
+                    .attr("markerHeight", 6)
+                    .attr("orient", "auto")
+                    .append("path")
+                    .attr("fill", "#64748b")
+                    .attr("d", "M0,-5L10,0L0,5");
+                const link = svg.append("g")
+                    .selectAll("line")
+                    .data(data.links)
+                    .join("line")
+                    .attr("stroke", "#475569")
+                    .attr("stroke-opacity", 0.6)
+                    .attr("stroke-width", 2)
+                    .attr("marker-end", "url(#arrow)");
+                const node = svg.append("g")
+                    .selectAll("circle")
+                    .data(data.nodes)
+                    .join("circle")
+                    .attr("r", d => d.group === 1 ? 18 : 8)
+                    .attr("fill", d => color(d.group))
+                    .attr("stroke", "#fff")
+                    .attr("stroke-width", 1.5)
+                    .style("cursor", "pointer")
+                    .call(drag(simulation))
+                    .on("click", (event, d) => {
+                        event.stopPropagation(); // Stop background click
+                        showNodeDetails(d);
+                        // Highlight selected
+                        node.attr("stroke", "#fff").attr("stroke-width", 1.5);
+                        d3.select(event.currentTarget).attr("stroke", "#f43f5e").attr("stroke-width", 3);
+                    });
+                // Labels
+                const text = svg.append("g")
+                    .selectAll("text")
+                    .data(data.nodes)
+                    .join("text")
+                    .text(d => d.name.length > 20 ? d.name.substring(0, 20) + "..." : d.name)
+                    .attr("font-size", "11px")
+                    .attr("fill", "#e0e0e0")
+                    .attr("dx", 12)
+                    .attr("dy", 4)
+                    .style("pointer-events", "none")
+                    .style("text-shadow", "0 1px 2px black");
+                // Tooltip
+                node.append("title").text(d => `${d.name}\n(${d.type})`);
+                simulation.on("tick", () => {
+                    link
+                        .attr("x1", d => d.source.x)
+                        .attr("y1", d => d.source.y)
+                        .attr("x2", d => d.target.x)
+                        .attr("y2", d => d.target.y);
+                    node
+                        .attr("cx", d => d.x)
+                        .attr("cy", d => d.y);
+                    text
+                        .attr("x", d => d.x)
+                        .attr("y", d => d.y);
+                });
+                // Zoom
+                svg.call(d3.zoom().scaleExtent([0.1, 4]).on("zoom", (e) => {
+                    svg.selectAll('g').attr('transform', e.transform);
+                }));
+                logDebug("Graph rendered successfully.");
+            } catch (err) {
+                console.error("D3 Graph error:", err);
+                const container = document.getElementById('cy');
+                if (container) container.innerHTML = `<p class="error visible">Erreur graphique: ${err.message}</p>`;
+                logDebug(`ERROR EXCEPTION: ${err.message}`);
+            }
+        }
+        function testD3() {
+            logDebug("Starting Static Test...");
+            const container = document.getElementById('cy');
+            container.innerHTML = '';
+            const width = container.clientWidth || 800;
+            const height = container.clientHeight || 500;
+            logDebug(`Container: ${width}x${height}`);
+            try {
+                const svg = d3.select(container).append("svg")
+                    .attr("width", "100%")
+                    .attr("height", "100%")
+                    .attr("viewBox", [-width / 2, -height / 2, width, height])
+                    .style("background-color", "#222");
+                svg.append("circle")
+                    .attr("r", 50)
+                    .attr("fill", "red")
+                    .attr("cx", 0)
+                    .attr("cy", 0);
+                svg.append("text")
+                    .text("D3 WORKS")
+                    .attr("fill", "white")
+                    .attr("x", 0)
+                    .attr("y", 5)
+                    .attr("text-anchor", "middle");
+                logDebug("Static Test Complete. You should see a red circle.");
+            } catch (e) {
+                logDebug("Static Test ERROR: " + e.message);
+                alert("Static Test Failed: " + e.message);
+            }
+        }
+        // --- Helper Functions ---
+        function logDebug(msg) {
+            console.log(`[SysCRED Debug] ${msg}`);
+        }
+        function drag(simulation) {
+            function dragstarted(event) {
+                if (!event.active) simulation.alphaTarget(0.3).restart();
+                event.subject.fx = event.subject.x;
+                event.subject.fy = event.subject.y;
+            }
+            function dragged(event) {
+                event.subject.fx = event.x;
+                event.subject.fy = event.y;
+            }
+            function dragended(event) {
+                if (!event.active) simulation.alphaTarget(0);
+                event.subject.fx = null;
+                event.subject.fy = null;
+            }
+            return d3.drag()
+                .on("start", dragstarted)
+                .on("drag", dragged)
+                .on("end", dragended);
+        }
+        function showNodeDetails(d) {
+            const overlay = document.getElementById('nodeDetails');
+            const title = document.getElementById('detailTitle');
+            const body = document.getElementById('detailBody');
+            if(!overlay) return;
+            title.textContent = d.name;
+            let typeColor = "#94a3b8";
+            if(d.group === 1) typeColor = "#8b5cf6"; // Report
+            if(d.group === 3) typeColor = "#22c55e"; // Good
+            if(d.group === 4) typeColor = "#ef4444"; // Bad
+            body.innerHTML = `
+                <div style="margin-bottom:0.5rem">
+                    <span style="background:${typeColor}; color:white; padding:2px 6px; border-radius:4px; font-size:0.75rem;">${d.type || 'Unknown Type'}</span>
+                </div>
+                <div><strong>URI:</strong> <br><span style="font-family:monospace; color:#a855f7; word-break:break-all;">${d.id}</span></div>
+            `;
+            overlay.classList.add('visible');
+        }
+        // Allow Enter key to trigger analysis
+        document.getElementById('urlInput').addEventListener('keypress', function (e) {
+            if (e.key === 'Enter') {
+                analyzeUrl();
+            }
+        });
+    </script>
+</body>
+</html>

syscred/static/js/d3.min.js ADDED Viewed

The diff for this file is too large to render. See raw diff

syscred/test_graphrag.py ADDED Viewed

	@@ -0,0 +1,87 @@

+# -*- coding: utf-8 -*-
+"""
+Test Script for GraphRAG
+========================
+Verifies that the GraphRAG module can correctly:
+1. Connect to an in-memory ontology.
+2. Retrieve context for a domain that has history.
+"""
+import sys
+import os
+# Add parent directory to path to allow imports
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from syscred.ontology_manager import OntologyManager
+from syscred.graph_rag import GraphRAG
+def test_graphrag_retrieval():
+    print("=== Testing GraphRAG Retrieval Logic ===\n")
+    # 1. Setup In-Memory Ontology
+    print("[1] Initializing in-memory Ontology...")
+    om = OntologyManager(base_ontology_path=None, data_path=None)
+    # 2. Add Fake History (Memory)
+    print("[2] Injecting test memory for 'lemonde.fr'...")
+    fake_report = {
+        'scoreCredibilite': 0.95,
+        'informationEntree': 'https://www.lemonde.fr/article/test',
+        'resumeAnalyse': "Reliable source.",
+        'reglesAppliquees': {
+            'source_analysis': {'reputation': 'High', 'domain': 'lemonde.fr'}
+        }
+    }
+    # Add it 3 times to simulate history
+    om.add_evaluation_triplets(fake_report)
+    om.add_evaluation_triplets(fake_report)
+    om.add_evaluation_triplets(fake_report)
+    print("    -> Added 3 evaluation records.")
+    # 3. Initialize GraphRAG
+    rag = GraphRAG(om)
+    # 4. Query Context
+    domain = "lemonde.fr"
+    print(f"\n[3] Querying GraphRAG for domain: '{domain}'...")
+    context = rag.get_context(domain)
+    print("\n--- Result Context (Domain History) ---")
+    print(context['full_text'])
+    print("---------------------------------------\n")
+    # 5. Validation 1 (History)
+    if "Analyzed 3 times" in context['full_text']:
+        print("✅ SUCCESS: GraphRAG correctly remembered the history.")
+    else:
+        print("❌ FAILURE: GraphRAG did not return the expected history count.")
+    # 6. Test Similar Claims (New Feature)
+    print(f"\n[4] Testing 'Similar Claims' for keywords: ['lemonde', 'fake']...")
+    # The previous injection didn't use 'fake', let's check what it finds or if we need to inject more
+    # Our fake_report had content: 'https://www.lemonde.fr/article/test'
+    # The new logic searches regex in 'informationContent'
+    # Let's add a specifically claim-like entry
+    fake_claim = {
+        'scoreCredibilite': 0.1,
+        'informationEntree': 'The earth is flat and fake',
+        'resumeAnalyse': "False claim.",
+        'reglesAppliquees': {'source_analysis': {'reputation': 'Low'}}
+    }
+    om.add_evaluation_triplets(fake_claim)
+    # Search for 'flat'
+    similar_context = rag.get_context("unknown.com", keywords=["flat", "earth"])
+    print("\n--- Result Context (Similar Claims) ---")
+    print(similar_context['full_text'])
+    print("---------------------------------------\n")
+    if "Found 1 similar claims" in similar_context['full_text'] or "The earth is flat" in similar_context['full_text']:
+        print("✅ SUCCESS: GraphRAG found similar claims by keywords.")
+    else:
+        print("❌ FAILURE: GraphRAG did not find the injected similar claim.")
+if __name__ == "__main__":
+    test_graphrag_retrieval()

syscred/test_phase1.py ADDED Viewed

	@@ -0,0 +1,28 @@

+import sys
+import os
+# Add project root to path
+sys.path.insert(0, os.getcwd())
+from syscred.api_clients import ExternalAPIClients
+def test_backlinks():
+    client = ExternalAPIClients()
+    test_urls = [
+        "https://www.lemonde.fr", # High + Old
+        "https://www.infowars.com", # Low + Old
+        "https://example.com", # Unknown + Old
+        "https://new-suspicious-site.xyz" # Unknown + New (likely)
+    ]
+    print("=== Testing Backlink Estimation Heuristic ===")
+    for url in test_urls:
+        print(f"\nTesting: {url}")
+        res = client.estimate_backlinks(url)
+        print(f"  Count: {res['estimated_count']}")
+        print(f"  Method: {res['method']}")
+        print(f"  Note: {res['note']}")
+if __name__ == "__main__":
+    test_backlinks()

syscred/test_phase2.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import sys
+import os
+# Add project root to path
+sys.path.insert(0, os.path.dirname(os.getcwd())) # Assumes running from syscred/
+try:
+    from syscred.verification_system import CredibilityVerificationSystem
+except ImportError:
+    # Just in case of path issues
+    sys.path.append(os.getcwd())
+    from verification_system import CredibilityVerificationSystem
+def test_nlp_fallbacks():
+    print("=== Testing NLP Hybrid Fallbacks ===")
+    # Initialize without loading standard ML (to test our new hybrid logic)
+    # Note: verification_system uses HAS_ML flag, but we want to test specific methods
+    syscred = CredibilityVerificationSystem(load_ml_models=False)
+    # Test 1: Coherence
+    print("\n[Test 1] Coherence")
+    coherent_text = "The quick brown fox jumps over the lazy dog. The dog was not amused. It barked loudly."
+    incoherent_text = "The quick brown fox. Banana republic creates clouds. Jump over the moon."
+    score1 = syscred._calculate_coherence(coherent_text)
+    score2 = syscred._calculate_coherence(incoherent_text)
+    print(f"  Coherent Text Score: {score1}")
+    print(f"  Incoherent Text Score: {score2}")
+    if score1 > score2:
+        print("  ✓ Coherence logic working (Metric discriminates)")
+    else:
+        print("  ! Coherence scores inconclusive (Might be heuristic limitations)")
+    # Test 2: Bias
+    print("\n[Test 2] Bias")
+    neutral_text = "The government announced a new policy today regarding taxation."
+    biased_text = "The corrupt regime stands accused of treason against the people by radical idiots."
+    res1 = syscred._analyze_bias(neutral_text)
+    res2 = syscred._analyze_bias(biased_text)
+    print(f"  Neutral: {res1['label']} (Score: {res1['score']:.2f})")
+    print(f"  Biased: {res2['label']} (Score: {res2['score']:.2f})")
+    print(f"  Method Used: {res1.get('method', 'Unknown')}")
+    if res2['score'] > res1['score']:
+        print("  ✓ Bias detection working")
+    else:
+        print("  ! Bias detection inconclusive")
+if __name__ == "__main__":
+    test_nlp_fallbacks()

syscred/test_suite.py ADDED Viewed

	@@ -0,0 +1,64 @@

+import unittest
+import sys
+import os
+# Point to parent directory (MonCode) so we can import 'syscred' package
+# Current file is in MonCode/syscred/test_suite.py
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from syscred.verification_system import CredibilityVerificationSystem
+from syscred.api_clients import ExternalAPIClients
+class TestSysCRED(unittest.TestCase):
+    @classmethod
+    def setUpClass(cls):
+        print("\n[TestSysCRED] Setting up system...")
+        cls.system = CredibilityVerificationSystem(load_ml_models=False)
+        cls.client = cls.system.api_clients
+    def test_backlink_estimation_heuristic(self):
+        """Test that backlink estimation respects reputation."""
+        lemonde = self.client.estimate_backlinks("https://www.lemonde.fr")
+        infowars = self.client.estimate_backlinks("https://infowars.com")
+        self.assertGreater(lemonde['estimated_count'], infowars['estimated_count'],
+                           "High reputation should have more backlinks than Low")
+        self.assertEqual(lemonde['method'], 'heuristic_v2.1')
+    def test_coherence_heuristic(self):
+        """Test coherence scoring heuristic."""
+        good_text = "This is a coherent sentence. It follows logically."
+        bad_text = "This is. Random words. Banana. Cloud."
+        score_good = self.system._calculate_coherence(good_text)
+        score_bad = self.system._calculate_coherence(bad_text)
+        self.assertTrue(0 <= score_good <= 1)
+        # Note: Heuristic using sentence length variance might be sensitive
+        # bad_text has very short sentences, so average length is small -> penalty
+        # good_text has normal length
+        self.assertGreaterEqual(score_good, score_bad, "Coherent text should score >= incoherent")
+    def test_bias_heuristic(self):
+        """Test bias detection heuristic."""
+        neutral = "The economy grew by 2%."
+        biased = "The radical corrupt regime is destroying us!"
+        res_neutral = self.system._analyze_bias(neutral)
+        res_biased = self.system._analyze_bias(biased)
+        self.assertLess(res_neutral['score'], res_biased['score'])
+        self.assertIn("biased", res_biased['label'].lower())
+    def test_full_pipeline(self):
+        """Test the full verification pipeline (integration test)."""
+        input_data = "https://www.example.com"
+        result = self.system.verify_information(input_data)
+        self.assertIn('scoreCredibilite', result)
+        self.assertIn('resumeAnalyse', result)
+        self.assertIsNotNone(result['scoreCredibilite'])
+if __name__ == '__main__':
+    unittest.main()

syscred/test_trec_integration.py ADDED Viewed

	@@ -0,0 +1,271 @@

+# -*- coding: utf-8 -*-
+"""
+Test TREC Integration - SysCRED
+================================
+Integration tests for TREC AP88-90 evidence retrieval.
+Tests:
+1. TRECRetriever initialization
+2. Evidence retrieval
+3. Integration with VerificationSystem
+4. Batch retrieval
+5. Metrics evaluation
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerEvaluationModelesRecherche2025
+"""
+import sys
+import unittest
+from pathlib import Path
+# Add parent to path
+sys.path.insert(0, str(Path(__file__).parent.parent))
+from syscred.trec_retriever import TRECRetriever, Evidence, RetrievalResult
+from syscred.trec_dataset import TRECDataset, TRECTopic, SAMPLE_TOPICS
+from syscred.eval_metrics import EvaluationMetrics
+from syscred.ir_engine import IREngine
+class TestTRECRetriever(unittest.TestCase):
+    """Tests for TRECRetriever class."""
+    @classmethod
+    def setUpClass(cls):
+        """Set up retriever with sample corpus."""
+        cls.retriever = TRECRetriever(use_stemming=True, enable_prf=False)
+        # Add sample corpus for testing
+        cls.retriever.corpus = {
+            "AP880101-0001": {
+                "text": "Climate change is primarily caused by human activities, particularly the burning of fossil fuels.",
+                "title": "Climate Science Report"
+            },
+            "AP880101-0002": {
+                "text": "The Earth's temperature has risen significantly over the past century due to greenhouse gas emissions.",
+                "title": "Global Warming Study"
+            },
+            "AP880102-0001": {
+                "text": "Natural climate variations have occurred throughout Earth's history, including ice ages.",
+                "title": "Climate History"
+            },
+            "AP880102-0002": {
+                "text": "Renewable energy sources like solar and wind can help reduce carbon emissions significantly.",
+                "title": "Green Energy Solutions"
+            },
+            "AP880103-0001": {
+                "text": "Scientific consensus supports the theory that humans are the primary cause of recent climate change.",
+                "title": "IPCC Summary"
+            },
+            "AP890215-0001": {
+                "text": "The presidential election campaign focused on economic issues and foreign policy.",
+                "title": "Election Coverage"
+            },
+            "AP890216-0001": {
+                "text": "Stock markets rose sharply after positive economic indicators were released.",
+                "title": "Financial News"
+            },
+        }
+    def test_retriever_initialization(self):
+        """Test that retriever initializes correctly."""
+        self.assertIsNotNone(self.retriever)
+        self.assertIsNotNone(self.retriever.ir_engine)
+        self.assertEqual(len(self.retriever.corpus), 7)
+    def test_evidence_retrieval(self):
+        """Test evidence retrieval for a claim."""
+        result = self.retriever.retrieve_evidence(
+            claim="Climate change is caused by human activities",
+            k=3
+        )
+        self.assertIsInstance(result, RetrievalResult)
+        self.assertGreater(len(result.evidences), 0)
+        self.assertLessEqual(len(result.evidences), 3)
+        # Check first evidence
+        first = result.evidences[0]
+        self.assertIsInstance(first, Evidence)
+        self.assertTrue(first.doc_id.startswith("AP"))
+        self.assertGreater(first.score, 0)
+        self.assertEqual(first.rank, 1)
+    def test_batch_retrieval(self):
+        """Test batch evidence retrieval."""
+        claims = [
+            "Climate change is real",
+            "Stock markets and economy",
+            "Presidential election"
+        ]
+        results = self.retriever.batch_retrieve(claims, k=2)
+        self.assertEqual(len(results), 3)
+        for result in results:
+            self.assertIsInstance(result, RetrievalResult)
+    def test_statistics(self):
+        """Test statistics collection."""
+        # Run a query first
+        self.retriever.retrieve_evidence("test query", k=2)
+        stats = self.retriever.get_statistics()
+        self.assertIn("queries_processed", stats)
+        self.assertIn("corpus_size", stats)
+        self.assertGreater(stats["queries_processed"], 0)
+class TestTRECDataset(unittest.TestCase):
+    """Tests for TRECDataset class."""
+    def test_sample_topics(self):
+        """Test sample topics availability."""
+        self.assertIsNotNone(SAMPLE_TOPICS)
+        self.assertGreater(len(SAMPLE_TOPICS), 0)
+        # Check structure
+        for topic_id, topic in SAMPLE_TOPICS.items():
+            self.assertIsInstance(topic, TRECTopic)
+            self.assertTrue(topic.title)
+    def test_dataset_initialization(self):
+        """Test dataset initialization."""
+        dataset = TRECDataset()
+        self.assertIsNotNone(dataset)
+        self.assertEqual(len(dataset.topics), 0)
+        self.assertEqual(len(dataset.qrels), 0)
+    def test_topic_query_generation(self):
+        """Test query generation from topics."""
+        dataset = TRECDataset()
+        dataset.topics = SAMPLE_TOPICS.copy()
+        short_queries = dataset.get_topic_queries(query_type="short")
+        long_queries = dataset.get_topic_queries(query_type="long")
+        self.assertEqual(len(short_queries), len(SAMPLE_TOPICS))
+        self.assertEqual(len(long_queries), len(SAMPLE_TOPICS))
+class TestEvaluationMetrics(unittest.TestCase):
+    """Tests for EvaluationMetrics class."""
+    def setUp(self):
+        self.metrics = EvaluationMetrics()
+    def test_precision_at_k(self):
+        """Test P@K calculation."""
+        retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
+        relevant = {"doc1", "doc3", "doc5"}
+        p_at_3 = self.metrics.precision_at_k(retrieved, relevant, k=3)
+        self.assertAlmostEqual(p_at_3, 2/3)  # doc1 and doc3 in top 3
+        p_at_5 = self.metrics.precision_at_k(retrieved, relevant, k=5)
+        self.assertAlmostEqual(p_at_5, 3/5)
+    def test_recall_at_k(self):
+        """Test R@K calculation."""
+        retrieved = ["doc1", "doc2", "doc3", "doc4", "doc5"]
+        relevant = {"doc1", "doc3", "doc5", "doc7"}  # 4 relevant, doc7 not retrieved
+        r_at_5 = self.metrics.recall_at_k(retrieved, relevant, k=5)
+        self.assertAlmostEqual(r_at_5, 3/4)  # 3 of 4 relevant docs retrieved
+    def test_average_precision(self):
+        """Test AP calculation."""
+        retrieved = ["doc1", "doc2", "doc3", "doc4"]
+        relevant = {"doc1", "doc3"}
+        ap = self.metrics.average_precision(retrieved, relevant)
+        # AP = (1/2) * (1/1 + 2/3) = 0.5 * 1.667 = 0.833
+        expected = (1.0 + 2/3) / 2
+        self.assertAlmostEqual(ap, expected, places=4)
+    def test_reciprocal_rank(self):
+        """Test MRR calculation."""
+        retrieved = ["doc2", "doc3", "doc1", "doc4"]
+        relevant = {"doc1"}
+        rr = self.metrics.reciprocal_rank(retrieved, relevant)
+        self.assertAlmostEqual(rr, 1/3)  # doc1 is at rank 3
+class TestIREngine(unittest.TestCase):
+    """Tests for IREngine class."""
+    def setUp(self):
+        self.engine = IREngine(use_stemming=True)
+    def test_preprocessing(self):
+        """Test text preprocessing."""
+        text = "The quick brown fox JUMPS over the lazy dog!"
+        processed = self.engine.preprocess(text)
+        # Should be lowercase, no common stopwords
+        self.assertNotIn("the", processed)
+        self.assertTrue(processed.islower())
+        # Should contain content words
+        self.assertIn("quick", processed)
+        self.assertIn("brown", processed)
+    def test_tfidf_calculation(self):
+        """Test TF-IDF scoring (basic)."""
+        # This tests the internal TF-IDF if pyserini not available
+        self.assertIsNotNone(self.engine)
+class TestVerificationSystemIntegration(unittest.TestCase):
+    """Integration tests with VerificationSystem."""
+    @classmethod
+    def setUpClass(cls):
+        """Initialize system without ML models for speed."""
+        try:
+            from syscred.verification_system import CredibilityVerificationSystem
+            cls.system = CredibilityVerificationSystem(load_ml_models=False)
+            cls.skip = False
+        except Exception as e:
+            print(f"Skipping integration tests: {e}")
+            cls.skip = True
+    def test_system_has_retriever(self):
+        """Test that system has TREC retriever."""
+        if self.skip:
+            self.skipTest("VerificationSystem not available")
+        # Retriever might be None if no corpus configured
+        self.assertTrue(hasattr(self.system, 'trec_retriever'))
+    def test_retrieve_evidence_method(self):
+        """Test retrieve_evidence method."""
+        if self.skip:
+            self.skipTest("VerificationSystem not available")
+        # Should return empty list if no corpus
+        evidences = self.system.retrieve_evidence("test claim")
+        self.assertIsInstance(evidences, list)
+    def test_verify_with_evidence_method(self):
+        """Test verify_with_evidence method."""
+        if self.skip:
+            self.skipTest("VerificationSystem not available")
+        result = self.system.verify_with_evidence("Climate change is real")
+        self.assertIn('claim', result)
+        self.assertIn('evidences', result)
+        self.assertIn('verification_verdict', result)
+        self.assertIn('confidence', result)
+if __name__ == "__main__":
+    print("=" * 60)
+    print("SysCRED TREC Integration Tests")
+    print("=" * 60)
+    # Run with verbosity
+    unittest.main(verbosity=2)

syscred/trec_dataset.py ADDED Viewed

	@@ -0,0 +1,409 @@

+# -*- coding: utf-8 -*-
+"""
+TREC Dataset Module - SysCRED
+==============================
+Loader and utilities for TREC AP88-90 dataset.
+Handles:
+- Topic/Query parsing
+- Qrels (relevance judgments) loading
+- Document corpus loading
+- TREC run file generation
+Based on: TREC_AP88-90_5juin2025.py
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerEvaluationModelesRecherche2025
+"""
+import os
+import re
+import json
+import tarfile
+from typing import Dict, List, Tuple, Optional, Set
+from dataclasses import dataclass, field
+from pathlib import Path
+@dataclass
+class TRECTopic:
+    """A TREC topic (query)."""
+    topic_id: str
+    title: str  # Short query
+    description: str  # Long description
+    narrative: str = ""  # Full narrative (optional)
+    @property
+    def short_query(self) -> str:
+        return self.title
+    @property
+    def long_query(self) -> str:
+        return f"{self.title} {self.description}".strip()
+@dataclass
+class TRECQrel:
+    """A relevance judgment."""
+    topic_id: str
+    doc_id: str
+    relevance: int  # 0=not relevant, 1=relevant, 2+=highly relevant
+@dataclass
+class TRECDocument:
+    """A document from the corpus."""
+    doc_id: str
+    text: str
+    title: str = ""
+    date: str = ""
+    source: str = ""
+class TRECDataset:
+    """
+    TREC AP88-90 Dataset loader and manager.
+    Provides utilities for:
+    - Loading topics (queries)
+    - Loading qrels (relevance judgments)
+    - Loading document corpus
+    - Creating TREC-format run files
+    Usage:
+        dataset = TRECDataset(base_path="/path/to/trec")
+        topics = dataset.load_topics()
+        qrels = dataset.load_qrels()
+    """
+    # Standard TREC file patterns
+    TOPIC_PATTERN = r"topics\.\d+\.txt"
+    QREL_PATTERN = r"qrels\.\d+\.txt"
+    def __init__(
+        self,
+        base_path: Optional[str] = None,
+        topics_dir: Optional[str] = None,
+        qrels_dir: Optional[str] = None,
+        corpus_path: Optional[str] = None
+    ):
+        """
+        Initialize the dataset loader.
+        Args:
+            base_path: Base path containing TREC data
+            topics_dir: Path to topics directory (overrides base_path)
+            qrels_dir: Path to qrels directory (overrides base_path)
+            corpus_path: Path to corpus file (AP.tar or JSONL)
+        """
+        self.base_path = Path(base_path) if base_path else None
+        self.topics_dir = Path(topics_dir) if topics_dir else None
+        self.qrels_dir = Path(qrels_dir) if qrels_dir else None
+        self.corpus_path = Path(corpus_path) if corpus_path else None
+        # Loaded data
+        self.topics: Dict[str, TRECTopic] = {}
+        self.qrels: Dict[str, Dict[str, int]] = {}  # topic_id -> {doc_id: relevance}
+        self.documents: Dict[str, TRECDocument] = {}
+        # Statistics
+        self.stats = {
+            "topics_loaded": 0,
+            "qrels_loaded": 0,
+            "docs_loaded": 0
+        }
+    def load_topics(self, topics_path: Optional[str] = None) -> Dict[str, TRECTopic]:
+        """
+        Load TREC topics from file(s).
+        Supports standard TREC topic format with <top>, <num>, <title>, <desc>, <narr> tags.
+        """
+        search_path = Path(topics_path) if topics_path else self.topics_dir or self.base_path
+        if not search_path or not search_path.exists():
+            print(f"[TRECDataset] Topics path not found: {search_path}")
+            return {}
+        topic_files = []
+        if search_path.is_file():
+            topic_files = [search_path]
+        else:
+            topic_files = list(search_path.glob("topics*.txt"))
+        for topic_file in topic_files:
+            self._parse_topic_file(topic_file)
+        self.stats["topics_loaded"] = len(self.topics)
+        print(f"[TRECDataset] Loaded {len(self.topics)} topics from {len(topic_files)} files")
+        return self.topics
+    def _parse_topic_file(self, file_path: Path):
+        """Parse a single TREC topic file."""
+        try:
+            with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
+                content = f.read()
+            # Find all <top>...</top> blocks
+            for top_match in re.finditer(r"<top>(.*?)</top>", content, re.DOTALL):
+                topic_content = top_match.group(1)
+                # Extract fields
+                num_match = re.search(r"<num>\s*(?:Number:)?\s*(\d+)", topic_content, re.IGNORECASE)
+                if not num_match:
+                    continue
+                topic_id = num_match.group(1).strip()
+                title_match = re.search(r"<title>\s*(.*?)\s*(?=<|$)", topic_content, re.IGNORECASE | re.DOTALL)
+                title = title_match.group(1).strip() if title_match else ""
+                desc_match = re.search(r"<desc>\s*(?:Description:)?\s*(.*?)\s*(?=<narr>|<|$)", topic_content, re.IGNORECASE | re.DOTALL)
+                desc = desc_match.group(1).strip() if desc_match else ""
+                narr_match = re.search(r"<narr>\s*(?:Narrative:)?\s*(.*?)\s*(?=<|$)", topic_content, re.IGNORECASE | re.DOTALL)
+                narr = narr_match.group(1).strip() if narr_match else ""
+                if topic_id and title:
+                    self.topics[topic_id] = TRECTopic(
+                        topic_id=topic_id,
+                        title=title,
+                        description=desc,
+                        narrative=narr
+                    )
+        except Exception as e:
+            print(f"[TRECDataset] Error parsing {file_path}: {e}")
+    def load_qrels(self, qrels_path: Optional[str] = None) -> Dict[str, Dict[str, int]]:
+        """
+        Load TREC qrels (relevance judgments).
+        Format: topic_id 0 doc_id relevance
+        """
+        search_path = Path(qrels_path) if qrels_path else self.qrels_dir or self.base_path
+        if not search_path or not search_path.exists():
+            print(f"[TRECDataset] Qrels path not found: {search_path}")
+            return {}
+        qrel_files = []
+        if search_path.is_file():
+            qrel_files = [search_path]
+        else:
+            qrel_files = list(search_path.glob("qrels*.txt")) + list(search_path.glob("*.qrels"))
+        total_qrels = 0
+        for qrel_file in qrel_files:
+            count = self._parse_qrel_file(qrel_file)
+            total_qrels += count
+        self.stats["qrels_loaded"] = total_qrels
+        print(f"[TRECDataset] Loaded {total_qrels} qrels from {len(qrel_files)} files")
+        return self.qrels
+    def _parse_qrel_file(self, file_path: Path) -> int:
+        """Parse a single qrel file. Returns count of qrels loaded."""
+        count = 0
+        try:
+            with open(file_path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    parts = line.strip().split()
+                    if len(parts) >= 4:
+                        topic_id = parts[0]
+                        doc_id = parts[2]
+                        relevance = int(parts[3])
+                        if topic_id not in self.qrels:
+                            self.qrels[topic_id] = {}
+                        self.qrels[topic_id][doc_id] = relevance
+                        count += 1
+        except Exception as e:
+            print(f"[TRECDataset] Error parsing {file_path}: {e}")
+        return count
+    def load_corpus_jsonl(self, jsonl_path: Optional[str] = None) -> Dict[str, TRECDocument]:
+        """
+        Load corpus from JSONL format.
+        Expected format: {"id": "...", "contents": "...", "title": "..."}
+        """
+        path = Path(jsonl_path) if jsonl_path else self.corpus_path
+        if not path or not path.exists():
+            print(f"[TRECDataset] Corpus path not found: {path}")
+            return {}
+        try:
+            with open(path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    doc = json.loads(line.strip())
+                    doc_id = doc.get('id', doc.get('docid', ''))
+                    text = doc.get('contents', doc.get('text', ''))
+                    title = doc.get('title', '')
+                    if doc_id:
+                        self.documents[doc_id] = TRECDocument(
+                            doc_id=doc_id,
+                            text=text,
+                            title=title
+                        )
+            self.stats["docs_loaded"] = len(self.documents)
+            print(f"[TRECDataset] Loaded {len(self.documents)} documents")
+        except Exception as e:
+            print(f"[TRECDataset] Error loading corpus: {e}")
+        return self.documents
+    def get_relevant_docs(self, topic_id: str) -> Set[str]:
+        """Get set of relevant document IDs for a topic."""
+        if topic_id not in self.qrels:
+            return set()
+        return {
+            doc_id for doc_id, rel in self.qrels[topic_id].items()
+            if rel > 0
+        }
+    def get_topic_queries(self, query_type: str = "short") -> Dict[str, str]:
+        """
+        Get dictionary of topic_id -> query text.
+        Args:
+            query_type: "short" (title only) or "long" (title + description)
+        """
+        if query_type == "short":
+            return {tid: t.short_query for tid, t in self.topics.items()}
+        else:
+            return {tid: t.long_query for tid, t in self.topics.items()}
+    @staticmethod
+    def format_trec_run(
+        results: List[Tuple[str, str, float, int]],  # (topic_id, doc_id, score, rank)
+        run_tag: str
+    ) -> str:
+        """
+        Format results as TREC run file.
+        Output format: topic_id Q0 doc_id rank score run_tag
+        """
+        lines = []
+        for topic_id, doc_id, score, rank in results:
+            lines.append(f"{topic_id} Q0 {doc_id} {rank} {score:.6f} {run_tag}")
+        return "\n".join(lines)
+    @staticmethod
+    def save_trec_run(
+        results: List[Tuple[str, str, float, int]],
+        run_tag: str,
+        output_path: str
+    ):
+        """Save results to TREC run file."""
+        run_content = TRECDataset.format_trec_run(results, run_tag)
+        with open(output_path, 'w', encoding='utf-8') as f:
+            f.write(run_content)
+        print(f"[TRECDataset] Saved run file: {output_path}")
+    def get_statistics(self) -> Dict[str, int]:
+        """Get dataset statistics."""
+        return {
+            "topics": len(self.topics),
+            "qrels_topics": len(self.qrels),
+            "total_qrels": sum(len(q) for q in self.qrels.values()),
+            "documents": len(self.documents)
+        }
+# --- Sample Topics for Testing (AP88-90 subset) ---
+SAMPLE_TOPICS = {
+    "51": TRECTopic(
+        topic_id="51",
+        title="Airbus Subsidies",
+        description="How much government money has been used to support Airbus aircraft manufacturing?",
+        narrative="A relevant document will contain information on subsidies or other financial support from government sources to Airbus."
+    ),
+    "52": TRECTopic(
+        topic_id="52",
+        title="Japanese Auto Sales",
+        description="How have Japanese automobile sales fared in the U.S.?",
+        narrative="A relevant document will report on sales figures, trends, or market share of Japanese automobile manufacturers in the United States."
+    ),
+    "53": TRECTopic(
+        topic_id="53",
+        title="Leveraged Buyouts",
+        description="What are the effects of leveraged buyouts on companies and industries?",
+        narrative="Relevant documents discuss the impact of LBOs on corporate structure, employment, or industry dynamics."
+    ),
+    "54": TRECTopic(
+        topic_id="54",
+        title="Satellite Launches",
+        description="What are the commercial applications of satellite launches?",
+        narrative="A relevant document will discuss commercial satellite launches and their business applications."
+    ),
+    "55": TRECTopic(
+        topic_id="55",
+        title="Insider Trading",
+        description="What individuals or companies have been accused or convicted of insider trading?",
+        narrative="A relevant document will identify specific cases of insider trading allegations or convictions."
+    ),
+}
+def create_sample_dataset() -> TRECDataset:
+    """Create a sample dataset for testing."""
+    dataset = TRECDataset()
+    dataset.topics = SAMPLE_TOPICS.copy()
+    # Add sample qrels
+    dataset.qrels = {
+        "51": {"AP880212-0001": 1, "AP880215-0003": 1, "AP880301-0010": 0},
+        "52": {"AP890102-0020": 1, "AP890115-0045": 1},
+        "53": {"AP880325-0100": 1},
+    }
+    return dataset
+# --- Testing ---
+if __name__ == "__main__":
+    print("=" * 60)
+    print("SysCRED TREC Dataset - Test Suite")
+    print("=" * 60)
+    # Create sample dataset
+    dataset = create_sample_dataset()
+    print(f"\n1. Sample Topics: {len(dataset.topics)}")
+    for tid, topic in list(dataset.topics.items())[:3]:
+        print(f"   {tid}: {topic.title}")
+        print(f"      Short: {topic.short_query}")
+        print(f"      Long: {topic.long_query[:80]}...")
+    print(f"\n2. Sample Qrels:")
+    for tid, docs in dataset.qrels.items():
+        print(f"   Topic {tid}: {len(docs)} judgments")
+    print(f"\n3. Query dictionaries:")
+    short_queries = dataset.get_topic_queries("short")
+    long_queries = dataset.get_topic_queries("long")
+    print(f"   Short queries: {len(short_queries)}")
+    print(f"   Long queries: {len(long_queries)}")
+    print(f"\n4. Relevant docs for topic 51:")
+    relevant = dataset.get_relevant_docs("51")
+    print(f"   {relevant}")
+    print(f"\n5. Statistics:")
+    stats = dataset.get_statistics()
+    for key, value in stats.items():
+        print(f"   {key}: {value}")
+    print("\n" + "=" * 60)
+    print("Tests complete!")
+    print("=" * 60)

syscred/trec_retriever.py ADDED Viewed

	@@ -0,0 +1,446 @@

+# -*- coding: utf-8 -*-
+"""
+TREC Retriever Module - SysCRED
+================================
+Information Retrieval component based on TREC AP88-90 methodology.
+This module bridges the classic IR evaluation framework (TREC)
+with the neuro-symbolic credibility verification system.
+Features:
+- BM25, TF-IDF, QLD scoring
+- Pyserini/Lucene integration (optional)
+- Evidence retrieval for fact-checking
+- PRF (Pseudo-Relevance Feedback) query expansion
+Based on: TREC_AP88-90_5juin2025.py
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerEvaluationModelesRecherche2025
+"""
+import os
+import json
+import time
+from typing import Dict, List, Tuple, Optional, Any
+from dataclasses import dataclass, field
+from pathlib import Path
+from syscred.ir_engine import IREngine, SearchResult, SearchResponse
+@dataclass
+class Evidence:
+    """
+    A piece of evidence retrieved for fact-checking.
+    Represents a document or passage that can support or refute a claim.
+    """
+    doc_id: str
+    text: str
+    score: float
+    rank: int
+    source: str = ""
+    retrieval_model: str = "bm25"
+    metadata: Dict[str, Any] = field(default_factory=dict)
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "doc_id": self.doc_id,
+            "text": self.text[:500] + "..." if len(self.text) > 500 else self.text,
+            "score": round(self.score, 4),
+            "rank": self.rank,
+            "source": self.source,
+            "model": self.retrieval_model
+        }
+@dataclass
+class RetrievalResult:
+    """Complete result from evidence retrieval."""
+    query: str
+    evidences: List[Evidence]
+    total_retrieved: int
+    search_time_ms: float
+    model_used: str
+    expanded_query: Optional[str] = None
+class TRECRetriever:
+    """
+    TREC-style retriever for evidence gathering in fact-checking.
+    This class wraps the IREngine to provide a fact-checking oriented
+    interface for retrieving evidence documents.
+    Usage:
+        retriever = TRECRetriever(index_path="/path/to/lucene/index")
+        result = retriever.retrieve_evidence("Climate change is caused by humans", k=10)
+        for evidence in result.evidences:
+            print(f"{evidence.rank}. [{evidence.score:.4f}] {evidence.text[:100]}...")
+    """
+    # Retrieval configuration
+    DEFAULT_K = 10
+    DEFAULT_MODEL = "bm25"
+    # BM25 parameters (optimized on AP88-90)
+    BM25_K1 = 0.9
+    BM25_B = 0.4
+    def __init__(
+        self,
+        index_path: Optional[str] = None,
+        corpus_path: Optional[str] = None,
+        use_stemming: bool = True,
+        enable_prf: bool = True,
+        prf_top_docs: int = 3,
+        prf_expansion_terms: int = 10
+    ):
+        """
+        Initialize the TREC retriever.
+        Args:
+            index_path: Path to Lucene/Pyserini index (optional)
+            corpus_path: Path to JSONL corpus for in-memory search
+            use_stemming: Whether to apply Porter stemming
+            enable_prf: Enable Pseudo-Relevance Feedback
+            prf_top_docs: Number of top docs for PRF
+            prf_expansion_terms: Number of expansion terms from PRF
+        """
+        self.index_path = index_path
+        self.corpus_path = corpus_path
+        self.enable_prf = enable_prf
+        self.prf_top_docs = prf_top_docs
+        self.prf_expansion_terms = prf_expansion_terms
+        # Initialize IR engine
+        self.ir_engine = IREngine(
+            index_path=index_path,
+            use_stemming=use_stemming
+        )
+        # In-memory corpus (for lightweight mode)
+        self.corpus: Dict[str, Dict[str, str]] = {}
+        if corpus_path and os.path.exists(corpus_path):
+            self._load_corpus(corpus_path)
+        # Statistics
+        self.stats = {
+            "queries_processed": 0,
+            "total_search_time_ms": 0,
+            "avg_results_per_query": 0
+        }
+        print(f"[TRECRetriever] Initialized with index={index_path}, stemming={use_stemming}")
+    def _load_corpus(self, corpus_path: str):
+        """Load JSONL corpus into memory for lightweight search."""
+        print(f"[TRECRetriever] Loading corpus from {corpus_path}...")
+        try:
+            with open(corpus_path, 'r', encoding='utf-8') as f:
+                for line in f:
+                    doc = json.loads(line.strip())
+                    self.corpus[doc['id']] = {
+                        'text': doc.get('contents', doc.get('text', '')),
+                        'title': doc.get('title', '')
+                    }
+            print(f"[TRECRetriever] Loaded {len(self.corpus)} documents")
+        except Exception as e:
+            print(f"[TRECRetriever] Failed to load corpus: {e}")
+    def retrieve_evidence(
+        self,
+        claim: str,
+        k: int = None,
+        model: str = None,
+        use_prf: bool = None
+    ) -> RetrievalResult:
+        """
+        Retrieve evidence documents for a given claim.
+        This is the main method for fact-checking integration.
+        Args:
+            claim: The claim or statement to verify
+            k: Number of evidence documents to retrieve
+            model: Retrieval model ('bm25', 'qld', 'tfidf')
+            use_prf: Override PRF setting for this query
+        Returns:
+            RetrievalResult with list of Evidence objects
+        """
+        start_time = time.time()
+        k = k or self.DEFAULT_K
+        model = model or self.DEFAULT_MODEL
+        use_prf = use_prf if use_prf is not None else self.enable_prf
+        # Preprocess the claim
+        processed_claim = self.ir_engine.preprocess(claim)
+        # Try Pyserini first, fall back to in-memory
+        if self.ir_engine.searcher:
+            response = self._search_pyserini(processed_claim, model, k)
+        else:
+            response = self._search_in_memory(processed_claim, k)
+        # Apply PRF if enabled
+        expanded_query = None
+        if use_prf and len(response.results) >= self.prf_top_docs:
+            expanded_query = self._apply_prf(claim, response.results[:self.prf_top_docs])
+            if expanded_query != claim:
+                # Re-search with expanded query
+                processed_expanded = self.ir_engine.preprocess(expanded_query)
+                if self.ir_engine.searcher:
+                    response = self._search_pyserini(processed_expanded, model, k)
+                else:
+                    response = self._search_in_memory(processed_expanded, k)
+        # Convert to Evidence objects
+        evidences = []
+        for result in response.results:
+            doc_text = self._get_document_text(result.doc_id)
+            evidences.append(Evidence(
+                doc_id=result.doc_id,
+                text=doc_text,
+                score=result.score,
+                rank=result.rank,
+                source="TREC-AP88-90" if "AP" in result.doc_id else "Unknown",
+                retrieval_model=model
+            ))
+        search_time = (time.time() - start_time) * 1000
+        # Update statistics
+        self.stats["queries_processed"] += 1
+        self.stats["total_search_time_ms"] += search_time
+        return RetrievalResult(
+            query=claim,
+            evidences=evidences,
+            total_retrieved=len(evidences),
+            search_time_ms=search_time,
+            model_used=model,
+            expanded_query=expanded_query
+        )
+    def _search_pyserini(self, query: str, model: str, k: int) -> SearchResponse:
+        """Search using Pyserini/Lucene."""
+        return self.ir_engine.search_pyserini(
+            query=query,
+            model=model,
+            k=k
+        )
+    def _search_in_memory(self, query: str, k: int) -> SearchResponse:
+        """
+        Lightweight in-memory BM25 search.
+        Used when Pyserini is not available.
+        """
+        start_time = time.time()
+        if not self.corpus:
+            return SearchResponse(
+                query_id="Q1",
+                query_text=query,
+                results=[],
+                model="bm25_memory",
+                total_hits=0,
+                search_time_ms=0
+            )
+        query_terms = query.split()
+        # Calculate document frequencies
+        doc_freq = {}
+        for term in query_terms:
+            doc_freq[term] = sum(
+                1 for doc in self.corpus.values()
+                if term in self.ir_engine.preprocess(doc['text'])
+            )
+        # Calculate average document length
+        total_length = sum(
+            len(self.ir_engine.preprocess(doc['text']).split())
+            for doc in self.corpus.values()
+        )
+        avg_doc_length = total_length / len(self.corpus) if self.corpus else 1
+        # Score all documents
+        scores = []
+        for doc_id, doc in self.corpus.items():
+            doc_text = self.ir_engine.preprocess(doc['text'])
+            doc_terms = doc_text.split()
+            score = self.ir_engine.calculate_bm25_score(
+                query_terms=query_terms,
+                doc_terms=doc_terms,
+                doc_length=len(doc_terms),
+                avg_doc_length=avg_doc_length,
+                doc_freq=doc_freq,
+                corpus_size=len(self.corpus)
+            )
+            if score > 0:
+                scores.append((doc_id, score))
+        # Sort and take top k
+        scores.sort(key=lambda x: x[1], reverse=True)
+        top_k = scores[:k]
+        results = [
+            SearchResult(doc_id=doc_id, score=score, rank=i+1)
+            for i, (doc_id, score) in enumerate(top_k)
+        ]
+        return SearchResponse(
+            query_id="Q1",
+            query_text=query,
+            results=results,
+            model="bm25_memory",
+            total_hits=len(results),
+            search_time_ms=(time.time() - start_time) * 1000
+        )
+    def _apply_prf(self, original_query: str, top_results: List[SearchResult]) -> str:
+        """Apply Pseudo-Relevance Feedback."""
+        top_docs_texts = [
+            self._get_document_text(r.doc_id)
+            for r in top_results
+        ]
+        return self.ir_engine.pseudo_relevance_feedback(
+            query=original_query,
+            top_docs_texts=top_docs_texts,
+            num_expansion_terms=self.prf_expansion_terms
+        )
+    def _get_document_text(self, doc_id: str) -> str:
+        """Get document text from corpus or index."""
+        if doc_id in self.corpus:
+            return self.corpus[doc_id]['text']
+        # Try Pyserini doc lookup
+        if self.ir_engine.searcher:
+            try:
+                doc = self.ir_engine.searcher.doc(doc_id)
+                if doc:
+                    return doc.raw()
+            except:
+                pass
+        return f"[Document {doc_id} text not available]"
+    def batch_retrieve(
+        self,
+        claims: List[str],
+        k: int = None,
+        model: str = None
+    ) -> List[RetrievalResult]:
+        """
+        Retrieve evidence for multiple claims.
+        Useful for benchmark evaluation.
+        """
+        results = []
+        for claim in claims:
+            result = self.retrieve_evidence(claim, k=k, model=model)
+            results.append(result)
+        return results
+    def get_statistics(self) -> Dict[str, Any]:
+        """Get retrieval statistics."""
+        avg_time = 0
+        if self.stats["queries_processed"] > 0:
+            avg_time = self.stats["total_search_time_ms"] / self.stats["queries_processed"]
+        return {
+            "queries_processed": self.stats["queries_processed"],
+            "total_search_time_ms": round(self.stats["total_search_time_ms"], 2),
+            "avg_search_time_ms": round(avg_time, 2),
+            "corpus_size": len(self.corpus),
+            "has_pyserini_index": self.ir_engine.searcher is not None
+        }
+# --- Integration with VerificationSystem ---
+def create_retriever_for_syscred(
+    config: Optional[Any] = None
+) -> TRECRetriever:
+    """
+    Factory function to create a TRECRetriever for SysCRED integration.
+    Uses configuration from syscred.config if available.
+    """
+    index_path = None
+    corpus_path = None
+    if config:
+        index_path = getattr(config, 'TREC_INDEX_PATH', None)
+        corpus_path = getattr(config, 'TREC_CORPUS_PATH', None)
+    # Try default paths
+    default_corpus = Path(__file__).parent.parent / "benchmarks" / "ap_corpus.jsonl"
+    if default_corpus.exists():
+        corpus_path = str(default_corpus)
+    return TRECRetriever(
+        index_path=index_path,
+        corpus_path=corpus_path,
+        use_stemming=True,
+        enable_prf=True
+    )
+# --- Testing ---
+if __name__ == "__main__":
+    print("=" * 60)
+    print("SysCRED TREC Retriever - Test Suite")
+    print("=" * 60)
+    # Initialize without index (in-memory mode)
+    retriever = TRECRetriever(use_stemming=True, enable_prf=False)
+    # Add some test documents to corpus
+    retriever.corpus = {
+        "DOC001": {"text": "Climate change is primarily caused by human activities, particularly the burning of fossil fuels.", "title": "Climate Science"},
+        "DOC002": {"text": "The Earth's temperature has risen significantly over the past century due to greenhouse gas emissions.", "title": "Global Warming"},
+        "DOC003": {"text": "Natural climate variations have occurred throughout Earth's history.", "title": "Climate History"},
+        "DOC004": {"text": "Renewable energy sources like solar and wind can help reduce carbon emissions.", "title": "Green Energy"},
+        "DOC005": {"text": "Scientific consensus supports anthropogenic climate change theory.", "title": "IPCC Report"},
+    }
+    print("\n1. Testing evidence retrieval...")
+    result = retriever.retrieve_evidence(
+        claim="Climate change is caused by human activities",
+        k=3
+    )
+    print(f"\n   Query: {result.query}")
+    print(f"   Model: {result.model_used}")
+    print(f"   Search time: {result.search_time_ms:.2f} ms")
+    print(f"   Results found: {result.total_retrieved}")
+    for evidence in result.evidences:
+        print(f"\n   Rank {evidence.rank} [{evidence.score:.4f}]:")
+        print(f"   {evidence.text[:100]}...")
+    print("\n2. Testing batch retrieval...")
+    claims = [
+        "Climate change is real",
+        "Renewable energy reduces emissions"
+    ]
+    batch_results = retriever.batch_retrieve(claims, k=2)
+    print(f"   Processed {len(batch_results)} claims")
+    print("\n3. Statistics:")
+    stats = retriever.get_statistics()
+    for key, value in stats.items():
+        print(f"   {key}: {value}")
+    print("\n" + "=" * 60)
+    print("Tests complete!")
+    print("=" * 60)

syscred/verification_system.py ADDED Viewed

	@@ -0,0 +1,926 @@

+# -*- coding: utf-8 -*-
+"""
+Verification System Module - SysCRED v2.0
+==========================================
+Main credibility verification system with real API integration.
+Refactored from sys-cred-Python-27avril2025.py
+(c) Dominique S. Loyer - PhD Thesis Prototype
+Citation Key: loyerModelingHybridSystem2025
+"""
+import re
+import json
+import datetime
+from typing import Optional, Dict, Any, List
+from urllib.parse import urlparse
+# Transformers and ML
+try:
+    from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
+    import numpy as np
+    import torch
+    from lime.lime_text import LimeTextExplainer
+    HAS_ML = True
+except ImportError:
+    HAS_ML = False
+    print("Warning: ML libraries not fully installed. Run: pip install transformers torch lime numpy")
+try:
+    from sentence_transformers import SentenceTransformer, util
+    HAS_SBERT = True
+except ImportError:
+    HAS_SBERT = False
+    print("Warning: sentence-transformers not installed. Semantic coherence will use heuristics.")
+# Local imports
+from syscred.api_clients import ExternalAPIClients, WebContent, ExternalData
+from syscred.ontology_manager import OntologyManager
+from syscred.seo_analyzer import SEOAnalyzer
+from syscred.graph_rag import GraphRAG  # [NEW] GraphRAG
+from syscred.trec_retriever import TRECRetriever, Evidence, RetrievalResult  # [NEW] TREC Integration
+from syscred import config
+class CredibilityVerificationSystem:
+    """
+    Système neuro-symbolique de vérification de crédibilité.
+    Combine:
+    - Analyse basée sur des règles (symbolique, transparent)
+    - Analyse NLP/IA (apprentissage automatique)
+    - Ontologie OWL pour la traçabilité
+    - APIs externes pour les données réelles
+    """
+    def __init__(
+        self,
+        google_api_key: Optional[str] = None,
+        ontology_base_path: Optional[str] = None,
+        ontology_data_path: Optional[str] = None,
+        load_ml_models: bool = True
+    ):
+        """
+        Initialize the credibility verification system.
+        Args:
+            google_api_key: API key for Google Fact Check (optional)
+            ontology_base_path: Path to base ontology TTL file
+            ontology_data_path: Path to store accumulated data
+            load_ml_models: Whether to load ML models (disable for testing)
+        """
+        print("[SysCRED] Initializing Credibility Verification System v2.0...")
+        # Initialize API clients
+        self.api_clients = ExternalAPIClients(google_api_key=google_api_key)
+        print("[SysCRED] API clients initialized")
+        # Initialize ontology manager
+        self.ontology_manager = None
+        if ontology_base_path or ontology_data_path:
+            try:
+                self.ontology_manager = OntologyManager(
+                    base_ontology_path=ontology_base_path,
+                    data_path=ontology_data_path
+                )
+                self.graph_rag = GraphRAG(self.ontology_manager) # [NEW] Init GraphRAG
+                print("[SysCRED] Ontology manager & GraphRAG initialized")
+            except Exception as e:
+                print(f"[SysCRED] Ontology manager disabled: {e}")
+                self.graph_rag = None
+        else:
+             self.graph_rag = None
+        # [NEW] Initialize TREC Retriever for evidence gathering
+        self.trec_retriever = None
+        try:
+            self.trec_retriever = TRECRetriever(
+                index_path=config.Config.TREC_INDEX_PATH,
+                corpus_path=config.Config.TREC_CORPUS_PATH,
+                use_stemming=True,
+                enable_prf=config.Config.ENABLE_PRF,
+                prf_top_docs=config.Config.PRF_TOP_DOCS,
+                prf_expansion_terms=config.Config.PRF_EXPANSION_TERMS
+            )
+            print("[SysCRED] TREC Retriever initialized for evidence gathering")
+        except Exception as e:
+            print(f"[SysCRED] TREC Retriever disabled: {e}")
+        # Initialize ML models
+        self.sentiment_pipeline = None
+        self.ner_pipeline = None
+        self.bias_tokenizer = None
+        self.bias_model = None
+        self.coherence_model = None
+        self.explainer = None
+        if load_ml_models and HAS_ML:
+            self._load_ml_models()
+        # Weights for score calculation (configurable)
+        # Weights for score calculation (Loaded from Config)
+        self.weights = config.Config.SCORE_WEIGHTS
+        print(f"[SysCRED] Using weights: {self.weights}")
+        print("[SysCRED] System ready!")
+    def _load_ml_models(self):
+        """Load ML models for NLP analysis."""
+        print("[SysCRED] Loading ML models (this may take a moment)...")
+        try:
+            # Sentiment analysis
+            self.sentiment_pipeline = pipeline(
+                "sentiment-analysis",
+                model="distilbert-base-uncased-finetuned-sst-2-english"
+            )
+            print("[SysCRED] ✓ Sentiment model loaded")
+        except Exception as e:
+            print(f"[SysCRED] ✗ Sentiment model failed: {e}")
+        try:
+            # NER pipeline
+            self.ner_pipeline = pipeline("ner", grouped_entities=True)
+            print("[SysCRED] ✓ NER model loaded")
+        except Exception as e:
+            print(f"[SysCRED] ✗ NER model failed: {e}")
+        try:
+            # Bias detection - Specialized model
+            # Using 'd4data/bias-detection-model' or fallback to generic
+            bias_model_name = "d4data/bias-detection-model"
+            self.bias_tokenizer = AutoTokenizer.from_pretrained(bias_model_name)
+            self.bias_model = AutoModelForSequenceClassification.from_pretrained(bias_model_name)
+            print("[SysCRED] ✓ Bias model loaded (d4data)")
+        except Exception as e:
+            print(f"[SysCRED] ✗ Bias model failed: {e}. Using heuristics.")
+        try:
+            # Semantic Coherence
+            if HAS_SBERT:
+                self.coherence_model = SentenceTransformer('all-MiniLM-L6-v2')
+                print("[SysCRED] ✓ Coherence model loaded (SBERT)")
+        except Exception as e:
+            print(f"[SysCRED] ✗ Coherence model failed: {e}")
+        try:
+            # LIME explainer
+            self.explainer = LimeTextExplainer(class_names=['NEGATIVE', 'POSITIVE'])
+            print("[SysCRED] ✓ LIME explainer loaded")
+        except Exception as e:
+            print(f"[SysCRED] ✗ LIME explainer failed: {e}")
+    def is_url(self, text: str) -> bool:
+        """Check if a string is a valid URL."""
+        try:
+            result = urlparse(text)
+            return all([result.scheme, result.netloc])
+        except ValueError:
+            return False
+    def preprocess(self, text: str) -> str:
+        """Clean and normalize text for analysis."""
+        if not isinstance(text, str):
+            return ""
+        # Remove URLs
+        text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
+        # Normalize whitespace
+        text = re.sub(r'\s+', ' ', text)
+        # Keep basic punctuation
+        text = re.sub(r'[^\w\s\.\?,!]', '', text)
+        return text.lower().strip()
+    def rule_based_analysis(self, text: str, external_data: ExternalData) -> Dict[str, Any]:
+        """
+        Perform rule-based analysis using symbolic reasoning.
+        Args:
+            text: Preprocessed text to analyze
+            external_data: Data from external APIs
+        Returns:
+            Dictionary with rule-based analysis results
+        """
+        results = {
+            'linguistic_markers': {},
+            'source_analysis': {},
+            'timeliness_flags': [],
+            'fact_checking': []
+        }
+        # 1. Linguistic markers
+        sensational_words = [
+            'shocking', 'revealed', 'conspiracy', 'amazing', 'secret',
+            'breakthrough', 'miracle', 'unbelievable', 'exclusive', 'urgent'
+        ]
+        certainty_words = [
+            'verified', 'authentic', 'credible', 'proven', 'fact',
+            'confirmed', 'official', 'legitimate', 'established'
+        ]
+        doubt_words = [
+            'hoax', 'false', 'fake', 'unproven', 'rumor', 'allegedly',
+            'claim', 'debunked', 'misleading', 'disputed'
+        ]
+        text_lower = text.lower()
+        results['linguistic_markers']['sensationalism'] = sum(
+            1 for word in sensational_words if word in text_lower
+        )
+        results['linguistic_markers']['certainty'] = sum(
+            1 for word in certainty_words if word in text_lower
+        )
+        results['linguistic_markers']['doubt'] = sum(
+            1 for word in doubt_words if word in text_lower
+        )
+        # 2. Source analysis from external data
+        results['source_analysis']['reputation'] = external_data.source_reputation
+        results['source_analysis']['domain_age_days'] = external_data.domain_age_days
+        if external_data.domain_info:
+            results['source_analysis']['registrar'] = external_data.domain_info.registrar
+            results['source_analysis']['domain'] = external_data.domain_info.domain
+        # 3. Timeliness flags
+        if external_data.domain_age_days is not None:
+            if external_data.domain_age_days < 180:
+                results['timeliness_flags'].append('Source domain is relatively new (<6 months)')
+            elif external_data.domain_age_days < 365:
+                results['timeliness_flags'].append('Source domain is less than 1 year old')
+        # 4. Fact checking results
+        for fc in external_data.fact_checks:
+            results['fact_checking'].append({
+                'claim': fc.claim,
+                'rating': fc.rating,
+                'publisher': fc.publisher,
+                'url': fc.url
+            })
+        return results
+    def nlp_analysis(self, text: str) -> Dict[str, Any]:
+        """
+        Perform NLP-based analysis using ML models.
+        Args:
+            text: Preprocessed text to analyze
+        Returns:
+            Dictionary with NLP analysis results
+        """
+        results = {
+            'sentiment': None,
+            'sentiment_explanation': None,
+            'bias_analysis': {'score': None, 'label': 'Unavailable'},
+            'named_entities': [],
+            'coherence_score': None
+        }
+        if not text:
+            results['sentiment'] = {'label': 'Neutral', 'score': 0.5}
+            return results
+        # 1. Sentiment analysis with LIME explanation
+        if self.sentiment_pipeline:
+            try:
+                main_pred = self.sentiment_pipeline(text[:512])[0]
+                results['sentiment'] = main_pred
+                if self.explainer:
+                    def predict_proba(texts):
+                        if isinstance(texts, str):
+                            texts = [texts]
+                        predictions = self.sentiment_pipeline(list(texts))
+                        probs = []
+                        for pred in predictions:
+                            if pred['label'] == 'POSITIVE':
+                                probs.append([1 - pred['score'], pred['score']])
+                            else:
+                                probs.append([pred['score'], 1 - pred['score']])
+                        return np.array(probs)
+                    explanation = self.explainer.explain_instance(
+                        text[:512], predict_proba, num_features=6
+                    )
+                    results['sentiment_explanation'] = explanation.as_list()
+            except Exception as e:
+                print(f"[NLP] Sentiment error: {e}")
+                results['sentiment'] = {'label': 'Error', 'score': 0.0}
+        # 2. Bias analysis
+        results['bias_analysis'] = self._analyze_bias(text)
+        # 3. Named Entity Recognition
+        if self.ner_pipeline:
+            try:
+                entities = self.ner_pipeline(text[:512])
+                results['named_entities'] = entities
+            except Exception as e:
+                print(f"[NLP] NER error: {e}")
+        # 4. Semantic Coherence
+        results['coherence_score'] = self._calculate_coherence(text)
+        return results
+    def _analyze_bias(self, text: str) -> Dict[str, Any]:
+        """Analyze text for bias using ML or heuristics."""
+        # Method 1: ML Model
+        if self.bias_model and self.bias_tokenizer:
+            try:
+                inputs = self.bias_tokenizer(
+                    text[:512], return_tensors="pt",
+                    truncation=True, max_length=512, padding=True
+                )
+                with torch.no_grad():
+                    logits = self.bias_model(**inputs).logits
+                probs = torch.softmax(logits, dim=1)[0]
+                # Label mapping depends on model, usually [Non-biased, Biased]
+                bias_score = probs[1].item()
+                label = " biased" if bias_score > 0.5 else "Non-biased"
+                return {'score': bias_score, 'label': label, 'method': 'ML (d4data)'}
+            except Exception as e:
+                print(f"[NLP] ML Bias error: {e}")
+        # Method 2: Heuristics
+        biased_words = [
+            'radical', 'extremist', 'disgraceful', 'shameful', 'corrupt',
+            'insane', 'idiot', 'disaster', 'propaganda', 'dictator',
+            'puppet', 'regime', 'tyrant', 'treason', 'traitor'
+        ]
+        text_lower = text.lower()
+        count = sum(1 for w in biased_words if w in text_lower)
+        score = min(1.0, count * 0.15)
+        label = "Potentially Biased" if score > 0.3 else "Neutral"
+        return {'score': score, 'label': label, 'method': 'Heuristic'}
+    def _calculate_coherence(self, text: str) -> float:
+        """Calculate semantic coherence score."""
+        sentences = re.split(r'[.!?]+', text)
+        sentences = [s.strip() for s in sentences if len(s.split()) > 3]
+        if len(sentences) < 2:
+            return 0.7  # Default to neutral/good for short text, not perfect 1.0
+        # Method 1: SBERT Semantic Similarity
+        if self.coherence_model and HAS_SBERT:
+            try:
+                embeddings = self.coherence_model.encode(sentences[:10]) # Limit to 10
+                sims = []
+                for i in range(len(embeddings) - 1):
+                    sim = util.pytorch_cos_sim(embeddings[i], embeddings[i+1])
+                    sims.append(sim.item())
+                return sum(sims) / len(sims) if sims else 0.5
+            except Exception as e:
+                print(f"[NLP] SBERT error: {e}")
+        # Method 2: Heuristic (Sentence Length Variance & Repetition)
+        lengths = [len(s.split()) for s in sentences]
+        avg_len = sum(lengths) / len(lengths)
+        variance = sum((l - avg_len) ** 2 for l in lengths) / len(lengths)
+        # High variance suggests simpler/choppier writing usually
+        score = 0.8
+        if variance > 100: score -= 0.2
+        if avg_len < 5: score -= 0.2
+        return max(0.0, score)
+    def calculate_overall_score(
+        self,
+        rule_results: Dict,
+        nlp_results: Dict
+    ) -> float:
+        """
+        Calculate overall credibility score based on User-Defined Metrics.
+        """
+        score = 0.5  # Start neutral
+        adjustments = 0.0
+        total_weight_used = 0.0
+        # 1. Source Reputation (25%)
+        w_rep = self.weights.get('source_reputation', 0.25)
+        reputation = rule_results['source_analysis'].get('reputation', 'Unknown')
+        if reputation != 'Unknown' and "N/A" not in reputation:
+            if reputation == 'High':
+                adjustments += w_rep * 1.0 # Full boost
+            elif reputation == 'Low':
+                adjustments -= w_rep * 1.0 # Full penalty
+            elif reputation == 'Medium':
+                adjustments += w_rep * 0.2 # Slight boost
+            total_weight_used += w_rep
+        # 2. Domain Age (10%)
+        w_age = self.weights.get('domain_age', 0.10)
+        domain_age = rule_results['source_analysis'].get('domain_age_days')
+        if domain_age is not None:
+            if domain_age > 730: # > 2 years
+                adjustments += w_age
+            elif domain_age < 90: # < 3 months
+                adjustments -= w_age
+            total_weight_used += w_age
+        # 3. Fact Check (20%)
+        w_fc = self.weights.get('fact_check', 0.20)
+        fact_checks = rule_results.get('fact_checking', [])
+        if fact_checks:
+            fc_score = 0
+            for fc in fact_checks:
+                rating = fc.get('rating', '').lower()
+                if rating in ['true', 'verified', 'correct']:
+                    fc_score += 1
+                elif rating in ['false', 'fake', 'incorrect']:
+                    fc_score -= 1
+            # Normalize fc_score (-1 to 1) roughly
+            if fc_score > 0: adjustments += w_fc
+            elif fc_score < 0: adjustments -= w_fc
+            total_weight_used += w_fc
+        # 4. Sentiment Neutrality (15%)
+        # Extreme sentiment = lower score
+        w_sent = self.weights.get('sentiment_neutrality', 0.15)
+        sentiment = nlp_results.get('sentiment', {})
+        if sentiment:
+            s_score = sentiment.get('score', 0.5)
+            # If extremely positive or negative (>0.9), penalize
+            if s_score > 0.9:
+                adjustments -= w_sent * 0.5 # Penalty for extremism
+            else:
+                adjustments += w_sent * 0.2 # Slight boost for moderation
+            total_weight_used += w_sent
+        # 5. Entity Presence (15%)
+        # Presence of Named Entities (PER, ORG, LOC) suggests verifyiability
+        w_ent = self.weights.get('entity_presence', 0.15)
+        entities = nlp_results.get('named_entities', [])
+        if len(entities) > 0:
+            # More entities = better (capped)
+            boost = min(1.0, len(entities) * 0.2)
+            adjustments += w_ent * boost
+            total_weight_used += w_ent
+        # 6. Text Coherence (15%) (Vocabulary Diversity)
+        w_coh = self.weights.get('coherence', 0.15)
+        coherence = nlp_results.get('coherence_score')
+        if coherence is not None:
+            # Coherence is usually 0.0 to 1.0
+            # Center around 0.5: >0.5 improves, <0.5 penalizes
+            adjustments += (coherence - 0.5) * w_coh
+            total_weight_used += w_coh
+        # Final calculation
+        # Base 0.5 + sum of weighted adjustments
+        # Adjustments are in range [-weight, +weight]
+        final_score = 0.5 + adjustments
+        return max(0.0, min(1.0, final_score))
+    # --- [NEW] TREC Evidence Retrieval Methods ---
+    def retrieve_evidence(
+        self,
+        claim: str,
+        k: int = 10,
+        model: str = "bm25"
+    ) -> List[Dict[str, Any]]:
+        """
+        Retrieve evidence documents for a given claim using TREC methodology.
+        This integrates the classic IR evaluation framework (TREC AP88-90)
+        with the neuro-symbolic credibility verification system.
+        Args:
+            claim: The claim or statement to verify
+            k: Number of evidence documents to retrieve
+            model: Retrieval model ('bm25', 'qld', 'tfidf')
+        Returns:
+            List of evidence dictionaries with doc_id, text, score, rank
+        """
+        if not self.trec_retriever:
+            return []
+        try:
+            result = self.trec_retriever.retrieve_evidence(
+                claim=claim,
+                k=k,
+                model=model
+            )
+            # Convert Evidence objects to dictionaries
+            evidences = [e.to_dict() for e in result.evidences]
+            # Add to ontology if available
+            if self.ontology_manager:
+                for e in result.evidences[:3]:  # Top 3 only
+                    self.ontology_manager.add_evidence(
+                        evidence_id=e.doc_id,
+                        source=e.source or "trec_corpus",
+                        content=e.text[:500],
+                        score=e.score
+                    )
+            return evidences
+        except Exception as ex:
+            print(f"[SysCRED] Evidence retrieval error: {ex}")
+            return []
+    def verify_with_evidence(
+        self,
+        claim: str,
+        k: int = 5
+    ) -> Dict[str, Any]:
+        """
+        Complete fact-checking pipeline with evidence retrieval.
+        Combines:
+        1. TREC-style evidence retrieval
+        2. NLP analysis of claim
+        3. Evidence-claim comparison
+        4. Credibility scoring
+        Args:
+            claim: The claim to verify
+            k: Number of evidence documents
+        Returns:
+            Verification result with evidence, analysis, and score
+        """
+        result = {
+            'claim': claim,
+            'evidences': [],
+            'nlp_analysis': {},
+            'evidence_support_score': 0.0,
+            'verification_verdict': 'UNKNOWN',
+            'confidence': 0.0
+        }
+        # 1. Retrieve evidence
+        evidences = self.retrieve_evidence(claim, k=k)
+        result['evidences'] = evidences
+        # 2. NLP analysis of claim
+        cleaned_claim = self.preprocess(claim)
+        result['nlp_analysis'] = self.nlp_analysis(cleaned_claim)
+        # 3. Calculate evidence support score
+        if evidences:
+            # Use semantic similarity if SBERT available
+            if self.coherence_model:
+                try:
+                    claim_embedding = self.coherence_model.encode(claim)
+                    evidence_texts = [e.get('text', '') for e in evidences]
+                    evidence_embeddings = self.coherence_model.encode(evidence_texts)
+                    from sentence_transformers import util
+                    similarities = util.pytorch_cos_sim(claim_embedding, evidence_embeddings)[0]
+                    avg_similarity = similarities.mean().item()
+                    max_similarity = similarities.max().item()
+                    # Evidence support based on similarity
+                    result['evidence_support_score'] = round(max_similarity, 4)
+                    result['average_evidence_similarity'] = round(avg_similarity, 4)
+                except Exception as e:
+                    print(f"[SysCRED] Similarity error: {e}")
+                    # Fallback: use retrieval scores
+                    result['evidence_support_score'] = evidences[0].get('score', 0) if evidences else 0
+            else:
+                # Fallback: use retrieval scores
+                result['evidence_support_score'] = evidences[0].get('score', 0) if evidences else 0
+        # 4. Determine verdict
+        support_score = result['evidence_support_score']
+        if support_score > 0.7:
+            result['verification_verdict'] = 'SUPPORTED'
+            result['confidence'] = support_score
+        elif support_score > 0.5:
+            result['verification_verdict'] = 'PARTIALLY_SUPPORTED'
+            result['confidence'] = support_score
+        elif support_score > 0.3:
+            result['verification_verdict'] = 'INSUFFICIENT_EVIDENCE'
+            result['confidence'] = 0.5
+        else:
+            result['verification_verdict'] = 'NOT_SUPPORTED'
+            result['confidence'] = 1 - support_score
+        return result
+    # --- End TREC Evidence Methods ---
+    def generate_report(
+        self,
+        input_data: str,
+        cleaned_text: str,
+        rule_results: Dict,
+        nlp_results: Dict,
+        external_data: ExternalData,
+        overall_score: float,
+        web_content: Optional[WebContent] = None,
+        graph_context: str = "",  # [NEW]
+        evidences: List[Dict[str, Any]] = None  # [NEW] TREC evidences
+    ) -> Dict[str, Any]:
+        """Generate the final evaluation report."""
+        report = {
+            'idRapport': f"report_{int(datetime.datetime.now().timestamp())}",
+            'informationEntree': input_data,
+            'dateGeneration': datetime.datetime.now().isoformat(),
+            'scoreCredibilite': round(overall_score, 2),
+            'resumeAnalyse': "",
+            'detailsScore': {
+                'base': 0.5,
+                'weights': self.weights,
+                'factors': self._get_score_factors(rule_results, nlp_results)
+            },
+            'sourcesUtilisees': [],
+            'reglesAppliquees': rule_results,
+            'analyseNLP': {
+                'sentiment': nlp_results.get('sentiment'),
+                'bias_analysis': nlp_results.get('bias_analysis'),
+                'named_entities_count': len(nlp_results.get('named_entities', [])),
+                'coherence_score': nlp_results.get('coherence_score'),
+                'sentiment_explanation_preview': (nlp_results.get('sentiment_explanation') or [])[:3]
+            },
+            # [NEW] TREC Evidence section
+            'evidences': evidences or [],
+            'metadonnees': {}
+        }
+        # Add web content metadata if available
+        if web_content:
+            if web_content.success:
+                report['metadonnees']['page_title'] = web_content.title
+                report['metadonnees']['meta_description'] = web_content.meta_description
+                report['metadonnees']['links_count'] = len(web_content.links)
+            else:
+                report['metadonnees']['warning'] = f"Content scrape failed: {web_content.error}"
+        # Generate summary
+        summary_parts = []
+        if web_content and not web_content.success:
+            summary_parts.append(f"⚠️ ATTENTION: Impossible de lire le texte de la page ({web_content.error}). Analyse basée uniquement sur la réputation du domaine.")
+        if overall_score > 0.75:
+            summary_parts.append("L'analyse suggère une crédibilité ÉLEVÉE.")
+        elif overall_score > 0.55:
+            summary_parts.append("L'analyse suggère une crédibilité MOYENNE à ÉLEVÉE.")
+        elif overall_score > 0.45:
+            summary_parts.append("L'analyse suggère une crédibilité MOYENNE.")
+        elif overall_score > 0.25:
+            summary_parts.append("L'analyse suggère une crédibilité FAIBLE à MOYENNE.")
+        else:
+            summary_parts.append("L'analyse suggère une crédibilité FAIBLE.")
+        if external_data.source_reputation != 'Unknown':
+            summary_parts.append(f"Réputation source : {external_data.source_reputation}.")
+        if external_data.domain_age_days:
+            years = external_data.domain_age_days / 365
+            summary_parts.append(f"Âge du domaine : {years:.1f} ans.")
+        if external_data.fact_checks:
+            summary_parts.append(f"{len(external_data.fact_checks)} vérification(s) de faits trouvée(s).")
+        report['resumeAnalyse'] = " ".join(summary_parts)
+        # List sources used
+        if self.is_url(input_data):
+            report['sourcesUtilisees'].append({
+                'type': 'Primary URL',
+                'url': input_data
+            })
+        report['sourcesUtilisees'].append({
+            'type': 'WHOIS Lookup',
+            'status': 'Success' if (external_data.domain_info and external_data.domain_info.success) else 'Failed/N/A'
+        })
+        report['sourcesUtilisees'].append({
+            'type': 'Fact Check API',
+            'results_count': len(external_data.fact_checks)
+        })
+        # [NEW] Add TREC evidence source
+        if evidences:
+            report['sourcesUtilisees'].append({
+                'type': 'TREC Evidence Retrieval',
+                'method': 'BM25/TF-IDF',
+                'corpus': 'AP88-90',
+                'results_count': len(evidences)
+            })
+        return report
+    def _get_score_factors(self, rule_results: Dict, nlp_results: Dict) -> List[Dict]:
+        """Get list of factors that influenced the score (For UI)."""
+        factors = []
+        # 1. Reputation
+        rep = rule_results['source_analysis'].get('reputation')
+        if rep and "N/A" not in rep:
+            factors.append({
+                'factor': 'Source Reputation',
+                'value': rep,
+                'weight': f"{int(self.weights.get('source_reputation',0)*100)}%",
+                'impact': '+' if rep == 'High' else ('-' if rep == 'Low' else '0')
+            })
+        # 2. Fact Checks
+        if rule_results.get('fact_checking'):
+             factors.append({
+                'factor': 'Fact Checks',
+                'value': f"{len(rule_results['fact_checking'])} found",
+                'weight': f"{int(self.weights.get('fact_check',0)*100)}%",
+                'impact': 'Variable'
+            })
+        # 3. Entities
+        n_ent = len(nlp_results.get('named_entities', []))
+        if n_ent > 0:
+            factors.append({
+                'factor': 'Entity Presence',
+                'value': f"{n_ent} entities",
+                'weight': f"{int(self.weights.get('entity_presence',0)*100)}%",
+                'impact': '+'
+            })
+        # 4. Sentiment
+        sent = nlp_results.get('sentiment', {})
+        if sent:
+            factors.append({
+                'factor': 'Sentiment Neutrality',
+                'value': f"{sent.get('label')} ({sent.get('score',0):.2f})",
+                'weight': f"{int(self.weights.get('sentiment_neutrality',0)*100)}%",
+                'impact': '-' if sent.get('score', 0) > 0.9 else '0'
+            })
+        return factors
+    def verify_information(self, input_data: str) -> Dict[str, Any]:
+        """
+        Main pipeline to verify credibility of input data.
+        Args:
+            input_data: URL or text to verify
+        Returns:
+            Complete evaluation report
+        """
+        if not isinstance(input_data, str) or not input_data.strip():
+            return {"error": "L'entrée doit être une chaîne non vide."}
+        print(f"\n[SysCRED] === Vérification: {input_data[:100]}... ===")
+        # 1. Determine input type and fetch content
+        text_to_analyze = ""
+        web_content = None
+        is_url = self.is_url(input_data)
+        if is_url:
+            print("[SysCRED] Fetching web content...")
+            web_content = self.api_clients.fetch_web_content(input_data)
+            if web_content.success:
+                text_to_analyze = web_content.text_content
+                print(f"[SysCRED] ✓ Content fetched: {len(text_to_analyze)} chars")
+            else:
+                print(f"[SysCRED] ⚠ Fetch failed: {web_content.error}")
+                print("[SysCRED] Proceeding with Domain/Metadata analysis only.")
+                text_to_analyze = ""
+                # We don't return error anymore, we proceed!
+        else:
+            text_to_analyze = input_data
+        # 2. Preprocess text
+        cleaned_text = self.preprocess(text_to_analyze)
+        # Only error on empty text if it wasn't a failed web fetch
+        # If web fetch failed, we proceed with empty text to give metadata analysis
+        if not cleaned_text and not (is_url and web_content and not web_content.success):
+            return {"error": "Le texte est vide après prétraitement."}
+        print(f"[SysCRED] Preprocessed text: {len(cleaned_text)} chars")
+        # Determine best query for Fact Checking
+        fact_check_query = input_data
+        if text_to_analyze and len(text_to_analyze) > 10:
+             # Use start of text if available
+            fact_check_query = text_to_analyze[:200]
+        elif is_url and web_content and web_content.title:
+            # Fallback to page title if text is missing (e.g. 403)
+            fact_check_query = web_content.title
+        # 3. Fetch external data
+        print(f"[SysCRED] Fetching external data (Query: {fact_check_query[:50]}...)...")
+        external_data = self.api_clients.fetch_external_data(input_data, fc_query=fact_check_query)
+        # [FIX] Handle text-only input reputation
+        if not is_url:
+            external_data.source_reputation = "N/A (User Input)"
+        print(f"[SysCRED] ✓ Reputation: {external_data.source_reputation}, Age: {external_data.domain_age_days} days")
+        # 4. Rule-based analysis
+        print("[SysCRED] Running rule-based analysis...")
+        rule_results = self.rule_based_analysis(cleaned_text, external_data)
+        # 5. NLP analysis
+        print("[SysCRED] Running NLP analysis...")
+        nlp_results = self.nlp_analysis(cleaned_text)
+        # 6. Calculate score
+        overall_score = self.calculate_overall_score(rule_results, nlp_results)
+        print(f"[SysCRED] ✓ Credibility score: {overall_score:.2f}")
+        # 7. [NEW] GraphRAG Context Retrieval
+        graph_context = ""
+        similar_uris = []
+        if self.graph_rag and 'source_analysis' in rule_results:
+             domain = rule_results['source_analysis'].get('domain', '')
+             # Pass keywords for text search if domain is empty or generic
+             keywords = []
+             if not domain and cleaned_text:
+                 keywords = cleaned_text.split()[:5] # Simple keyword extraction
+             context = self.graph_rag.get_context(domain, keywords=keywords)
+             graph_context = context.get('full_text', '')
+             similar_uris = context.get('similar_uris', [])
+             if "Graph Memory" in graph_context:
+                 print(f"[SysCRED] GraphRAG Context Found: {graph_context.splitlines()[1]}")
+        # 8. Generate report (Updated to include context)
+        report = self.generate_report(
+            input_data, cleaned_text, rule_results,
+            nlp_results, external_data, overall_score, web_content,
+            graph_context=graph_context
+        )
+        # Add similar URIs to report for ontology linking
+        if similar_uris:
+            report['similar_claims_uris'] = similar_uris
+        # 9. Save to ontology
+        if self.ontology_manager:
+            try:
+                report_uri = self.ontology_manager.add_evaluation_triplets(report)
+                report['ontology_uri'] = report_uri
+                self.ontology_manager.save_data()
+            except Exception as e:
+                print(f"[SysCRED] Ontology save failed: {e}")
+        print("[SysCRED] === Vérification terminée ===\n")
+        return report
+# --- Main / Testing ---
+if __name__ == "__main__":
+    import json
+    print("=" * 60)
+    print("SysCRED v2.0 - Système de Vérification de Crédibilité")
+    print("(c) Dominique S. Loyer - PhD Thesis Prototype")
+    print("=" * 60 + "\n")
+    # Initialize system (without ML models for quick testing)
+    system = CredibilityVerificationSystem(
+        ontology_base_path="/Users/bk280625/documents041025/MonCode/sysCRED_onto26avrtil.ttl",
+        ontology_data_path="/Users/bk280625/documents041025/MonCode/ontology/sysCRED_data.ttl",
+        load_ml_models=False  # Set to True for full analysis
+    )
+    # Test cases
+    test_cases = {
+        "Test URL Crédible": "https://www.lemonde.fr",
+        "Test URL Inconnu": "https://example.com/article",
+        "Test Texte Simple": "This is a verified and authentic news report.",
+        "Test Texte Suspect": "Shocking conspiracy revealed! They don't want you to know this secret!",
+    }
+    results = {}
+    for name, test_input in test_cases.items():
+        print(f"\n{'='*50}")
+        print(f"Test: {name}")
+        print('='*50)
+        result = system.verify_information(test_input)
+        results[name] = result
+        if 'error' not in result:
+            print(f"\nScore: {result['scoreCredibilite']}")
+            print(f"Résumé: {result['resumeAnalyse']}")
+        else:
+            print(f"Erreur: {result['error']}")
+    print("\n" + "="*60)
+    print("Résumé des tests:")
+    print("="*60)
+    for name, result in results.items():
+        if 'error' not in result:
+            print(f"  {name}: Score = {result['scoreCredibilite']:.2f}")
+        else:
+            print(f"  {name}: ERREUR")