Spaces:

goodmodeler
/

safe_rag

Sleeping

App Files Files Community

test

by goodmodeler - opened Oct 10, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+44

-3923

This PR is in draft mode

Files changed (31) hide show

.gitattributes +35 -0
.gitignore +0 -49
PROJECT_INFO.md +0 -141
README.md +9 -101
calibration/__init__.py +0 -5
calibration/calibration_head.py +0 -210
calibration/features.py +0 -173
calibration/trainer.py +0 -171
config.yaml +0 -189
data_processing/__init__.py +0 -4
data_processing/data_loader.py +0 -29
data_processing/preprocessor.py +0 -112
eval/__init__.py +0 -6
eval/eval_attr.py +0 -275
eval/eval_calib.py +0 -269
eval/eval_qa.py +0 -137
eval/eval_system.py +0 -297
exp_pipeline/pipeline.py +0 -56
generator/__init__.py +0 -5
generator/prompt_templates.py +0 -113
generator/safe_generate.py +0 -170
generator/vllm_server.py +0 -102
real_embedding_test.py +0 -269
requirements.txt +0 -19
retriever/__init__.py +0 -6
retriever/embedder.py +0 -49
retriever/faiss_index.py +0 -131
retriever/reranker.py +0 -46
retriever/retriever.py +0 -104
simple_e2e_test.py +0 -518
simple_test.py +0 -167

.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

.gitignore DELETED Viewed

@@ -1,49 +0,0 @@
-# Python
-__pycache__/
-*.py[cod]
-*$py.class
-*.so
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-# Virtual environments
-venv/
-env/
-ENV/
-# IDE
-.vscode/
-.idea/
-*.swp
-*.swo
-# OS
-.DS_Store
-Thumbs.db
-# Project specific
-cache/
-logs/
-results/
-models/
-index/
-data/
-*.log
-# Temporary files
-*.tmp
-*.temp

PROJECT_INFO.md DELETED Viewed

@@ -1,141 +0,0 @@
-# SafeRAG 项目信息
-## 📁 项目结构
-```
-safe_rag/
-├── app.py                    # Gradio 演示应用
-├── requirements.txt          # Python 依赖
-├── config.yaml              # 配置文件
-├── README.md                # 项目说明（HF Spaces 配置）
-├── simple_e2e_test.py       # 端到端测试
-├── simple_test.py           # 基本功能测试
-├── data_processing/         # 数据处理模块
-│   ├── __init__.py
-│   ├── data_loader.py       # 数据加载器
-│   └── preprocessor.py      # 文本预处理器
-├── retriever/              # 检索模块
-│   ├── __init__.py
-│   ├── embedder.py         # 嵌入生成器
-│   ├── faiss_index.py      # FAISS 索引
-│   ├── retriever.py        # 检索器
-│   └── reranker.py         # 重排序器
-├── generator/              # 生成模块
-│   ├── __init__.py
-│   ├── vllm_server.py      # vLLM 服务器
-│   ├── prompt_templates.py # 提示模板
-│   └── safe_generate.py    # 安全生成器
-├── calibration/            # 校准模块
-│   ├── __init__.py
-│   ├── features.py         # 特征提取
-│   ├── calibration_head.py # 校准头
-│   └── trainer.py          # 训练器
-└── eval/                   # 评估模块
-    ├── __init__.py
-    ├── eval_qa.py          # QA 评估
-    ├── eval_attr.py        # 归因评估
-    ├── eval_calib.py       # 校准评估
-    └── eval_system.py      # 系统评估
-```
-## 🚀 核心功能
-### 1. 数据处理 (`data_processing/`)
-- **DataLoader**: 加载 HF Datasets（HotpotQA, TriviaQA, Wikipedia）
-- **Preprocessor**: 文本清理、句子分割、词元化
-### 2. 检索系统 (`retriever/`)
-- **Embedder**: 使用 BGE/E5 生成嵌入向量
-- **FAISSIndex**: 构建和搜索 FAISS 索引
-- **Retriever**: 批量检索相关文档
-- **Reranker**: 重排序提升检索质量
-### 3. 生成系统 (`generator/`)
-- **VLLMServer**: vLLM 推理服务器
-- **SafeGenerator**: 风险感知的答案生成
-- **PromptTemplates**: 提示模板管理
-### 4. 风险校准 (`calibration/`)
-- **RiskFeatureExtractor**: 提取 16 维风险特征
-- **CalibrationHead**: LogReg/MLP 校准头
-- **Trainer**: 校准头训练
-### 5. 评估系统 (`eval/`)
-- **QAEvaluator**: EM/F1 评估
-- **AttributionEvaluator**: 引用归因评估
-- **CalibrationEvaluator**: 校准质量评估
-- **SystemEvaluator**: 系统性能评估
-## 🎯 风险校准策略
-### 风险特征 (16维)
-1. **检索统计**: 相似度分数、方差、多样性
-2. **覆盖特征**: Q&A 间的 token/实体重叠
-3. **一致性特征**: 段落间语义相似度
-4. **多样性特征**: 主题方差、段落多样性
-### 自适应策略
-- **低风险 (r < 0.3)**: 正常生成
-- **中风险 (0.3 ≤ r < 0.7)**: 保守生成 + 强制引用
-- **高风险 (r ≥ 0.7)**: 非常保守或拒绝回答
-## 📊 性能目标
-- **QA 准确率**: 相比 vanilla RAG 的 EM/F1 提升
-- **归因质量**: 引用精确率/召回率提升 8-12pt
-- **校准质量**: ECE 降低 30-40%
-- **系统吞吐**: vLLM 带来 2-3.5x 提升
-## 🧪 测试验证
-### 端到端测试 (`simple_e2e_test.py`)
-- ✅ 8/8 测试通过
-- ✅ 完整 RAG 流程验证
-- ✅ 所有核心功能正常
-### 基本测试 (`simple_test.py`)
-- ✅ 模块导入测试
-- ✅ 基本功能验证
-- ✅ 配置检查
-## 🚀 部署到 Hugging Face Spaces
-### 1. 上传文件
-- 将整个 `safe_rag` 目录上传到 HF Spaces
-- 确保 `app.py` 在根目录
-### 2. 配置 Spaces
-- SDK: Gradio
-- Hardware: GPU (推荐 A10G 或 A100)
-- Environment: Python 3.8+
-### 3. 自动部署
-- HF Spaces 会自动安装依赖
-- 自动启动 `app.py`
-- 提供公共访问链接
-## 📝 使用说明
-### 本地运行
-```bash
-# 安装依赖
-pip install -r requirements.txt
-# 运行测试
-python3 simple_e2e_test.py
-# 启动演示
-python3 app.py
-```
-### 在线演示
-访问 Hugging Face Spaces 链接，体验交互式 RAG 系统。
-## 🎉 项目状态
-✅ **完成**: 所有核心模块实现
-✅ **测试**: 端到端测试通过
-✅ **简化**: 移除不必要的文件
-✅ **就绪**: 可部署到 HF Spaces
-SafeRAG 项目已经准备好部署和使用了！

README.md CHANGED Viewed

@@ -1,108 +1,16 @@
 ---
-title: SafeRAG Demo
-emoji: 🤖
-colorFrom: blue
 colorTo: purple
 sdk: gradio
-sdk_version: 4.0.0
 app_file: app.py
 pinned: false
-license: apache-2.0
 ---
-# SafeRAG: High-Performance Calibrated RAG
-A production-ready Retrieval-Augmented Generation (RAG) system with risk calibration, built on the Hugging Face ecosystem.
-## 🚀 Key Features
-- **Risk Calibration**: Multi-layer risk assessment with adaptive strategies
-- **High Performance**: Optimized for 2-3.5x throughput improvement
-- **Hugging Face Native**: Built on HF Datasets, Models, and Spaces
-- **Production Ready**: Complete pipeline with error handling and monitoring
-## 🏗️ Architecture
-```
-HF Datasets → Embedding (BGE/E5) → FAISS Index
-Query → Batched Retrieval → Evidence Selector → Generator (vLLM + gpt-oss-20b)
-→ Risk Calibration → Adaptive Strategy → Output (Answer + Citations + Risk Score)
-```
-## 📊 Performance Targets
-- **QA Accuracy**: EM/F1 improvements over vanilla RAG
-- **Attribution**: +8-12pt improvement in citation precision/recall
-- **Calibration**: 30-40% reduction in ECE (Expected Calibration Error)
-- **Throughput**: 2-3.5x improvement with vLLM
-## 🛠️ Quick Start
-### Run Tests
-```bash
-python3 simple_e2e_test.py
-```
-### Start Demo
-```bash
-python3 app.py
-```
-## 📈 Evaluation
-The system has been tested with comprehensive end-to-end tests:
-- ✅ Text processing and sentence extraction
-- ✅ Embedding creation and similarity calculation
-- ✅ Passage retrieval and reranking
-- ✅ Risk feature extraction and prediction
-- ✅ Risk-aware answer generation
-- ✅ Evaluation metrics (EM, F1, ROUGE)
-- ✅ Complete end-to-end RAG pipeline
-## 🔧 Configuration
-Key parameters in `config.yaml`:
-- **Risk Thresholds**: τ₁ = 0.3, τ₂ = 0.7
-- **Retrieval**: k = 20, rerank_k = 10
-- **Generation**: max_tokens = 512, temperature = 0.7
-- **Calibration**: 16 features, logistic regression
-## 🎯 Risk Calibration
-### Risk Features (16-dimensional)
-1. **Retrieval Statistics**: Similarity scores, variance, diversity
-2. **Coverage Features**: Token/entity overlap between Q&A
-3. **Consistency Features**: Semantic similarity between passages
-4. **Diversity Features**: Topic variance, passage diversity
-### Adaptive Strategies
-- **Low Risk (r < τ₁)**: Normal generation
-- **Medium Risk (τ₁ ≤ r < τ₂)**: Conservative generation + citations
-- **High Risk (r ≥ τ₂)**: Very conservative or refuse
-## 📚 Datasets
-- **HotpotQA**: Multi-hop reasoning with supporting facts
-- **TriviaQA**: Open-domain QA for general knowledge
-- **Wikipedia**: Knowledge base via HF Datasets
-## 📄 Citation
-```bibtex
-@article{safrag2024,
-  title={SafeRAG: High-Performance Calibrated RAG with Risk Assessment},
-  author={Your Name},
-  journal={arXiv preprint},
-  year={2024}
-}
-```
-## 📝 License
-Apache 2.0 License - see LICENSE file for details.
----
-**SafeRAG**: A production-ready RAG system with risk calibration, built on Hugging Face ecosystem.

 ---
+title: Safe Rag
+emoji: 💬
+colorFrom: yellow
 colorTo: purple
 sdk: gradio
+sdk_version: 5.42.0
 app_file: app.py
 pinned: false
+hf_oauth: true
+hf_oauth_scopes:
+- inference-api
+short_description: A High-Performance and Risk-Calibrated RAG system
 ---
+An example chatbot using [Gradio](https://gradio.app), [`huggingface_hub`](https://huggingface.co/docs/huggingface_hub/v0.22.2/en/index), and the [Hugging Face Inference API](https://huggingface.co/docs/api-inference/index).

calibration/__init__.py DELETED Viewed

@@ -1,5 +0,0 @@
-from .features import RiskFeatureExtractor
-from .calibration_head import CalibrationHead
-from .trainer import CalibrationTrainer
-__all__ = ['RiskFeatureExtractor', 'CalibrationHead', 'CalibrationTrainer']

calibration/calibration_head.py DELETED Viewed

@@ -1,210 +0,0 @@
-import torch
-import torch.nn as nn
-import numpy as np
-from sklearn.linear_model import LogisticRegression
-from sklearn.ensemble import RandomForestClassifier
-from sklearn.metrics import accuracy_score, roc_auc_score
-from typing import Dict, Any, List, Tuple
-import logging
-import joblib
-import os
-logger = logging.getLogger(__name__)
-class CalibrationHead:
-    def __init__(self, model_type: str = "logistic", input_dim: int = 16):
-        self.model_type = model_type
-        self.input_dim = input_dim
-        self.model = None
-        self.is_trained = False
-    def _create_model(self):
-        """Create the calibration model"""
-        if self.model_type == "logistic":
-            self.model = LogisticRegression(
-                random_state=42,
-                max_iter=1000,
-                class_weight='balanced'
-            )
-        elif self.model_type == "random_forest":
-            self.model = RandomForestClassifier(
-                n_estimators=100,
-                random_state=42,
-                class_weight='balanced'
-            )
-        elif self.model_type == "mlp":
-            self.model = MLPCalibrationHead(self.input_dim)
-        else:
-            raise ValueError(f"Unknown model type: {self.model_type}")
-    def train(self, X: np.ndarray, y: np.ndarray) -> Dict[str, float]:
-        """Train the calibration model"""
-        if self.model is None:
-            self._create_model()
-        if self.model_type in ["logistic", "random_forest"]:
-            # Sklearn models
-            self.model.fit(X, y)
-            # Get predictions and metrics
-            y_pred = self.model.predict(X)
-            y_proba = self.model.predict_proba(X)[:, 1] if hasattr(self.model, 'predict_proba') else y_pred
-            metrics = {
-                'accuracy': accuracy_score(y, y_pred),
-                'auc': roc_auc_score(y, y_proba) if len(np.unique(y)) > 1 else 0.0
-            }
-        else:
-            # PyTorch models
-            metrics = self._train_pytorch_model(X, y)
-        self.is_trained = True
-        logger.info(f"Trained {self.model_type} model with metrics: {metrics}")
-        return metrics
-    def predict_risk(self, features: Dict[str, Any]) -> float:
-        """Predict risk score from features"""
-        if not self.is_trained:
-            logger.warning("Model not trained, returning default risk score")
-            return 0.5
-        # Convert features to array
-        X = self._features_to_array(features)
-        if self.model_type in ["logistic", "random_forest"]:
-            if hasattr(self.model, 'predict_proba'):
-                risk_score = self.model.predict_proba(X.reshape(1, -1))[0, 1]
-            else:
-                risk_score = float(self.model.predict(X.reshape(1, -1))[0])
-        else:
-            # PyTorch models
-            with torch.no_grad():
-                X_tensor = torch.FloatTensor(X.reshape(1, -1))
-                risk_score = torch.sigmoid(self.model(X_tensor)).item()
-        return float(risk_score)
-    def predict_batch(self, features_list: List[Dict[str, Any]]) -> List[float]:
-        """Predict risk scores for multiple feature sets"""
-        if not features_list:
-            return []
-        # Convert all features to arrays
-        X = np.array([self._features_to_array(f) for f in features_list])
-        if self.model_type in ["logistic", "random_forest"]:
-            if hasattr(self.model, 'predict_proba'):
-                risk_scores = self.model.predict_proba(X)[:, 1]
-            else:
-                risk_scores = self.model.predict(X)
-        else:
-            # PyTorch models
-            with torch.no_grad():
-                X_tensor = torch.FloatTensor(X)
-                risk_scores = torch.sigmoid(self.model(X_tensor)).numpy()
-        return risk_scores.tolist()
-    def _features_to_array(self, features: Dict[str, Any]) -> np.ndarray:
-        """Convert features dictionary to numpy array"""
-        # Define feature order (must match training)
-        feature_order = [
-            'num_passages', 'avg_similarity', 'std_similarity', 'max_similarity',
-            'min_similarity', 'score_variance', 'avg_token_overlap', 'max_token_overlap',
-            'avg_entity_overlap', 'max_entity_overlap', 'passage_consistency',
-            'passage_consistency_std', 'min_passage_similarity', 'diversity',
-            'topic_variance'
-        ]
-        # Extract features in order
-        feature_array = []
-        for feature_name in feature_order:
-            value = features.get(feature_name, 0.0)
-            feature_array.append(float(value))
-        return np.array(feature_array)
-    def _train_pytorch_model(self, X: np.ndarray, y: np.ndarray) -> Dict[str, float]:
-        """Train PyTorch model"""
-        # Convert to tensors
-        X_tensor = torch.FloatTensor(X)
-        y_tensor = torch.FloatTensor(y)
-        # Training setup
-        optimizer = torch.optim.Adam(self.model.parameters(), lr=0.001)
-        criterion = nn.BCEWithLogitsLoss()
-        # Training loop
-        self.model.train()
-        for epoch in range(100):
-            optimizer.zero_grad()
-            outputs = self.model(X_tensor)
-            loss = criterion(outputs.squeeze(), y_tensor)
-            loss.backward()
-            optimizer.step()
-        # Evaluation
-        self.model.eval()
-        with torch.no_grad():
-            outputs = self.model(X_tensor)
-            predictions = torch.sigmoid(outputs).squeeze().numpy()
-            binary_preds = (predictions > 0.5).astype(int)
-        metrics = {
-            'accuracy': accuracy_score(y, binary_preds),
-            'auc': roc_auc_score(y, predictions) if len(np.unique(y)) > 1 else 0.0
-        }
-        return metrics
-    def save(self, path: str) -> None:
-        """Save the trained model"""
-        os.makedirs(os.path.dirname(path), exist_ok=True)
-        if self.model_type in ["logistic", "random_forest"]:
-            joblib.dump(self.model, f"{path}.joblib")
-        else:
-            torch.save(self.model.state_dict(), f"{path}.pth")
-        # Save metadata
-        metadata = {
-            'model_type': self.model_type,
-            'input_dim': self.input_dim,
-            'is_trained': self.is_trained
-        }
-        joblib.dump(metadata, f"{path}_metadata.joblib")
-        logger.info(f"Saved model to {path}")
-    def load(self, path: str) -> None:
-        """Load a trained model"""
-        # Load metadata
-        metadata = joblib.load(f"{path}_metadata.joblib")
-        self.model_type = metadata['model_type']
-        self.input_dim = metadata['input_dim']
-        self.is_trained = metadata['is_trained']
-        # Load model
-        if self.model_type in ["logistic", "random_forest"]:
-            self.model = joblib.load(f"{path}.joblib")
-        else:
-            self.model = MLPCalibrationHead(self.input_dim)
-            self.model.load_state_dict(torch.load(f"{path}.pth"))
-        logger.info(f"Loaded model from {path}")
-class MLPCalibrationHead(nn.Module):
-    def __init__(self, input_dim: int, hidden_dim: int = 64):
-        super().__init__()
-        self.layers = nn.Sequential(
-            nn.Linear(input_dim, hidden_dim),
-            nn.ReLU(),
-            nn.Dropout(0.2),
-            nn.Linear(hidden_dim, hidden_dim // 2),
-            nn.ReLU(),
-            nn.Dropout(0.2),
-            nn.Linear(hidden_dim // 2, 1)
-        )
-    def forward(self, x):
-        return self.layers(x)

calibration/features.py DELETED Viewed

@@ -1,173 +0,0 @@
-from typing import List, Dict, Any
-import numpy as np
-from sentence_transformers import SentenceTransformer
-import logging
-from sklearn.feature_extraction.text import TfidfVectorizer
-from sklearn.metrics.pairwise import cosine_similarity
-import re
-logger = logging.getLogger(__name__)
-class RiskFeatureExtractor:
-    def __init__(self, embedding_model: str = "BAAI/bge-large-en-v1.5"):
-        self.embedding_model = SentenceTransformer(embedding_model)
-        self.tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
-    def extract_features(self, question: str, retrieved_passages: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Extract risk assessment features"""
-        if not retrieved_passages:
-            return self._get_empty_features()
-        features = {}
-        # Retrieval statistics
-        features.update(self._extract_retrieval_stats(retrieved_passages))
-        # Coverage features
-        features.update(self._extract_coverage_features(question, retrieved_passages))
-        # Consistency features
-        features.update(self._extract_consistency_features(question, retrieved_passages))
-        # Diversity features
-        features.update(self._extract_diversity_features(retrieved_passages))
-        return features
-    def _extract_retrieval_stats(self, passages: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Extract retrieval statistics"""
-        if not passages:
-            return {}
-        scores = [p.get('score', 0.0) for p in passages]
-        return {
-            'num_passages': len(passages),
-            'avg_similarity': np.mean(scores),
-            'std_similarity': np.std(scores),
-            'max_similarity': np.max(scores),
-            'min_similarity': np.min(scores),
-            'score_variance': np.var(scores)
-        }
-    def _extract_coverage_features(self, question: str, passages: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Extract coverage features between question and passages"""
-        if not passages:
-            return {}
-        # Token overlap
-        question_tokens = set(question.lower().split())
-        passage_texts = [p.get('text', '') for p in passages]
-        overlaps = []
-        for passage_text in passage_texts:
-            passage_tokens = set(passage_text.lower().split())
-            overlap = len(question_tokens.intersection(passage_tokens))
-            overlaps.append(overlap / len(question_tokens) if question_tokens else 0)
-        # Entity overlap (simplified)
-        question_entities = self._extract_entities(question)
-        entity_overlaps = []
-        for passage_text in passage_texts:
-            passage_entities = self._extract_entities(passage_text)
-            overlap = len(question_entities.intersection(passage_entities))
-            entity_overlaps.append(overlap / len(question_entities) if question_entities else 0)
-        return {
-            'avg_token_overlap': np.mean(overlaps),
-            'max_token_overlap': np.max(overlaps),
-            'avg_entity_overlap': np.mean(entity_overlaps),
-            'max_entity_overlap': np.max(entity_overlaps)
-        }
-    def _extract_consistency_features(self, question: str, passages: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Extract consistency features between passages"""
-        if len(passages) < 2:
-            return {'passage_consistency': 1.0}
-        # Semantic similarity between passages
-        passage_texts = [p.get('text', '') for p in passages]
-        embeddings = self.embedding_model.encode(passage_texts)
-        # Compute pairwise similarities
-        similarities = cosine_similarity(embeddings)
-        # Get upper triangle (excluding diagonal)
-        upper_triangle = similarities[np.triu_indices_from(similarities, k=1)]
-        return {
-            'passage_consistency': np.mean(upper_triangle),
-            'passage_consistency_std': np.std(upper_triangle),
-            'min_passage_similarity': np.min(upper_triangle)
-        }
-    def _extract_diversity_features(self, passages: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Extract diversity features"""
-        if len(passages) < 2:
-            return {'diversity': 1.0}
-        # Topic diversity using TF-IDF
-        passage_texts = [p.get('text', '') for p in passages]
-        try:
-            tfidf_matrix = self.tfidf_vectorizer.fit_transform(passage_texts)
-            similarities = cosine_similarity(tfidf_matrix)
-            # Diversity as 1 - average similarity
-            upper_triangle = similarities[np.triu_indices_from(similarities, k=1)]
-            diversity = 1.0 - np.mean(upper_triangle)
-            return {
-                'diversity': diversity,
-                'topic_variance': np.var(upper_triangle)
-            }
-        except:
-            return {'diversity': 0.5, 'topic_variance': 0.0}
-    def _extract_entities(self, text: str) -> set:
-        """Extract entities from text (simplified)"""
-        # Simple entity extraction - in practice use NER
-        # Look for capitalized words and common entity patterns
-        entities = set()
-        # Capitalized words (potential entities)
-        capitalized = re.findall(r'\b[A-Z][a-z]+\b', text)
-        entities.update(capitalized)
-        # Numbers and dates
-        numbers = re.findall(r'\b\d+\b', text)
-        entities.update(numbers)
-        return entities
-    def _get_empty_features(self) -> Dict[str, Any]:
-        """Return empty features when no passages available"""
-        return {
-            'num_passages': 0,
-            'avg_similarity': 0.0,
-            'std_similarity': 0.0,
-            'max_similarity': 0.0,
-            'min_similarity': 0.0,
-            'score_variance': 0.0,
-            'avg_token_overlap': 0.0,
-            'max_token_overlap': 0.0,
-            'avg_entity_overlap': 0.0,
-            'max_entity_overlap': 0.0,
-            'passage_consistency': 0.0,
-            'passage_consistency_std': 0.0,
-            'min_passage_similarity': 0.0,
-            'diversity': 0.0,
-            'topic_variance': 0.0
-        }
-    def extract_batch_features(self, questions: List[str],
-                             passages_list: List[List[Dict[str, Any]]]) -> List[Dict[str, Any]]:
-        """Extract features for multiple question-passage pairs"""
-        features_list = []
-        for question, passages in zip(questions, passages_list):
-            features = self.extract_features(question, passages)
-            features_list.append(features)
-        return features_list

calibration/trainer.py DELETED Viewed

@@ -1,171 +0,0 @@
-from typing import List, Dict, Any, Tuple
-import numpy as np
-from sklearn.model_selection import train_test_split
-from sklearn.metrics import accuracy_score, precision_recall_fscore_support, roc_auc_score
-import logging
-from .features import RiskFeatureExtractor
-from .calibration_head import CalibrationHead
-logger = logging.getLogger(__name__)
-class CalibrationTrainer:
-    def __init__(self, feature_extractor: RiskFeatureExtractor,
-                 calibration_head: CalibrationHead):
-        self.feature_extractor = feature_extractor
-        self.calibration_head = calibration_head
-    def prepare_training_data(self, qa_data: List[Dict[str, Any]],
-                            retrieved_passages_list: List[List[Dict[str, Any]]],
-                            labels: List[int]) -> Tuple[np.ndarray, np.ndarray]:
-        """Prepare training data from QA samples and retrieved passages"""
-        # Extract features
-        features_list = self.feature_extractor.extract_batch_features(
-            [item['question'] for item in qa_data],
-            retrieved_passages_list
-        )
-        # Convert features to arrays
-        X = np.array([self.feature_extractor._features_to_array(f) for f in features_list])
-        y = np.array(labels)
-        logger.info(f"Prepared training data: {X.shape[0]} samples, {X.shape[1]} features")
-        return X, y
-    def train(self, X: np.ndarray, y: np.ndarray,
-              test_size: float = 0.2, random_state: int = 42) -> Dict[str, Any]:
-        """Train the calibration model"""
-        # Split data
-        X_train, X_test, y_train, y_test = train_test_split(
-            X, y, test_size=test_size, random_state=random_state, stratify=y
-        )
-        # Train model
-        train_metrics = self.calibration_head.train(X_train, y_train)
-        # Evaluate on test set
-        test_metrics = self.evaluate(X_test, y_test)
-        # Combine metrics
-        all_metrics = {
-            'train': train_metrics,
-            'test': test_metrics,
-            'train_size': len(X_train),
-            'test_size': len(X_test)
-        }
-        logger.info(f"Training completed. Test metrics: {test_metrics}")
-        return all_metrics
-    def evaluate(self, X: np.ndarray, y: np.ndarray) -> Dict[str, float]:
-        """Evaluate the calibration model"""
-        if not self.calibration_head.is_trained:
-            raise ValueError("Model not trained yet")
-        # Get predictions
-        if hasattr(self.calibration_head.model, 'predict_proba'):
-            y_proba = self.calibration_head.model.predict_proba(X)[:, 1]
-            y_pred = (y_proba > 0.5).astype(int)
-        else:
-            y_pred = self.calibration_head.model.predict(X)
-            y_proba = y_pred
-        # Calculate metrics
-        accuracy = accuracy_score(y, y_pred)
-        precision, recall, f1, _ = precision_recall_fscore_support(y, y_pred, average='binary')
-        try:
-            auc = roc_auc_score(y, y_proba)
-        except:
-            auc = 0.0
-        return {
-            'accuracy': accuracy,
-            'precision': precision,
-            'recall': recall,
-            'f1': f1,
-            'auc': auc
-        }
-    def create_synthetic_labels(self, qa_data: List[Dict[str, Any]],
-                              retrieved_passages_list: List[List[Dict[str, Any]]]) -> List[int]:
-        """Create synthetic risk labels for training (placeholder implementation)"""
-        labels = []
-        for qa_item, passages in zip(qa_data, retrieved_passages_list):
-            # Simple heuristic for risk labeling
-            # In practice, this would be based on human annotations or automated evaluation
-            question = qa_item['question']
-            answer = qa_item['answer']
-            # Risk factors
-            risk_score = 0.0
-            # Low similarity scores = high risk
-            if passages:
-                avg_similarity = np.mean([p.get('score', 0.0) for p in passages])
-                if avg_similarity < 0.3:
-                    risk_score += 0.3
-            # Few passages = high risk
-            if len(passages) < 3:
-                risk_score += 0.2
-            # Question complexity (length, question words)
-            if len(question.split()) > 20:
-                risk_score += 0.1
-            if any(word in question.lower() for word in ['why', 'how', 'explain', 'compare']):
-                risk_score += 0.1
-            # Answer length (very short or very long answers might be risky)
-            if len(answer.split()) < 5 or len(answer.split()) > 100:
-                risk_score += 0.1
-            # Convert to binary label
-            label = 1 if risk_score > 0.3 else 0
-            labels.append(label)
-        logger.info(f"Created {sum(labels)} high-risk labels out of {len(labels)} total")
-        return labels
-    def cross_validate(self, X: np.ndarray, y: np.ndarray,
-                      cv_folds: int = 5) -> Dict[str, List[float]]:
-        """Perform cross-validation"""
-        from sklearn.model_selection import StratifiedKFold
-        skf = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=42)
-        fold_metrics = {
-            'accuracy': [],
-            'precision': [],
-            'recall': [],
-            'f1': [],
-            'auc': []
-        }
-        for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
-            logger.info(f"Training fold {fold + 1}/{cv_folds}")
-            X_train, X_val = X[train_idx], X[val_idx]
-            y_train, y_val = y[train_idx], y[val_idx]
-            # Train on fold
-            self.calibration_head.train(X_train, y_train)
-            # Evaluate on validation set
-            val_metrics = self.evaluate(X_val, y_val)
-            for metric, value in val_metrics.items():
-                fold_metrics[metric].append(value)
-        # Calculate mean and std
-        cv_results = {}
-        for metric, values in fold_metrics.items():
-            cv_results[f'{metric}_mean'] = np.mean(values)
-            cv_results[f'{metric}_std'] = np.std(values)
-        logger.info(f"Cross-validation results: {cv_results}")
-        return cv_results

config.yaml DELETED Viewed

@@ -1,189 +0,0 @@
-# SafeRAG Configuration File
-# Model Configuration
-models:
-  embedding:
-    name: "BAAI/bge-large-en-v1.5"
-    device: "cuda"
-    batch_size: 32
-  reranker:
-    name: "cross-encoder/ms-marco-MiniLM-L-6-v2"
-    device: "cuda"
-    batch_size: 32
-  generator:
-    name: "openai/gpt-oss-20b"
-    tensor_parallel_size: 1
-    gpu_memory_utilization: 0.9
-    max_tokens: 512
-    temperature: 0.7
-    top_p: 0.9
-  calibration:
-    type: "logistic"  # logistic, random_forest, mlp
-    input_dim: 16
-    hidden_dim: 64
-# Data Configuration
-data:
-  datasets:
-    - "hotpotqa"
-    - "triviaqa"
-    - "nq_open"
-  knowledge_base:
-    name: "wikipedia"
-    language: "en"
-    date: "20231101"
-  preprocessing:
-    max_sentence_length: 512
-    min_sentence_length: 20
-    cache_dir: "./cache"
-# Index Configuration
-index:
-  type: "ivf"  # flat, ivf
-  dimension: 1024
-  nlist: 4096
-  save_path: "./index/safrag"
-# Retrieval Configuration
-retrieval:
-  k: 20
-  rerank_k: 10
-  batch_size: 32
-  similarity_threshold: 0.3
-# Risk Calibration Configuration
-calibration:
-  tau1: 0.3  # Low risk threshold
-  tau2: 0.7  # High risk threshold
-  features:
-    - "num_passages"
-    - "avg_similarity"
-    - "std_similarity"
-    - "max_similarity"
-    - "min_similarity"
-    - "score_variance"
-    - "avg_token_overlap"
-    - "max_token_overlap"
-    - "avg_entity_overlap"
-    - "max_entity_overlap"
-    - "passage_consistency"
-    - "passage_consistency_std"
-    - "min_passage_similarity"
-    - "diversity"
-    - "topic_variance"
-# Evaluation Configuration
-evaluation:
-  metrics:
-    qa:
-      - "exact_match"
-      - "f1"
-      - "rouge1"
-      - "rouge2"
-      - "rougeL"
-    attribution:
-      - "precision"
-      - "recall"
-      - "f1"
-      - "citation_coverage"
-      - "citation_accuracy"
-    calibration:
-      - "ece"
-      - "mce"
-      - "auroc"
-      - "auprc"
-    system:
-      - "throughput"
-      - "latency"
-      - "gpu_utilization"
-      - "memory_usage"
-  test_size: 0.2
-  random_state: 42
-  cv_folds: 5
-# System Configuration
-system:
-  device: "cuda"
-  num_workers: 4
-  batch_size: 32
-  max_memory_gb: 16
-  monitoring:
-    enabled: true
-    interval: 1  # seconds
-    metrics:
-      - "cpu"
-      - "memory"
-      - "gpu"
-      - "disk"
-# Output Configuration
-output:
-  results_dir: "./results"
-  logs_dir: "./logs"
-  models_dir: "./models"
-  plots_dir: "./plots"
-  formats:
-    - "json"
-    - "csv"
-    - "html"
-  save_predictions: true
-  save_features: true
-  save_plots: true
-# Logging Configuration
-logging:
-  level: "INFO"
-  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
-  file: "./logs/safrag.log"
-  max_size: "10MB"
-  backup_count: 5
-# Hugging Face Configuration
-huggingface:
-  cache_dir: "./cache"
-  token: null  # Set your HF token here
-  hub_url: "https://huggingface.co"
-  spaces:
-    app_name: "safrag-demo"
-    hardware: "cpu"  # cpu, gpu, cpu-basic, gpu-basic
-    visibility: "public"
-# Experiment Configuration
-experiments:
-  baseline:
-    enabled: true
-    output_dir: "./results/baseline"
-  safrag:
-    enabled: true
-    output_dir: "./results/safrag"
-  ablation:
-    enabled: true
-    output_dir: "./results/ablation"
-    studies:
-      - "no_reranking"
-      - "no_calibration"
-      - "different_embeddings"
-      - "different_thresholds"
-      - "different_calibration_models"
-      - "different_retrieval_k"
-  comprehensive:
-    enabled: true
-    output_dir: "./results/comprehensive"

data_processing/__init__.py DELETED Viewed

@@ -1,4 +0,0 @@
-from .data_loader import DataLoader
-from .preprocessor import Preprocessor
-__all__ = ['DataLoader', 'Preprocessor']

data_processing/data_loader.py DELETED Viewed

@@ -1,29 +0,0 @@
-import logging
-from datasets import load_dataset
-logger = logging.getLogger(__name__)
-class DataLoader:
-    def __init__(self, cache_dir: str = "./cache"):
-        self.cache_dir = cache_dir
-    def load_msmarco_passage(self, split: str = "train"):
-        """Load MS MARCO Passage Ranking dataset from Hugging Face (v2.1)"""
-        try:
-            logger.info(f"Downloading MS MARCO Passage Ranking {split} (v2.1) from Hugging Face")
-            ds = load_dataset("ms_marco", "v2.1", split=split)
-            return ds
-        except Exception as e:
-            logger.error(f"Failed to load MS MARCO Passage Ranking: {e}")
-            raise
-    def get_passage_dataset(self, split: str = "train"):
-        """Load MS MARCO Passage Ranking dataset"""
-        try:
-            ds = self.load_msmarco_passage(split)
-            logger.info("MS MARCO Passage Ranking loaded successfully")
-            return ds
-        except Exception as e:
-            logger.error(f"Failed to load MS MARCO Passage Ranking: {e}")
-            raise

data_processing/preprocessor.py DELETED Viewed

@@ -1,112 +0,0 @@
-from typing import List, Dict, Any
-import re
-import logging
-logger = logging.getLogger(__name__)
-class Preprocessor:
-    def __init__(self):
-        """Initialize preprocessor without external dependencies"""
-        pass
-    def clean_text(self, text: str) -> str:
-        """Clean and normalize text"""
-        if not text:
-            return ""
-        # Remove extra whitespace
-        text = text.strip()
-        text = re.sub(r'\s+', ' ', text)
-        # Remove special characters but keep punctuation
-        text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)]', '', text)
-        return text.strip()
-    def extract_sentences(self, text: str) -> List[str]:
-        """Extract sentences from text (simplified version without NLTK)"""
-        if not text:
-            return []
-        # Simple sentence splitting based on punctuation
-        sentences = re.split(r'[.!?]+', text)
-        sentences = [s.strip() for s in sentences if s.strip()]
-        return sentences
-    def tokenize(self, text: str) -> List[str]:
-        """Tokenize text into words (simplified version)"""
-        if not text:
-            return []
-        # Simple word tokenization
-        words = re.findall(r'\b\w+\b', text.lower())
-        return words
-    def preprocess_passages(self, passages: List[str]) -> List[Dict[str, Any]]:
-        """Preprocess a list of passages"""
-        processed = []
-        for i, passage in enumerate(passages):
-            if not passage:
-                continue
-            cleaned = self.clean_text(passage)
-            sentences = self.extract_sentences(cleaned)
-            tokens = self.tokenize(cleaned)
-            processed.append({
-                'id': i,
-                'text': cleaned,
-                'sentences': sentences,
-                'tokens': tokens,
-                'length': len(tokens)
-            })
-        return processed
-    def preprocess_qa_data(self, data: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
-        """Preprocess QA data, auto convert dict/list fields to string"""
-        processed = []
-        def to_str(val):
-            if isinstance(val, dict):
-                # 拼接所有value
-                return " ".join([to_str(v) for v in val.values()])
-            elif isinstance(val, list):
-                return " ".join([to_str(v) for v in val])
-            elif val is None:
-                return ""
-            return str(val)
-        for item in data:
-            if not isinstance(item, dict):
-                continue
-            question = to_str(item.get('question', ''))
-            answer = to_str(item.get('answer', ''))
-            context = to_str(item.get('context', ''))
-            processed_item = {
-                'question': self.clean_text(question),
-                'answer': self.clean_text(answer),
-                'context': self.clean_text(context),
-                'question_tokens': self.tokenize(question),
-                'answer_tokens': self.tokenize(answer),
-                'context_tokens': self.tokenize(context)
-            }
-            processed.append(processed_item)
-        return processed
-    def create_chunks(self, text: str, chunk_size: int = 512, overlap: int = 50) -> List[str]:
-        """Create overlapping text chunks"""
-        if not text:
-            return []
-        tokens = self.tokenize(text)
-        chunks = []
-        for i in range(0, len(tokens), chunk_size - overlap):
-            chunk_tokens = tokens[i:i + chunk_size]
-            chunk_text = ' '.join(chunk_tokens)
-            chunks.append(chunk_text)
-        return chunks

eval/__init__.py DELETED Viewed

@@ -1,6 +0,0 @@
-from .eval_qa import QAEvaluator
-from .eval_attr import AttributionEvaluator
-from .eval_calib import CalibrationEvaluator
-from .eval_system import SystemEvaluator
-__all__ = ['QAEvaluator', 'AttributionEvaluator', 'CalibrationEvaluator', 'SystemEvaluator']

eval/eval_attr.py DELETED Viewed

@@ -1,275 +0,0 @@
-from typing import List, Dict, Any, Set
-import numpy as np
-from sentence_transformers import SentenceTransformer
-from sklearn.metrics.pairwise import cosine_similarity
-import logging
-logger = logging.getLogger(__name__)
-class AttributionEvaluator:
-    def __init__(self, embedding_model: str = "BAAI/bge-large-en-v1.5"):
-        self.embedding_model = SentenceTransformer(embedding_model)
-    def evaluate_attribution(self, answers: List[str],
-                           retrieved_passages: List[List[Dict[str, Any]]],
-                           supporting_facts: List[List[str]] = None) -> Dict[str, float]:
-        """Evaluate attribution quality"""
-        if not answers or not retrieved_passages:
-            return {'precision': 0.0, 'recall': 0.0, 'f1': 0.0}
-        precisions = []
-        recalls = []
-        f1_scores = []
-        for answer, passages, facts in zip(answers, retrieved_passages, supporting_facts or [[]] * len(answers)):
-            if not passages:
-                precisions.append(0.0)
-                recalls.append(0.0)
-                f1_scores.append(0.0)
-                continue
-            # Extract passage texts
-            passage_texts = [p.get('text', '') for p in passages]
-            # Calculate attribution metrics
-            if facts:
-                # Use provided supporting facts
-                precision, recall, f1 = self._calculate_attribution_metrics(
-                    answer, passage_texts, facts
-                )
-            else:
-                # Use semantic similarity as proxy
-                precision, recall, f1 = self._calculate_semantic_attribution(
-                    answer, passage_texts
-                )
-            precisions.append(precision)
-            recalls.append(recall)
-            f1_scores.append(f1)
-        return {
-            'precision': np.mean(precisions),
-            'recall': np.mean(recalls),
-            'f1': np.mean(f1_scores),
-            'precision_std': np.std(precisions),
-            'recall_std': np.std(recalls),
-            'f1_std': np.std(f1_scores)
-        }
-    def _calculate_attribution_metrics(self, answer: str, passages: List[str],
-                                     supporting_facts: List[str]) -> tuple:
-        """Calculate attribution metrics using supporting facts"""
-        # Find which passages contain supporting facts
-        relevant_passages = set()
-        for fact in supporting_facts:
-            for i, passage in enumerate(passages):
-                if self._passage_contains_fact(passage, fact):
-                    relevant_passages.add(i)
-        # Calculate metrics
-        total_passages = len(passages)
-        relevant_count = len(relevant_passages)
-        if total_passages == 0:
-            return 0.0, 0.0, 0.0
-        # Precision: relevant passages / total retrieved passages
-        precision = relevant_count / total_passages
-        # Recall: relevant passages / total supporting facts
-        recall = relevant_count / len(supporting_facts) if supporting_facts else 0.0
-        # F1 score
-        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
-        return precision, recall, f1
-    def _calculate_semantic_attribution(self, answer: str, passages: List[str]) -> tuple:
-        """Calculate attribution using semantic similarity"""
-        if not passages:
-            return 0.0, 0.0, 0.0
-        # Encode answer and passages
-        answer_embedding = self.embedding_model.encode([answer])
-        passage_embeddings = self.embedding_model.encode(passages)
-        # Calculate similarities
-        similarities = cosine_similarity(answer_embedding, passage_embeddings)[0]
-        # Use threshold to determine relevant passages
-        threshold = 0.3
-        relevant_passages = similarities >= threshold
-        # Calculate metrics
-        total_passages = len(passages)
-        relevant_count = np.sum(relevant_passages)
-        precision = relevant_count / total_passages
-        recall = relevant_count / total_passages  # Simplified for semantic method
-        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
-        return precision, recall, f1
-    def _passage_contains_fact(self, passage: str, fact: str) -> bool:
-        """Check if passage contains a supporting fact"""
-        # Simple containment check
-        fact_words = set(fact.lower().split())
-        passage_words = set(passage.lower().split())
-        # Check if most fact words are in passage
-        overlap = len(fact_words & passage_words)
-        return overlap >= len(fact_words) * 0.7
-    def evaluate_citation_quality(self, answers: List[str],
-                                citations: List[List[Dict[str, Any]]]) -> Dict[str, float]:
-        """Evaluate citation quality in answers"""
-        if not answers or not citations:
-            return {'citation_coverage': 0.0, 'citation_accuracy': 0.0}
-        coverage_scores = []
-        accuracy_scores = []
-        for answer, answer_citations in zip(answers, citations):
-            # Citation coverage: percentage of answer that is cited
-            coverage = self._calculate_citation_coverage(answer, answer_citations)
-            coverage_scores.append(coverage)
-            # Citation accuracy: percentage of citations that are relevant
-            accuracy = self._calculate_citation_accuracy(answer, answer_citations)
-            accuracy_scores.append(accuracy)
-        return {
-            'citation_coverage': np.mean(coverage_scores),
-            'citation_accuracy': np.mean(accuracy_scores),
-            'citation_coverage_std': np.std(coverage_scores),
-            'citation_accuracy_std': np.std(accuracy_scores)
-        }
-    def _calculate_citation_coverage(self, answer: str, citations: List[Dict[str, Any]]) -> float:
-        """Calculate what percentage of answer is covered by citations"""
-        if not citations:
-            return 0.0
-        # Simple heuristic: check if answer contains citation markers
-        import re
-        citation_markers = re.findall(r'\[\d+\]', answer)
-        if not citation_markers:
-            return 0.0
-        # Estimate coverage based on citation density
-        answer_length = len(answer.split())
-        citation_density = len(citation_markers) / answer_length if answer_length > 0 else 0
-        return min(1.0, citation_density * 10)  # Scale factor
-    def _calculate_citation_accuracy(self, answer: str, citations: List[Dict[str, Any]]) -> float:
-        """Calculate accuracy of citations"""
-        if not citations:
-            return 0.0
-        # Simple heuristic: check if cited passages are relevant to answer
-        answer_words = set(answer.lower().split())
-        relevant_citations = 0
-        for citation in citations:
-            citation_text = citation.get('text', '')
-            citation_words = set(citation_text.lower().split())
-            # Check word overlap
-            overlap = len(answer_words & citation_words)
-            if overlap >= 3:  # Threshold for relevance
-                relevant_citations += 1
-        return relevant_citations / len(citations)
-    def evaluate_retrieval_quality(self, queries: List[str],
-                                 retrieved_passages: List[List[Dict[str, Any]]],
-                                 relevant_passages: List[List[str]] = None) -> Dict[str, float]:
-        """Evaluate retrieval quality"""
-        if not queries or not retrieved_passages:
-            return {'retrieval_precision': 0.0, 'retrieval_recall': 0.0, 'retrieval_f1': 0.0}
-        precisions = []
-        recalls = []
-        f1_scores = []
-        for query, passages, relevant in zip(queries, retrieved_passages, relevant_passages or [[]] * len(queries)):
-            if not passages:
-                precisions.append(0.0)
-                recalls.append(0.0)
-                f1_scores.append(0.0)
-                continue
-            # Calculate retrieval metrics
-            if relevant:
-                precision, recall, f1 = self._calculate_retrieval_metrics(passages, relevant)
-            else:
-                # Use semantic similarity as proxy
-                precision, recall, f1 = self._calculate_semantic_retrieval(query, passages)
-            precisions.append(precision)
-            recalls.append(recall)
-            f1_scores.append(f1)
-        return {
-            'retrieval_precision': np.mean(precisions),
-            'retrieval_recall': np.mean(recalls),
-            'retrieval_f1': np.mean(f1_scores),
-            'retrieval_precision_std': np.std(precisions),
-            'retrieval_recall_std': np.std(recalls),
-            'retrieval_f1_std': np.std(f1_scores)
-        }
-    def _calculate_retrieval_metrics(self, passages: List[Dict[str, Any]],
-                                   relevant_passages: List[str]) -> tuple:
-        """Calculate retrieval metrics using ground truth"""
-        retrieved_texts = [p.get('text', '') for p in passages]
-        # Find relevant retrieved passages
-        relevant_retrieved = 0
-        for retrieved in retrieved_texts:
-            for relevant in relevant_passages:
-                if self._passage_contains_fact(retrieved, relevant):
-                    relevant_retrieved += 1
-                    break
-        total_retrieved = len(passages)
-        total_relevant = len(relevant_passages)
-        precision = relevant_retrieved / total_retrieved if total_retrieved > 0 else 0.0
-        recall = relevant_retrieved / total_relevant if total_relevant > 0 else 0.0
-        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
-        return precision, recall, f1
-    def _calculate_semantic_retrieval(self, query: str, passages: List[Dict[str, Any]]) -> tuple:
-        """Calculate retrieval metrics using semantic similarity"""
-        if not passages:
-            return 0.0, 0.0, 0.0
-        # Encode query and passages
-        query_embedding = self.embedding_model.encode([query])
-        passage_embeddings = self.embedding_model.encode([p.get('text', '') for p in passages])
-        # Calculate similarities
-        similarities = cosine_similarity(query_embedding, passage_embeddings)[0]
-        # Use threshold to determine relevant passages
-        threshold = 0.3
-        relevant_count = np.sum(similarities >= threshold)
-        total_retrieved = len(passages)
-        precision = relevant_count / total_retrieved
-        recall = relevant_count / total_retrieved  # Simplified for semantic method
-        f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
-        return precision, recall, f1

eval/eval_calib.py DELETED Viewed

@@ -1,269 +0,0 @@
-from typing import List, Dict, Any
-import numpy as np
-from sklearn.metrics import roc_auc_score, average_precision_score
-import matplotlib.pyplot as plt
-import logging
-logger = logging.getLogger(__name__)
-class CalibrationEvaluator:
-    def __init__(self):
-        pass
-    def expected_calibration_error(self, predictions: List[float],
-                                 labels: List[int], n_bins: int = 10) -> float:
-        """Calculate Expected Calibration Error (ECE)"""
-        if not predictions or not labels:
-            return 0.0
-        predictions = np.array(predictions)
-        labels = np.array(labels)
-        # Create bins
-        bin_boundaries = np.linspace(0, 1, n_bins + 1)
-        bin_lowers = bin_boundaries[:-1]
-        bin_uppers = bin_boundaries[1:]
-        ece = 0
-        for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
-            # Find predictions in this bin
-            in_bin = (predictions > bin_lower) & (predictions <= bin_upper)
-            prop_in_bin = in_bin.mean()
-            if prop_in_bin > 0:
-                # Calculate accuracy in this bin
-                accuracy_in_bin = labels[in_bin].mean()
-                avg_confidence_in_bin = predictions[in_bin].mean()
-                # Add to ECE
-                ece += np.abs(avg_confidence_in_bin - accuracy_in_bin) * prop_in_bin
-        return ece
-    def maximum_calibration_error(self, predictions: List[float],
-                                labels: List[int], n_bins: int = 10) -> float:
-        """Calculate Maximum Calibration Error (MCE)"""
-        if not predictions or not labels:
-            return 0.0
-        predictions = np.array(predictions)
-        labels = np.array(labels)
-        # Create bins
-        bin_boundaries = np.linspace(0, 1, n_bins + 1)
-        bin_lowers = bin_boundaries[:-1]
-        bin_uppers = bin_boundaries[1:]
-        mce = 0
-        for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
-            # Find predictions in this bin
-            in_bin = (predictions > bin_lower) & (predictions <= bin_upper)
-            if in_bin.sum() > 0:
-                # Calculate accuracy in this bin
-                accuracy_in_bin = labels[in_bin].mean()
-                avg_confidence_in_bin = predictions[in_bin].mean()
-                # Update MCE
-                mce = max(mce, np.abs(avg_confidence_in_bin - accuracy_in_bin))
-        return mce
-    def reliability_diagram(self, predictions: List[float], labels: List[int],
-                          n_bins: int = 10, save_path: str = None) -> Dict[str, Any]:
-        """Create reliability diagram"""
-        if not predictions or not labels:
-            return {}
-        predictions = np.array(predictions)
-        labels = np.array(labels)
-        # Create bins
-        bin_boundaries = np.linspace(0, 1, n_bins + 1)
-        bin_lowers = bin_boundaries[:-1]
-        bin_uppers = bin_boundaries[1:]
-        bin_centers = []
-        accuracies = []
-        confidences = []
-        counts = []
-        for bin_lower, bin_upper in zip(bin_lowers, bin_uppers):
-            # Find predictions in this bin
-            in_bin = (predictions > bin_lower) & (predictions <= bin_upper)
-            count = in_bin.sum()
-            if count > 0:
-                bin_center = (bin_lower + bin_upper) / 2
-                accuracy = labels[in_bin].mean()
-                confidence = predictions[in_bin].mean()
-                bin_centers.append(bin_center)
-                accuracies.append(accuracy)
-                confidences.append(confidence)
-                counts.append(count)
-        # Create plot
-        plt.figure(figsize=(8, 6))
-        plt.bar(bin_centers, accuracies, width=0.1, alpha=0.7, label='Accuracy')
-        plt.plot([0, 1], [0, 1], 'r--', label='Perfect Calibration')
-        plt.xlabel('Confidence')
-        plt.ylabel('Accuracy')
-        plt.title('Reliability Diagram')
-        plt.legend()
-        plt.grid(True, alpha=0.3)
-        if save_path:
-            plt.savefig(save_path, dpi=300, bbox_inches='tight')
-        plt.close()
-        return {
-            'bin_centers': bin_centers,
-            'accuracies': accuracies,
-            'confidences': confidences,
-            'counts': counts
-        }
-    def auroc(self, predictions: List[float], labels: List[int]) -> float:
-        """Calculate Area Under ROC Curve"""
-        if not predictions or not labels:
-            return 0.0
-        try:
-            return roc_auc_score(labels, predictions)
-        except:
-            return 0.0
-    def auprc(self, predictions: List[float], labels: List[int]) -> float:
-        """Calculate Area Under Precision-Recall Curve"""
-        if not predictions or not labels:
-            return 0.0
-        try:
-            return average_precision_score(labels, predictions)
-        except:
-            return 0.0
-    def risk_coverage_curve(self, predictions: List[float], labels: List[int],
-                          risk_thresholds: List[float] = None) -> Dict[str, Any]:
-        """Calculate risk-coverage curve"""
-        if not predictions or not labels:
-            return {'thresholds': [], 'coverage': [], 'accuracy': []}
-        predictions = np.array(predictions)
-        labels = np.array(labels)
-        if risk_thresholds is None:
-            risk_thresholds = np.linspace(0, 1, 21)
-        coverages = []
-        accuracies = []
-        for threshold in risk_thresholds:
-            # Select predictions with risk <= threshold
-            selected = predictions <= threshold
-            if selected.sum() > 0:
-                coverage = selected.mean()
-                accuracy = labels[selected].mean()
-            else:
-                coverage = 0.0
-                accuracy = 0.0
-            coverages.append(coverage)
-            accuracies.append(accuracy)
-        return {
-            'thresholds': risk_thresholds.tolist(),
-            'coverage': coverages,
-            'accuracy': accuracies
-        }
-    def evaluate_calibration(self, predictions: List[float], labels: List[int]) -> Dict[str, float]:
-        """Comprehensive calibration evaluation"""
-        if not predictions or not labels:
-            return {
-                'ece': 0.0,
-                'mce': 0.0,
-                'auroc': 0.0,
-                'auprc': 0.0
-            }
-        metrics = {
-            'ece': self.expected_calibration_error(predictions, labels),
-            'mce': self.maximum_calibration_error(predictions, labels),
-            'auroc': self.auroc(predictions, labels),
-            'auprc': self.auprc(predictions, labels)
-        }
-        # Risk-coverage analysis
-        risk_coverage = self.risk_coverage_curve(predictions, labels)
-        metrics['risk_coverage'] = risk_coverage
-        return metrics
-    def plot_calibration_curves(self, predictions: List[float], labels: List[int],
-                              save_path: str = None) -> None:
-        """Plot calibration curves"""
-        if not predictions or not labels:
-            return
-        fig, axes = plt.subplots(2, 2, figsize=(12, 10))
-        # Reliability diagram
-        reliability_data = self.reliability_diagram(predictions, labels)
-        if reliability_data:
-            axes[0, 0].bar(reliability_data['bin_centers'], reliability_data['accuracies'],
-                          width=0.1, alpha=0.7)
-            axes[0, 0].plot([0, 1], [0, 1], 'r--')
-            axes[0, 0].set_xlabel('Confidence')
-            axes[0, 0].set_ylabel('Accuracy')
-            axes[0, 0].set_title('Reliability Diagram')
-            axes[0, 0].grid(True, alpha=0.3)
-        # Risk-coverage curve
-        risk_coverage = self.risk_coverage_curve(predictions, labels)
-        if risk_coverage['thresholds']:
-            axes[0, 1].plot(risk_coverage['coverage'], risk_coverage['accuracy'], 'b-')
-            axes[0, 1].set_xlabel('Coverage')
-            axes[0, 1].set_ylabel('Accuracy')
-            axes[0, 1].set_title('Risk-Coverage Curve')
-            axes[0, 1].grid(True, alpha=0.3)
-        # Confidence distribution
-        axes[1, 0].hist(predictions, bins=20, alpha=0.7, edgecolor='black')
-        axes[1, 0].set_xlabel('Confidence')
-        axes[1, 0].set_ylabel('Count')
-        axes[1, 0].set_title('Confidence Distribution')
-        axes[1, 0].grid(True, alpha=0.3)
-        # Accuracy vs Confidence
-        bin_centers = np.linspace(0, 1, 11)
-        accuracies = []
-        for i in range(len(bin_centers) - 1):
-            mask = (np.array(predictions) >= bin_centers[i]) & (np.array(predictions) < bin_centers[i + 1])
-            if mask.sum() > 0:
-                accuracies.append(np.array(labels)[mask].mean())
-            else:
-                accuracies.append(0)
-        axes[1, 1].plot(bin_centers[:-1], accuracies, 'bo-')
-        axes[1, 1].plot([0, 1], [0, 1], 'r--')
-        axes[1, 1].set_xlabel('Confidence')
-        axes[1, 1].set_ylabel('Accuracy')
-        axes[1, 1].set_title('Accuracy vs Confidence')
-        axes[1, 1].grid(True, alpha=0.3)
-        plt.tight_layout()
-        if save_path:
-            plt.savefig(save_path, dpi=300, bbox_inches='tight')
-        plt.close()

eval/eval_qa.py DELETED Viewed

@@ -1,137 +0,0 @@
-import re
-from typing import List, Dict, Any
-import numpy as np
-from evaluate import load
-import logging
-logger = logging.getLogger(__name__)
-class QAEvaluator:
-    def __init__(self):
-        self.squad_metric = load("squad")
-        self.rouge_metric = load("rouge")
-    def exact_match(self, predictions: List[str], references: List[str]) -> float:
-        """Calculate exact match score"""
-        matches = 0
-        for pred, ref in zip(predictions, references):
-            if self._normalize_answer(pred) == self._normalize_answer(ref):
-                matches += 1
-        return matches / len(predictions) if predictions else 0.0
-    def f1_score(self, predictions: List[str], references: List[str]) -> float:
-        """Calculate F1 score"""
-        f1_scores = []
-        for pred, ref in zip(predictions, references):
-            f1 = self._calculate_f1(pred, ref)
-            f1_scores.append(f1)
-        return np.mean(f1_scores) if f1_scores else 0.0
-    def rouge_score(self, predictions: List[str], references: List[str]) -> Dict[str, float]:
-        """Calculate ROUGE scores"""
-        if not predictions or not references:
-            return {'rouge1': 0.0, 'rouge2': 0.0, 'rougeL': 0.0}
-        results = self.rouge_metric.compute(
-            predictions=predictions,
-            references=references
-        )
-        return {
-            'rouge1': results['rouge1'],
-            'rouge2': results['rouge2'],
-            'rougeL': results['rougeL']
-        }
-    def squad_metrics(self, predictions: List[str], references: List[str]) -> Dict[str, float]:
-        """Calculate SQuAD-style metrics"""
-        if not predictions or not references:
-            return {'exact_match': 0.0, 'f1': 0.0}
-        # Format for SQuAD metric
-        formatted_predictions = [{"prediction_text": pred, "id": str(i)}
-                               for i, pred in enumerate(predictions)]
-        formatted_references = [{"answers": {"text": [ref], "answer_start": [0]}, "id": str(i)}
-                              for i, ref in enumerate(references)]
-        results = self.squad_metric.compute(
-            predictions=formatted_predictions,
-            references=formatted_references
-        )
-        return {
-            'exact_match': results['exact_match'],
-            'f1': results['f1']
-        }
-    def evaluate_batch(self, predictions: List[str], references: List[str]) -> Dict[str, float]:
-        """Evaluate a batch of predictions"""
-        metrics = {}
-        # Basic metrics
-        metrics['exact_match'] = self.exact_match(predictions, references)
-        metrics['f1'] = self.f1_score(predictions, references)
-        # ROUGE metrics
-        rouge_scores = self.rouge_score(predictions, references)
-        metrics.update(rouge_scores)
-        # SQuAD metrics
-        squad_scores = self.squad_metrics(predictions, references)
-        metrics.update(squad_scores)
-        return metrics
-    def _normalize_answer(self, answer: str) -> str:
-        """Normalize answer for comparison"""
-        def remove_articles(text):
-            return re.sub(r'\b(a|an|the)\b', ' ', text)
-        def white_space_fix(text):
-            return ' '.join(text.split())
-        def remove_punc(text):
-            exclude = set(string.punctuation)
-            return ''.join(ch for ch in text if ch not in exclude)
-        def lower(text):
-            return text.lower()
-        return white_space_fix(remove_articles(remove_punc(lower(answer))))
-    def _calculate_f1(self, prediction: str, reference: str) -> float:
-        """Calculate F1 score between prediction and reference"""
-        pred_tokens = self._normalize_answer(prediction).split()
-        ref_tokens = self._normalize_answer(reference).split()
-        if len(ref_tokens) == 0:
-            return 1.0 if len(pred_tokens) == 0 else 0.0
-        common = set(pred_tokens) & set(ref_tokens)
-        if len(common) == 0:
-            return 0.0
-        precision = len(common) / len(pred_tokens)
-        recall = len(common) / len(ref_tokens)
-        f1 = 2 * precision * recall / (precision + recall)
-        return f1
-    def evaluate_with_context(self, predictions: List[str], references: List[str],
-                            contexts: List[str]) -> Dict[str, float]:
-        """Evaluate with context awareness"""
-        metrics = self.evaluate_batch(predictions, references)
-        # Context-based metrics
-        context_scores = []
-        for pred, context in zip(predictions, contexts):
-            # Check if prediction is supported by context
-            pred_words = set(pred.lower().split())
-            context_words = set(context.lower().split())
-            overlap = len(pred_words & context_words) / len(pred_words) if pred_words else 0
-            context_scores.append(overlap)
-        metrics['context_support'] = np.mean(context_scores)
-        return metrics

eval/eval_system.py DELETED Viewed

@@ -1,297 +0,0 @@
-import time
-import psutil
-import GPUtil
-from typing import List, Dict, Any, Optional
-import numpy as np
-import logging
-import threading
-from concurrent.futures import ThreadPoolExecutor, as_completed
-logger = logging.getLogger(__name__)
-class SystemEvaluator:
-    def __init__(self):
-        self.monitoring = False
-        self.metrics = []
-        self.monitor_thread = None
-    def start_monitoring(self):
-        """Start system monitoring"""
-        self.monitoring = True
-        self.metrics = []
-        self.monitor_thread = threading.Thread(target=self._monitor_system)
-        self.monitor_thread.start()
-        logger.info("Started system monitoring")
-    def stop_monitoring(self):
-        """Stop system monitoring"""
-        self.monitoring = False
-        if self.monitor_thread:
-            self.monitor_thread.join()
-        logger.info("Stopped system monitoring")
-    def _monitor_system(self):
-        """Monitor system resources"""
-        while self.monitoring:
-            try:
-                # CPU usage
-                cpu_percent = psutil.cpu_percent(interval=1)
-                # Memory usage
-                memory = psutil.virtual_memory()
-                memory_percent = memory.percent
-                memory_used_gb = memory.used / (1024**3)
-                # GPU usage (if available)
-                gpu_metrics = self._get_gpu_metrics()
-                # Disk usage
-                disk = psutil.disk_usage('/')
-                disk_percent = disk.percent
-                metric = {
-                    'timestamp': time.time(),
-                    'cpu_percent': cpu_percent,
-                    'memory_percent': memory_percent,
-                    'memory_used_gb': memory_used_gb,
-                    'disk_percent': disk_percent,
-                    **gpu_metrics
-                }
-                self.metrics.append(metric)
-            except Exception as e:
-                logger.error(f"Error monitoring system: {e}")
-            time.sleep(1)  # Monitor every second
-    def _get_gpu_metrics(self) -> Dict[str, Any]:
-        """Get GPU metrics"""
-        try:
-            gpus = GPUtil.getGPUs()
-            if gpus:
-                gpu = gpus[0]  # Use first GPU
-                return {
-                    'gpu_utilization': gpu.load * 100,
-                    'gpu_memory_used': gpu.memoryUsed,
-                    'gpu_memory_total': gpu.memoryTotal,
-                    'gpu_memory_percent': (gpu.memoryUsed / gpu.memoryTotal) * 100,
-                    'gpu_temperature': gpu.temperature
-                }
-        except:
-            pass
-        return {
-            'gpu_utilization': 0,
-            'gpu_memory_used': 0,
-            'gpu_memory_total': 0,
-            'gpu_memory_percent': 0,
-            'gpu_temperature': 0
-        }
-    def measure_throughput(self, func, args_list: List[tuple],
-                          max_workers: int = 4) -> Dict[str, Any]:
-        """Measure throughput of a function"""
-        start_time = time.time()
-        # Execute function with different concurrency levels
-        results = []
-        with ThreadPoolExecutor(max_workers=max_workers) as executor:
-            futures = [executor.submit(func, *args) for args in args_list]
-            for future in as_completed(futures):
-                try:
-                    result = future.result()
-                    results.append(result)
-                except Exception as e:
-                    logger.error(f"Error in throughput measurement: {e}")
-        end_time = time.time()
-        total_time = end_time - start_time
-        throughput = len(results) / total_time  # queries per second
-        return {
-            'total_queries': len(args_list),
-            'successful_queries': len(results),
-            'total_time': total_time,
-            'throughput_qps': throughput,
-            'avg_time_per_query': total_time / len(args_list) if args_list else 0
-        }
-    def measure_latency(self, func, args: tuple, num_runs: int = 10) -> Dict[str, Any]:
-        """Measure latency of a function"""
-        latencies = []
-        for _ in range(num_runs):
-            start_time = time.time()
-            try:
-                result = func(*args)
-                end_time = time.time()
-                latency = end_time - start_time
-                latencies.append(latency)
-            except Exception as e:
-                logger.error(f"Error in latency measurement: {e}")
-                latencies.append(float('inf'))
-        # Remove infinite latencies
-        latencies = [l for l in latencies if l != float('inf')]
-        if not latencies:
-            return {
-                'avg_latency': 0,
-                'p50_latency': 0,
-                'p95_latency': 0,
-                'p99_latency': 0,
-                'min_latency': 0,
-                'max_latency': 0,
-                'std_latency': 0
-            }
-        latencies = np.array(latencies)
-        return {
-            'avg_latency': np.mean(latencies),
-            'p50_latency': np.percentile(latencies, 50),
-            'p95_latency': np.percentile(latencies, 95),
-            'p99_latency': np.percentile(latencies, 99),
-            'min_latency': np.min(latencies),
-            'max_latency': np.max(latencies),
-            'std_latency': np.std(latencies)
-        }
-    def measure_batch_latency(self, func, args_list: List[tuple],
-                            batch_sizes: List[int] = [1, 4, 8, 16]) -> Dict[str, Any]:
-        """Measure latency for different batch sizes"""
-        results = {}
-        for batch_size in batch_sizes:
-            batch_latencies = []
-            # Process in batches
-            for i in range(0, len(args_list), batch_size):
-                batch_args = args_list[i:i + batch_size]
-                start_time = time.time()
-                try:
-                    batch_results = [func(*args) for args in batch_args]
-                    end_time = time.time()
-                    batch_latency = end_time - start_time
-                    batch_latencies.append(batch_latency)
-                except Exception as e:
-                    logger.error(f"Error in batch latency measurement: {e}")
-            if batch_latencies:
-                results[f'batch_size_{batch_size}'] = {
-                    'avg_latency': np.mean(batch_latencies),
-                    'p95_latency': np.percentile(batch_latencies, 95),
-                    'throughput': batch_size / np.mean(batch_latencies)
-                }
-        return results
-    def get_system_stats(self) -> Dict[str, Any]:
-        """Get current system statistics"""
-        if not self.metrics:
-            return {}
-        # Calculate statistics from monitoring data
-        cpu_values = [m['cpu_percent'] for m in self.metrics]
-        memory_values = [m['memory_percent'] for m in self.metrics]
-        gpu_values = [m.get('gpu_utilization', 0) for m in self.metrics]
-        return {
-            'monitoring_duration': len(self.metrics),
-            'cpu': {
-                'avg': np.mean(cpu_values),
-                'max': np.max(cpu_values),
-                'min': np.min(cpu_values),
-                'std': np.std(cpu_values)
-            },
-            'memory': {
-                'avg': np.mean(memory_values),
-                'max': np.max(memory_values),
-                'min': np.min(memory_values),
-                'std': np.std(memory_values)
-            },
-            'gpu': {
-                'avg': np.mean(gpu_values),
-                'max': np.max(gpu_values),
-                'min': np.min(gpu_values),
-                'std': np.std(gpu_values)
-            }
-        }
-    def evaluate_retrieval_performance(self, retriever, queries: List[str],
-                                     k: int = 10) -> Dict[str, Any]:
-        """Evaluate retrieval performance"""
-        # Measure latency
-        latency_stats = self.measure_latency(
-            retriever.retrieve_single,
-            (queries[0], k),
-            num_runs=5
-        )
-        # Measure throughput
-        throughput_stats = self.measure_throughput(
-            retriever.retrieve_single,
-            [(query, k) for query in queries[:10]],  # Limit for throughput test
-            max_workers=4
-        )
-        return {
-            'latency': latency_stats,
-            'throughput': throughput_stats
-        }
-    def evaluate_generation_performance(self, generator, questions: List[str],
-                                      passages_list: List[List[Dict[str, Any]]]) -> Dict[str, Any]:
-        """Evaluate generation performance"""
-        # Measure latency
-        latency_stats = self.measure_latency(
-            generator.generate_with_strategy,
-            (questions[0], passages_list[0]),
-            num_runs=5
-        )
-        # Measure throughput
-        throughput_stats = self.measure_throughput(
-            generator.generate_with_strategy,
-            list(zip(questions[:5], passages_list[:5])),  # Limit for throughput test
-            max_workers=2
-        )
-        return {
-            'latency': latency_stats,
-            'throughput': throughput_stats
-        }
-    def evaluate_end_to_end_performance(self, rag_system, queries: List[str]) -> Dict[str, Any]:
-        """Evaluate end-to-end RAG performance"""
-        # Measure latency
-        latency_stats = self.measure_latency(
-            rag_system.query,
-            (queries[0],),
-            num_runs=5
-        )
-        # Measure throughput
-        throughput_stats = self.measure_throughput(
-            rag_system.query,
-            [(query,) for query in queries[:10]],  # Limit for throughput test
-            max_workers=2
-        )
-        return {
-            'latency': latency_stats,
-            'throughput': throughput_stats
-        }

exp_pipeline/pipeline.py DELETED Viewed

@@ -1,56 +0,0 @@
-"""
-End-to-end pipeline for dataset download, preprocessing, embedding, and indexing.
-"""
-import logging
-from data_processing.data_loader import DataLoader
-from data_processing.preprocessor import Preprocessor
-from retriever.embedder import Embedder
-from retriever.faiss_index import build_faiss_index
-logger = logging.getLogger(__name__)
-def run_pipeline(split: str = "train"):
-    # 1. 下载MS MARCO Passage Ranking数据集
-    data_loader = DataLoader()
-    raw_data = data_loader.get_passage_dataset(split)
-    logger.info(f"Loaded {len(raw_data)} samples from MS MARCO Passage Ranking [{split}]")
-    print("data_loader\n")
-    # 2. 预处理数据
-    preprocessor = Preprocessor()
-    # HuggingFace datasets对象转list
-    if hasattr(raw_data, "to_dict"):
-        raw_data = raw_data.to_dict()
-        raw_data = [dict(zip(raw_data.keys(), v)) for v in zip(*raw_data.values())]
-    print("raw_data\n")
-    # MS MARCO Passage v2.1: 用passages["passage_text"]字段
-    passages = []
-    for item in raw_data:
-        if "passages" in item and "passage_text" in item["passages"]:
-            passages.extend(item["passages"]["passage_text"])
-    processed = preprocessor.preprocess_passages(passages)
-    texts = [p["text"] for p in processed]
-    print("texts\n")
-    logger.info(f"Processed {len(texts)} passages")
-    # 3. 生产embedding
-    embedder = Embedder(device="cuda")
-    embeddings = embedder.encode(texts)
-    print(f"Embedding shape: {getattr(embeddings, 'shape', None)}")
-    print(f"Texts count: {len(texts)}")
-    if embeddings is None or not hasattr(embeddings, 'shape') or len(embeddings.shape) != 2 or embeddings.shape[0] == 0:
-        raise ValueError("Embeddings is empty or not a 2D array. Check input texts and embedding model.")
-    # 4. 建立FAISS索引
-    index = build_faiss_index(embeddings, texts, index_type="HNSW")
-    logger.info("FAISS index built successfully")
-    # 持久化index到./index文件夹
-    index.save("../index/msmarco_hnsw")
-    logger.info("FAISS index saved to ./index/msmarco_hnsw")
-    return index
-if __name__ == "__main__":
-    run_pipeline("train")

generator/__init__.py DELETED Viewed

@@ -1,5 +0,0 @@
-from .vllm_server import VLLMServer
-from .safe_generate import SafeGenerator
-from .prompt_templates import PromptTemplates
-__all__ = ['VLLMServer', 'SafeGenerator', 'PromptTemplates']

generator/prompt_templates.py DELETED Viewed

@@ -1,113 +0,0 @@
-from typing import List, Dict, Any
-from dataclasses import dataclass
-@dataclass
-class PromptTemplate:
-    name: str
-    template: str
-    system_prompt: str = ""
-class PromptTemplates:
-    def __init__(self):
-        self.templates = {
-            'rag': PromptTemplate(
-                name='rag',
-                system_prompt="You are a helpful assistant that answers questions based on provided context. Always cite your sources when possible.",
-                template="""Context:
-{context}
-Question: {question}
-Answer:"""
-            ),
-            'rag_with_citations': PromptTemplate(
-                name='rag_with_citations',
-                system_prompt="You are a helpful assistant that answers questions based on provided context. Always provide citations in the format [1], [2], etc.",
-                template="""Context:
-{context}
-Question: {question}
-Answer (with citations):"""
-            ),
-            'rag_safe': PromptTemplate(
-                name='rag_safe',
-                system_prompt="You are a helpful assistant that answers questions based on provided context. If you're uncertain, say so. Always cite your sources.",
-                template="""Context:
-{context}
-Question: {question}
-Instructions:
-- Answer based on the provided context
-- If uncertain, express your uncertainty
-- Always provide citations
-- If the context doesn't contain enough information, say so
-Answer:"""
-            ),
-            'rag_uncertain': PromptTemplate(
-                name='rag_uncertain',
-                system_prompt="You are a helpful assistant. Express uncertainty when appropriate and always cite sources.",
-                template="""Context:
-{context}
-Question: {question}
-Answer (express uncertainty if appropriate):"""
-            )
-        }
-    def get_template(self, name: str) -> PromptTemplate:
-        """Get a prompt template by name"""
-        if name not in self.templates:
-            raise ValueError(f"Unknown template: {name}")
-        return self.templates[name]
-    def format_prompt(self, template_name: str, **kwargs) -> str:
-        """Format a prompt using a template"""
-        template = self.get_template(template_name)
-        # Format the main template
-        formatted = template.template.format(**kwargs)
-        # Add system prompt if available
-        if template.system_prompt:
-            formatted = f"{template.system_prompt}\n\n{formatted}"
-        return formatted
-    def format_context(self, retrieved_passages: List[Dict[str, Any]],
-                      max_length: int = 2000) -> str:
-        """Format retrieved passages as context"""
-        context_parts = []
-        current_length = 0
-        for i, passage in enumerate(retrieved_passages):
-            text = passage.get('text', '')
-            if current_length + len(text) > max_length:
-                break
-            context_parts.append(f"[{i+1}] {text}")
-            current_length += len(text)
-        return "\n\n".join(context_parts)
-    def create_rag_prompt(self, question: str, retrieved_passages: List[Dict[str, Any]],
-                         template_name: str = 'rag', max_context_length: int = 2000) -> str:
-        """Create a RAG prompt"""
-        context = self.format_context(retrieved_passages, max_context_length)
-        return self.format_prompt(template_name, question=question, context=context)
-    def create_batch_prompts(self, questions: List[str],
-                           retrieved_passages_list: List[List[Dict[str, Any]]],
-                           template_name: str = 'rag') -> List[str]:
-        """Create multiple RAG prompts"""
-        prompts = []
-        for question, passages in zip(questions, retrieved_passages_list):
-            prompt = self.create_rag_prompt(question, passages, template_name)
-            prompts.append(prompt)
-        return prompts

generator/safe_generate.py DELETED Viewed

@@ -1,170 +0,0 @@
-from typing import List, Dict, Any, Optional, Tuple
-import logging
-from .vllm_server import VLLMServer
-from .prompt_templates import PromptTemplates
-from ..calibration.features import RiskFeatureExtractor
-logger = logging.getLogger(__name__)
-class SafeGenerator:
-    def __init__(self, vllm_server: VLLMServer,
-                 risk_extractor: RiskFeatureExtractor,
-                 tau1: float = 0.3, tau2: float = 0.7):
-        self.vllm_server = vllm_server
-        self.risk_extractor = risk_extractor
-        self.prompt_templates = PromptTemplates()
-        self.tau1 = tau1  # Low risk threshold
-        self.tau2 = tau2  # High risk threshold
-    def generate_with_strategy(self, question: str,
-                             retrieved_passages: List[Dict[str, Any]],
-                             force_citation: bool = False) -> Dict[str, Any]:
-        """Generate answer with adaptive strategy based on risk assessment"""
-        # Extract risk features
-        risk_features = self.risk_extractor.extract_features(
-            question, retrieved_passages
-        )
-        # Get risk score (placeholder - will be implemented in calibration module)
-        risk_score = self._estimate_risk_score(risk_features)
-        # Determine strategy based on risk score
-        if risk_score < self.tau1:
-            # Low risk: normal generation
-            strategy = "normal"
-            temperature = 0.7
-            template_name = "rag"
-        elif risk_score < self.tau2:
-            # Medium risk: conservative generation with citations
-            strategy = "conservative"
-            temperature = 0.5
-            template_name = "rag_with_citations"
-            force_citation = True
-        else:
-            # High risk: very conservative or refuse
-            strategy = "conservative_or_refuse"
-            temperature = 0.3
-            template_name = "rag_safe"
-            force_citation = True
-        # Generate prompt
-        prompt = self.prompt_templates.create_rag_prompt(
-            question, retrieved_passages, template_name
-        )
-        # Generate answer
-        try:
-            result = self.vllm_server.generate_single(
-                prompt,
-                max_tokens=512,
-                temperature=temperature
-            )
-            # Post-process for citations if needed
-            if force_citation:
-                result = self._add_citations(result, retrieved_passages)
-            return {
-                'answer': result,
-                'risk_score': risk_score,
-                'strategy': strategy,
-                'temperature': temperature,
-                'features': risk_features,
-                'citations': self._extract_citations(result, retrieved_passages)
-            }
-        except Exception as e:
-            logger.error(f"Generation failed: {e}")
-            return {
-                'answer': "I apologize, but I encountered an error while generating a response.",
-                'risk_score': 1.0,
-                'strategy': 'error',
-                'temperature': 0.0,
-                'features': risk_features,
-                'citations': []
-            }
-    def generate_batch(self, questions: List[str],
-                      retrieved_passages_list: List[List[Dict[str, Any]]]) -> List[Dict[str, Any]]:
-        """Generate answers for multiple questions"""
-        results = []
-        for question, passages in zip(questions, retrieved_passages_list):
-            result = self.generate_with_strategy(question, passages)
-            results.append(result)
-        return results
-    def _estimate_risk_score(self, features: Dict[str, Any]) -> float:
-        """Estimate risk score from features (placeholder implementation)"""
-        # This is a simplified risk estimation
-        # In practice, this would use a trained calibration model
-        # Higher similarity scores = lower risk
-        avg_similarity = features.get('avg_similarity', 0.5)
-        # More diverse passages = lower risk
-        diversity = features.get('diversity', 0.5)
-        # More passages = lower risk (up to a point)
-        num_passages = min(features.get('num_passages', 1), 10)
-        passage_score = 1.0 - (num_passages / 10.0)
-        # Combine factors
-        risk_score = 1.0 - (avg_similarity * 0.4 + diversity * 0.3 + (1.0 - passage_score) * 0.3)
-        return max(0.0, min(1.0, risk_score))
-    def _add_citations(self, answer: str, passages: List[Dict[str, Any]]) -> str:
-        """Add citations to answer if not present"""
-        if '[' in answer and ']' in answer:
-            return answer  # Already has citations
-        # Simple citation addition (in practice, use more sophisticated methods)
-        cited_answer = answer
-        for i, passage in enumerate(passages[:3]):  # Limit to first 3 passages
-            if any(word in answer.lower() for word in passage['text'].lower().split()[:5]):
-                cited_answer += f" [{i+1}]"
-        return cited_answer
-    def _extract_citations(self, answer: str, passages: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
-        """Extract citations from answer"""
-        citations = []
-        # Find citation markers like [1], [2], etc.
-        import re
-        citation_matches = re.findall(r'\[(\d+)\]', answer)
-        for match in citation_matches:
-            idx = int(match) - 1
-            if 0 <= idx < len(passages):
-                citations.append({
-                    'id': idx,
-                    'text': passages[idx]['text'],
-                    'metadata': passages[idx].get('metadata', {})
-                })
-        return citations
-    def get_generation_stats(self, results: List[Dict[str, Any]]) -> Dict[str, Any]:
-        """Get statistics from generation results"""
-        if not results:
-            return {}
-        risk_scores = [r['risk_score'] for r in results]
-        strategies = [r['strategy'] for r in results]
-        strategy_counts = {}
-        for strategy in strategies:
-            strategy_counts[strategy] = strategy_counts.get(strategy, 0) + 1
-        return {
-            'num_queries': len(results),
-            'avg_risk_score': sum(risk_scores) / len(risk_scores),
-            'min_risk_score': min(risk_scores),
-            'max_risk_score': max(risk_scores),
-            'strategy_distribution': strategy_counts,
-            'avg_citations_per_answer': sum(len(r.get('citations', [])) for r in results) / len(results)
-        }

generator/vllm_server.py DELETED Viewed

@@ -1,102 +0,0 @@
-from vllm import LLM, SamplingParams
-from typing import List, Dict, Any, Optional
-import logging
-import asyncio
-from concurrent.futures import ThreadPoolExecutor
-logger = logging.getLogger(__name__)
-class VLLMServer:
-    def __init__(self, model_name: str = "openai/gpt-oss-20b",
-                 tensor_parallel_size: int = 1, gpu_memory_utilization: float = 0.9):
-        self.model_name = model_name
-        self.tensor_parallel_size = tensor_parallel_size
-        self.gpu_memory_utilization = gpu_memory_utilization
-        self.llm = None
-        self.executor = ThreadPoolExecutor(max_workers=4)
-    def initialize(self):
-        """Initialize the vLLM model"""
-        try:
-            self.llm = LLM(
-                model=self.model_name,
-                tensor_parallel_size=self.tensor_parallel_size,
-                gpu_memory_utilization=self.gpu_memory_utilization,
-                trust_remote_code=True
-            )
-            logger.info(f"Initialized vLLM with model: {self.model_name}")
-        except Exception as e:
-            logger.error(f"Failed to initialize vLLM: {e}")
-            raise
-    def generate(self, prompts: List[str],
-                max_tokens: int = 512,
-                temperature: float = 0.7,
-                top_p: float = 0.9,
-                stop: Optional[List[str]] = None) -> List[Dict[str, Any]]:
-        """Generate text for prompts"""
-        if self.llm is None:
-            self.initialize()
-        sampling_params = SamplingParams(
-            max_tokens=max_tokens,
-            temperature=temperature,
-            top_p=top_p,
-            stop=stop
-        )
-        try:
-            outputs = self.llm.generate(prompts, sampling_params)
-            results = []
-            for output in outputs:
-                results.append({
-                    'text': output.outputs[0].text,
-                    'prompt': output.prompt,
-                    'finish_reason': output.outputs[0].finish_reason,
-                    'token_ids': output.outputs[0].token_ids,
-                    'logprobs': getattr(output.outputs[0], 'logprobs', None)
-                })
-            return results
-        except Exception as e:
-            logger.error(f"Generation failed: {e}")
-            raise
-    def generate_single(self, prompt: str, **kwargs) -> str:
-        """Generate text for a single prompt"""
-        results = self.generate([prompt], **kwargs)
-        return results[0]['text'] if results else ""
-    def generate_batch(self, prompts: List[str], batch_size: int = 8, **kwargs) -> List[str]:
-        """Generate text for multiple prompts in batches"""
-        all_results = []
-        for i in range(0, len(prompts), batch_size):
-            batch_prompts = prompts[i:i + batch_size]
-            batch_results = self.generate(batch_prompts, **kwargs)
-            all_results.extend([r['text'] for r in batch_results])
-        return all_results
-    async def generate_async(self, prompts: List[str], **kwargs) -> List[Dict[str, Any]]:
-        """Async generation"""
-        loop = asyncio.get_event_loop()
-        return await loop.run_in_executor(self.executor, self.generate, prompts, **kwargs)
-    def get_model_info(self) -> Dict[str, Any]:
-        """Get model information"""
-        if self.llm is None:
-            return {}
-        return {
-            'model_name': self.model_name,
-            'tensor_parallel_size': self.tensor_parallel_size,
-            'gpu_memory_utilization': self.gpu_memory_utilization,
-            'is_initialized': self.llm is not None
-        }
-    def cleanup(self):
-        """Cleanup resources"""
-        if self.executor:
-            self.executor.shutdown(wait=True)

real_embedding_test.py DELETED Viewed

@@ -1,269 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-SafeRAG Real Embedding Test
-Load data -> Generate real embeddings using sentence-transformers -> Build index -> Retrieve
-"""
-import sys
-import os
-import time
-import numpy as np
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-def test_real_embedding_pipeline():
-    """Test the complete pipeline with real embeddings"""
-    print("SafeRAG Real Embedding Pipeline Test")
-    print("=" * 50)
-    try:
-        # Step 1: Load data
-        print("\n1. Loading data...")
-        from data_processing import DataLoader, Preprocessor
-        loader = DataLoader()
-        preprocessor = Preprocessor()
-        # Load knowledge base
-        kb_passages = loader.get_knowledge_base()
-        print(f"   ✓ Loaded {len(kb_passages)} knowledge base passages")
-        # Show sample passages
-        for i, passage in enumerate(kb_passages):
-            print(f"     [{i+1}] {passage}")
-        # Preprocess passages
-        processed_passages = preprocessor.preprocess_passages(kb_passages)
-        print(f"   ✓ Preprocessed {len(processed_passages)} passages")
-        # Step 2: Generate real embeddings
-        print("\n2. Generating real embeddings with sentence-transformers...")
-        from retriever import Embedder
-        # Use a smaller model for faster testing
-        embedder = Embedder(model_name="all-MiniLM-L6-v2", device="cpu")
-        print(f"   ✓ Loaded embedding model: {embedder.model_name}")
-        print(f"   ✓ Embedding dimension: {embedder.get_dimension()}")
-        # Extract text from processed passages
-        passage_texts = [p['text'] for p in processed_passages]
-        # Generate embeddings
-        start_time = time.time()
-        embeddings = embedder.encode_passages(passage_texts)
-        embedding_time = time.time() - start_time
-        print(f"   ✓ Generated {embeddings.shape[0]} embeddings in {embedding_time:.3f}s")
-        print(f"   ✓ Embedding shape: {embeddings.shape}")
-        print(f"   ✓ Embedding type: {type(embeddings)}")
-        # Show embedding statistics
-        print(f"   ✓ Embedding stats:")
-        print(f"     - Mean: {np.mean(embeddings):.4f}")
-        print(f"     - Std: {np.std(embeddings):.4f}")
-        print(f"     - Min: {np.min(embeddings):.4f}")
-        print(f"     - Max: {np.max(embeddings):.4f}")
-        # Step 3: Build FAISS index
-        print("\n3. Building FAISS index...")
-        from retriever import FAISSIndex
-        index = FAISSIndex(embedder.get_dimension())
-        start_time = time.time()
-        index.build_index(embeddings, passage_texts)
-        build_time = time.time() - start_time
-        print(f"   ✓ Built FAISS index in {build_time:.3f}s")
-        print(f"   ✓ Index contains {index.index.ntotal} vectors")
-        # Step 4: Test retrieval
-        print("\n4. Testing retrieval...")
-        from retriever import Retriever
-        retriever = Retriever(embedder, index, None)  # No reranker for simplicity
-        test_queries = [
-            "What is machine learning?",
-            "Tell me about the capital of France",
-            "How does Python work?",
-            "What is artificial intelligence?"
-        ]
-        for query in test_queries:
-            print(f"\n   Query: '{query}'")
-            start_time = time.time()
-            results = retriever.retrieve_single(query, k=3)
-            retrieval_time = time.time() - start_time
-            print(f"   ✓ Retrieved {len(results)} passages in {retrieval_time:.3f}s")
-            for i, result in enumerate(results):
-                print(f"     [{i+1}] Score: {result['score']:.4f}")
-                print(f"         Text: {result['text'][:100]}...")
-        # Step 5: Test similarity calculation
-        print("\n5. Testing similarity calculation...")
-        # Test query-passage similarity
-        query = "What is machine learning?"
-        query_embedding = embedder.encode_queries([query])[0]
-        print(f"   Query: '{query}'")
-        print(f"   Query embedding shape: {query_embedding.shape}")
-        # Calculate similarities with all passages
-        similarities = []
-        for i, passage_embedding in enumerate(embeddings):
-            # Cosine similarity
-            similarity = np.dot(query_embedding, passage_embedding) / (
-                np.linalg.norm(query_embedding) * np.linalg.norm(passage_embedding)
-            )
-            similarities.append((i, similarity, passage_texts[i]))
-        # Sort by similarity
-        similarities.sort(key=lambda x: x[1], reverse=True)
-        print(f"   ✓ Calculated similarities with {len(similarities)} passages")
-        print(f"   Top 3 most similar passages:")
-        for i, (idx, sim, text) in enumerate(similarities[:3]):
-            print(f"     [{i+1}] Similarity: {sim:.4f}")
-            print(f"         Text: {text[:80]}...")
-        # Step 6: Test generation
-        print("\n6. Testing generation...")
-        from generator import SafeGenerator, PromptTemplates
-        templates = PromptTemplates()
-        generator = SafeGenerator(None, None, 0.3, 0.7)  # Simplified version
-        test_query = "What is machine learning?"
-        retrieved_passages = retriever.retrieve_single(test_query, k=3)
-        print(f"   Query: '{test_query}'")
-        print(f"   Retrieved {len(retrieved_passages)} passages")
-        # Generate answer
-        start_time = time.time()
-        result = generator.generate_with_strategy(test_query, retrieved_passages)
-        generation_time = time.time() - start_time
-        print(f"   ✓ Generated answer in {generation_time:.3f}s")
-        print(f"   Answer: {result['answer'][:200]}...")
-        print(f"   Risk Score: {result['risk_score']:.3f}")
-        print(f"   Strategy: {result['strategy']}")
-        print("\n" + "=" * 50)
-        print("🎉 Real embedding pipeline test completed successfully!")
-        print("\nPipeline Summary:")
-        print(f"- Data Loading: {len(kb_passages)} passages")
-        print(f"- Real Embedding Generation: {embeddings.shape[0]} vectors ({embeddings.shape[1]}D)")
-        print(f"- Index Building: {index.index.ntotal} indexed vectors")
-        print(f"- Retrieval: {len(test_queries)} test queries")
-        print(f"- Similarity Calculation: Cosine similarity with all passages")
-        print(f"- Generation: Risk-aware answer generation")
-        return True
-    except Exception as e:
-        print(f"\n❌ Pipeline test failed: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
-def test_embedding_quality():
-    """Test embedding quality and properties"""
-    print("\n" + "=" * 50)
-    print("Testing Embedding Quality")
-    print("=" * 50)
-    try:
-        from retriever import Embedder
-        # Initialize embedder
-        embedder = Embedder(model_name="all-MiniLM-L6-v2", device="cpu")
-        # Test texts
-        test_texts = [
-            "Machine learning is a subset of artificial intelligence",
-            "The capital of France is Paris",
-            "Python is a programming language",
-            "Machine learning algorithms learn from data",  # Similar to first
-            "Paris is the capital city of France",  # Similar to second
-        ]
-        print("1. Generating embeddings for test texts...")
-        embeddings = embedder.encode(test_texts)
-        print(f"   ✓ Generated {embeddings.shape[0]} embeddings")
-        print("\n2. Testing similarity between related texts...")
-        # Test similarity between related texts
-        pairs = [
-            (0, 3, "Machine learning texts"),
-            (1, 4, "France/Paris texts"),
-        ]
-        for i, j, description in pairs:
-            sim = np.dot(embeddings[i], embeddings[j]) / (
-                np.linalg.norm(embeddings[i]) * np.linalg.norm(embeddings[j])
-            )
-            print(f"   {description}: {sim:.4f}")
-            print(f"     Text 1: {test_texts[i]}")
-            print(f"     Text 2: {test_texts[j]}")
-        print("\n3. Testing embedding properties...")
-        # Check if embeddings are normalized
-        norms = [np.linalg.norm(emb) for emb in embeddings]
-        print(f"   ✓ Embedding norms: {[f'{n:.4f}' for n in norms]}")
-        # Check embedding statistics
-        all_embeddings = embeddings.flatten()
-        print(f"   ✓ All embedding values:")
-        print(f"     - Mean: {np.mean(all_embeddings):.4f}")
-        print(f"     - Std: {np.std(all_embeddings):.4f}")
-        print(f"     - Min: {np.min(all_embeddings):.4f}")
-        print(f"     - Max: {np.max(all_embeddings):.4f}")
-        print("\n✅ Embedding quality test completed!")
-        return True
-    except Exception as e:
-        print(f"\n❌ Embedding quality test failed: {e}")
-        import traceback
-        traceback.print_exc()
-        return False
-def main():
-    """Run all tests"""
-    print("SafeRAG Real Embedding Test Suite")
-    print("=" * 60)
-    success = True
-    # Test embedding quality
-    if not test_embedding_quality():
-        success = False
-    # Test real embedding pipeline
-    if not test_real_embedding_pipeline():
-        success = False
-    print("\n" + "=" * 60)
-    if success:
-        print("🎉 All real embedding tests passed!")
-        print("\nThe system can now:")
-        print("1. ✅ Load data from knowledge base")
-        print("2. ✅ Generate real embeddings using sentence-transformers")
-        print("3. ✅ Build FAISS index with real embeddings")
-        print("4. ✅ Retrieve relevant passages using real similarity")
-        print("5. ✅ Calculate cosine similarity between queries and passages")
-        print("6. ✅ Generate answers based on retrieved passages")
-        print("7. ✅ Assess embedding quality and properties")
-    else:
-        print("❌ Some tests failed. Please check the errors above.")
-    return success
-if __name__ == "__main__":
-    success = main()
-    sys.exit(0 if success else 1)

requirements.txt DELETED Viewed

@@ -1,19 +0,0 @@
-torch>=2.0.0
-transformers>=4.35.0
-datasets>=2.14.0
-vllm>=0.2.0
-faiss-cpu>=1.7.4
-sentence-transformers>=2.2.2
-scikit-learn>=1.3.0
-numpy>=1.24.0
-pandas>=2.0.0
-tqdm>=4.65.0
-gradio>=4.0.0
-accelerate>=0.24.0
-evaluate>=0.4.0
-rouge-score>=0.1.2
-nltk>=3.8.0
-spacy>=3.7.0
-matplotlib>=3.7.0
-seaborn>=0.12.0
-wandb>=0.15.0

retriever/__init__.py DELETED Viewed

@@ -1,6 +0,0 @@
-from .embedder import Embedder
-from .faiss_index import FAISSIndex
-from .retriever import Retriever
-from .reranker import Reranker
-__all__ = ['Embedder', 'FAISSIndex', 'Retriever', 'Reranker']

retriever/embedder.py DELETED Viewed

@@ -1,49 +0,0 @@
-from sentence_transformers import SentenceTransformer
-from typing import List, Union
-import numpy as np
-import logging
-logger = logging.getLogger(__name__)
-class Embedder:
-    def __init__(self, model_name: str = "BAAI/bge-large-en-v1.5", device: str = "cuda"):
-        self.model_name = model_name
-        self.device = device
-        self.model = SentenceTransformer(model_name, device=device)
-        logger.info(f"Loaded embedding model: {model_name}")
-    def encode(self, texts: Union[str, List[str]], batch_size: int = 16) -> np.ndarray:
-        """Encode texts to embeddings"""
-        if isinstance(texts, str):
-            texts = [texts]
-        embeddings = self.model.encode(
-            texts,
-            batch_size=batch_size,
-            convert_to_numpy=True,
-            show_progress_bar=len(texts) > 100
-        )
-        return embeddings
-    def encode_queries(self, queries: List[str], batch_size: int = 16) -> np.ndarray:
-        """Encode queries with query prefix"""
-        if not queries:
-            return np.array([])
-        # Add query prefix for BGE models
-        prefixed_queries = [f"Represent this sentence for searching relevant passages: {q}" for q in queries]
-        return self.encode(prefixed_queries, batch_size)
-    def encode_passages(self, passages: List[str], batch_size: int = 16) -> np.ndarray:
-        """Encode passages with passage prefix"""
-        if not passages:
-            return np.array([])
-        # Add passage prefix for BGE models
-        prefixed_passages = [f"Represent this sentence for searching relevant passages: {p}" for p in passages]
-        return self.encode(prefixed_passages, batch_size)
-    def get_dimension(self) -> int:
-        """Get embedding dimension"""
-        return self.model.get_sentence_embedding_dimension()

retriever/faiss_index.py DELETED Viewed

@@ -1,131 +0,0 @@
-# 工厂函数，供pipeline调用
-def build_faiss_index(embeddings, texts, metadata=None, index_type="HNSW"):
-    if embeddings is None or not hasattr(embeddings, 'shape') or len(embeddings.shape) != 2 or embeddings.shape[0] == 0:
-        raise ValueError(f"Embeddings is empty or not a 2D array. Got shape: {getattr(embeddings, 'shape', None)}")
-    dimension = embeddings.shape[1]
-    index = FAISSIndex(dimension, index_type=index_type)
-    index.build_index(embeddings, texts, metadata)
-    return index
-import faiss
-import numpy as np
-import pickle
-import os
-from typing import List, Dict, Any, Tuple
-import logging
-logger = logging.getLogger(__name__)
-class FAISSIndex:
-    def __init__(self, dimension: int, index_type: str = "HNSW"):
-        self.dimension = dimension
-        self.index_type = index_type
-        self.index = None
-        self.id_to_text = {}
-        self.id_to_metadata = {}
-        self.next_id = 0
-    def build_index(self, embeddings: np.ndarray, texts: List[str],
-                   metadata: List[Dict[str, Any]] = None) -> None:
-        """Build FAISS index from embeddings"""
-        if embeddings.shape[1] != self.dimension:
-            raise ValueError(f"Embedding dimension {embeddings.shape[1]} != {self.dimension}")
-        # Normalize embeddings for cosine similarity
-        faiss.normalize_L2(embeddings)
-        if self.index_type == "HNSW":
-            # HNSW index for fast approximate search
-            self.index = faiss.IndexHNSWFlat(self.dimension, 32)  # 32 is default M
-            self.index.hnsw.efConstruction = 200
-            self.index.add(embeddings)
-        elif self.index_type == "IVF":
-            nlist = min(4096, len(embeddings) // 100)
-            quantizer = faiss.IndexFlatIP(self.dimension)
-            self.index = faiss.IndexIVFFlat(quantizer, self.dimension, nlist)
-            self.index.train(embeddings)
-            self.index.add(embeddings)
-        else:
-            self.index = faiss.IndexFlatIP(self.dimension)
-            self.index.add(embeddings)
-        # Store text and metadata
-        for i, text in enumerate(texts):
-            self.id_to_text[i] = text
-            if metadata and i < len(metadata):
-                self.id_to_metadata[i] = metadata[i]
-        logger.info(f"Built FAISS {self.index_type} index with {len(embeddings)} vectors")
-    def search(self, query_embeddings: np.ndarray, k: int = 10) -> Tuple[np.ndarray, np.ndarray]:
-        """Search for similar vectors"""
-        if self.index is None:
-            raise ValueError("Index not built yet")
-        # Normalize query embeddings
-        faiss.normalize_L2(query_embeddings)
-        # Search
-        scores, indices = self.index.search(query_embeddings, k)
-        return scores, indices
-    def get_texts(self, indices: np.ndarray) -> List[str]:
-        """Get texts by indices"""
-        texts = []
-        for idx in indices.flatten():
-            if idx in self.id_to_text:
-                texts.append(self.id_to_text[idx])
-            else:
-                texts.append("")
-        return texts
-    def get_metadata(self, indices: np.ndarray) -> List[Dict[str, Any]]:
-        """Get metadata by indices"""
-        metadata = []
-        for idx in indices.flatten():
-            if idx in self.id_to_metadata:
-                metadata.append(self.id_to_metadata[idx])
-            else:
-                metadata.append({})
-        return metadata
-    def save(self, path: str) -> None:
-        """Save index to disk"""
-        os.makedirs(os.path.dirname(path), exist_ok=True)
-        # Save FAISS index
-        faiss.write_index(self.index, f"{path}.faiss")
-        # Save metadata
-        with open(f"{path}.pkl", "wb") as f:
-            pickle.dump({
-                'id_to_text': self.id_to_text,
-                'id_to_metadata': self.id_to_metadata,
-                'dimension': self.dimension,
-                'index_type': self.index_type
-            }, f)
-        logger.info(f"Saved index to {path}")
-    def load(self, path: str) -> None:
-        """Load index from disk"""
-        # Load FAISS index
-        self.index = faiss.read_index(f"{path}.faiss")
-        # Load metadata
-        with open(f"{path}.pkl", "rb") as f:
-            data = pickle.load(f)
-            self.id_to_text = data['id_to_text']
-            self.id_to_metadata = data['id_to_metadata']
-            self.dimension = data['dimension']
-            self.index_type = data['index_type']
-        logger.info(f"Loaded index from {path}")
-    def get_stats(self) -> Dict[str, Any]:
-        """Get index statistics"""
-        if self.index is None:
-            return {}
-        return {
-            'num_vectors': self.index.ntotal,
-            'dimension': self.dimension,
-            'index_type': self.index_type,
-            'is_trained': self.index.is_trained if hasattr(self.index, 'is_trained') else True
-        }

retriever/reranker.py DELETED Viewed

@@ -1,46 +0,0 @@
-from sentence_transformers import CrossEncoder
-from typing import List
-import numpy as np
-import logging
-logger = logging.getLogger(__name__)
-class Reranker:
-    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2", device: str = "cuda"):
-        self.model_name = model_name
-        self.device = device
-        self.model = CrossEncoder(model_name, device=device)
-        logger.info(f"Loaded reranker model: {model_name}")
-    def rerank(self, query: str, passages: List[str], batch_size: int = 32) -> List[float]:
-        """Rerank passages for a query"""
-        if not passages:
-            return []
-        # Create query-passage pairs
-        pairs = [(query, passage) for passage in passages]
-        # Get relevance scores
-        scores = self.model.predict(pairs, batch_size=batch_size)
-        return scores.tolist()
-    def rerank_batch(self, queries: List[str], passages_list: List[List[str]],
-                    batch_size: int = 32) -> List[List[float]]:
-        """Rerank passages for multiple queries"""
-        all_scores = []
-        for query, passages in zip(queries, passages_list):
-            scores = self.rerank(query, passages, batch_size)
-            all_scores.append(scores)
-        return all_scores
-    def get_top_k(self, query: str, passages: List[str], k: int = 5) -> List[tuple]:
-        """Get top-k passages with scores"""
-        scores = self.rerank(query, passages)
-        # Sort by score
-        ranked = sorted(zip(passages, scores), key=lambda x: x[1], reverse=True)
-        return ranked[:k]

retriever/retriever.py DELETED Viewed

@@ -1,104 +0,0 @@
-from typing import List, Dict, Any, Tuple
-import numpy as np
-from .embedder import Embedder
-from .faiss_index import FAISSIndex
-from .reranker import Reranker
-import logging
-logger = logging.getLogger(__name__)
-class Retriever:
-    def __init__(self, embedder: Embedder, index: FAISSIndex, reranker: Reranker = None):
-        self.embedder = embedder
-        self.index = index
-        self.reranker = reranker
-    def retrieve(self, queries: List[str], k: int = 20,
-                rerank_k: int = 10) -> List[List[Dict[str, Any]]]:
-        """Retrieve and rerank passages for queries"""
-        if not queries:
-            return []
-        # Encode queries
-        query_embeddings = self.embedder.encode_queries(queries)
-        # Search index
-        scores, indices = self.index.search(query_embeddings, k)
-        # Format results
-        results = []
-        for i, query in enumerate(queries):
-            query_results = []
-            for j, (score, idx) in enumerate(zip(scores[i], indices[i])):
-                if idx == -1:  # Invalid index
-                    continue
-                text = self.index.id_to_text.get(idx, "")
-                metadata = self.index.id_to_metadata.get(idx, {})
-                query_results.append({
-                    'text': text,
-                    'score': float(score),
-                    'rank': j + 1,
-                    'metadata': metadata,
-                    'id': idx
-                })
-            results.append(query_results)
-        # Rerank if reranker is available
-        if self.reranker and rerank_k < k:
-            reranked_results = []
-            for i, query in enumerate(queries):
-                passages = [r['text'] for r in results[i][:k]]
-                rerank_scores = self.reranker.rerank(query, passages)
-                # Reorder results based on rerank scores
-                reranked = sorted(
-                    zip(results[i][:k], rerank_scores),
-                    key=lambda x: x[1],
-                    reverse=True
-                )
-                reranked_results.append([
-                    {**result, 'rerank_score': score, 'rank': j + 1}
-                    for j, (result, score) in enumerate(reranked[:rerank_k])
-                ])
-            results = reranked_results
-        return results
-    def retrieve_single(self, query: str, k: int = 10) -> List[Dict[str, Any]]:
-        """Retrieve for a single query"""
-        results = self.retrieve([query], k)
-        return results[0] if results else []
-    def batch_retrieve(self, queries: List[str], batch_size: int = 32,
-                      k: int = 10) -> List[List[Dict[str, Any]]]:
-        """Retrieve for multiple queries in batches"""
-        all_results = []
-        for i in range(0, len(queries), batch_size):
-            batch_queries = queries[i:i + batch_size]
-            batch_results = self.retrieve(batch_queries, k)
-            all_results.extend(batch_results)
-        return all_results
-    def get_retrieval_stats(self, queries: List[str], k: int = 10) -> Dict[str, Any]:
-        """Get retrieval statistics"""
-        results = self.retrieve(queries, k)
-        scores = []
-        for query_results in results:
-            scores.extend([r['score'] for r in query_results])
-        return {
-            'num_queries': len(queries),
-            'avg_scores': np.mean(scores) if scores else 0,
-            'std_scores': np.std(scores) if scores else 0,
-            'min_scores': np.min(scores) if scores else 0,
-            'max_scores': np.max(scores) if scores else 0,
-            'avg_results_per_query': np.mean([len(r) for r in results])
-        }

simple_e2e_test.py DELETED Viewed

@@ -1,518 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-SafeRAG Simple End-to-End Test
-Complete workflow test without external dependencies
-"""
-import sys
-import os
-import time
-import random
-import math
-# Add project root to path
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-def test_basic_functionality():
-    """Test basic Python functionality"""
-    print("Testing basic functionality...")
-    try:
-        # Test basic operations
-        assert 1 + 1 == 2, "Basic math failed"
-        assert "hello" + " " + "world" == "hello world", "String concatenation failed"
-        assert len([1, 2, 3]) == 3, "List length failed"
-        print("+ Basic Python operations work")
-        # Test random number generation
-        random.seed(42)
-        rand_num = random.random()
-        assert 0 <= rand_num <= 1, "Random number out of range"
-        print("+ Random number generation works")
-        return True
-    except Exception as e:
-        print("✗ Basic functionality test failed:", e)
-        return False
-def test_text_processing():
-    """Test text processing functionality"""
-    print("\nTesting text processing...")
-    try:
-        # Simple text cleaning
-        def clean_text(text):
-            if not text:
-                return ""
-            # Remove extra whitespace
-            import re
-            text = re.sub(r'\s+', ' ', text)
-            # Remove special characters but keep punctuation
-            text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)]', '', text)
-            return text.strip()
-        # Test text cleaning
-        test_text = "  This is a test text!!!   "
-        cleaned = clean_text(test_text)
-        expected = "This is a test text!!!"
-        assert cleaned == expected, "Text cleaning failed: got '{}', expected '{}'".format(cleaned, expected)
-        print("+ Text cleaning works")
-        # Test sentence extraction
-        def extract_sentences(text):
-            sentences = text.split('.')
-            return [clean_text(s) for s in sentences if s.strip()]
-        test_text = "First sentence. Second sentence. Third sentence."
-        sentences = extract_sentences(test_text)
-        assert len(sentences) == 3, "Sentence extraction failed: got {} sentences, expected 3".format(len(sentences))
-        print("+ Sentence extraction works")
-        return True
-    except Exception as e:
-        print("✗ Text processing test failed:", e)
-        return False
-def test_simple_embeddings():
-    """Test simple embedding simulation"""
-    print("\nTesting simple embeddings...")
-    try:
-        # Simple embedding simulation using random numbers
-        def create_simple_embeddings(texts, dim=10):
-            """Create simple random embeddings for testing"""
-            random.seed(42)  # For reproducibility
-            embeddings = []
-            for text in texts:
-                embedding = [random.random() for _ in range(dim)]
-                # Simple normalization
-                norm = math.sqrt(sum(x*x for x in embedding))
-                if norm > 0:
-                    embedding = [x/norm for x in embedding]
-                embeddings.append(embedding)
-            return embeddings
-        # Test embedding creation
-        texts = ["This is a test", "Another test sentence"]
-        embeddings = create_simple_embeddings(texts)
-        assert len(embeddings) == 2, "Wrong number of embeddings"
-        assert len(embeddings[0]) == 10, "Wrong embedding dimension"
-        print("+ Simple embedding creation works")
-        # Test similarity calculation
-        def cosine_similarity(a, b):
-            dot_product = sum(x * y for x, y in zip(a, b))
-            norm_a = math.sqrt(sum(x*x for x in a))
-            norm_b = math.sqrt(sum(x*x for x in b))
-            if norm_a == 0 or norm_b == 0:
-                return 0
-            return dot_product / (norm_a * norm_b)
-        sim = cosine_similarity(embeddings[0], embeddings[1])
-        assert 0 <= sim <= 1, "Similarity score out of range: {}".format(sim)
-        print("+ Similarity calculation works")
-        return True
-    except Exception as e:
-        print("✗ Simple embeddings test failed:", e)
-        return False
-def test_simple_retrieval():
-    """Test simple retrieval functionality"""
-    print("\nTesting simple retrieval...")
-    try:
-        # Simple retrieval simulation
-        class SimpleRetriever:
-            def __init__(self, passages, embeddings):
-                self.passages = passages
-                self.embeddings = embeddings
-            def search(self, query_embedding, k=5):
-                # Calculate similarities
-                similarities = []
-                for embedding in self.embeddings:
-                    sim = sum(x * y for x, y in zip(embedding, query_embedding))
-                    similarities.append(sim)
-                # Get top-k indices
-                indexed_sims = [(i, sim) for i, sim in enumerate(similarities)]
-                indexed_sims.sort(key=lambda x: x[1], reverse=True)
-                top_indices = [i for i, _ in indexed_sims[:k]]
-                # Return results
-                results = []
-                for i, idx in enumerate(top_indices):
-                    results.append({
-                        'text': self.passages[idx],
-                        'score': similarities[idx],
-                        'rank': i + 1
-                    })
-                return results
-        # Create test data
-        passages = [
-            "Machine learning is a subset of artificial intelligence.",
-            "Deep learning uses neural networks with multiple layers.",
-            "Natural language processing deals with text and speech.",
-            "Computer vision focuses on image and video analysis."
-        ]
-        # Create simple embeddings
-        def create_simple_embeddings(texts, dim=10):
-            random.seed(42)
-            embeddings = []
-            for text in texts:
-                embedding = [random.random() for _ in range(dim)]
-                norm = math.sqrt(sum(x*x for x in embedding))
-                if norm > 0:
-                    embedding = [x/norm for x in embedding]
-                embeddings.append(embedding)
-            return embeddings
-        embeddings = create_simple_embeddings(passages)
-        # Test retrieval
-        retriever = SimpleRetriever(passages, embeddings)
-        query_embedding = [random.random() for _ in range(10)]
-        norm = math.sqrt(sum(x*x for x in query_embedding))
-        if norm > 0:
-            query_embedding = [x/norm for x in query_embedding]
-        results = retriever.search(query_embedding, k=3)
-        assert len(results) == 3, "Retrieval returned wrong number of results: {}".format(len(results))
-        assert all('text' in r and 'score' in r for r in results), "Retrieval results missing fields"
-        print("+ Simple retrieval works")
-        return True
-    except Exception as e:
-        print("✗ Simple retrieval test failed:", e)
-        return False
-def test_risk_calibration():
-    """Test risk calibration functionality"""
-    print("\nTesting risk calibration...")
-    try:
-        # Simple risk feature extraction
-        def extract_risk_features(question, retrieved_passages):
-            features = {}
-            if not retrieved_passages:
-                return {'num_passages': 0, 'avg_similarity': 0.0, 'diversity': 0.0}
-            # Basic features
-            features['num_passages'] = len(retrieved_passages)
-            scores = [p['score'] for p in retrieved_passages]
-            features['avg_similarity'] = sum(scores) / len(scores)
-            features['max_similarity'] = max(scores)
-            features['min_similarity'] = min(scores)
-            # Simple diversity calculation
-            if len(scores) > 1:
-                mean_score = features['avg_similarity']
-                variance = sum((x - mean_score) ** 2 for x in scores) / len(scores)
-                features['diversity'] = 1.0 - math.sqrt(variance)
-            else:
-                features['diversity'] = 1.0
-            return features
-        # Simple risk prediction
-        def predict_risk(features):
-            # Simple heuristic for risk scoring
-            risk_score = 0.0
-            # Few passages = higher risk
-            if features['num_passages'] < 3:
-                risk_score += 0.3
-            # Low similarity = higher risk
-            if features['avg_similarity'] < 0.5:
-                risk_score += 0.2
-            # Low diversity = higher risk
-            if features['diversity'] < 0.3:
-                risk_score += 0.2
-            return min(1.0, risk_score)
-        # Test risk feature extraction
-        question = "What is machine learning?"
-        passages = [
-            {'text': 'ML is AI subset', 'score': 0.8},
-            {'text': 'Neural networks are used', 'score': 0.7},
-            {'text': 'Deep learning is popular', 'score': 0.6}
-        ]
-        features = extract_risk_features(question, passages)
-        assert 'num_passages' in features, "Missing num_passages feature"
-        assert features['num_passages'] == 3, "Wrong number of passages: {}".format(features['num_passages'])
-        print("+ Risk feature extraction works")
-        # Test risk prediction
-        risk_score = predict_risk(features)
-        assert 0 <= risk_score <= 1, "Risk score out of range: {}".format(risk_score)
-        print("+ Risk prediction works")
-        return True
-    except Exception as e:
-        print("✗ Risk calibration test failed:", e)
-        return False
-def test_generation():
-    """Test generation functionality"""
-    print("\nTesting generation...")
-    try:
-        # Simple generation simulation
-        def generate_answer(question, retrieved_passages, risk_score):
-            # Simple template-based generation
-            context = " ".join([p['text'] for p in retrieved_passages[:3]])
-            if risk_score < 0.3:
-                # Low risk: confident answer
-                answer = "Based on the information: {}. The answer is: {}.".format(
-                    context, "This is a confident answer."
-                )
-            elif risk_score < 0.7:
-                # Medium risk: cautious answer
-                answer = "Based on the available information: {}. The answer might be: {}.".format(
-                    context, "This is a cautious answer."
-                )
-            else:
-                # High risk: uncertain answer
-                answer = "The available information: {} is limited. I'm not certain, but it might be: {}.".format(
-                    context, "This is an uncertain answer."
-                )
-            return answer
-        # Test generation
-        question = "What is machine learning?"
-        passages = [
-            {'text': 'Machine learning is AI subset', 'score': 0.8},
-            {'text': 'It uses algorithms', 'score': 0.7}
-        ]
-        # Test different risk levels
-        for risk_score in [0.2, 0.5, 0.8]:
-            answer = generate_answer(question, passages, risk_score)
-            assert len(answer) > 0, "Empty answer generated"
-            assert "machine learning" in answer.lower() or "ai" in answer.lower(), "Answer doesn't address question"
-        print("+ Generation works")
-        return True
-    except Exception as e:
-        print("✗ Generation test failed:", e)
-        return False
-def test_evaluation():
-    """Test evaluation functionality"""
-    print("\nTesting evaluation...")
-    try:
-        # Simple evaluation metrics
-        def exact_match(prediction, reference):
-            return prediction.lower().strip() == reference.lower().strip()
-        def f1_score(prediction, reference):
-            pred_words = set(prediction.lower().split())
-            ref_words = set(reference.lower().split())
-            if len(ref_words) == 0:
-                return 1.0 if len(pred_words) == 0 else 0.0
-            common = pred_words & ref_words
-            precision = len(common) / len(pred_words) if pred_words else 0.0
-            recall = len(common) / len(ref_words)
-            if precision + recall == 0:
-                return 0.0
-            return 2 * precision * recall / (precision + recall)
-        # Test evaluation
-        predictions = ["Machine learning is AI", "Deep learning uses neural networks"]
-        references = ["Machine learning is AI", "Deep learning uses neural networks"]
-        # Test exact match
-        em_scores = [exact_match(p, r) for p, r in zip(predictions, references)]
-        assert all(em_scores), "Exact match failed"
-        print("+ Exact match evaluation works")
-        # Test F1 score
-        f1_scores = [f1_score(p, r) for p, r in zip(predictions, references)]
-        assert all(0 <= score <= 1 for score in f1_scores), "F1 scores out of range"
-        print("+ F1 score evaluation works")
-        return True
-    except Exception as e:
-        print("✗ Evaluation test failed:", e)
-        return False
-def test_end_to_end_workflow():
-    """Test complete end-to-end workflow"""
-    print("\nTesting end-to-end workflow...")
-    try:
-        # Simulate complete RAG pipeline
-        def rag_pipeline(question):
-            # Step 1: Create simple embeddings
-            passages = [
-                "Machine learning is a subset of artificial intelligence.",
-                "Deep learning uses neural networks with multiple layers.",
-                "Natural language processing deals with text and speech.",
-                "Computer vision focuses on image and video analysis."
-            ]
-            # Simulate embeddings
-            random.seed(42)
-            embeddings = []
-            for passage in passages:
-                embedding = [random.random() for _ in range(10)]
-                norm = math.sqrt(sum(x*x for x in embedding))
-                if norm > 0:
-                    embedding = [x/norm for x in embedding]
-                embeddings.append(embedding)
-            # Step 2: Retrieve relevant passages
-            query_embedding = [random.random() for _ in range(10)]
-            norm = math.sqrt(sum(x*x for x in query_embedding))
-            if norm > 0:
-                query_embedding = [x/norm for x in query_embedding]
-            similarities = []
-            for embedding in embeddings:
-                sim = sum(x * y for x, y in zip(embedding, query_embedding))
-                similarities.append(sim)
-            indexed_sims = [(i, sim) for i, sim in enumerate(similarities)]
-            indexed_sims.sort(key=lambda x: x[1], reverse=True)
-            top_indices = [i for i, _ in indexed_sims[:3]]
-            retrieved_passages = []
-            for i, idx in enumerate(top_indices):
-                retrieved_passages.append({
-                    'text': passages[idx],
-                    'score': similarities[idx],
-                    'rank': i + 1
-                })
-            # Step 3: Extract risk features
-            scores = [p['score'] for p in retrieved_passages]
-            features = {
-                'num_passages': len(retrieved_passages),
-                'avg_similarity': sum(scores) / len(scores) if scores else 0.0,
-                'diversity': 1.0 - math.sqrt(sum((x - sum(scores)/len(scores))**2 for x in scores) / len(scores)) if len(scores) > 1 else 1.0
-            }
-            # Step 4: Predict risk
-            risk_score = 0.0
-            if features['num_passages'] < 3:
-                risk_score += 0.3
-            if features['avg_similarity'] < 0.5:
-                risk_score += 0.2
-            if features['diversity'] < 0.3:
-                risk_score += 0.2
-            risk_score = min(1.0, risk_score)
-            # Step 5: Generate answer
-            context = " ".join([p['text'] for p in retrieved_passages[:3]])
-            if risk_score < 0.3:
-                answer = "Based on the information: {}. The answer is: Machine learning is a subset of AI.".format(context)
-            elif risk_score < 0.7:
-                answer = "Based on the available information: {}. The answer might be: Machine learning is likely a subset of AI.".format(context)
-            else:
-                answer = "The available information: {} is limited. I'm not certain, but it might be: Machine learning could be related to AI.".format(context)
-            return {
-                'question': question,
-                'answer': answer,
-                'retrieved_passages': retrieved_passages,
-                'risk_score': risk_score,
-                'features': features
-            }
-        # Test complete pipeline
-        question = "What is machine learning?"
-        result = rag_pipeline(question)
-        # Validate result
-        assert 'question' in result, "Missing question in result"
-        assert 'answer' in result, "Missing answer in result"
-        assert 'retrieved_passages' in result, "Missing retrieved passages"
-        assert 'risk_score' in result, "Missing risk score"
-        assert 'features' in result, "Missing features"
-        assert result['question'] == question, "Question not preserved"
-        assert len(result['answer']) > 0, "Empty answer"
-        assert len(result['retrieved_passages']) > 0, "No retrieved passages"
-        assert 0 <= result['risk_score'] <= 1, "Risk score out of range: {}".format(result['risk_score'])
-        print("+ End-to-end workflow works")
-        print("  Question: {}".format(result['question']))
-        print("  Answer: {}".format(result['answer'][:100] + "..."))
-        print("  Risk Score: {:.3f}".format(result['risk_score']))
-        print("  Retrieved Passages: {}".format(len(result['retrieved_passages'])))
-        return True
-    except Exception as e:
-        print("✗ End-to-end workflow test failed:", e)
-        return False
-def main():
-    """Run all end-to-end tests"""
-    print("SafeRAG Simple End-to-End Test Suite")
-    print("=" * 50)
-    start_time = time.time()
-    tests = [
-        test_basic_functionality,
-        test_text_processing,
-        test_simple_embeddings,
-        test_simple_retrieval,
-        test_risk_calibration,
-        test_generation,
-        test_evaluation,
-        test_end_to_end_workflow
-    ]
-    passed = 0
-    total = len(tests)
-    for test in tests:
-        try:
-            if test():
-                passed += 1
-        except Exception as e:
-            print("✗ Test {} failed with exception: {}".format(test.__name__, e))
-    end_time = time.time()
-    print("\n" + "=" * 50)
-    print("Test Results:")
-    print("Passed: {}/{}".format(passed, total))
-    print("Time: {:.2f} seconds".format(end_time - start_time))
-    if passed == total:
-        print("✓ All tests passed! SafeRAG end-to-end workflow is working.")
-        print("\nThe system can:")
-        print("- Process text and extract sentences")
-        print("- Create simple embeddings and calculate similarities")
-        print("- Retrieve relevant passages based on similarity")
-        print("- Extract risk features and predict risk scores")
-        print("- Generate answers with different risk-aware strategies")
-        print("- Evaluate answers using standard metrics")
-        print("- Run complete end-to-end RAG pipeline")
-        return True
-    else:
-        print("✗ Some tests failed. Please check the errors above.")
-        return False
-if __name__ == "__main__":
-    success = main()
-    sys.exit(0 if success else 1)

simple_test.py DELETED Viewed

@@ -1,167 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*-
-"""
-Simple SafeRAG Test
-Basic functionality test without complex dependencies
-"""
-import sys
-import os
-sys.path.append(os.path.dirname(os.path.abspath(__file__)))
-def test_imports():
-    """Test that all modules can be imported"""
-    print("Testing imports...")
-    try:
-        from data_processing import DataLoader, Preprocessor
-        print("+ DataLoader and Preprocessor imported successfully")
-    except Exception as e:
-        print("✗ Failed to import DataLoader/Preprocessor:", e)
-        return False
-    try:
-        from retriever import Embedder, FAISSIndex, Retriever, Reranker
-        print("+ Retriever modules imported successfully")
-    except Exception as e:
-        print("✗ Failed to import retriever modules:", e)
-        return False
-    try:
-        from generator import VLLMServer, SafeGenerator, PromptTemplates
-        print("+ Generator modules imported successfully")
-    except Exception as e:
-        print("✗ Failed to import generator modules:", e)
-        return False
-    try:
-        from calibration import RiskFeatureExtractor, CalibrationHead
-        print("+ Calibration modules imported successfully")
-    except Exception as e:
-        print("✗ Failed to import calibration modules:", e)
-        return False
-    try:
-        from eval import QAEvaluator, AttributionEvaluator, CalibrationEvaluator
-        print("+ Evaluation modules imported successfully")
-    except Exception as e:
-        print("✗ Failed to import evaluation modules:", e)
-        return False
-    return True
-def test_basic_functionality():
-    """Test basic functionality without heavy dependencies"""
-    print("\nTesting basic functionality...")
-    try:
-        # Test Preprocessor
-        from data_processing.preprocessor import Preprocessor
-        preprocessor = Preprocessor()
-        # Test text cleaning
-        text = "  This is a test text.   "
-        cleaned = preprocessor.clean_text(text)
-        assert cleaned == "This is a test text.", "Expected 'This is a test text.', got '{}'".format(cleaned)
-        print("+ Text cleaning works")
-        # Test sentence extraction
-        text = "First sentence. Second sentence. Third sentence."
-        sentences = preprocessor.extract_sentences(text)
-        assert len(sentences) == 3, "Expected 3 sentences, got {}".format(len(sentences))
-        print("+ Sentence extraction works")
-    except Exception as e:
-        print("✗ Preprocessor test failed:", e)
-        return False
-    try:
-        # Test PromptTemplates
-        from generator.prompt_templates import PromptTemplates
-        templates = PromptTemplates()
-        # Test prompt formatting
-        prompt = templates.format_prompt(
-            'rag',
-            question="What is AI?",
-            context="AI is artificial intelligence."
-        )
-        assert "What is AI?" in prompt, "Question not found in prompt"
-        assert "AI is artificial intelligence." in prompt, "Context not found in prompt"
-        print("+ Prompt templates work")
-    except Exception as e:
-        print("✗ PromptTemplates test failed:", e)
-        return False
-    try:
-        # Test QAEvaluator
-        from eval.eval_qa import QAEvaluator
-        evaluator = QAEvaluator()
-        # Test exact match
-        predictions = ["Paris", "Paris"]
-        references = ["Paris", "London"]
-        em = evaluator.exact_match(predictions, references)
-        assert em == 0.5, "Expected 0.5, got {}".format(em)
-        print("+ QA evaluation works")
-    except Exception as e:
-        print("✗ QAEvaluator test failed:", e)
-        return False
-    return True
-def test_config():
-    """Test configuration loading"""
-    print("\nTesting configuration...")
-    try:
-        import yaml
-        with open('config.yaml', 'r') as f:
-            config = yaml.safe_load(f)
-        # Check required sections
-        required_sections = ['models', 'data', 'index', 'retrieval', 'calibration', 'evaluation']
-        for section in required_sections:
-            assert section in config, "Missing config section: {}".format(section)
-        print("+ Configuration file is valid")
-        return True
-    except Exception as e:
-        print("✗ Configuration test failed:", e)
-        return False
-def main():
-    """Run all tests"""
-    print("SafeRAG Simple Test Suite")
-    print("=" * 40)
-    all_passed = True
-    # Test imports
-    if not test_imports():
-        all_passed = False
-    # Test basic functionality
-    if not test_basic_functionality():
-        all_passed = False
-    # Test configuration
-    if not test_config():
-        all_passed = False
-    print("\n" + "=" * 40)
-    if all_passed:
-        print("+ All tests passed!")
-        print("SafeRAG is ready to use.")
-    else:
-        print("✗ Some tests failed.")
-        print("Please check the errors above.")
-    return all_passed
-if __name__ == "__main__":
-    success = main()
-    sys.exit(0 if success else 1)