# Evaluation Specification: ScamShield AI
## Metrics, Computation Methods, and Testing Framework

**Version:** 1.0  
**Date:** January 26, 2026  
**Owner:** QA & Evaluation Team  
**Related Documents:** FRD.md, DATA_SPEC.md, API_CONTRACT.md

---

## TABLE OF CONTENTS
1. [Evaluation Overview](#evaluation-overview)
2. [Detection Metrics](#detection-metrics)
3. [Extraction Metrics](#extraction-metrics)
4. [Engagement Metrics](#engagement-metrics)
5. [Performance Metrics](#performance-metrics)
6. [Computation Methods](#computation-methods)
7. [Testing Framework](#testing-framework)
8. [Competition Scoring (Predicted)](#competition-scoring-predicted)

---

## EVALUATION OVERVIEW

### Evaluation Objectives

1. **Functional Correctness:** System meets FRD requirements
2. **Performance:** Response time, throughput within SLAs
3. **Quality:** Detection accuracy, extraction precision/recall
4. **Robustness:** Handles edge cases, adversarial inputs
5. **Competition Readiness:** Meets judging criteria

### Evaluation Phases

| Phase | Timeline | Focus | Pass Criteria |
|-------|----------|-------|---------------|
| **Unit Testing** | Days 3-9 | Individual components | >80% code coverage |
| **Integration Testing** | Day 8 | End-to-end flows | All API endpoints functional |
| **Performance Testing** | Day 9 | Load, latency | <2s p95 latency, 100 req/min |
| **Acceptance Testing** | Day 10 | Requirements validation | All FRD acceptance criteria met |
| **Red Team Testing** | Day 10 | Adversarial scenarios | >80% red team tests passed |
| **Pre-Submission** | Day 11 | Final validation | >90% detection accuracy |

---

## DETECTION METRICS

### Metric 1: Scam Detection Accuracy

**Definition:** Proportion of messages correctly classified as scam or legitimate.

**Formula:**
```
Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where:
- TP (True Positives): Scams correctly identified
- TN (True Negatives): Legitimate messages correctly identified
- FP (False Positives): Legitimate messages incorrectly flagged as scams
- FN (False Negatives): Scams missed
```

**Target:** ≥90%

**Computation:**
```python
def compute_detection_accuracy(predictions: List[dict], ground_truth: List[dict]) -> float:
    """
    Compute scam detection accuracy.
    
    Args:
        predictions: List of {"id": str, "scam_detected": bool}
        ground_truth: List of {"id": str, "label": "scam"|"legitimate"}
    
    Returns:
        Accuracy score (0.0-1.0)
    """
    assert len(predictions) == len(ground_truth), "Mismatched lengths"
    
    # Align by ID
    pred_map = {p['id']: p['scam_detected'] for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    correct = sum(1 for id in pred_map if pred_map[id] == gt_map[id])
    total = len(pred_map)
    
    return correct / total if total > 0 else 0.0

# Example usage
predictions = [
    {"id": "test_001", "scam_detected": True},
    {"id": "test_002", "scam_detected": False},
    {"id": "test_003", "scam_detected": True}
]

ground_truth = [
    {"id": "test_001", "label": "scam"},
    {"id": "test_002", "label": "legitimate"},
    {"id": "test_003", "label": "scam"}
]

accuracy = compute_detection_accuracy(predictions, ground_truth)
print(f"Accuracy: {accuracy:.2%}")  # Expected: 100%
```

---

### Metric 2: Precision

**Definition:** Of all messages flagged as scams, what proportion are actual scams?

**Formula:**
```
Precision = TP / (TP + FP)
```

**Target:** ≥85%

**Significance:** High precision minimizes false alarms (legitimate messages flagged as scams).

**Computation:**
```python
def compute_precision(predictions: List[dict], ground_truth: List[dict]) -> float:
    """Compute precision for scam detection"""
    pred_map = {p['id']: p['scam_detected'] for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id])
    fp = sum(1 for id in pred_map if pred_map[id] and not gt_map[id])
    
    return tp / (tp + fp) if (tp + fp) > 0 else 0.0
```

---

### Metric 3: Recall (Sensitivity)

**Definition:** Of all actual scams, what proportion are detected?

**Formula:**
```
Recall = TP / (TP + FN)
```

**Target:** ≥90%

**Significance:** High recall ensures few scams are missed.

**Computation:**
```python
def compute_recall(predictions: List[dict], ground_truth: List[dict]) -> float:
    """Compute recall for scam detection"""
    pred_map = {p['id']: p['scam_detected'] for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id])
    fn = sum(1 for id in pred_map if not pred_map[id] and gt_map[id])
    
    return tp / (tp + fn) if (tp + fn) > 0 else 0.0
```

---

### Metric 4: F1-Score

**Definition:** Harmonic mean of precision and recall.

**Formula:**
```
F1 = 2 * (Precision * Recall) / (Precision + Recall)
```

**Target:** ≥87%

**Computation:**
```python
def compute_f1_score(precision: float, recall: float) -> float:
    """Compute F1-score from precision and recall"""
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)
```

---

### Metric 5: Confidence Calibration

**Definition:** How well do confidence scores correlate with actual accuracy?

**Formula:**
```
Expected Calibration Error (ECE) = Σ (|accuracy_bin - avg_confidence_bin|) × (bin_size / total)

For bins: [0-0.1], [0.1-0.2], ..., [0.9-1.0]
```

**Target:** ECE <0.1 (well-calibrated)

**Computation:**
```python
def compute_ece(predictions: List[dict], ground_truth: List[dict], n_bins: int = 10) -> float:
    """
    Compute Expected Calibration Error.
    
    Predictions must include "confidence" field.
    """
    pred_map = {p['id']: (p['scam_detected'], p['confidence']) for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    bins = [[] for _ in range(n_bins)]
    
    for id in pred_map:
        pred, conf = pred_map[id]
        actual = gt_map[id]
        correct = (pred == actual)
        
        bin_idx = min(int(conf * n_bins), n_bins - 1)
        bins[bin_idx].append((conf, correct))
    
    ece = 0.0
    total = len(pred_map)
    
    for bin_samples in bins:
        if len(bin_samples) == 0:
            continue
        
        avg_conf = sum(conf for conf, _ in bin_samples) / len(bin_samples)
        accuracy = sum(1 for _, correct in bin_samples if correct) / len(bin_samples)
        
        ece += abs(accuracy - avg_conf) * (len(bin_samples) / total)
    
    return ece
```

---

### Metric 6: Language-Specific Accuracy

**Definition:** Detection accuracy broken down by language.

**Target:** 
- English: ≥92%
- Hindi: ≥88%
- Hinglish: ≥85%
- Fairness: <5% difference between languages

**Computation:**
```python
def compute_language_specific_accuracy(predictions: List[dict], ground_truth: List[dict]) -> dict:
    """Compute accuracy per language"""
    from collections import defaultdict
    
    lang_correct = defaultdict(int)
    lang_total = defaultdict(int)
    
    pred_map = {p['id']: p for p in predictions}
    gt_map = {g['id']: g for g in ground_truth}
    
    for id in pred_map:
        lang = gt_map[id]['language']
        pred_scam = pred_map[id]['scam_detected']
        actual_scam = (gt_map[id]['label'] == 'scam')
        
        lang_total[lang] += 1
        if pred_scam == actual_scam:
            lang_correct[lang] += 1
    
    return {
        lang: lang_correct[lang] / lang_total[lang] if lang_total[lang] > 0 else 0.0
        for lang in lang_total
    }

# Check fairness
def check_language_fairness(lang_accuracies: dict, threshold: float = 0.05) -> bool:
    """Ensure accuracy difference between languages is within threshold"""
    accuracies = list(lang_accuracies.values())
    max_diff = max(accuracies) - min(accuracies)
    return max_diff < threshold
```

---

## EXTRACTION METRICS

### Metric 7: Extraction Precision (per entity type)

**Definition:** Of all extracted entities, what proportion are correct?

**Formula:**
```
Precision_entity = |Extracted ∩ Ground_Truth| / |Extracted|

For entity types: upi_ids, bank_accounts, ifsc_codes, phone_numbers, phishing_links
```

**Target:** 
- UPI IDs: ≥90%
- Bank Accounts: ≥85%
- IFSC Codes: ≥95%
- Phone Numbers: ≥90%
- Phishing Links: ≥95%

**Computation:**
```python
def compute_extraction_precision(extracted: dict, ground_truth: dict) -> dict:
    """
    Compute precision for each entity type.
    
    Args:
        extracted: {"upi_ids": [...], "bank_accounts": [...], ...}
        ground_truth: Same structure
    
    Returns:
        {"upi_ids": precision, "bank_accounts": precision, ...}
    """
    precisions = {}
    
    for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']:
        extracted_set = set(extracted.get(entity_type, []))
        gt_set = set(ground_truth.get(entity_type, []))
        
        if len(extracted_set) == 0:
            precisions[entity_type] = 1.0 if len(gt_set) == 0 else 0.0
        else:
            correct = len(extracted_set & gt_set)
            precisions[entity_type] = correct / len(extracted_set)
    
    return precisions
```

---

### Metric 8: Extraction Recall (per entity type)

**Definition:** Of all actual entities, what proportion are extracted?

**Formula:**
```
Recall_entity = |Extracted ∩ Ground_Truth| / |Ground_Truth|
```

**Target:** 
- UPI IDs: ≥85%
- Bank Accounts: ≥80%
- IFSC Codes: ≥90%
- Phone Numbers: ≥85%
- Phishing Links: ≥90%

**Computation:**
```python
def compute_extraction_recall(extracted: dict, ground_truth: dict) -> dict:
    """Compute recall for each entity type"""
    recalls = {}
    
    for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']:
        extracted_set = set(extracted.get(entity_type, []))
        gt_set = set(ground_truth.get(entity_type, []))
        
        if len(gt_set) == 0:
            recalls[entity_type] = 1.0 if len(extracted_set) == 0 else 0.0
        else:
            correct = len(extracted_set & gt_set)
            recalls[entity_type] = correct / len(gt_set)
    
    return recalls
```

---

### Metric 9: Overall Extraction F1-Score

**Definition:** Weighted average F1-score across all entity types.

**Weights:**
```python
ENTITY_WEIGHTS = {
    'upi_ids': 0.30,
    'bank_accounts': 0.30,
    'ifsc_codes': 0.20,
    'phone_numbers': 0.10,
    'phishing_links': 0.10
}
```

**Target:** ≥85%

**Computation:**
```python
def compute_overall_extraction_f1(precisions: dict, recalls: dict, weights: dict = ENTITY_WEIGHTS) -> float:
    """Compute weighted F1-score across entity types"""
    f1_scores = {}
    
    for entity_type in weights:
        p = precisions.get(entity_type, 0.0)
        r = recalls.get(entity_type, 0.0)
        
        if p + r == 0:
            f1_scores[entity_type] = 0.0
        else:
            f1_scores[entity_type] = 2 * (p * r) / (p + r)
    
    weighted_f1 = sum(f1_scores[entity] * weights[entity] for entity in weights)
    return weighted_f1
```

---

### Metric 10: Extraction Confidence Accuracy

**Definition:** Correlation between extraction_confidence score and actual precision.

**Target:** Pearson correlation >0.7

**Computation:**
```python
from scipy.stats import pearsonr

def evaluate_extraction_confidence(test_results: List[dict]) -> float:
    """
    Evaluate extraction confidence calibration.
    
    test_results: [
        {
            "extraction_confidence": 0.85,
            "actual_precision": 0.90
        },
        ...
    ]
    """
    confidences = [r['extraction_confidence'] for r in test_results]
    precisions = [r['actual_precision'] for r in test_results]
    
    correlation, p_value = pearsonr(confidences, precisions)
    return correlation
```

---

## ENGAGEMENT METRICS

### Metric 11: Average Conversation Length

**Definition:** Mean number of turns per conversation.

**Target:** ≥10 turns (demonstrates sustained engagement)

**Computation:**
```python
def compute_avg_conversation_length(conversations: List[dict]) -> float:
    """
    conversations: [{"session_id": str, "turn_count": int}, ...]
    """
    if len(conversations) == 0:
        return 0.0
    
    total_turns = sum(conv['turn_count'] for conv in conversations)
    return total_turns / len(conversations)
```

---

### Metric 12: Intelligence Extraction Rate

**Definition:** Proportion of conversations that extract at least one intelligence entity.

**Target:** ≥70%

**Computation:**
```python
def compute_extraction_rate(conversations: List[dict]) -> float:
    """
    conversations: [
        {
            "session_id": str,
            "extracted_intelligence": {
                "upi_ids": [...],
                ...
            }
        },
        ...
    ]
    """
    if len(conversations) == 0:
        return 0.0
    
    extracted_count = 0
    for conv in conversations:
        intel = conv['extracted_intelligence']
        has_intel = any(
            len(intel.get(entity_type, [])) > 0
            for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']
        )
        if has_intel:
            extracted_count += 1
    
    return extracted_count / len(conversations)
```

---

### Metric 13: Persona Consistency

**Definition:** Proportion of conversations where persona remains consistent across all turns.

**Target:** ≥95%

**Computation:**
```python
def compute_persona_consistency(conversations: List[dict]) -> float:
    """
    conversations: [
        {
            "session_id": str,
            "messages": [
                {"turn": 1, "sender": "agent", "persona": "elderly"},
                {"turn": 2, "sender": "agent", "persona": "elderly"},
                ...
            ]
        },
        ...
    ]
    """
    consistent_count = 0
    
    for conv in conversations:
        agent_messages = [msg for msg in conv['messages'] if msg['sender'] == 'agent']
        if len(agent_messages) == 0:
            continue
        
        personas = [msg.get('persona') for msg in agent_messages]
        if len(set(personas)) == 1:  # All same persona
            consistent_count += 1
    
    return consistent_count / len(conversations) if len(conversations) > 0 else 0.0
```

---

### Metric 14: Engagement Quality Score

**Definition:** Composite score measuring naturalness and effectiveness of engagement.

**Components:**
1. Average turns (weight: 0.4)
2. Extraction rate (weight: 0.4)
3. Persona consistency (weight: 0.2)

**Target:** ≥0.8

**Computation:**
```python
def compute_engagement_quality(avg_turns: float, extraction_rate: float, persona_consistency: float) -> float:
    """
    Normalize and weight engagement metrics.
    
    Args:
        avg_turns: Actual average turns
        extraction_rate: 0.0-1.0
        persona_consistency: 0.0-1.0
    """
    # Normalize avg_turns (max 20)
    normalized_turns = min(avg_turns / 20, 1.0)
    
    quality_score = (
        0.4 * normalized_turns +
        0.4 * extraction_rate +
        0.2 * persona_consistency
    )
    
    return quality_score
```

---

## PERFORMANCE METRICS

### Metric 15: API Response Time

**Definition:** Time from request received to response sent.

**Targets:**
- P50 (Median): <1 second
- P95: <2 seconds
- P99: <3 seconds

**Computation:**
```python
import numpy as np

def compute_response_time_percentiles(response_times: List[float]) -> dict:
    """
    response_times: List of times in seconds
    """
    return {
        'p50': np.percentile(response_times, 50),
        'p95': np.percentile(response_times, 95),
        'p99': np.percentile(response_times, 99),
        'mean': np.mean(response_times),
        'max': np.max(response_times)
    }
```

---

### Metric 16: Throughput

**Definition:** Number of requests processed per minute.

**Target:** ≥100 requests/minute (sustained)

**Computation:**
```python
def compute_throughput(total_requests: int, time_window_seconds: float) -> float:
    """
    Returns requests per minute
    """
    return (total_requests / time_window_seconds) * 60
```

---

### Metric 17: Error Rate

**Definition:** Proportion of requests that result in errors (4xx, 5xx).

**Target:** <1%

**Computation:**
```python
def compute_error_rate(total_requests: int, error_count: int) -> float:
    """Returns error rate as proportion (0.0-1.0)"""
    return error_count / total_requests if total_requests > 0 else 0.0
```

---

### Metric 18: Uptime

**Definition:** Percentage of time service is available and healthy.

**Target:** ≥99% during competition testing window

**Computation:**
```python
def compute_uptime(total_time_seconds: float, downtime_seconds: float) -> float:
    """Returns uptime as percentage"""
    return ((total_time_seconds - downtime_seconds) / total_time_seconds) * 100
```

---

## COMPUTATION METHODS

### Complete Evaluation Pipeline

```python
import json
from typing import List, Dict, Tuple

class ScamShieldEvaluator:
    """Complete evaluation framework for ScamShield AI"""
    
    def __init__(self, api_endpoint: str):
        self.api_endpoint = api_endpoint
        self.results = {
            'detection': {},
            'extraction': {},
            'engagement': {},
            'performance': {}
        }
    
    def evaluate_detection(self, test_file: str) -> dict:
        """
        Evaluate scam detection on test dataset.
        
        Args:
            test_file: Path to JSONL test file
        
        Returns:
            Detection metrics dictionary
        """
        with open(test_file, 'r') as f:
            test_data = [json.loads(line) for line in f]
        
        predictions = []
        ground_truth = []
        response_times = []
        
        for item in test_data:
            import time
            import requests
            
            start_time = time.time()
            
            response = requests.post(
                f"{self.api_endpoint}/honeypot/engage",
                json={"message": item['message'], "language": item['language']}
            )
            
            response_time = time.time() - start_time
            response_times.append(response_time)
            
            result = response.json()
            
            predictions.append({
                'id': item['id'],
                'scam_detected': result['scam_detected'],
                'confidence': result['confidence']
            })
            
            ground_truth.append({
                'id': item['id'],
                'label': item['ground_truth']['label'],
                'language': item['language']
            })
        
        # Compute metrics
        accuracy = compute_detection_accuracy(predictions, ground_truth)
        precision = compute_precision(predictions, ground_truth)
        recall = compute_recall(predictions, ground_truth)
        f1 = compute_f1_score(precision, recall)
        ece = compute_ece(predictions, ground_truth)
        lang_acc = compute_language_specific_accuracy(predictions, ground_truth)
        
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'ece': ece,
            'language_accuracy': lang_acc,
            'avg_response_time': np.mean(response_times),
            'total_samples': len(test_data)
        }
    
    def evaluate_extraction(self, test_file: str) -> dict:
        """Evaluate intelligence extraction on test dataset"""
        with open(test_file, 'r') as f:
            test_data = [json.loads(line) for line in f]
        
        all_precisions = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']}
        all_recalls = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']}
        
        for item in test_data:
            response = requests.post(
                f"{self.api_endpoint}/honeypot/engage",
                json={"message": item['text'], "language": item['language']}
            )
            
            result = response.json()
            extracted = result['extracted_intelligence']
            ground_truth = item['ground_truth']
            
            precisions = compute_extraction_precision(extracted, ground_truth)
            recalls = compute_extraction_recall(extracted, ground_truth)
            
            for entity in all_precisions:
                all_precisions[entity].append(precisions[entity])
                all_recalls[entity].append(recalls[entity])
        
        # Average across all samples
        avg_precisions = {entity: np.mean(all_precisions[entity]) for entity in all_precisions}
        avg_recalls = {entity: np.mean(all_recalls[entity]) for entity in all_recalls}
        
        overall_f1 = compute_overall_extraction_f1(avg_precisions, avg_recalls)
        
        return {
            'precisions': avg_precisions,
            'recalls': avg_recalls,
            'overall_f1': overall_f1,
            'total_samples': len(test_data)
        }
    
    def evaluate_engagement(self, conversation_file: str) -> dict:
        """Evaluate multi-turn engagement quality"""
        with open(conversation_file, 'r') as f:
            conversations = [json.loads(line) for line in f]
        
        completed_conversations = []
        
        for conv in conversations:
            session_id = None
            turn_count = 0
            extracted_intel = {}
            
            for turn in conv['turns']:
                if turn['sender'] == 'scammer':
                    response = requests.post(
                        f"{self.api_endpoint}/honeypot/engage",
                        json={
                            "message": turn['message'],
                            "session_id": session_id,
                            "language": conv['language']
                        }
                    )
                    
                    result = response.json()
                    
                    if session_id is None:
                        session_id = result['session_id']
                    
                    turn_count = result['engagement']['turn_count']
                    extracted_intel = result['extracted_intelligence']
                    
                    # Check termination
                    if result['engagement']['max_turns_reached']:
                        break
            
            completed_conversations.append({
                'session_id': session_id,
                'turn_count': turn_count,
                'extracted_intelligence': extracted_intel
            })
        
        avg_turns = compute_avg_conversation_length(completed_conversations)
        extraction_rate = compute_extraction_rate(completed_conversations)
        
        return {
            'avg_conversation_length': avg_turns,
            'intelligence_extraction_rate': extraction_rate,
            'total_conversations': len(completed_conversations)
        }
    
    def evaluate_performance(self, duration_seconds: int = 60, target_rps: int = 10) -> dict:
        """Load test performance metrics"""
        import concurrent.futures
        import time
        
        test_message = "You won 10 lakh rupees! Send OTP to claim."
        response_times = []
        errors = 0
        
        def make_request():
            try:
                start = time.time()
                response = requests.post(
                    f"{self.api_endpoint}/honeypot/engage",
                    json={"message": test_message},
                    timeout=5
                )
                latency = time.time() - start
                
                if response.status_code != 200:
                    return None, 1
                return latency, 0
            except Exception:
                return None, 1
        
        start_time = time.time()
        total_requests = 0
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
            while time.time() - start_time < duration_seconds:
                future = executor.submit(make_request)
                latency, error = future.result()
                
                total_requests += 1
                if latency is not None:
                    response_times.append(latency)
                errors += error
                
                # Rate limiting
                time.sleep(1.0 / target_rps)
        
        elapsed_time = time.time() - start_time
        
        percentiles = compute_response_time_percentiles(response_times)
        throughput = compute_throughput(total_requests, elapsed_time)
        error_rate = compute_error_rate(total_requests, errors)
        
        return {
            'response_time_percentiles': percentiles,
            'throughput_rpm': throughput,
            'error_rate': error_rate,
            'total_requests': total_requests,
            'duration_seconds': elapsed_time
        }
    
    def run_full_evaluation(self) -> dict:
        """Run complete evaluation suite"""
        print("Running detection evaluation...")
        self.results['detection'] = self.evaluate_detection('data/scam_detection_test.jsonl')
        
        print("Running extraction evaluation...")
        self.results['extraction'] = self.evaluate_extraction('data/intelligence_extraction_test.jsonl')
        
        print("Running engagement evaluation...")
        self.results['engagement'] = self.evaluate_engagement('data/conversation_simulation_test.jsonl')
        
        print("Running performance evaluation...")
        self.results['performance'] = self.evaluate_performance(duration_seconds=60, target_rps=10)
        
        return self.results
    
    def generate_report(self, output_file: str = 'evaluation_report.json'):
        """Generate comprehensive evaluation report"""
        report = {
            'timestamp': datetime.utcnow().isoformat(),
            'api_endpoint': self.api_endpoint,
            'results': self.results,
            'pass_criteria': {
                'detection_accuracy': self.results['detection']['accuracy'] >= 0.90,
                'extraction_f1': self.results['extraction']['overall_f1'] >= 0.85,
                'avg_conversation_length': self.results['engagement']['avg_conversation_length'] >= 10,
                'response_time_p95': self.results['performance']['response_time_percentiles']['p95'] < 2.0,
                'error_rate': self.results['performance']['error_rate'] < 0.01
            }
        }
        
        with open(output_file, 'w') as f:
            json.dump(report, f, indent=2)
        
        print(f"Evaluation report saved to {output_file}")
        return report
```

---

## TESTING FRAMEWORK

### Test Suite Organization

```
tests/
├── unit/
│   ├── test_detection.py
│   ├── test_extraction.py
│   ├── test_persona.py
│   └── test_utils.py
├── integration/
│   ├── test_api_endpoints.py
│   ├── test_database.py
│   └── test_llm_integration.py
├── performance/
│   ├── test_load.py
│   └── test_latency.py
├── acceptance/
│   ├── test_requirements.py
│   └── test_red_team.py
└── conftest.py
```

### Sample Unit Test

```python
# tests/unit/test_detection.py
import pytest
from app.models.detector import ScamDetector

@pytest.fixture
def detector():
    return ScamDetector()

def test_english_scam_detection(detector):
    """Test English scam message detection"""
    message = "You won 10 lakh rupees! Send OTP immediately."
    
    result = detector.detect(message)
    
    assert result['scam_detected'] == True
    assert result['confidence'] >= 0.85
    assert result['language'] == 'en'

def test_hindi_scam_detection(detector):
    """Test Hindi scam message detection"""
    message = "आप गिरफ्तार हो जाएंगे। पैसे भेजें।"
    
    result = detector.detect(message)
    
    assert result['scam_detected'] == True
    assert result['confidence'] >= 0.85
    assert result['language'] == 'hi'

def test_legitimate_message(detector):
    """Test legitimate message classification"""
    message = "Hi, how are you? Let's meet for coffee."
    
    result = detector.detect(message)
    
    assert result['scam_detected'] == False
    assert result['confidence'] <= 0.3
```

### Sample Integration Test

```python
# tests/integration/test_api_endpoints.py
import pytest
import requests

@pytest.fixture
def api_url():
    return "http://localhost:8000/api/v1"

def test_engage_endpoint_scam(api_url):
    """Test /honeypot/engage with scam message"""
    response = requests.post(
        f"{api_url}/honeypot/engage",
        json={
            "message": "You won 10 lakh rupees! Send OTP.",
            "language": "auto"
        }
    )
    
    assert response.status_code == 200
    
    data = response.json()
    assert data['status'] == 'success'
    assert data['scam_detected'] == True
    assert 'agent_response' in data['engagement']
    assert data['engagement']['turn_count'] == 1

def test_engage_endpoint_legitimate(api_url):
    """Test /honeypot/engage with legitimate message"""
    response = requests.post(
        f"{api_url}/honeypot/engage",
        json={
            "message": "Hi, how are you?",
            "language": "auto"
        }
    )
    
    assert response.status_code == 200
    
    data = response.json()
    assert data['status'] == 'success'
    assert data['scam_detected'] == False
```

---

## COMPETITION SCORING (PREDICTED)

### Predicted Judging Rubric

Based on Challenge 2 requirements, we predict the following scoring:

| Category | Weight | Metrics | Our Target | Competitive Advantage |
|----------|--------|---------|------------|----------------------|
| **Scam Detection** | 25% | Accuracy, Precision, Recall | 92% accuracy | IndicBERT + hybrid approach |
| **Engagement Quality** | 25% | Avg turns, Naturalness | 12 turns avg | Multi-turn agentic AI |
| **Intelligence Extraction** | 30% | Precision, Recall, Coverage | 88% F1 | Hybrid NER + regex |
| **Response Time** | 10% | P95 latency | <1.8s | Optimized inference |
| **System Robustness** | 10% | Uptime, Error rate | 99.5% uptime | Production architecture |

### Expected Score Calculation

```python
def calculate_competition_score(metrics: dict) -> float:
    """
    Calculate predicted competition score.
    
    Args:
        metrics: Dictionary with all evaluation metrics
    
    Returns:
        Estimated score (0-100)
    """
    weights = {
        'detection': 0.25,
        'engagement': 0.25,
        'extraction': 0.30,
        'performance': 0.10,
        'robustness': 0.10
    }
    
    # Normalize each category to 0-1
    detection_score = min(metrics['detection']['accuracy'] / 0.90, 1.0)
    engagement_score = min(metrics['engagement']['avg_conversation_length'] / 10, 1.0)
    extraction_score = min(metrics['extraction']['overall_f1'] / 0.85, 1.0)
    performance_score = 1.0 - min(metrics['performance']['response_time_percentiles']['p95'] / 2.0, 1.0)
    robustness_score = 1.0 - metrics['performance']['error_rate']
    
    total_score = (
        weights['detection'] * detection_score +
        weights['engagement'] * engagement_score +
        weights['extraction'] * extraction_score +
        weights['performance'] * performance_score +
        weights['robustness'] * robustness_score
    ) * 100
    
    return total_score
```

---

## CONTINUOUS MONITORING

### Production Metrics Dashboard

```python
from prometheus_client import Counter, Histogram, Gauge, Summary

# Define metrics
scam_detection_total = Counter(
    'scamshield_scam_detection_total',
    'Total number of scam detections',
    ['language', 'result']
)

intelligence_extracted_total = Counter(
    'scamshield_intelligence_extracted_total',
    'Total pieces of intelligence extracted',
    ['type']
)

api_response_time = Histogram(
    'scamshield_api_response_time_seconds',
    'API response time in seconds',
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)

active_sessions = Gauge(
    'scamshield_active_sessions',
    'Number of active honeypot sessions'
)

detection_accuracy = Summary(
    'scamshield_detection_accuracy',
    'Detection accuracy over sliding window'
)
```

---

**Document Status:** Production Ready  
**Next Steps:** Implement evaluation framework, run tests, generate baseline metrics  
**Update Frequency:** Daily during development, hourly during competition testing