| # Evaluation Specification: ScamShield AI | |
| ## Metrics, Computation Methods, and Testing Framework | |
| **Version:** 1.0 | |
| **Date:** January 26, 2026 | |
| **Owner:** QA & Evaluation Team | |
| **Related Documents:** FRD.md, DATA_SPEC.md, API_CONTRACT.md | |
| --- | |
| ## TABLE OF CONTENTS | |
| 1. [Evaluation Overview](#evaluation-overview) | |
| 2. [Detection Metrics](#detection-metrics) | |
| 3. [Extraction Metrics](#extraction-metrics) | |
| 4. [Engagement Metrics](#engagement-metrics) | |
| 5. [Performance Metrics](#performance-metrics) | |
| 6. [Computation Methods](#computation-methods) | |
| 7. [Testing Framework](#testing-framework) | |
| 8. [Competition Scoring (Predicted)](#competition-scoring-predicted) | |
| --- | |
| ## EVALUATION OVERVIEW | |
| ### Evaluation Objectives | |
| 1. **Functional Correctness:** System meets FRD requirements | |
| 2. **Performance:** Response time, throughput within SLAs | |
| 3. **Quality:** Detection accuracy, extraction precision/recall | |
| 4. **Robustness:** Handles edge cases, adversarial inputs | |
| 5. **Competition Readiness:** Meets judging criteria | |
| ### Evaluation Phases | |
| | Phase | Timeline | Focus | Pass Criteria | | |
| |-------|----------|-------|---------------| | |
| | **Unit Testing** | Days 3-9 | Individual components | >80% code coverage | | |
| | **Integration Testing** | Day 8 | End-to-end flows | All API endpoints functional | | |
| | **Performance Testing** | Day 9 | Load, latency | <2s p95 latency, 100 req/min | | |
| | **Acceptance Testing** | Day 10 | Requirements validation | All FRD acceptance criteria met | | |
| | **Red Team Testing** | Day 10 | Adversarial scenarios | >80% red team tests passed | | |
| | **Pre-Submission** | Day 11 | Final validation | >90% detection accuracy | | |
| --- | |
| ## DETECTION METRICS | |
| ### Metric 1: Scam Detection Accuracy | |
| **Definition:** Proportion of messages correctly classified as scam or legitimate. | |
| **Formula:** | |
| ``` | |
| Accuracy = (TP + TN) / (TP + TN + FP + FN) | |
| Where: | |
| - TP (True Positives): Scams correctly identified | |
| - TN (True Negatives): Legitimate messages correctly identified | |
| - FP (False Positives): Legitimate messages incorrectly flagged as scams | |
| - FN (False Negatives): Scams missed | |
| ``` | |
| **Target:** ≥90% | |
| **Computation:** | |
| ```python | |
| def compute_detection_accuracy(predictions: List[dict], ground_truth: List[dict]) -> float: | |
| """ | |
| Compute scam detection accuracy. | |
| Args: | |
| predictions: List of {"id": str, "scam_detected": bool} | |
| ground_truth: List of {"id": str, "label": "scam"|"legitimate"} | |
| Returns: | |
| Accuracy score (0.0-1.0) | |
| """ | |
| assert len(predictions) == len(ground_truth), "Mismatched lengths" | |
| # Align by ID | |
| pred_map = {p['id']: p['scam_detected'] for p in predictions} | |
| gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth} | |
| correct = sum(1 for id in pred_map if pred_map[id] == gt_map[id]) | |
| total = len(pred_map) | |
| return correct / total if total > 0 else 0.0 | |
| # Example usage | |
| predictions = [ | |
| {"id": "test_001", "scam_detected": True}, | |
| {"id": "test_002", "scam_detected": False}, | |
| {"id": "test_003", "scam_detected": True} | |
| ] | |
| ground_truth = [ | |
| {"id": "test_001", "label": "scam"}, | |
| {"id": "test_002", "label": "legitimate"}, | |
| {"id": "test_003", "label": "scam"} | |
| ] | |
| accuracy = compute_detection_accuracy(predictions, ground_truth) | |
| print(f"Accuracy: {accuracy:.2%}") # Expected: 100% | |
| ``` | |
| --- | |
| ### Metric 2: Precision | |
| **Definition:** Of all messages flagged as scams, what proportion are actual scams? | |
| **Formula:** | |
| ``` | |
| Precision = TP / (TP + FP) | |
| ``` | |
| **Target:** ≥85% | |
| **Significance:** High precision minimizes false alarms (legitimate messages flagged as scams). | |
| **Computation:** | |
| ```python | |
| def compute_precision(predictions: List[dict], ground_truth: List[dict]) -> float: | |
| """Compute precision for scam detection""" | |
| pred_map = {p['id']: p['scam_detected'] for p in predictions} | |
| gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth} | |
| tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id]) | |
| fp = sum(1 for id in pred_map if pred_map[id] and not gt_map[id]) | |
| return tp / (tp + fp) if (tp + fp) > 0 else 0.0 | |
| ``` | |
| --- | |
| ### Metric 3: Recall (Sensitivity) | |
| **Definition:** Of all actual scams, what proportion are detected? | |
| **Formula:** | |
| ``` | |
| Recall = TP / (TP + FN) | |
| ``` | |
| **Target:** ≥90% | |
| **Significance:** High recall ensures few scams are missed. | |
| **Computation:** | |
| ```python | |
| def compute_recall(predictions: List[dict], ground_truth: List[dict]) -> float: | |
| """Compute recall for scam detection""" | |
| pred_map = {p['id']: p['scam_detected'] for p in predictions} | |
| gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth} | |
| tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id]) | |
| fn = sum(1 for id in pred_map if not pred_map[id] and gt_map[id]) | |
| return tp / (tp + fn) if (tp + fn) > 0 else 0.0 | |
| ``` | |
| --- | |
| ### Metric 4: F1-Score | |
| **Definition:** Harmonic mean of precision and recall. | |
| **Formula:** | |
| ``` | |
| F1 = 2 * (Precision * Recall) / (Precision + Recall) | |
| ``` | |
| **Target:** ≥87% | |
| **Computation:** | |
| ```python | |
| def compute_f1_score(precision: float, recall: float) -> float: | |
| """Compute F1-score from precision and recall""" | |
| if precision + recall == 0: | |
| return 0.0 | |
| return 2 * (precision * recall) / (precision + recall) | |
| ``` | |
| --- | |
| ### Metric 5: Confidence Calibration | |
| **Definition:** How well do confidence scores correlate with actual accuracy? | |
| **Formula:** | |
| ``` | |
| Expected Calibration Error (ECE) = Σ (|accuracy_bin - avg_confidence_bin|) × (bin_size / total) | |
| For bins: [0-0.1], [0.1-0.2], ..., [0.9-1.0] | |
| ``` | |
| **Target:** ECE <0.1 (well-calibrated) | |
| **Computation:** | |
| ```python | |
| def compute_ece(predictions: List[dict], ground_truth: List[dict], n_bins: int = 10) -> float: | |
| """ | |
| Compute Expected Calibration Error. | |
| Predictions must include "confidence" field. | |
| """ | |
| pred_map = {p['id']: (p['scam_detected'], p['confidence']) for p in predictions} | |
| gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth} | |
| bins = [[] for _ in range(n_bins)] | |
| for id in pred_map: | |
| pred, conf = pred_map[id] | |
| actual = gt_map[id] | |
| correct = (pred == actual) | |
| bin_idx = min(int(conf * n_bins), n_bins - 1) | |
| bins[bin_idx].append((conf, correct)) | |
| ece = 0.0 | |
| total = len(pred_map) | |
| for bin_samples in bins: | |
| if len(bin_samples) == 0: | |
| continue | |
| avg_conf = sum(conf for conf, _ in bin_samples) / len(bin_samples) | |
| accuracy = sum(1 for _, correct in bin_samples if correct) / len(bin_samples) | |
| ece += abs(accuracy - avg_conf) * (len(bin_samples) / total) | |
| return ece | |
| ``` | |
| --- | |
| ### Metric 6: Language-Specific Accuracy | |
| **Definition:** Detection accuracy broken down by language. | |
| **Target:** | |
| - English: ≥92% | |
| - Hindi: ≥88% | |
| - Hinglish: ≥85% | |
| - Fairness: <5% difference between languages | |
| **Computation:** | |
| ```python | |
| def compute_language_specific_accuracy(predictions: List[dict], ground_truth: List[dict]) -> dict: | |
| """Compute accuracy per language""" | |
| from collections import defaultdict | |
| lang_correct = defaultdict(int) | |
| lang_total = defaultdict(int) | |
| pred_map = {p['id']: p for p in predictions} | |
| gt_map = {g['id']: g for g in ground_truth} | |
| for id in pred_map: | |
| lang = gt_map[id]['language'] | |
| pred_scam = pred_map[id]['scam_detected'] | |
| actual_scam = (gt_map[id]['label'] == 'scam') | |
| lang_total[lang] += 1 | |
| if pred_scam == actual_scam: | |
| lang_correct[lang] += 1 | |
| return { | |
| lang: lang_correct[lang] / lang_total[lang] if lang_total[lang] > 0 else 0.0 | |
| for lang in lang_total | |
| } | |
| # Check fairness | |
| def check_language_fairness(lang_accuracies: dict, threshold: float = 0.05) -> bool: | |
| """Ensure accuracy difference between languages is within threshold""" | |
| accuracies = list(lang_accuracies.values()) | |
| max_diff = max(accuracies) - min(accuracies) | |
| return max_diff < threshold | |
| ``` | |
| --- | |
| ## EXTRACTION METRICS | |
| ### Metric 7: Extraction Precision (per entity type) | |
| **Definition:** Of all extracted entities, what proportion are correct? | |
| **Formula:** | |
| ``` | |
| Precision_entity = |Extracted ∩ Ground_Truth| / |Extracted| | |
| For entity types: upi_ids, bank_accounts, ifsc_codes, phone_numbers, phishing_links | |
| ``` | |
| **Target:** | |
| - UPI IDs: ≥90% | |
| - Bank Accounts: ≥85% | |
| - IFSC Codes: ≥95% | |
| - Phone Numbers: ≥90% | |
| - Phishing Links: ≥95% | |
| **Computation:** | |
| ```python | |
| def compute_extraction_precision(extracted: dict, ground_truth: dict) -> dict: | |
| """ | |
| Compute precision for each entity type. | |
| Args: | |
| extracted: {"upi_ids": [...], "bank_accounts": [...], ...} | |
| ground_truth: Same structure | |
| Returns: | |
| {"upi_ids": precision, "bank_accounts": precision, ...} | |
| """ | |
| precisions = {} | |
| for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']: | |
| extracted_set = set(extracted.get(entity_type, [])) | |
| gt_set = set(ground_truth.get(entity_type, [])) | |
| if len(extracted_set) == 0: | |
| precisions[entity_type] = 1.0 if len(gt_set) == 0 else 0.0 | |
| else: | |
| correct = len(extracted_set & gt_set) | |
| precisions[entity_type] = correct / len(extracted_set) | |
| return precisions | |
| ``` | |
| --- | |
| ### Metric 8: Extraction Recall (per entity type) | |
| **Definition:** Of all actual entities, what proportion are extracted? | |
| **Formula:** | |
| ``` | |
| Recall_entity = |Extracted ∩ Ground_Truth| / |Ground_Truth| | |
| ``` | |
| **Target:** | |
| - UPI IDs: ≥85% | |
| - Bank Accounts: ≥80% | |
| - IFSC Codes: ≥90% | |
| - Phone Numbers: ≥85% | |
| - Phishing Links: ≥90% | |
| **Computation:** | |
| ```python | |
| def compute_extraction_recall(extracted: dict, ground_truth: dict) -> dict: | |
| """Compute recall for each entity type""" | |
| recalls = {} | |
| for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']: | |
| extracted_set = set(extracted.get(entity_type, [])) | |
| gt_set = set(ground_truth.get(entity_type, [])) | |
| if len(gt_set) == 0: | |
| recalls[entity_type] = 1.0 if len(extracted_set) == 0 else 0.0 | |
| else: | |
| correct = len(extracted_set & gt_set) | |
| recalls[entity_type] = correct / len(gt_set) | |
| return recalls | |
| ``` | |
| --- | |
| ### Metric 9: Overall Extraction F1-Score | |
| **Definition:** Weighted average F1-score across all entity types. | |
| **Weights:** | |
| ```python | |
| ENTITY_WEIGHTS = { | |
| 'upi_ids': 0.30, | |
| 'bank_accounts': 0.30, | |
| 'ifsc_codes': 0.20, | |
| 'phone_numbers': 0.10, | |
| 'phishing_links': 0.10 | |
| } | |
| ``` | |
| **Target:** ≥85% | |
| **Computation:** | |
| ```python | |
| def compute_overall_extraction_f1(precisions: dict, recalls: dict, weights: dict = ENTITY_WEIGHTS) -> float: | |
| """Compute weighted F1-score across entity types""" | |
| f1_scores = {} | |
| for entity_type in weights: | |
| p = precisions.get(entity_type, 0.0) | |
| r = recalls.get(entity_type, 0.0) | |
| if p + r == 0: | |
| f1_scores[entity_type] = 0.0 | |
| else: | |
| f1_scores[entity_type] = 2 * (p * r) / (p + r) | |
| weighted_f1 = sum(f1_scores[entity] * weights[entity] for entity in weights) | |
| return weighted_f1 | |
| ``` | |
| --- | |
| ### Metric 10: Extraction Confidence Accuracy | |
| **Definition:** Correlation between extraction_confidence score and actual precision. | |
| **Target:** Pearson correlation >0.7 | |
| **Computation:** | |
| ```python | |
| from scipy.stats import pearsonr | |
| def evaluate_extraction_confidence(test_results: List[dict]) -> float: | |
| """ | |
| Evaluate extraction confidence calibration. | |
| test_results: [ | |
| { | |
| "extraction_confidence": 0.85, | |
| "actual_precision": 0.90 | |
| }, | |
| ... | |
| ] | |
| """ | |
| confidences = [r['extraction_confidence'] for r in test_results] | |
| precisions = [r['actual_precision'] for r in test_results] | |
| correlation, p_value = pearsonr(confidences, precisions) | |
| return correlation | |
| ``` | |
| --- | |
| ## ENGAGEMENT METRICS | |
| ### Metric 11: Average Conversation Length | |
| **Definition:** Mean number of turns per conversation. | |
| **Target:** ≥10 turns (demonstrates sustained engagement) | |
| **Computation:** | |
| ```python | |
| def compute_avg_conversation_length(conversations: List[dict]) -> float: | |
| """ | |
| conversations: [{"session_id": str, "turn_count": int}, ...] | |
| """ | |
| if len(conversations) == 0: | |
| return 0.0 | |
| total_turns = sum(conv['turn_count'] for conv in conversations) | |
| return total_turns / len(conversations) | |
| ``` | |
| --- | |
| ### Metric 12: Intelligence Extraction Rate | |
| **Definition:** Proportion of conversations that extract at least one intelligence entity. | |
| **Target:** ≥70% | |
| **Computation:** | |
| ```python | |
| def compute_extraction_rate(conversations: List[dict]) -> float: | |
| """ | |
| conversations: [ | |
| { | |
| "session_id": str, | |
| "extracted_intelligence": { | |
| "upi_ids": [...], | |
| ... | |
| } | |
| }, | |
| ... | |
| ] | |
| """ | |
| if len(conversations) == 0: | |
| return 0.0 | |
| extracted_count = 0 | |
| for conv in conversations: | |
| intel = conv['extracted_intelligence'] | |
| has_intel = any( | |
| len(intel.get(entity_type, [])) > 0 | |
| for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links'] | |
| ) | |
| if has_intel: | |
| extracted_count += 1 | |
| return extracted_count / len(conversations) | |
| ``` | |
| --- | |
| ### Metric 13: Persona Consistency | |
| **Definition:** Proportion of conversations where persona remains consistent across all turns. | |
| **Target:** ≥95% | |
| **Computation:** | |
| ```python | |
| def compute_persona_consistency(conversations: List[dict]) -> float: | |
| """ | |
| conversations: [ | |
| { | |
| "session_id": str, | |
| "messages": [ | |
| {"turn": 1, "sender": "agent", "persona": "elderly"}, | |
| {"turn": 2, "sender": "agent", "persona": "elderly"}, | |
| ... | |
| ] | |
| }, | |
| ... | |
| ] | |
| """ | |
| consistent_count = 0 | |
| for conv in conversations: | |
| agent_messages = [msg for msg in conv['messages'] if msg['sender'] == 'agent'] | |
| if len(agent_messages) == 0: | |
| continue | |
| personas = [msg.get('persona') for msg in agent_messages] | |
| if len(set(personas)) == 1: # All same persona | |
| consistent_count += 1 | |
| return consistent_count / len(conversations) if len(conversations) > 0 else 0.0 | |
| ``` | |
| --- | |
| ### Metric 14: Engagement Quality Score | |
| **Definition:** Composite score measuring naturalness and effectiveness of engagement. | |
| **Components:** | |
| 1. Average turns (weight: 0.4) | |
| 2. Extraction rate (weight: 0.4) | |
| 3. Persona consistency (weight: 0.2) | |
| **Target:** ≥0.8 | |
| **Computation:** | |
| ```python | |
| def compute_engagement_quality(avg_turns: float, extraction_rate: float, persona_consistency: float) -> float: | |
| """ | |
| Normalize and weight engagement metrics. | |
| Args: | |
| avg_turns: Actual average turns | |
| extraction_rate: 0.0-1.0 | |
| persona_consistency: 0.0-1.0 | |
| """ | |
| # Normalize avg_turns (max 20) | |
| normalized_turns = min(avg_turns / 20, 1.0) | |
| quality_score = ( | |
| 0.4 * normalized_turns + | |
| 0.4 * extraction_rate + | |
| 0.2 * persona_consistency | |
| ) | |
| return quality_score | |
| ``` | |
| --- | |
| ## PERFORMANCE METRICS | |
| ### Metric 15: API Response Time | |
| **Definition:** Time from request received to response sent. | |
| **Targets:** | |
| - P50 (Median): <1 second | |
| - P95: <2 seconds | |
| - P99: <3 seconds | |
| **Computation:** | |
| ```python | |
| import numpy as np | |
| def compute_response_time_percentiles(response_times: List[float]) -> dict: | |
| """ | |
| response_times: List of times in seconds | |
| """ | |
| return { | |
| 'p50': np.percentile(response_times, 50), | |
| 'p95': np.percentile(response_times, 95), | |
| 'p99': np.percentile(response_times, 99), | |
| 'mean': np.mean(response_times), | |
| 'max': np.max(response_times) | |
| } | |
| ``` | |
| --- | |
| ### Metric 16: Throughput | |
| **Definition:** Number of requests processed per minute. | |
| **Target:** ≥100 requests/minute (sustained) | |
| **Computation:** | |
| ```python | |
| def compute_throughput(total_requests: int, time_window_seconds: float) -> float: | |
| """ | |
| Returns requests per minute | |
| """ | |
| return (total_requests / time_window_seconds) * 60 | |
| ``` | |
| --- | |
| ### Metric 17: Error Rate | |
| **Definition:** Proportion of requests that result in errors (4xx, 5xx). | |
| **Target:** <1% | |
| **Computation:** | |
| ```python | |
| def compute_error_rate(total_requests: int, error_count: int) -> float: | |
| """Returns error rate as proportion (0.0-1.0)""" | |
| return error_count / total_requests if total_requests > 0 else 0.0 | |
| ``` | |
| --- | |
| ### Metric 18: Uptime | |
| **Definition:** Percentage of time service is available and healthy. | |
| **Target:** ≥99% during competition testing window | |
| **Computation:** | |
| ```python | |
| def compute_uptime(total_time_seconds: float, downtime_seconds: float) -> float: | |
| """Returns uptime as percentage""" | |
| return ((total_time_seconds - downtime_seconds) / total_time_seconds) * 100 | |
| ``` | |
| --- | |
| ## COMPUTATION METHODS | |
| ### Complete Evaluation Pipeline | |
| ```python | |
| import json | |
| from typing import List, Dict, Tuple | |
| class ScamShieldEvaluator: | |
| """Complete evaluation framework for ScamShield AI""" | |
| def __init__(self, api_endpoint: str): | |
| self.api_endpoint = api_endpoint | |
| self.results = { | |
| 'detection': {}, | |
| 'extraction': {}, | |
| 'engagement': {}, | |
| 'performance': {} | |
| } | |
| def evaluate_detection(self, test_file: str) -> dict: | |
| """ | |
| Evaluate scam detection on test dataset. | |
| Args: | |
| test_file: Path to JSONL test file | |
| Returns: | |
| Detection metrics dictionary | |
| """ | |
| with open(test_file, 'r') as f: | |
| test_data = [json.loads(line) for line in f] | |
| predictions = [] | |
| ground_truth = [] | |
| response_times = [] | |
| for item in test_data: | |
| import time | |
| import requests | |
| start_time = time.time() | |
| response = requests.post( | |
| f"{self.api_endpoint}/honeypot/engage", | |
| json={"message": item['message'], "language": item['language']} | |
| ) | |
| response_time = time.time() - start_time | |
| response_times.append(response_time) | |
| result = response.json() | |
| predictions.append({ | |
| 'id': item['id'], | |
| 'scam_detected': result['scam_detected'], | |
| 'confidence': result['confidence'] | |
| }) | |
| ground_truth.append({ | |
| 'id': item['id'], | |
| 'label': item['ground_truth']['label'], | |
| 'language': item['language'] | |
| }) | |
| # Compute metrics | |
| accuracy = compute_detection_accuracy(predictions, ground_truth) | |
| precision = compute_precision(predictions, ground_truth) | |
| recall = compute_recall(predictions, ground_truth) | |
| f1 = compute_f1_score(precision, recall) | |
| ece = compute_ece(predictions, ground_truth) | |
| lang_acc = compute_language_specific_accuracy(predictions, ground_truth) | |
| return { | |
| 'accuracy': accuracy, | |
| 'precision': precision, | |
| 'recall': recall, | |
| 'f1_score': f1, | |
| 'ece': ece, | |
| 'language_accuracy': lang_acc, | |
| 'avg_response_time': np.mean(response_times), | |
| 'total_samples': len(test_data) | |
| } | |
| def evaluate_extraction(self, test_file: str) -> dict: | |
| """Evaluate intelligence extraction on test dataset""" | |
| with open(test_file, 'r') as f: | |
| test_data = [json.loads(line) for line in f] | |
| all_precisions = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']} | |
| all_recalls = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']} | |
| for item in test_data: | |
| response = requests.post( | |
| f"{self.api_endpoint}/honeypot/engage", | |
| json={"message": item['text'], "language": item['language']} | |
| ) | |
| result = response.json() | |
| extracted = result['extracted_intelligence'] | |
| ground_truth = item['ground_truth'] | |
| precisions = compute_extraction_precision(extracted, ground_truth) | |
| recalls = compute_extraction_recall(extracted, ground_truth) | |
| for entity in all_precisions: | |
| all_precisions[entity].append(precisions[entity]) | |
| all_recalls[entity].append(recalls[entity]) | |
| # Average across all samples | |
| avg_precisions = {entity: np.mean(all_precisions[entity]) for entity in all_precisions} | |
| avg_recalls = {entity: np.mean(all_recalls[entity]) for entity in all_recalls} | |
| overall_f1 = compute_overall_extraction_f1(avg_precisions, avg_recalls) | |
| return { | |
| 'precisions': avg_precisions, | |
| 'recalls': avg_recalls, | |
| 'overall_f1': overall_f1, | |
| 'total_samples': len(test_data) | |
| } | |
| def evaluate_engagement(self, conversation_file: str) -> dict: | |
| """Evaluate multi-turn engagement quality""" | |
| with open(conversation_file, 'r') as f: | |
| conversations = [json.loads(line) for line in f] | |
| completed_conversations = [] | |
| for conv in conversations: | |
| session_id = None | |
| turn_count = 0 | |
| extracted_intel = {} | |
| for turn in conv['turns']: | |
| if turn['sender'] == 'scammer': | |
| response = requests.post( | |
| f"{self.api_endpoint}/honeypot/engage", | |
| json={ | |
| "message": turn['message'], | |
| "session_id": session_id, | |
| "language": conv['language'] | |
| } | |
| ) | |
| result = response.json() | |
| if session_id is None: | |
| session_id = result['session_id'] | |
| turn_count = result['engagement']['turn_count'] | |
| extracted_intel = result['extracted_intelligence'] | |
| # Check termination | |
| if result['engagement']['max_turns_reached']: | |
| break | |
| completed_conversations.append({ | |
| 'session_id': session_id, | |
| 'turn_count': turn_count, | |
| 'extracted_intelligence': extracted_intel | |
| }) | |
| avg_turns = compute_avg_conversation_length(completed_conversations) | |
| extraction_rate = compute_extraction_rate(completed_conversations) | |
| return { | |
| 'avg_conversation_length': avg_turns, | |
| 'intelligence_extraction_rate': extraction_rate, | |
| 'total_conversations': len(completed_conversations) | |
| } | |
| def evaluate_performance(self, duration_seconds: int = 60, target_rps: int = 10) -> dict: | |
| """Load test performance metrics""" | |
| import concurrent.futures | |
| import time | |
| test_message = "You won 10 lakh rupees! Send OTP to claim." | |
| response_times = [] | |
| errors = 0 | |
| def make_request(): | |
| try: | |
| start = time.time() | |
| response = requests.post( | |
| f"{self.api_endpoint}/honeypot/engage", | |
| json={"message": test_message}, | |
| timeout=5 | |
| ) | |
| latency = time.time() - start | |
| if response.status_code != 200: | |
| return None, 1 | |
| return latency, 0 | |
| except Exception: | |
| return None, 1 | |
| start_time = time.time() | |
| total_requests = 0 | |
| with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor: | |
| while time.time() - start_time < duration_seconds: | |
| future = executor.submit(make_request) | |
| latency, error = future.result() | |
| total_requests += 1 | |
| if latency is not None: | |
| response_times.append(latency) | |
| errors += error | |
| # Rate limiting | |
| time.sleep(1.0 / target_rps) | |
| elapsed_time = time.time() - start_time | |
| percentiles = compute_response_time_percentiles(response_times) | |
| throughput = compute_throughput(total_requests, elapsed_time) | |
| error_rate = compute_error_rate(total_requests, errors) | |
| return { | |
| 'response_time_percentiles': percentiles, | |
| 'throughput_rpm': throughput, | |
| 'error_rate': error_rate, | |
| 'total_requests': total_requests, | |
| 'duration_seconds': elapsed_time | |
| } | |
| def run_full_evaluation(self) -> dict: | |
| """Run complete evaluation suite""" | |
| print("Running detection evaluation...") | |
| self.results['detection'] = self.evaluate_detection('data/scam_detection_test.jsonl') | |
| print("Running extraction evaluation...") | |
| self.results['extraction'] = self.evaluate_extraction('data/intelligence_extraction_test.jsonl') | |
| print("Running engagement evaluation...") | |
| self.results['engagement'] = self.evaluate_engagement('data/conversation_simulation_test.jsonl') | |
| print("Running performance evaluation...") | |
| self.results['performance'] = self.evaluate_performance(duration_seconds=60, target_rps=10) | |
| return self.results | |
| def generate_report(self, output_file: str = 'evaluation_report.json'): | |
| """Generate comprehensive evaluation report""" | |
| report = { | |
| 'timestamp': datetime.utcnow().isoformat(), | |
| 'api_endpoint': self.api_endpoint, | |
| 'results': self.results, | |
| 'pass_criteria': { | |
| 'detection_accuracy': self.results['detection']['accuracy'] >= 0.90, | |
| 'extraction_f1': self.results['extraction']['overall_f1'] >= 0.85, | |
| 'avg_conversation_length': self.results['engagement']['avg_conversation_length'] >= 10, | |
| 'response_time_p95': self.results['performance']['response_time_percentiles']['p95'] < 2.0, | |
| 'error_rate': self.results['performance']['error_rate'] < 0.01 | |
| } | |
| } | |
| with open(output_file, 'w') as f: | |
| json.dump(report, f, indent=2) | |
| print(f"Evaluation report saved to {output_file}") | |
| return report | |
| ``` | |
| --- | |
| ## TESTING FRAMEWORK | |
| ### Test Suite Organization | |
| ``` | |
| tests/ | |
| ├── unit/ | |
| │ ├── test_detection.py | |
| │ ├── test_extraction.py | |
| │ ├── test_persona.py | |
| │ └── test_utils.py | |
| ├── integration/ | |
| │ ├── test_api_endpoints.py | |
| │ ├── test_database.py | |
| │ └── test_llm_integration.py | |
| ├── performance/ | |
| │ ├── test_load.py | |
| │ └── test_latency.py | |
| ├── acceptance/ | |
| │ ├── test_requirements.py | |
| │ └── test_red_team.py | |
| └── conftest.py | |
| ``` | |
| ### Sample Unit Test | |
| ```python | |
| # tests/unit/test_detection.py | |
| import pytest | |
| from app.models.detector import ScamDetector | |
| @pytest.fixture | |
| def detector(): | |
| return ScamDetector() | |
| def test_english_scam_detection(detector): | |
| """Test English scam message detection""" | |
| message = "You won 10 lakh rupees! Send OTP immediately." | |
| result = detector.detect(message) | |
| assert result['scam_detected'] == True | |
| assert result['confidence'] >= 0.85 | |
| assert result['language'] == 'en' | |
| def test_hindi_scam_detection(detector): | |
| """Test Hindi scam message detection""" | |
| message = "आप गिरफ्तार हो जाएंगे। पैसे भेजें।" | |
| result = detector.detect(message) | |
| assert result['scam_detected'] == True | |
| assert result['confidence'] >= 0.85 | |
| assert result['language'] == 'hi' | |
| def test_legitimate_message(detector): | |
| """Test legitimate message classification""" | |
| message = "Hi, how are you? Let's meet for coffee." | |
| result = detector.detect(message) | |
| assert result['scam_detected'] == False | |
| assert result['confidence'] <= 0.3 | |
| ``` | |
| ### Sample Integration Test | |
| ```python | |
| # tests/integration/test_api_endpoints.py | |
| import pytest | |
| import requests | |
| @pytest.fixture | |
| def api_url(): | |
| return "http://localhost:8000/api/v1" | |
| def test_engage_endpoint_scam(api_url): | |
| """Test /honeypot/engage with scam message""" | |
| response = requests.post( | |
| f"{api_url}/honeypot/engage", | |
| json={ | |
| "message": "You won 10 lakh rupees! Send OTP.", | |
| "language": "auto" | |
| } | |
| ) | |
| assert response.status_code == 200 | |
| data = response.json() | |
| assert data['status'] == 'success' | |
| assert data['scam_detected'] == True | |
| assert 'agent_response' in data['engagement'] | |
| assert data['engagement']['turn_count'] == 1 | |
| def test_engage_endpoint_legitimate(api_url): | |
| """Test /honeypot/engage with legitimate message""" | |
| response = requests.post( | |
| f"{api_url}/honeypot/engage", | |
| json={ | |
| "message": "Hi, how are you?", | |
| "language": "auto" | |
| } | |
| ) | |
| assert response.status_code == 200 | |
| data = response.json() | |
| assert data['status'] == 'success' | |
| assert data['scam_detected'] == False | |
| ``` | |
| --- | |
| ## COMPETITION SCORING (PREDICTED) | |
| ### Predicted Judging Rubric | |
| Based on Challenge 2 requirements, we predict the following scoring: | |
| | Category | Weight | Metrics | Our Target | Competitive Advantage | | |
| |----------|--------|---------|------------|----------------------| | |
| | **Scam Detection** | 25% | Accuracy, Precision, Recall | 92% accuracy | IndicBERT + hybrid approach | | |
| | **Engagement Quality** | 25% | Avg turns, Naturalness | 12 turns avg | Multi-turn agentic AI | | |
| | **Intelligence Extraction** | 30% | Precision, Recall, Coverage | 88% F1 | Hybrid NER + regex | | |
| | **Response Time** | 10% | P95 latency | <1.8s | Optimized inference | | |
| | **System Robustness** | 10% | Uptime, Error rate | 99.5% uptime | Production architecture | | |
| ### Expected Score Calculation | |
| ```python | |
| def calculate_competition_score(metrics: dict) -> float: | |
| """ | |
| Calculate predicted competition score. | |
| Args: | |
| metrics: Dictionary with all evaluation metrics | |
| Returns: | |
| Estimated score (0-100) | |
| """ | |
| weights = { | |
| 'detection': 0.25, | |
| 'engagement': 0.25, | |
| 'extraction': 0.30, | |
| 'performance': 0.10, | |
| 'robustness': 0.10 | |
| } | |
| # Normalize each category to 0-1 | |
| detection_score = min(metrics['detection']['accuracy'] / 0.90, 1.0) | |
| engagement_score = min(metrics['engagement']['avg_conversation_length'] / 10, 1.0) | |
| extraction_score = min(metrics['extraction']['overall_f1'] / 0.85, 1.0) | |
| performance_score = 1.0 - min(metrics['performance']['response_time_percentiles']['p95'] / 2.0, 1.0) | |
| robustness_score = 1.0 - metrics['performance']['error_rate'] | |
| total_score = ( | |
| weights['detection'] * detection_score + | |
| weights['engagement'] * engagement_score + | |
| weights['extraction'] * extraction_score + | |
| weights['performance'] * performance_score + | |
| weights['robustness'] * robustness_score | |
| ) * 100 | |
| return total_score | |
| ``` | |
| --- | |
| ## CONTINUOUS MONITORING | |
| ### Production Metrics Dashboard | |
| ```python | |
| from prometheus_client import Counter, Histogram, Gauge, Summary | |
| # Define metrics | |
| scam_detection_total = Counter( | |
| 'scamshield_scam_detection_total', | |
| 'Total number of scam detections', | |
| ['language', 'result'] | |
| ) | |
| intelligence_extracted_total = Counter( | |
| 'scamshield_intelligence_extracted_total', | |
| 'Total pieces of intelligence extracted', | |
| ['type'] | |
| ) | |
| api_response_time = Histogram( | |
| 'scamshield_api_response_time_seconds', | |
| 'API response time in seconds', | |
| buckets=[0.1, 0.5, 1.0, 2.0, 5.0] | |
| ) | |
| active_sessions = Gauge( | |
| 'scamshield_active_sessions', | |
| 'Number of active honeypot sessions' | |
| ) | |
| detection_accuracy = Summary( | |
| 'scamshield_detection_accuracy', | |
| 'Detection accuracy over sliding window' | |
| ) | |
| ``` | |
| --- | |
| **Document Status:** Production Ready | |
| **Next Steps:** Implement evaluation framework, run tests, generate baseline metrics | |
| **Update Frequency:** Daily during development, hourly during competition testing | |