Spaces:

Gankit12
/

scam

Sleeping

App Files Files Community

scam / EVAL_SPEC.md

Gankit12

Upload 129 files

31f0e50 verified 2 months ago

preview code

raw

history blame contribute delete

34 kB

	# Evaluation Specification: ScamShield AI
	## Metrics, Computation Methods, and Testing Framework

	Version: 1.0
	Date: January 26, 2026
	Owner: QA & Evaluation Team
	Related Documents: FRD.md, DATA_SPEC.md, API_CONTRACT.md

	---

	## TABLE OF CONTENTS
	1. [Evaluation Overview](#evaluation-overview)
	2. [Detection Metrics](#detection-metrics)
	3. [Extraction Metrics](#extraction-metrics)
	4. [Engagement Metrics](#engagement-metrics)
	5. [Performance Metrics](#performance-metrics)
	6. [Computation Methods](#computation-methods)
	7. [Testing Framework](#testing-framework)
	8. [Competition Scoring (Predicted)](#competition-scoring-predicted)

	---

	## EVALUATION OVERVIEW

	### Evaluation Objectives

	1. Functional Correctness: System meets FRD requirements
	2. Performance: Response time, throughput within SLAs
	3. Quality: Detection accuracy, extraction precision/recall
	4. Robustness: Handles edge cases, adversarial inputs
	5. Competition Readiness: Meets judging criteria

	### Evaluation Phases

	\| Phase \| Timeline \| Focus \| Pass Criteria \|
	\|-------\|----------\|-------\|---------------\|
	\| Unit Testing \| Days 3-9 \| Individual components \| >80% code coverage \|
	\| Integration Testing \| Day 8 \| End-to-end flows \| All API endpoints functional \|
	\| Performance Testing \| Day 9 \| Load, latency \| <2s p95 latency, 100 req/min \|
	\| Acceptance Testing \| Day 10 \| Requirements validation \| All FRD acceptance criteria met \|
	\| Red Team Testing \| Day 10 \| Adversarial scenarios \| >80% red team tests passed \|
	\| Pre-Submission \| Day 11 \| Final validation \| >90% detection accuracy \|

	---

	## DETECTION METRICS

	### Metric 1: Scam Detection Accuracy

	Definition: Proportion of messages correctly classified as scam or legitimate.

	Formula:
	```
	Accuracy = (TP + TN) / (TP + TN + FP + FN)

	Where:
	- TP (True Positives): Scams correctly identified
	- TN (True Negatives): Legitimate messages correctly identified
	- FP (False Positives): Legitimate messages incorrectly flagged as scams
	- FN (False Negatives): Scams missed
	```

	Target: ≥90%

	Computation:
	```python
	def compute_detection_accuracy(predictions: List[dict], ground_truth: List[dict]) -> float:
	"""
	Compute scam detection accuracy.

	Args:
	predictions: List of {"id": str, "scam_detected": bool}
	ground_truth: List of {"id": str, "label": "scam"\|"legitimate"}

	Returns:
	Accuracy score (0.0-1.0)
	"""
	assert len(predictions) == len(ground_truth), "Mismatched lengths"

	# Align by ID
	pred_map = {p['id']: p['scam_detected'] for p in predictions}
	gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}

	correct = sum(1 for id in pred_map if pred_map[id] == gt_map[id])
	total = len(pred_map)

	return correct / total if total > 0 else 0.0

	# Example usage
	predictions = [
	{"id": "test_001", "scam_detected": True},
	{"id": "test_002", "scam_detected": False},
	{"id": "test_003", "scam_detected": True}
	]

	ground_truth = [
	{"id": "test_001", "label": "scam"},
	{"id": "test_002", "label": "legitimate"},
	{"id": "test_003", "label": "scam"}
	]

	accuracy = compute_detection_accuracy(predictions, ground_truth)
	print(f"Accuracy: {accuracy:.2%}") # Expected: 100%
	```

	---

	### Metric 2: Precision

	Definition: Of all messages flagged as scams, what proportion are actual scams?

	Formula:
	```
	Precision = TP / (TP + FP)
	```

	Target: ≥85%

	Significance: High precision minimizes false alarms (legitimate messages flagged as scams).

	Computation:
	```python
	def compute_precision(predictions: List[dict], ground_truth: List[dict]) -> float:
	"""Compute precision for scam detection"""
	pred_map = {p['id']: p['scam_detected'] for p in predictions}
	gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}

	tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id])
	fp = sum(1 for id in pred_map if pred_map[id] and not gt_map[id])

	return tp / (tp + fp) if (tp + fp) > 0 else 0.0
	```

	---

	### Metric 3: Recall (Sensitivity)

	Definition: Of all actual scams, what proportion are detected?

	Formula:
	```
	Recall = TP / (TP + FN)
	```

	Target: ≥90%

	Significance: High recall ensures few scams are missed.

	Computation:
	```python
	def compute_recall(predictions: List[dict], ground_truth: List[dict]) -> float:
	"""Compute recall for scam detection"""
	pred_map = {p['id']: p['scam_detected'] for p in predictions}
	gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}

	tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id])
	fn = sum(1 for id in pred_map if not pred_map[id] and gt_map[id])

	return tp / (tp + fn) if (tp + fn) > 0 else 0.0
	```

	---

	### Metric 4: F1-Score

	Definition: Harmonic mean of precision and recall.

	Formula:
	```
	F1 = 2 * (Precision * Recall) / (Precision + Recall)
	```

	Target: ≥87%

	Computation:
	```python
	def compute_f1_score(precision: float, recall: float) -> float:
	"""Compute F1-score from precision and recall"""
	if precision + recall == 0:
	return 0.0
	return 2 * (precision * recall) / (precision + recall)
	```

	---

	### Metric 5: Confidence Calibration

	Definition: How well do confidence scores correlate with actual accuracy?

	Formula:
	```
	Expected Calibration Error (ECE) = Σ (\|accuracy_bin - avg_confidence_bin\|) × (bin_size / total)

	For bins: [0-0.1], [0.1-0.2], ..., [0.9-1.0]
	```

	Target: ECE <0.1 (well-calibrated)

	Computation:
	```python
	def compute_ece(predictions: List[dict], ground_truth: List[dict], n_bins: int = 10) -> float:
	"""
	Compute Expected Calibration Error.

	Predictions must include "confidence" field.
	"""
	pred_map = {p['id']: (p['scam_detected'], p['confidence']) for p in predictions}
	gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}

	bins = [[] for _ in range(n_bins)]

	for id in pred_map:
	pred, conf = pred_map[id]
	actual = gt_map[id]
	correct = (pred == actual)

	bin_idx = min(int(conf * n_bins), n_bins - 1)
	bins[bin_idx].append((conf, correct))

	ece = 0.0
	total = len(pred_map)

	for bin_samples in bins:
	if len(bin_samples) == 0:
	continue

	avg_conf = sum(conf for conf, _ in bin_samples) / len(bin_samples)
	accuracy = sum(1 for _, correct in bin_samples if correct) / len(bin_samples)

	ece += abs(accuracy - avg_conf) * (len(bin_samples) / total)

	return ece
	```

	---

	### Metric 6: Language-Specific Accuracy

	Definition: Detection accuracy broken down by language.

	Target:
	- English: ≥92%
	- Hindi: ≥88%
	- Hinglish: ≥85%
	- Fairness: <5% difference between languages

	Computation:
	```python
	def compute_language_specific_accuracy(predictions: List[dict], ground_truth: List[dict]) -> dict:
	"""Compute accuracy per language"""
	from collections import defaultdict

	lang_correct = defaultdict(int)
	lang_total = defaultdict(int)

	pred_map = {p['id']: p for p in predictions}
	gt_map = {g['id']: g for g in ground_truth}

	for id in pred_map:
	lang = gt_map[id]['language']
	pred_scam = pred_map[id]['scam_detected']
	actual_scam = (gt_map[id]['label'] == 'scam')

	lang_total[lang] += 1
	if pred_scam == actual_scam:
	lang_correct[lang] += 1

	return {
	lang: lang_correct[lang] / lang_total[lang] if lang_total[lang] > 0 else 0.0
	for lang in lang_total
	}

	# Check fairness
	def check_language_fairness(lang_accuracies: dict, threshold: float = 0.05) -> bool:
	"""Ensure accuracy difference between languages is within threshold"""
	accuracies = list(lang_accuracies.values())
	max_diff = max(accuracies) - min(accuracies)
	return max_diff < threshold
	```

	---

	## EXTRACTION METRICS

	### Metric 7: Extraction Precision (per entity type)

	Definition: Of all extracted entities, what proportion are correct?

	Formula:
	```
	Precision_entity = \|Extracted ∩ Ground_Truth\| / \|Extracted\|

	For entity types: upi_ids, bank_accounts, ifsc_codes, phone_numbers, phishing_links
	```

	Target:
	- UPI IDs: ≥90%
	- Bank Accounts: ≥85%
	- IFSC Codes: ≥95%
	- Phone Numbers: ≥90%
	- Phishing Links: ≥95%

	Computation:
	```python
	def compute_extraction_precision(extracted: dict, ground_truth: dict) -> dict:
	"""
	Compute precision for each entity type.

	Args:
	extracted: {"upi_ids": [...], "bank_accounts": [...], ...}
	ground_truth: Same structure

	Returns:
	{"upi_ids": precision, "bank_accounts": precision, ...}
	"""
	precisions = {}

	for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']:
	extracted_set = set(extracted.get(entity_type, []))
	gt_set = set(ground_truth.get(entity_type, []))

	if len(extracted_set) == 0:
	precisions[entity_type] = 1.0 if len(gt_set) == 0 else 0.0
	else:
	correct = len(extracted_set & gt_set)
	precisions[entity_type] = correct / len(extracted_set)

	return precisions
	```

	---

	### Metric 8: Extraction Recall (per entity type)

	Definition: Of all actual entities, what proportion are extracted?

	Formula:
	```
	Recall_entity = \|Extracted ∩ Ground_Truth\| / \|Ground_Truth\|
	```

	Target:
	- UPI IDs: ≥85%
	- Bank Accounts: ≥80%
	- IFSC Codes: ≥90%
	- Phone Numbers: ≥85%
	- Phishing Links: ≥90%

	Computation:
	```python
	def compute_extraction_recall(extracted: dict, ground_truth: dict) -> dict:
	"""Compute recall for each entity type"""
	recalls = {}

	for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']:
	extracted_set = set(extracted.get(entity_type, []))
	gt_set = set(ground_truth.get(entity_type, []))

	if len(gt_set) == 0:
	recalls[entity_type] = 1.0 if len(extracted_set) == 0 else 0.0
	else:
	correct = len(extracted_set & gt_set)
	recalls[entity_type] = correct / len(gt_set)

	return recalls
	```

	---

	### Metric 9: Overall Extraction F1-Score

	Definition: Weighted average F1-score across all entity types.

	Weights:
	```python
	ENTITY_WEIGHTS = {
	'upi_ids': 0.30,
	'bank_accounts': 0.30,
	'ifsc_codes': 0.20,
	'phone_numbers': 0.10,
	'phishing_links': 0.10
	}
	```

	Target: ≥85%

	Computation:
	```python
	def compute_overall_extraction_f1(precisions: dict, recalls: dict, weights: dict = ENTITY_WEIGHTS) -> float:
	"""Compute weighted F1-score across entity types"""
	f1_scores = {}

	for entity_type in weights:
	p = precisions.get(entity_type, 0.0)
	r = recalls.get(entity_type, 0.0)

	if p + r == 0:
	f1_scores[entity_type] = 0.0
	else:
	f1_scores[entity_type] = 2 * (p * r) / (p + r)

	weighted_f1 = sum(f1_scores[entity] * weights[entity] for entity in weights)
	return weighted_f1
	```

	---

	### Metric 10: Extraction Confidence Accuracy

	Definition: Correlation between extraction_confidence score and actual precision.

	Target: Pearson correlation >0.7

	Computation:
	```python
	from scipy.stats import pearsonr

	def evaluate_extraction_confidence(test_results: List[dict]) -> float:
	"""
	Evaluate extraction confidence calibration.

	test_results: [
	{
	"extraction_confidence": 0.85,
	"actual_precision": 0.90
	},
	...
	]
	"""
	confidences = [r['extraction_confidence'] for r in test_results]
	precisions = [r['actual_precision'] for r in test_results]

	correlation, p_value = pearsonr(confidences, precisions)
	return correlation
	```

	---

	## ENGAGEMENT METRICS

	### Metric 11: Average Conversation Length

	Definition: Mean number of turns per conversation.

	Target: ≥10 turns (demonstrates sustained engagement)

	Computation:
	```python
	def compute_avg_conversation_length(conversations: List[dict]) -> float:
	"""
	conversations: [{"session_id": str, "turn_count": int}, ...]
	"""
	if len(conversations) == 0:
	return 0.0

	total_turns = sum(conv['turn_count'] for conv in conversations)
	return total_turns / len(conversations)
	```

	---

	### Metric 12: Intelligence Extraction Rate

	Definition: Proportion of conversations that extract at least one intelligence entity.

	Target: ≥70%

	Computation:
	```python
	def compute_extraction_rate(conversations: List[dict]) -> float:
	"""
	conversations: [
	{
	"session_id": str,
	"extracted_intelligence": {
	"upi_ids": [...],
	...
	}
	},
	...
	]
	"""
	if len(conversations) == 0:
	return 0.0

	extracted_count = 0
	for conv in conversations:
	intel = conv['extracted_intelligence']
	has_intel = any(
	len(intel.get(entity_type, [])) > 0
	for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']
	)
	if has_intel:
	extracted_count += 1

	return extracted_count / len(conversations)
	```

	---

	### Metric 13: Persona Consistency

	Definition: Proportion of conversations where persona remains consistent across all turns.

	Target: ≥95%

	Computation:
	```python
	def compute_persona_consistency(conversations: List[dict]) -> float:
	"""
	conversations: [
	{
	"session_id": str,
	"messages": [
	{"turn": 1, "sender": "agent", "persona": "elderly"},
	{"turn": 2, "sender": "agent", "persona": "elderly"},
	...
	]
	},
	...
	]
	"""
	consistent_count = 0

	for conv in conversations:
	agent_messages = [msg for msg in conv['messages'] if msg['sender'] == 'agent']
	if len(agent_messages) == 0:
	continue

	personas = [msg.get('persona') for msg in agent_messages]
	if len(set(personas)) == 1: # All same persona
	consistent_count += 1

	return consistent_count / len(conversations) if len(conversations) > 0 else 0.0
	```

	---

	### Metric 14: Engagement Quality Score

	Definition: Composite score measuring naturalness and effectiveness of engagement.

	Components:
	1. Average turns (weight: 0.4)
	2. Extraction rate (weight: 0.4)
	3. Persona consistency (weight: 0.2)

	Target: ≥0.8

	Computation:
	```python
	def compute_engagement_quality(avg_turns: float, extraction_rate: float, persona_consistency: float) -> float:
	"""
	Normalize and weight engagement metrics.

	Args:
	avg_turns: Actual average turns
	extraction_rate: 0.0-1.0
	persona_consistency: 0.0-1.0
	"""
	# Normalize avg_turns (max 20)
	normalized_turns = min(avg_turns / 20, 1.0)

	quality_score = (
	0.4 * normalized_turns +
	0.4 * extraction_rate +
	0.2 * persona_consistency
	)

	return quality_score
	```

	---

	## PERFORMANCE METRICS

	### Metric 15: API Response Time

	Definition: Time from request received to response sent.

	Targets:
	- P50 (Median): <1 second
	- P95: <2 seconds
	- P99: <3 seconds

	Computation:
	```python
	import numpy as np

	def compute_response_time_percentiles(response_times: List[float]) -> dict:
	"""
	response_times: List of times in seconds
	"""
	return {
	'p50': np.percentile(response_times, 50),
	'p95': np.percentile(response_times, 95),
	'p99': np.percentile(response_times, 99),
	'mean': np.mean(response_times),
	'max': np.max(response_times)
	}
	```

	---

	### Metric 16: Throughput

	Definition: Number of requests processed per minute.

	Target: ≥100 requests/minute (sustained)

	Computation:
	```python
	def compute_throughput(total_requests: int, time_window_seconds: float) -> float:
	"""
	Returns requests per minute
	"""
	return (total_requests / time_window_seconds) * 60
	```

	---

	### Metric 17: Error Rate

	Definition: Proportion of requests that result in errors (4xx, 5xx).

	Target: <1%

	Computation:
	```python
	def compute_error_rate(total_requests: int, error_count: int) -> float:
	"""Returns error rate as proportion (0.0-1.0)"""
	return error_count / total_requests if total_requests > 0 else 0.0
	```

	---

	### Metric 18: Uptime

	Definition: Percentage of time service is available and healthy.

	Target: ≥99% during competition testing window

	Computation:
	```python
	def compute_uptime(total_time_seconds: float, downtime_seconds: float) -> float:
	"""Returns uptime as percentage"""
	return ((total_time_seconds - downtime_seconds) / total_time_seconds) * 100
	```

	---

	## COMPUTATION METHODS

	### Complete Evaluation Pipeline

	```python
	import json
	from typing import List, Dict, Tuple

	class ScamShieldEvaluator:
	"""Complete evaluation framework for ScamShield AI"""

	def __init__(self, api_endpoint: str):
	self.api_endpoint = api_endpoint
	self.results = {
	'detection': {},
	'extraction': {},
	'engagement': {},
	'performance': {}
	}

	def evaluate_detection(self, test_file: str) -> dict:
	"""
	Evaluate scam detection on test dataset.

	Args:
	test_file: Path to JSONL test file

	Returns:
	Detection metrics dictionary
	"""
	with open(test_file, 'r') as f:
	test_data = [json.loads(line) for line in f]

	predictions = []
	ground_truth = []
	response_times = []

	for item in test_data:
	import time
	import requests

	start_time = time.time()

	response = requests.post(
	f"{self.api_endpoint}/honeypot/engage",
	json={"message": item['message'], "language": item['language']}
	)

	response_time = time.time() - start_time
	response_times.append(response_time)

	result = response.json()

	predictions.append({
	'id': item['id'],
	'scam_detected': result['scam_detected'],
	'confidence': result['confidence']
	})

	ground_truth.append({
	'id': item['id'],
	'label': item['ground_truth']['label'],
	'language': item['language']
	})

	# Compute metrics
	accuracy = compute_detection_accuracy(predictions, ground_truth)
	precision = compute_precision(predictions, ground_truth)
	recall = compute_recall(predictions, ground_truth)
	f1 = compute_f1_score(precision, recall)
	ece = compute_ece(predictions, ground_truth)
	lang_acc = compute_language_specific_accuracy(predictions, ground_truth)

	return {
	'accuracy': accuracy,
	'precision': precision,
	'recall': recall,
	'f1_score': f1,
	'ece': ece,
	'language_accuracy': lang_acc,
	'avg_response_time': np.mean(response_times),
	'total_samples': len(test_data)
	}

	def evaluate_extraction(self, test_file: str) -> dict:
	"""Evaluate intelligence extraction on test dataset"""
	with open(test_file, 'r') as f:
	test_data = [json.loads(line) for line in f]

	all_precisions = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']}
	all_recalls = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']}

	for item in test_data:
	response = requests.post(
	f"{self.api_endpoint}/honeypot/engage",
	json={"message": item['text'], "language": item['language']}
	)

	result = response.json()
	extracted = result['extracted_intelligence']
	ground_truth = item['ground_truth']

	precisions = compute_extraction_precision(extracted, ground_truth)
	recalls = compute_extraction_recall(extracted, ground_truth)

	for entity in all_precisions:
	all_precisions[entity].append(precisions[entity])
	all_recalls[entity].append(recalls[entity])

	# Average across all samples
	avg_precisions = {entity: np.mean(all_precisions[entity]) for entity in all_precisions}
	avg_recalls = {entity: np.mean(all_recalls[entity]) for entity in all_recalls}

	overall_f1 = compute_overall_extraction_f1(avg_precisions, avg_recalls)

	return {
	'precisions': avg_precisions,
	'recalls': avg_recalls,
	'overall_f1': overall_f1,
	'total_samples': len(test_data)
	}

	def evaluate_engagement(self, conversation_file: str) -> dict:
	"""Evaluate multi-turn engagement quality"""
	with open(conversation_file, 'r') as f:
	conversations = [json.loads(line) for line in f]

	completed_conversations = []

	for conv in conversations:
	session_id = None
	turn_count = 0
	extracted_intel = {}

	for turn in conv['turns']:
	if turn['sender'] == 'scammer':
	response = requests.post(
	f"{self.api_endpoint}/honeypot/engage",
	json={
	"message": turn['message'],
	"session_id": session_id,
	"language": conv['language']
	}
	)

	result = response.json()

	if session_id is None:
	session_id = result['session_id']

	turn_count = result['engagement']['turn_count']
	extracted_intel = result['extracted_intelligence']

	# Check termination
	if result['engagement']['max_turns_reached']:
	break

	completed_conversations.append({
	'session_id': session_id,
	'turn_count': turn_count,
	'extracted_intelligence': extracted_intel
	})

	avg_turns = compute_avg_conversation_length(completed_conversations)
	extraction_rate = compute_extraction_rate(completed_conversations)

	return {
	'avg_conversation_length': avg_turns,
	'intelligence_extraction_rate': extraction_rate,
	'total_conversations': len(completed_conversations)
	}

	def evaluate_performance(self, duration_seconds: int = 60, target_rps: int = 10) -> dict:
	"""Load test performance metrics"""
	import concurrent.futures
	import time

	test_message = "You won 10 lakh rupees! Send OTP to claim."
	response_times = []
	errors = 0

	def make_request():
	try:
	start = time.time()
	response = requests.post(
	f"{self.api_endpoint}/honeypot/engage",
	json={"message": test_message},
	timeout=5
	)
	latency = time.time() - start

	if response.status_code != 200:
	return None, 1
	return latency, 0
	except Exception:
	return None, 1

	start_time = time.time()
	total_requests = 0

	with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
	while time.time() - start_time < duration_seconds:
	future = executor.submit(make_request)
	latency, error = future.result()

	total_requests += 1
	if latency is not None:
	response_times.append(latency)
	errors += error

	# Rate limiting
	time.sleep(1.0 / target_rps)

	elapsed_time = time.time() - start_time

	percentiles = compute_response_time_percentiles(response_times)
	throughput = compute_throughput(total_requests, elapsed_time)
	error_rate = compute_error_rate(total_requests, errors)

	return {
	'response_time_percentiles': percentiles,
	'throughput_rpm': throughput,
	'error_rate': error_rate,
	'total_requests': total_requests,
	'duration_seconds': elapsed_time
	}

	def run_full_evaluation(self) -> dict:
	"""Run complete evaluation suite"""
	print("Running detection evaluation...")
	self.results['detection'] = self.evaluate_detection('data/scam_detection_test.jsonl')

	print("Running extraction evaluation...")
	self.results['extraction'] = self.evaluate_extraction('data/intelligence_extraction_test.jsonl')

	print("Running engagement evaluation...")
	self.results['engagement'] = self.evaluate_engagement('data/conversation_simulation_test.jsonl')

	print("Running performance evaluation...")
	self.results['performance'] = self.evaluate_performance(duration_seconds=60, target_rps=10)

	return self.results

	def generate_report(self, output_file: str = 'evaluation_report.json'):
	"""Generate comprehensive evaluation report"""
	report = {
	'timestamp': datetime.utcnow().isoformat(),
	'api_endpoint': self.api_endpoint,
	'results': self.results,
	'pass_criteria': {
	'detection_accuracy': self.results['detection']['accuracy'] >= 0.90,
	'extraction_f1': self.results['extraction']['overall_f1'] >= 0.85,
	'avg_conversation_length': self.results['engagement']['avg_conversation_length'] >= 10,
	'response_time_p95': self.results['performance']['response_time_percentiles']['p95'] < 2.0,
	'error_rate': self.results['performance']['error_rate'] < 0.01
	}
	}

	with open(output_file, 'w') as f:
	json.dump(report, f, indent=2)

	print(f"Evaluation report saved to {output_file}")
	return report
	```

	---

	## TESTING FRAMEWORK

	### Test Suite Organization

	```
	tests/
	├── unit/
	│ ├── test_detection.py
	│ ├── test_extraction.py
	│ ├── test_persona.py
	│ └── test_utils.py
	├── integration/
	│ ├── test_api_endpoints.py
	│ ├── test_database.py
	│ └── test_llm_integration.py
	├── performance/
	│ ├── test_load.py
	│ └── test_latency.py
	├── acceptance/
	│ ├── test_requirements.py
	│ └── test_red_team.py
	└── conftest.py
	```

	### Sample Unit Test

	```python
	# tests/unit/test_detection.py
	import pytest
	from app.models.detector import ScamDetector

	@pytest.fixture
	def detector():
	return ScamDetector()

	def test_english_scam_detection(detector):
	"""Test English scam message detection"""
	message = "You won 10 lakh rupees! Send OTP immediately."

	result = detector.detect(message)

	assert result['scam_detected'] == True
	assert result['confidence'] >= 0.85
	assert result['language'] == 'en'

	def test_hindi_scam_detection(detector):
	"""Test Hindi scam message detection"""
	message = "आप गिरफ्तार हो जाएंगे। पैसे भेजें।"

	result = detector.detect(message)

	assert result['scam_detected'] == True
	assert result['confidence'] >= 0.85
	assert result['language'] == 'hi'

	def test_legitimate_message(detector):
	"""Test legitimate message classification"""
	message = "Hi, how are you? Let's meet for coffee."

	result = detector.detect(message)

	assert result['scam_detected'] == False
	assert result['confidence'] <= 0.3
	```

	### Sample Integration Test

	```python
	# tests/integration/test_api_endpoints.py
	import pytest
	import requests

	@pytest.fixture
	def api_url():
	return "http://localhost:8000/api/v1"

	def test_engage_endpoint_scam(api_url):
	"""Test /honeypot/engage with scam message"""
	response = requests.post(
	f"{api_url}/honeypot/engage",
	json={
	"message": "You won 10 lakh rupees! Send OTP.",
	"language": "auto"
	}
	)

	assert response.status_code == 200

	data = response.json()
	assert data['status'] == 'success'
	assert data['scam_detected'] == True
	assert 'agent_response' in data['engagement']
	assert data['engagement']['turn_count'] == 1

	def test_engage_endpoint_legitimate(api_url):
	"""Test /honeypot/engage with legitimate message"""
	response = requests.post(
	f"{api_url}/honeypot/engage",
	json={
	"message": "Hi, how are you?",
	"language": "auto"
	}
	)

	assert response.status_code == 200

	data = response.json()
	assert data['status'] == 'success'
	assert data['scam_detected'] == False
	```

	---

	## COMPETITION SCORING (PREDICTED)

	### Predicted Judging Rubric

	Based on Challenge 2 requirements, we predict the following scoring:

	\| Category \| Weight \| Metrics \| Our Target \| Competitive Advantage \|
	\|----------\|--------\|---------\|------------\|----------------------\|
	\| Scam Detection \| 25% \| Accuracy, Precision, Recall \| 92% accuracy \| IndicBERT + hybrid approach \|
	\| Engagement Quality \| 25% \| Avg turns, Naturalness \| 12 turns avg \| Multi-turn agentic AI \|
	\| Intelligence Extraction \| 30% \| Precision, Recall, Coverage \| 88% F1 \| Hybrid NER + regex \|
	\| Response Time \| 10% \| P95 latency \| <1.8s \| Optimized inference \|
	\| System Robustness \| 10% \| Uptime, Error rate \| 99.5% uptime \| Production architecture \|

	### Expected Score Calculation

	```python
	def calculate_competition_score(metrics: dict) -> float:
	"""
	Calculate predicted competition score.

	Args:
	metrics: Dictionary with all evaluation metrics

	Returns:
	Estimated score (0-100)
	"""
	weights = {
	'detection': 0.25,
	'engagement': 0.25,
	'extraction': 0.30,
	'performance': 0.10,
	'robustness': 0.10
	}

	# Normalize each category to 0-1
	detection_score = min(metrics['detection']['accuracy'] / 0.90, 1.0)
	engagement_score = min(metrics['engagement']['avg_conversation_length'] / 10, 1.0)
	extraction_score = min(metrics['extraction']['overall_f1'] / 0.85, 1.0)
	performance_score = 1.0 - min(metrics['performance']['response_time_percentiles']['p95'] / 2.0, 1.0)
	robustness_score = 1.0 - metrics['performance']['error_rate']

	total_score = (
	weights['detection'] * detection_score +
	weights['engagement'] * engagement_score +
	weights['extraction'] * extraction_score +
	weights['performance'] * performance_score +
	weights['robustness'] * robustness_score
	) * 100

	return total_score
	```

	---

	## CONTINUOUS MONITORING

	### Production Metrics Dashboard

	```python
	from prometheus_client import Counter, Histogram, Gauge, Summary

	# Define metrics
	scam_detection_total = Counter(
	'scamshield_scam_detection_total',
	'Total number of scam detections',
	['language', 'result']
	)

	intelligence_extracted_total = Counter(
	'scamshield_intelligence_extracted_total',
	'Total pieces of intelligence extracted',
	['type']
	)

	api_response_time = Histogram(
	'scamshield_api_response_time_seconds',
	'API response time in seconds',
	buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
	)

	active_sessions = Gauge(
	'scamshield_active_sessions',
	'Number of active honeypot sessions'
	)

	detection_accuracy = Summary(
	'scamshield_detection_accuracy',
	'Detection accuracy over sliding window'
	)
	```

	---

	Document Status: Production Ready
	Next Steps: Implement evaluation framework, run tests, generate baseline metrics
	Update Frequency: Daily during development, hourly during competition testing