Spaces:

Gankit12
/

scam

Sleeping

Phase	Timeline	Focus	Pass Criteria
Unit Testing	Days 3-9	Individual components	>80% code coverage
Integration Testing	Day 8	End-to-end flows	All API endpoints functional
Performance Testing	Day 9	Load, latency	<2s p95 latency, 100 req/min
Acceptance Testing	Day 10	Requirements validation	All FRD acceptance criteria met
Red Team Testing	Day 10	Adversarial scenarios	>80% red team tests passed
Pre-Submission	Day 11	Final validation	>90% detection accuracy

DETECTION METRICS

Metric 1: Scam Detection Accuracy

Definition: Proportion of messages correctly classified as scam or legitimate.

Formula:

Accuracy = (TP + TN) / (TP + TN + FP + FN)

Where:
- TP (True Positives): Scams correctly identified
- TN (True Negatives): Legitimate messages correctly identified
- FP (False Positives): Legitimate messages incorrectly flagged as scams
- FN (False Negatives): Scams missed

Target: ≥90%

Computation:

def compute_detection_accuracy(predictions: List[dict], ground_truth: List[dict]) -> float:
    """
    Compute scam detection accuracy.
    
    Args:
        predictions: List of {"id": str, "scam_detected": bool}
        ground_truth: List of {"id": str, "label": "scam"|"legitimate"}
    
    Returns:
        Accuracy score (0.0-1.0)
    """
    assert len(predictions) == len(ground_truth), "Mismatched lengths"
    
    # Align by ID
    pred_map = {p['id']: p['scam_detected'] for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    correct = sum(1 for id in pred_map if pred_map[id] == gt_map[id])
    total = len(pred_map)
    
    return correct / total if total > 0 else 0.0

# Example usage
predictions = [
    {"id": "test_001", "scam_detected": True},
    {"id": "test_002", "scam_detected": False},
    {"id": "test_003", "scam_detected": True}
]

ground_truth = [
    {"id": "test_001", "label": "scam"},
    {"id": "test_002", "label": "legitimate"},
    {"id": "test_003", "label": "scam"}
]

accuracy = compute_detection_accuracy(predictions, ground_truth)
print(f"Accuracy: {accuracy:.2%}")  # Expected: 100%

Metric 2: Precision

Definition: Of all messages flagged as scams, what proportion are actual scams?

Formula:

Precision = TP / (TP + FP)

Target: ≥85%

Significance: High precision minimizes false alarms (legitimate messages flagged as scams).

Computation:

def compute_precision(predictions: List[dict], ground_truth: List[dict]) -> float:
    """Compute precision for scam detection"""
    pred_map = {p['id']: p['scam_detected'] for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id])
    fp = sum(1 for id in pred_map if pred_map[id] and not gt_map[id])
    
    return tp / (tp + fp) if (tp + fp) > 0 else 0.0

Metric 3: Recall (Sensitivity)

Definition: Of all actual scams, what proportion are detected?

Formula:

Recall = TP / (TP + FN)

Target: ≥90%

Significance: High recall ensures few scams are missed.

Computation:

def compute_recall(predictions: List[dict], ground_truth: List[dict]) -> float:
    """Compute recall for scam detection"""
    pred_map = {p['id']: p['scam_detected'] for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    tp = sum(1 for id in pred_map if pred_map[id] and gt_map[id])
    fn = sum(1 for id in pred_map if not pred_map[id] and gt_map[id])
    
    return tp / (tp + fn) if (tp + fn) > 0 else 0.0

Metric 4: F1-Score

Definition: Harmonic mean of precision and recall.

Formula:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Target: ≥87%

Computation:

def compute_f1_score(precision: float, recall: float) -> float:
    """Compute F1-score from precision and recall"""
    if precision + recall == 0:
        return 0.0
    return 2 * (precision * recall) / (precision + recall)

Metric 5: Confidence Calibration

Definition: How well do confidence scores correlate with actual accuracy?

Formula:

Expected Calibration Error (ECE) = Σ (|accuracy_bin - avg_confidence_bin|) × (bin_size / total)

For bins: [0-0.1], [0.1-0.2], ..., [0.9-1.0]

Target: ECE <0.1 (well-calibrated)

Computation:

def compute_ece(predictions: List[dict], ground_truth: List[dict], n_bins: int = 10) -> float:
    """
    Compute Expected Calibration Error.
    
    Predictions must include "confidence" field.
    """
    pred_map = {p['id']: (p['scam_detected'], p['confidence']) for p in predictions}
    gt_map = {g['id']: (g['label'] == 'scam') for g in ground_truth}
    
    bins = [[] for _ in range(n_bins)]
    
    for id in pred_map:
        pred, conf = pred_map[id]
        actual = gt_map[id]
        correct = (pred == actual)
        
        bin_idx = min(int(conf * n_bins), n_bins - 1)
        bins[bin_idx].append((conf, correct))
    
    ece = 0.0
    total = len(pred_map)
    
    for bin_samples in bins:
        if len(bin_samples) == 0:
            continue
        
        avg_conf = sum(conf for conf, _ in bin_samples) / len(bin_samples)
        accuracy = sum(1 for _, correct in bin_samples if correct) / len(bin_samples)
        
        ece += abs(accuracy - avg_conf) * (len(bin_samples) / total)
    
    return ece

Metric 6: Language-Specific Accuracy

Definition: Detection accuracy broken down by language.

Target:

English: ≥92%
Hindi: ≥88%
Hinglish: ≥85%
Fairness: <5% difference between languages

Computation:

def compute_language_specific_accuracy(predictions: List[dict], ground_truth: List[dict]) -> dict:
    """Compute accuracy per language"""
    from collections import defaultdict
    
    lang_correct = defaultdict(int)
    lang_total = defaultdict(int)
    
    pred_map = {p['id']: p for p in predictions}
    gt_map = {g['id']: g for g in ground_truth}
    
    for id in pred_map:
        lang = gt_map[id]['language']
        pred_scam = pred_map[id]['scam_detected']
        actual_scam = (gt_map[id]['label'] == 'scam')
        
        lang_total[lang] += 1
        if pred_scam == actual_scam:
            lang_correct[lang] += 1
    
    return {
        lang: lang_correct[lang] / lang_total[lang] if lang_total[lang] > 0 else 0.0
        for lang in lang_total
    }

# Check fairness
def check_language_fairness(lang_accuracies: dict, threshold: float = 0.05) -> bool:
    """Ensure accuracy difference between languages is within threshold"""
    accuracies = list(lang_accuracies.values())
    max_diff = max(accuracies) - min(accuracies)
    return max_diff < threshold

EXTRACTION METRICS

Metric 7: Extraction Precision (per entity type)

Definition: Of all extracted entities, what proportion are correct?

Formula:

Precision_entity = |Extracted ∩ Ground_Truth| / |Extracted|

For entity types: upi_ids, bank_accounts, ifsc_codes, phone_numbers, phishing_links

Target:

UPI IDs: ≥90%
Bank Accounts: ≥85%
IFSC Codes: ≥95%
Phone Numbers: ≥90%
Phishing Links: ≥95%

Computation:

def compute_extraction_precision(extracted: dict, ground_truth: dict) -> dict:
    """
    Compute precision for each entity type.
    
    Args:
        extracted: {"upi_ids": [...], "bank_accounts": [...], ...}
        ground_truth: Same structure
    
    Returns:
        {"upi_ids": precision, "bank_accounts": precision, ...}
    """
    precisions = {}
    
    for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']:
        extracted_set = set(extracted.get(entity_type, []))
        gt_set = set(ground_truth.get(entity_type, []))
        
        if len(extracted_set) == 0:
            precisions[entity_type] = 1.0 if len(gt_set) == 0 else 0.0
        else:
            correct = len(extracted_set & gt_set)
            precisions[entity_type] = correct / len(extracted_set)
    
    return precisions

Metric 8: Extraction Recall (per entity type)

Definition: Of all actual entities, what proportion are extracted?

Formula:

Recall_entity = |Extracted ∩ Ground_Truth| / |Ground_Truth|

Target:

UPI IDs: ≥85%
Bank Accounts: ≥80%
IFSC Codes: ≥90%
Phone Numbers: ≥85%
Phishing Links: ≥90%

Computation:

def compute_extraction_recall(extracted: dict, ground_truth: dict) -> dict:
    """Compute recall for each entity type"""
    recalls = {}
    
    for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']:
        extracted_set = set(extracted.get(entity_type, []))
        gt_set = set(ground_truth.get(entity_type, []))
        
        if len(gt_set) == 0:
            recalls[entity_type] = 1.0 if len(extracted_set) == 0 else 0.0
        else:
            correct = len(extracted_set & gt_set)
            recalls[entity_type] = correct / len(gt_set)
    
    return recalls

Metric 9: Overall Extraction F1-Score

Definition: Weighted average F1-score across all entity types.

Weights:

ENTITY_WEIGHTS = {
    'upi_ids': 0.30,
    'bank_accounts': 0.30,
    'ifsc_codes': 0.20,
    'phone_numbers': 0.10,
    'phishing_links': 0.10
}

Target: ≥85%

Computation:

def compute_overall_extraction_f1(precisions: dict, recalls: dict, weights: dict = ENTITY_WEIGHTS) -> float:
    """Compute weighted F1-score across entity types"""
    f1_scores = {}
    
    for entity_type in weights:
        p = precisions.get(entity_type, 0.0)
        r = recalls.get(entity_type, 0.0)
        
        if p + r == 0:
            f1_scores[entity_type] = 0.0
        else:
            f1_scores[entity_type] = 2 * (p * r) / (p + r)
    
    weighted_f1 = sum(f1_scores[entity] * weights[entity] for entity in weights)
    return weighted_f1

Metric 10: Extraction Confidence Accuracy

Definition: Correlation between extraction_confidence score and actual precision.

Target: Pearson correlation >0.7

Computation:

from scipy.stats import pearsonr

def evaluate_extraction_confidence(test_results: List[dict]) -> float:
    """
    Evaluate extraction confidence calibration.
    
    test_results: [
        {
            "extraction_confidence": 0.85,
            "actual_precision": 0.90
        },
        ...
    ]
    """
    confidences = [r['extraction_confidence'] for r in test_results]
    precisions = [r['actual_precision'] for r in test_results]
    
    correlation, p_value = pearsonr(confidences, precisions)
    return correlation

ENGAGEMENT METRICS

Metric 11: Average Conversation Length

Definition: Mean number of turns per conversation.

Target: ≥10 turns (demonstrates sustained engagement)

Computation:

def compute_avg_conversation_length(conversations: List[dict]) -> float:
    """
    conversations: [{"session_id": str, "turn_count": int}, ...]
    """
    if len(conversations) == 0:
        return 0.0
    
    total_turns = sum(conv['turn_count'] for conv in conversations)
    return total_turns / len(conversations)

Metric 12: Intelligence Extraction Rate

Definition: Proportion of conversations that extract at least one intelligence entity.

Target: ≥70%

Computation:

def compute_extraction_rate(conversations: List[dict]) -> float:
    """
    conversations: [
        {
            "session_id": str,
            "extracted_intelligence": {
                "upi_ids": [...],
                ...
            }
        },
        ...
    ]
    """
    if len(conversations) == 0:
        return 0.0
    
    extracted_count = 0
    for conv in conversations:
        intel = conv['extracted_intelligence']
        has_intel = any(
            len(intel.get(entity_type, [])) > 0
            for entity_type in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']
        )
        if has_intel:
            extracted_count += 1
    
    return extracted_count / len(conversations)

Metric 13: Persona Consistency

Definition: Proportion of conversations where persona remains consistent across all turns.

Target: ≥95%

Computation:

def compute_persona_consistency(conversations: List[dict]) -> float:
    """
    conversations: [
        {
            "session_id": str,
            "messages": [
                {"turn": 1, "sender": "agent", "persona": "elderly"},
                {"turn": 2, "sender": "agent", "persona": "elderly"},
                ...
            ]
        },
        ...
    ]
    """
    consistent_count = 0
    
    for conv in conversations:
        agent_messages = [msg for msg in conv['messages'] if msg['sender'] == 'agent']
        if len(agent_messages) == 0:
            continue
        
        personas = [msg.get('persona') for msg in agent_messages]
        if len(set(personas)) == 1:  # All same persona
            consistent_count += 1
    
    return consistent_count / len(conversations) if len(conversations) > 0 else 0.0

Metric 14: Engagement Quality Score

Definition: Composite score measuring naturalness and effectiveness of engagement.

Components:

Average turns (weight: 0.4)
Extraction rate (weight: 0.4)
Persona consistency (weight: 0.2)

Target: ≥0.8

Computation:

def compute_engagement_quality(avg_turns: float, extraction_rate: float, persona_consistency: float) -> float:
    """
    Normalize and weight engagement metrics.
    
    Args:
        avg_turns: Actual average turns
        extraction_rate: 0.0-1.0
        persona_consistency: 0.0-1.0
    """
    # Normalize avg_turns (max 20)
    normalized_turns = min(avg_turns / 20, 1.0)
    
    quality_score = (
        0.4 * normalized_turns +
        0.4 * extraction_rate +
        0.2 * persona_consistency
    )
    
    return quality_score

PERFORMANCE METRICS

Metric 15: API Response Time

Definition: Time from request received to response sent.

Targets:

P50 (Median): <1 second
P95: <2 seconds
P99: <3 seconds

Computation:

import numpy as np

def compute_response_time_percentiles(response_times: List[float]) -> dict:
    """
    response_times: List of times in seconds
    """
    return {
        'p50': np.percentile(response_times, 50),
        'p95': np.percentile(response_times, 95),
        'p99': np.percentile(response_times, 99),
        'mean': np.mean(response_times),
        'max': np.max(response_times)
    }

Metric 16: Throughput

Definition: Number of requests processed per minute.

Target: ≥100 requests/minute (sustained)

Computation:

def compute_throughput(total_requests: int, time_window_seconds: float) -> float:
    """
    Returns requests per minute
    """
    return (total_requests / time_window_seconds) * 60

Metric 17: Error Rate

Definition: Proportion of requests that result in errors (4xx, 5xx).

Target: <1%

Computation:

def compute_error_rate(total_requests: int, error_count: int) -> float:
    """Returns error rate as proportion (0.0-1.0)"""
    return error_count / total_requests if total_requests > 0 else 0.0

Metric 18: Uptime

Definition: Percentage of time service is available and healthy.

Target: ≥99% during competition testing window

Computation:

def compute_uptime(total_time_seconds: float, downtime_seconds: float) -> float:
    """Returns uptime as percentage"""
    return ((total_time_seconds - downtime_seconds) / total_time_seconds) * 100

COMPUTATION METHODS

Complete Evaluation Pipeline

import json
from typing import List, Dict, Tuple

class ScamShieldEvaluator:
    """Complete evaluation framework for ScamShield AI"""
    
    def __init__(self, api_endpoint: str):
        self.api_endpoint = api_endpoint
        self.results = {
            'detection': {},
            'extraction': {},
            'engagement': {},
            'performance': {}
        }
    
    def evaluate_detection(self, test_file: str) -> dict:
        """
        Evaluate scam detection on test dataset.
        
        Args:
            test_file: Path to JSONL test file
        
        Returns:
            Detection metrics dictionary
        """
        with open(test_file, 'r') as f:
            test_data = [json.loads(line) for line in f]
        
        predictions = []
        ground_truth = []
        response_times = []
        
        for item in test_data:
            import time
            import requests
            
            start_time = time.time()
            
            response = requests.post(
                f"{self.api_endpoint}/honeypot/engage",
                json={"message": item['message'], "language": item['language']}
            )
            
            response_time = time.time() - start_time
            response_times.append(response_time)
            
            result = response.json()
            
            predictions.append({
                'id': item['id'],
                'scam_detected': result['scam_detected'],
                'confidence': result['confidence']
            })
            
            ground_truth.append({
                'id': item['id'],
                'label': item['ground_truth']['label'],
                'language': item['language']
            })
        
        # Compute metrics
        accuracy = compute_detection_accuracy(predictions, ground_truth)
        precision = compute_precision(predictions, ground_truth)
        recall = compute_recall(predictions, ground_truth)
        f1 = compute_f1_score(precision, recall)
        ece = compute_ece(predictions, ground_truth)
        lang_acc = compute_language_specific_accuracy(predictions, ground_truth)
        
        return {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'ece': ece,
            'language_accuracy': lang_acc,
            'avg_response_time': np.mean(response_times),
            'total_samples': len(test_data)
        }
    
    def evaluate_extraction(self, test_file: str) -> dict:
        """Evaluate intelligence extraction on test dataset"""
        with open(test_file, 'r') as f:
            test_data = [json.loads(line) for line in f]
        
        all_precisions = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']}
        all_recalls = {entity: [] for entity in ['upi_ids', 'bank_accounts', 'ifsc_codes', 'phone_numbers', 'phishing_links']}
        
        for item in test_data:
            response = requests.post(
                f"{self.api_endpoint}/honeypot/engage",
                json={"message": item['text'], "language": item['language']}
            )
            
            result = response.json()
            extracted = result['extracted_intelligence']
            ground_truth = item['ground_truth']
            
            precisions = compute_extraction_precision(extracted, ground_truth)
            recalls = compute_extraction_recall(extracted, ground_truth)
            
            for entity in all_precisions:
                all_precisions[entity].append(precisions[entity])
                all_recalls[entity].append(recalls[entity])
        
        # Average across all samples
        avg_precisions = {entity: np.mean(all_precisions[entity]) for entity in all_precisions}
        avg_recalls = {entity: np.mean(all_recalls[entity]) for entity in all_recalls}
        
        overall_f1 = compute_overall_extraction_f1(avg_precisions, avg_recalls)
        
        return {
            'precisions': avg_precisions,
            'recalls': avg_recalls,
            'overall_f1': overall_f1,
            'total_samples': len(test_data)
        }
    
    def evaluate_engagement(self, conversation_file: str) -> dict:
        """Evaluate multi-turn engagement quality"""
        with open(conversation_file, 'r') as f:
            conversations = [json.loads(line) for line in f]
        
        completed_conversations = []
        
        for conv in conversations:
            session_id = None
            turn_count = 0
            extracted_intel = {}
            
            for turn in conv['turns']:
                if turn['sender'] == 'scammer':
                    response = requests.post(
                        f"{self.api_endpoint}/honeypot/engage",
                        json={
                            "message": turn['message'],
                            "session_id": session_id,
                            "language": conv['language']
                        }
                    )
                    
                    result = response.json()
                    
                    if session_id is None:
                        session_id = result['session_id']
                    
                    turn_count = result['engagement']['turn_count']
                    extracted_intel = result['extracted_intelligence']
                    
                    # Check termination
                    if result['engagement']['max_turns_reached']:
                        break
            
            completed_conversations.append({
                'session_id': session_id,
                'turn_count': turn_count,
                'extracted_intelligence': extracted_intel
            })
        
        avg_turns = compute_avg_conversation_length(completed_conversations)
        extraction_rate = compute_extraction_rate(completed_conversations)
        
        return {
            'avg_conversation_length': avg_turns,
            'intelligence_extraction_rate': extraction_rate,
            'total_conversations': len(completed_conversations)
        }
    
    def evaluate_performance(self, duration_seconds: int = 60, target_rps: int = 10) -> dict:
        """Load test performance metrics"""
        import concurrent.futures
        import time
        
        test_message = "You won 10 lakh rupees! Send OTP to claim."
        response_times = []
        errors = 0
        
        def make_request():
            try:
                start = time.time()
                response = requests.post(
                    f"{self.api_endpoint}/honeypot/engage",
                    json={"message": test_message},
                    timeout=5
                )
                latency = time.time() - start
                
                if response.status_code != 200:
                    return None, 1
                return latency, 0
            except Exception:
                return None, 1
        
        start_time = time.time()
        total_requests = 0
        
        with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
            while time.time() - start_time < duration_seconds:
                future = executor.submit(make_request)
                latency, error = future.result()
                
                total_requests += 1
                if latency is not None:
                    response_times.append(latency)
                errors += error
                
                # Rate limiting
                time.sleep(1.0 / target_rps)
        
        elapsed_time = time.time() - start_time
        
        percentiles = compute_response_time_percentiles(response_times)
        throughput = compute_throughput(total_requests, elapsed_time)
        error_rate = compute_error_rate(total_requests, errors)
        
        return {
            'response_time_percentiles': percentiles,
            'throughput_rpm': throughput,
            'error_rate': error_rate,
            'total_requests': total_requests,
            'duration_seconds': elapsed_time
        }
    
    def run_full_evaluation(self) -> dict:
        """Run complete evaluation suite"""
        print("Running detection evaluation...")
        self.results['detection'] = self.evaluate_detection('data/scam_detection_test.jsonl')
        
        print("Running extraction evaluation...")
        self.results['extraction'] = self.evaluate_extraction('data/intelligence_extraction_test.jsonl')
        
        print("Running engagement evaluation...")
        self.results['engagement'] = self.evaluate_engagement('data/conversation_simulation_test.jsonl')
        
        print("Running performance evaluation...")
        self.results['performance'] = self.evaluate_performance(duration_seconds=60, target_rps=10)
        
        return self.results
    
    def generate_report(self, output_file: str = 'evaluation_report.json'):
        """Generate comprehensive evaluation report"""
        report = {
            'timestamp': datetime.utcnow().isoformat(),
            'api_endpoint': self.api_endpoint,
            'results': self.results,
            'pass_criteria': {
                'detection_accuracy': self.results['detection']['accuracy'] >= 0.90,
                'extraction_f1': self.results['extraction']['overall_f1'] >= 0.85,
                'avg_conversation_length': self.results['engagement']['avg_conversation_length'] >= 10,
                'response_time_p95': self.results['performance']['response_time_percentiles']['p95'] < 2.0,
                'error_rate': self.results['performance']['error_rate'] < 0.01
            }
        }
        
        with open(output_file, 'w') as f:
            json.dump(report, f, indent=2)
        
        print(f"Evaluation report saved to {output_file}")
        return report

TESTING FRAMEWORK

Test Suite Organization

tests/
├── unit/
│   ├── test_detection.py
│   ├── test_extraction.py
│   ├── test_persona.py
│   └── test_utils.py
├── integration/
│   ├── test_api_endpoints.py
│   ├── test_database.py
│   └── test_llm_integration.py
├── performance/
│   ├── test_load.py
│   └── test_latency.py
├── acceptance/
│   ├── test_requirements.py
│   └── test_red_team.py
└── conftest.py

Sample Unit Test

# tests/unit/test_detection.py
import pytest
from app.models.detector import ScamDetector

@pytest.fixture
def detector():
    return ScamDetector()

def test_english_scam_detection(detector):
    """Test English scam message detection"""
    message = "You won 10 lakh rupees! Send OTP immediately."
    
    result = detector.detect(message)
    
    assert result['scam_detected'] == True
    assert result['confidence'] >= 0.85
    assert result['language'] == 'en'

def test_hindi_scam_detection(detector):
    """Test Hindi scam message detection"""
    message = "आप गिरफ्तार हो जाएंगे। पैसे भेजें।"
    
    result = detector.detect(message)
    
    assert result['scam_detected'] == True
    assert result['confidence'] >= 0.85
    assert result['language'] == 'hi'

def test_legitimate_message(detector):
    """Test legitimate message classification"""
    message = "Hi, how are you? Let's meet for coffee."
    
    result = detector.detect(message)
    
    assert result['scam_detected'] == False
    assert result['confidence'] <= 0.3

Sample Integration Test

# tests/integration/test_api_endpoints.py
import pytest
import requests

@pytest.fixture
def api_url():
    return "http://localhost:8000/api/v1"

def test_engage_endpoint_scam(api_url):
    """Test /honeypot/engage with scam message"""
    response = requests.post(
        f"{api_url}/honeypot/engage",
        json={
            "message": "You won 10 lakh rupees! Send OTP.",
            "language": "auto"
        }
    )
    
    assert response.status_code == 200
    
    data = response.json()
    assert data['status'] == 'success'
    assert data['scam_detected'] == True
    assert 'agent_response' in data['engagement']
    assert data['engagement']['turn_count'] == 1

def test_engage_endpoint_legitimate(api_url):
    """Test /honeypot/engage with legitimate message"""
    response = requests.post(
        f"{api_url}/honeypot/engage",
        json={
            "message": "Hi, how are you?",
            "language": "auto"
        }
    )
    
    assert response.status_code == 200
    
    data = response.json()
    assert data['status'] == 'success'
    assert data['scam_detected'] == False

COMPETITION SCORING (PREDICTED)

Predicted Judging Rubric

Based on Challenge 2 requirements, we predict the following scoring:

Category	Weight	Metrics	Our Target	Competitive Advantage
Scam Detection	25%	Accuracy, Precision, Recall	92% accuracy	IndicBERT + hybrid approach
Engagement Quality	25%	Avg turns, Naturalness	12 turns avg	Multi-turn agentic AI
Intelligence Extraction	30%	Precision, Recall, Coverage	88% F1	Hybrid NER + regex
Response Time	10%	P95 latency	<1.8s	Optimized inference
System Robustness	10%	Uptime, Error rate	99.5% uptime	Production architecture

Expected Score Calculation

def calculate_competition_score(metrics: dict) -> float:
    """
    Calculate predicted competition score.
    
    Args:
        metrics: Dictionary with all evaluation metrics
    
    Returns:
        Estimated score (0-100)
    """
    weights = {
        'detection': 0.25,
        'engagement': 0.25,
        'extraction': 0.30,
        'performance': 0.10,
        'robustness': 0.10
    }
    
    # Normalize each category to 0-1
    detection_score = min(metrics['detection']['accuracy'] / 0.90, 1.0)
    engagement_score = min(metrics['engagement']['avg_conversation_length'] / 10, 1.0)
    extraction_score = min(metrics['extraction']['overall_f1'] / 0.85, 1.0)
    performance_score = 1.0 - min(metrics['performance']['response_time_percentiles']['p95'] / 2.0, 1.0)
    robustness_score = 1.0 - metrics['performance']['error_rate']
    
    total_score = (
        weights['detection'] * detection_score +
        weights['engagement'] * engagement_score +
        weights['extraction'] * extraction_score +
        weights['performance'] * performance_score +
        weights['robustness'] * robustness_score
    ) * 100
    
    return total_score

CONTINUOUS MONITORING

Production Metrics Dashboard

from prometheus_client import Counter, Histogram, Gauge, Summary

# Define metrics
scam_detection_total = Counter(
    'scamshield_scam_detection_total',
    'Total number of scam detections',
    ['language', 'result']
)

intelligence_extracted_total = Counter(
    'scamshield_intelligence_extracted_total',
    'Total pieces of intelligence extracted',
    ['type']
)

api_response_time = Histogram(
    'scamshield_api_response_time_seconds',
    'API response time in seconds',
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)

active_sessions = Gauge(
    'scamshield_active_sessions',
    'Number of active honeypot sessions'
)

detection_accuracy = Summary(
    'scamshield_detection_accuracy',
    'Detection accuracy over sliding window'
)

Document Status: Production Ready
Next Steps: Implement evaluation framework, run tests, generate baseline metrics
Update Frequency: Daily during development, hourly during competition testing

Evaluation Specification: ScamShield AI

Metrics, Computation Methods, and Testing Framework

TABLE OF CONTENTS

EVALUATION OVERVIEW

Evaluation Objectives

Evaluation Phases

DETECTION METRICS

Metric 1: Scam Detection Accuracy

Metric 2: Precision

Metric 3: Recall (Sensitivity)

Metric 4: F1-Score

Metric 5: Confidence Calibration

Metric 6: Language-Specific Accuracy

EXTRACTION METRICS

Metric 7: Extraction Precision (per entity type)

Metric 8: Extraction Recall (per entity type)

Metric 9: Overall Extraction F1-Score

Metric 10: Extraction Confidence Accuracy

ENGAGEMENT METRICS

Metric 11: Average Conversation Length

Metric 12: Intelligence Extraction Rate

Metric 13: Persona Consistency

Metric 14: Engagement Quality Score

PERFORMANCE METRICS

Metric 15: API Response Time

Metric 16: Throughput

Metric 17: Error Rate

Metric 18: Uptime

COMPUTATION METHODS

Complete Evaluation Pipeline

TESTING FRAMEWORK

Test Suite Organization

Sample Unit Test

Sample Integration Test

COMPETITION SCORING (PREDICTED)

Predicted Judging Rubric

Expected Score Calculation

CONTINUOUS MONITORING

Production Metrics Dashboard