Spaces:

AksharaSharma
/

voice-authenticity-openenv

Running

App Files Files Community

Akki0404 commited on 8 days ago

Commit

7ba76bf

1 Parent(s): 5f6cf4c

v2 - web interface, health endpoint, full app.py

Browse files

Files changed (20) hide show

.dockerignore +10 -0
.gitignore +0 -0
Dockerfile +5 -4
README.md +188 -126
app.py +221 -16
environment/__pycache__/env.cpython-310.pyc +0 -0
environment/__pycache__/graders.cpython-310.pyc +0 -0
environment/__pycache__/models.cpython-310.pyc +0 -0
environment/data/features_phonecall.npy +0 -0
environment/data/features_streaming.npy +0 -0
environment/data/labels_phonecall.npy +0 -0
environment/data/labels_streaming.npy +0 -0
environment/env.py +598 -83
environment/graders.py +324 -25
environment/models.py +55 -6
inference.py +189 -81
openenv.yaml +41 -9
pyproject.toml +4 -4
scripts/extract_features.py +130 -12
server/app.py +218 -17

.dockerignore ADDED Viewed

	@@ -0,0 +1,10 @@

+.git
+.gitignore
+.env
+__pycache__
+*.pyc
+*.pyo
+uv.lock
+walkthrough.md
+validate.sh
+.dockerignore

.gitignore CHANGED Viewed

Binary files a/.gitignore and b/.gitignore differ

Dockerfile CHANGED Viewed

@@ -2,21 +2,22 @@ FROM python:3.10-slim
 WORKDIR /app
 RUN apt-get update && apt-get install -y \
     libsndfile1 \
     praat \
     build-essential \
     && rm -rf /var/lib/apt/lists/*
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 COPY . .
-ENV API_BASE_URL=https://router.huggingface.co/v1
-ENV MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
-ENV VOICE_TASK=clean_detection
 EXPOSE 7860
 CMD ["python", "app.py"]

 WORKDIR /app
+# Install system dependencies
 RUN apt-get update && apt-get install -y \
     libsndfile1 \
     praat \
     build-essential \
     && rm -rf /var/lib/apt/lists/*
+# Install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
+# Copy project files
 COPY . .
+# Expose the FastAPI port
 EXPOSE 7860
+# Run the environment server
 CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -5,52 +5,68 @@ colorFrom: blue
 colorTo: red
 sdk: docker
 pinned: false
 ---
 # 🎙️ Voice Authenticity Detection — OpenEnv Environment
-An advanced reinforcement learning environment for training and evaluating AI agents to detect synthetic (AI-generated) speech across real-world degradation and adversarial conditions.
-> Voice fraud is a growing crisis. This environment trains agents to detect synthetic speech under clean, compressed, and adversarial conditions — directly applicable to fraud detection, content moderation, and voice authentication systems.
 ---
 ## 🌍 Real-World Motivation
-AI-generated voices (ElevenLabs, Coqui, etc.) are increasingly used for:
-- Phone fraud and social engineering attacks
-- Deepfake audio in misinformation campaigns
-- Identity spoofing in voice authentication systems
-This environment provides a structured benchmark for training agents to detect synthetic speech under realistic degradation conditions that existing classifiers struggle with.
 ---
 ## 🏗️ Environment Overview
-The environment serves 48-dimensional feature vectors extracted from audio samples. Agents must classify each sample as real or synthetic, while expressing calibrated confidence that reflects genuine uncertainty.
-Unlike standard classification benchmarks, this environment introduces **sequential decision-making under partial observability**, requiring agents to actively query and interpret information before making a final decision.
 ---
-## 🧠 Agent Interaction Model (Multi-Step)
-This environment operates as a **two-phase decision process**, transforming it from a static classification task into a true agentic system:
-**Phase 1 — Analyze**
-The agent inspects the observation and requests focus on specific acoustic features (jitter, shimmer, spectral properties, etc.).
-**Phase 2 — Decide**
-The agent submits a final classification (`real` or `synthetic`) along with a confidence score and reasoning.
-This structure introduces:
-- Partial observability
-- Action-dependent state transitions
-- Planning and tool-use behavior
-Episodes consist of a **two-step interaction (analysis → decision)** rather than a single-step prediction.
 ---
@@ -58,7 +74,7 @@ Episodes consist of a **two-step interaction (analysis → decision)** rather th
 - Fits within 2 vCPU / 8GB RAM constraints
 - Feature extraction is performed offline for fast inference
-- Enables **LLM-native reasoning over interpretable audio characteristics**, which is not possible with raw waveforms under current infrastructure constraints
 - Avoids heavy signal processing during evaluation
 ---
@@ -67,7 +83,7 @@ Episodes consist of a **two-step interaction (analysis → decision)** rather th
 - Real speech: 250 samples from `garystafford/deepfake-audio-detection` (authentic human recordings)
 - Synthetic speech: 250 samples (ElevenLabs, Hume AI, and other TTS platforms)
-- Total: 500 labeled samples across 3 task variants
 The dataset is designed for **evaluation structure and reward learning**, not scale. The feature pipeline supports arbitrary dataset expansion for production deployment.
@@ -75,7 +91,24 @@ The dataset is designed for **evaluation structure and reward learning**, not sc
 ## 📐 Observation Space
-Each observation is a 48-dimensional float32 vector:
 | Index | Feature | Description |
 |-------|---------|-------------|
@@ -90,127 +123,140 @@ Each observation is a 48-dimensional float32 vector:
 ### Key Discriminating Features
-These acoustic properties capture fundamental differences between human and synthetic vocal production without exposing trivial decision rules:
-- Jitter: measures cycle-to-cycle frequency instability in the voice signal
-- Shimmer: tracks amplitude variation between consecutive glottal pulses
-- HNR: quantifies the ratio of harmonic energy to noise in the signal
-### Observation Schema (Pydantic)
-```python
-class VoiceObservation(BaseModel):
-    features: List[float]      # 48-dim feature vector (normalized)
-    task_name: str             # current task
-    step_number: int           # current step in episode
-    difficulty: str            # easy | medium | hard
-    sample_id: int             # index into dataset
-    hint: Optional[str]        # task context and key raw values
-```
 ---
 ## 🎯 Action Space
 ```python
 class VoiceAction(BaseModel):
-    label: int        # 0 = real, 1 = synthetic
-    confidence: float # agent confidence in [0.0, 1.0]
-    reasoning: str    # brief explanation of decision
 ```
 ---
-## 🏆 Tasks
 ### Task 1 — Clean Detection (Easy)
-- Description: Classify real vs synthetic speech from clean, unmodified audio features
-- Difficulty: Easy
-- Expected agent score: 0.7–1.0
-- Scoring: Binary — correct=1.0, incorrect=0.0
 ### Task 2 — Compressed Detection (Medium)
-- Description: Classify speech after codec compression degradation. Acoustic features are degraded, compression artifacts added.
-- Difficulty: Medium
-- Expected agent score: 0.4–0.7
-- Scoring: Partial credit based on confidence calibration
-  - correct + high confidence → 1.0
-  - correct + low confidence → 0.6
-  - wrong + low confidence → 0.2
-  - wrong + high confidence → 0.0
 ### Task 3 — Adversarial Detection (Hard)
-- Description: Synthetic audio specifically engineered to mimic real speech characteristics. Feature distributions overlap significantly with real speech, making threshold-based classification unreliable.
-- Difficulty: Hard
-- Expected agent score: 0.3–0.6
-- Additional realism: distribution overlap between classes, controlled ambiguity, label noise to simulate real-world inconsistencies
-- Scoring: Rewards correct classification AND penalizes overconfidence
-  - correct + calibrated confidence (~0.7) → ~1.0
-  - correct + overconfident (>0.9) → 0.5
-  - wrong + appropriately uncertain → 0.15
-  - wrong + overconfident → 0.0
 ---
-## 🎁 Reward Function
-The reward function provides partial, meaningful signals — not just binary win/lose.
-```python
-def grade(true_label, action, difficulty):
-    correct = (action["label"] == true_label)
-    confidence = action["confidence"]
-    if difficulty == "easy":
-        return 1.0 if correct else 0.0
-    elif difficulty == "medium":
-        if correct:
-            return 0.6 + 0.4 * confidence
-        else:
-            return max(0.0, 0.2 - 0.3 * confidence)
-    elif difficulty == "hard":
-        if correct:
-            base = 0.5
-            calibration_bonus = 0.5 * (1 - abs(confidence - 0.7))
-            return base + calibration_bonus
-        else:
-            return 0.15 if confidence < 0.4 else 0.0
-```
-### Why Confidence Calibration Matters
-In real-world fraud detection systems, a **confident wrong prediction is more dangerous than an uncertain one**. This environment explicitly rewards:
-- Calibrated uncertainty
-- Risk-aware decision-making
-- Avoidance of overconfident errors
 ---
 ## 🔌 OpenEnv API
 ```python
 from environment.env import VoiceAuthenticityEnv
 env = VoiceAuthenticityEnv(task_name="clean_detection")
 obs = env.reset()
-# obs.features      → 48-dim list
-# obs.hint          → phase instructions and key raw values
-# obs.difficulty    → "easy"
-# Phase 1 — analysis request
-action = {"focus": ["jitter", "shimmer", "hnr"], "label": 0, "confidence": 0.5, "reasoning": "requesting analysis"}
 obs, reward, done, info = env.step(action)
-# info["phase"] → "decide"
-# Phase 2 — final decision
-action = {"label": 1, "confidence": 0.75, "reasoning": "low temporal instability"}
 obs, reward, done, info = env.step(action)
-# reward            → float in [0.0, 1.0]
-# done              → True
-# info["true_label"]→ ground truth
 state = env.state()
 ```
@@ -220,29 +266,32 @@ state = env.state()
 ## 📊 Baseline Scores
 Agent: `Qwen/Qwen2.5-72B-Instruct` via HuggingFace router
 Runs: 10 independent episodes per task
-Metric: Average reward per episode (decision phase only)
 | Task | Difficulty | Avg Reward | Success Rate | Notes |
 |------|-----------|------------|--------------|-------|
 | clean_detection | Easy | 0.80 | 80% | Strong baseline on clean features |
 | compressed_detection | Medium | 0.45 | 55% | Compression degrades acoustic signal |
-| adversarial_detection | Hard | 0.50 | 50% | Overlapping distributions challenge frontier models |
-Scores vary per run due to random sample selection. Higher rewards on harder tasks reflect the confidence calibration reward — agents that express appropriate uncertainty score better than overconfident wrong answers.
 ---
 ## ⚠️ Known Limitations and Failure Cases
-- Synthetic voices with injected background noise may evade detection
-- Real voices recorded under heavy studio compression may score lower than expected
 - Borderline acoustic feature overlap exists between real and adversarially crafted samples — no clean threshold separates them
 - Dataset of 500 samples is designed for evaluation structure and reward design, not production scale
-- The feature pipeline supports arbitrary dataset scaling for enterprise deployment
-- Results may vary across accents, languages, and recording conditions not represented in the training distribution
-This environment is designed to be extended with real enterprise datasets. The evaluation structure, reward function, and feature pipeline are production-ready; the dataset is a research prototype.
 ---
@@ -271,7 +320,7 @@ cp .env.example .env
 # Terminal 1 — start the environment server
 python app.py
-# Terminal 2 — run baseline inference
 python inference.py
 ```
@@ -298,25 +347,29 @@ docker run --env-file .env voice-authenticity
 voice-authenticity-openenv/
 ├── environment/
 │   ├── __init__.py
-│   ├── env.py              # step() / reset() / state() with 2-phase loop
 │   ├── models.py           # Pydantic Observation/Action/Reward models
-│   ├── graders.py          # scoring logic per task
 │   └── data/
 │       ├── features.npy            # clean features (500 × 48)
 │       ├── features_compressed.npy # codec-degraded features
 │       ├── features_adversarial.npy# adversarially perturbed features
-│       ├── features_raw.npy        # unnormalized values for hints
 │       ├── labels.npy              # ground truth labels
 │       ├── labels_compressed.npy
-│       └── labels_adversarial.npy
 ├── scripts/
 │   ├── download_data.py    # fetch dataset from HuggingFace
-│   └── extract_features.py # audio → feature vectors
 ├── server/
 │   └── app.py              # OpenEnv HTTP server entry point
 ├── app.py                  # FastAPI server (root)
-├── inference.py            # baseline LLM agent
-├── openenv.yaml            # OpenEnv spec
 ├── pyproject.toml          # package config
 ├── Dockerfile
 ├── requirements.txt
@@ -334,23 +387,32 @@ Audio (.wav / .flac)
     ↓ parselmouth/Praat → jitter, shimmer, HNR
     ↓ z-score normalization
     ↓ 48-dim float32 vector
-    → stored as .npy arrays
 ```
 ### Compression Simulation (Task 2)
 Codec compression is simulated by degrading MFCC standard deviations, reducing jitter and shimmer values, and adding spectral artifact signals — replicating the acoustic degradation introduced by MP3/codec pipelines.
 ### Adversarial Simulation (Task 3)
-Adversarial perturbation shifts synthetic sample features into the real speech distribution range, and real sample features toward the synthetic range. Controlled label noise (8%) is introduced to simulate real-world annotation ambiguity. No clean threshold-based separation exists between classes.
 ---
 ## 📋 Expected stdout Format
 ```
 [START] task=clean_detection env=voice-authenticity model=Qwen/Qwen2.5-72B-Instruct
-[STEP] step=1 action={"focus":["jitter","shimmer","hnr"],"label":0,"confidence":0.5,"reasoning":"Requesting focused analysis"} reward=0.00 done=false error=null
-[STEP] step=2 action={"label":0,"confidence":0.75,"reasoning":"..."} reward=1.00 done=true error=null
-[END] success=true steps=2 score=1.000 rewards=0.00,1.00
 ```
 ---

 colorTo: red
 sdk: docker
 pinned: false
+app_port: 7860
+base_path: /docs
+tags:
+  - openenv
+  - speech
+  - fraud-detection
+  - audio
 ---
 # 🎙️ Voice Authenticity Detection — OpenEnv Environment
+Voice fraud now costs the global economy over **$25 billion annually**, devastating banking, insurance, telecom, and government services. AI-generated voices from platforms like ElevenLabs, Coqui, and Bark can clone any voice in under 60 seconds — enabling real-time phone scams, identity theft, and social engineering at unprecedented scale. Existing benchmarks like ASVspoof and ADD fail under real-world conditions: they operate on static datasets with fixed train/test splits, evaluate single-shot classifiers with no agent interaction, ignore partial observability (real systems never see all features at once), and provide binary pass/fail scoring with no reward shaping. This environment fills that gap. It trains agents to **actively gather, analyze, and reason about acoustic evidence** under realistic degradation — codec compression, adversarial perturbation, streaming noise, and phone call simulation — through a genuine multi-step decision process with 5 distinct actions, 6-component grading, and step-level reward shaping that teaches calibrated, risk-aware classification.
 ---
 ## 🌍 Real-World Motivation
+AI-generated voices are increasingly weaponized for:
+- **Phone fraud & social engineering** — real-time voice cloning during live calls
+- **Deepfake audio in misinformation** — fabricated audio of public figures
+- **Identity spoofing** — bypassing voice biometric authentication systems
+- **Financial fraud** — CEO voice cloning for unauthorized wire transfers
+- **Insurance scams** — fabricated recorded statements
+This environment provides a structured benchmark for training agents to detect synthetic speech under conditions that static classifiers and existing benchmarks cannot handle.
 ---
 ## 🏗️ Environment Overview
+The environment serves 48-dimensional feature vectors extracted from audio samples. Unlike standard classification benchmarks, agents **start with NO features visible** and must actively query the environment through a 5-action protocol to gather evidence before making a final classification.
+This creates genuine **sequential decision-making under partial observability**, requiring agents to:
+- Choose which information to request and in what order
+- Synthesize heterogeneous evidence sources
+- Express calibrated confidence reflecting genuine uncertainty
+- Follow logical investigation trajectories
 ---
+## 🧠 Agent Interaction Model (5-Action Multi-Step)
+The agent interacts through **5 distinct actions**, each returning genuinely different observation content:
+| Action | Returns | Purpose |
+|--------|---------|---------|
+| `request_temporal_features` | Jitter, shimmer, HNR (raw + normalized) | Vocal cord irregularity markers |
+| `request_spectral_features` | 20 MFCC means, 20 MFCC stds, ZCR, spectral centroid | Timbre and spectral shape |
+| `request_comparison` | Cosine similarity + euclidean distance to real/fake centroids | Statistical comparison to known references |
+| `analyze_evidence` | Structured synthesis of all gathered evidence with signal tally | Evidence integration and confidence calibration |
+| `final_classify` | Submits label (0=real, 1=synthetic) + confidence + reasoning | Terminal action — triggers 6-component grading |
+### Key Design Properties
+- **Partial observability** — features are zeroed until explicitly requested
+- **Action-dependent observations** — each action reveals genuinely different data
+- **Flexible ordering** — agent chooses its own investigation strategy
+- **Soft-gated streaming** — streaming task adds step-dependent noise (noisier early, cleaner late)
+- **Step-level rewards** — shaping signals throughout the episode, not just at the end
+Episodes consist of **up to 6 steps** (5 investigation actions + buffer), not a single prediction.
 ---
 - Fits within 2 vCPU / 8GB RAM constraints
 - Feature extraction is performed offline for fast inference
+- Enables **LLM-native reasoning over interpretable acoustic characteristics** — not possible with raw waveforms under current infrastructure constraints
 - Avoids heavy signal processing during evaluation
 ---
 - Real speech: 250 samples from `garystafford/deepfake-audio-detection` (authentic human recordings)
 - Synthetic speech: 250 samples (ElevenLabs, Hume AI, and other TTS platforms)
+- Total: 500 labeled samples across 5 task variants
 The dataset is designed for **evaluation structure and reward learning**, not scale. The feature pipeline supports arbitrary dataset expansion for production deployment.
 ## 📐 Observation Space
+Each observation contains:
+```python
+class VoiceObservation(BaseModel):
+    features: List[float]                          # 48-dim (zeroed until revealed)
+    task_name: str                                 # current task
+    step_number: int                               # current step in episode
+    difficulty: str                                # easy|medium|medium_hard|hard|extreme
+    sample_id: int                                 # index into dataset
+    hint: Optional[str]                            # context and guidance
+    visible_features: Dict[str, Any]               # features revealed so far
+    evidence_summary: Optional[str]                # from analyze_evidence
+    comparison_result: Optional[Dict[str, float]]  # from request_comparison
+    available_actions: List[str]                    # valid actions this step
+    actions_taken: List[str]                        # action history
+```
+### 48-Dimensional Feature Vector
 | Index | Feature | Description |
 |-------|---------|-------------|
 ### Key Discriminating Features
+- **Jitter**: measures cycle-to-cycle frequency instability — real voices show natural irregularity, synthetic voices are too stable
+- **Shimmer**: tracks amplitude variation between consecutive glottal pulses — real speech has organic variation
+- **HNR**: quantifies harmonic-to-noise ratio — synthetic voices are typically "too clean"
 ---
 ## 🎯 Action Space
 ```python
 class VoiceAction(BaseModel):
+    action_type: str   # one of the 5 actions
+    label: int         # 0=real, 1=synthetic (for final_classify)
+    confidence: float  # [0.05, 0.95] (for final_classify)
+    reasoning: str     # explanation (for final_classify)
 ```
 ---
+## 🏆 Tasks (5 Total)
 ### Task 1 — Clean Detection (Easy)
+- **Description**: Classify real vs synthetic speech from clean, unmodified audio features
+- **Difficulty**: Easy
+- **Expected agent score**: 0.7–0.95
 ### Task 2 — Compressed Detection (Medium)
+- **Description**: Classify speech after codec compression degradation. MFCC stds are flattened, jitter/shimmer are suppressed, spectral artifacts are added.
+- **Difficulty**: Medium
+- **Expected agent score**: 0.4–0.7
 ### Task 3 — Adversarial Detection (Hard)
+- **Description**: Synthetic audio engineered to mimic real speech characteristics. Feature distributions overlap significantly with real speech. 8% label noise simulates real-world annotation ambiguity.
+- **Difficulty**: Hard
+- **Expected agent score**: 0.3–0.6
+### Task 4 — Streaming Detection (Medium-Hard)
+- **Description**: Multi-step streaming scenario where features arrive with step-dependent noise. Earlier requests return noisier data; later requests return cleaner data. Agents are rewarded for intelligent sequencing without being forced into a fixed order (soft-gating).
+- **Difficulty**: Medium-Hard
+- **Expected agent score**: 0.3–0.6
+### Task 5 — Phone Call Detection (Extreme)
+- **Description**: Simulates worst-case real-world conditions: heavy narrowband codec compression (300-3400Hz telephony simulation), additive background noise across all frequency bands, severe HNR degradation, MFCC high-frequency rolloff, and RMS energy fluctuation from packet loss. Designed to be near the limit of detectability.
+- **Difficulty**: Extreme
+- **Expected agent score**: 0.2–0.5
 ---
+## 🏅 Grading System (6 Components)
+Each episode is scored across 6 components with difficulty-weighted contributions:
+| Component | What It Measures | Easy | Medium | Hard | Extreme |
+|-----------|-----------------|------|--------|------|---------|
+| **Correctness** | Label matches ground truth | 0.40 | 0.30 | 0.25 | 0.20 |
+| **Confidence Calibration** | Penalizes overconfidence, rewards calibrated uncertainty | 0.15 | 0.20 | 0.25 | 0.25 |
+| **Trajectory Quality** | Did agent gather → analyze → classify? | 0.10 | 0.15 | 0.18 | 0.20 |
+| **Feature Utilization** | Did agent request temporal AND spectral features? | 0.15 | 0.15 | 0.12 | 0.15 |
+| **Reasoning Consistency** | Does reasoning text match chosen label? | 0.10 | 0.10 | 0.10 | 0.10 |
+| **Action Ordering** | Logical sequence: gather → analyze → classify | 0.10 | 0.10 | 0.10 | 0.10 |
+### Why This Matters
+On easy tasks, correctness dominates. On hard/extreme tasks, confidence calibration and trajectory quality become critical — mirroring real-world fraud detection where **a confident wrong answer is more dangerous than an uncertain one**, and where **systematic investigation outperforms snap judgments**.
+---
+## 🎁 Step-Level Rewards
+The environment provides shaping signals at every step, not just on final classification:
+| Condition | Reward |
+|-----------|--------|
+| First action is a feature request | +0.05 |
+| Requested both temporal AND spectral features | +0.05 |
+| Used `analyze_evidence` before `final_classify` | +0.05 |
+| Jumped straight to `final_classify` without gathering | -0.10 |
+| Repeated the same action consecutively | -0.05 |
+| Reasoning contradicts chosen label | -0.10 |
+These intermediate rewards teach agents **investigation behavior** rather than pure classification.
 ---
 ## 🔌 OpenEnv API
 ```python
 from environment.env import VoiceAuthenticityEnv
 env = VoiceAuthenticityEnv(task_name="clean_detection")
+# Reset — no features visible yet
 obs = env.reset()
+# obs.features           → [0.05, 0.05, ..., 0.05] (zeroed)
+# obs.available_actions  → ["request_temporal_features", ...]
+# Step 1 — request temporal features
+action = {"action_type": "request_temporal_features"}
+obs, reward, done, info = env.step(action)
+# obs.visible_features["temporal"]["jitter"] → 0.032451
+# reward → 0.05 (shaping: first action is gathering)
+# Step 2 — request spectral features
+action = {"action_type": "request_spectral_features"}
+obs, reward, done, info = env.step(action)
+# obs.visible_features["spectral"]["mfcc_means"] → [20 values]
+# reward → 0.05 (shaping: multi-feature-type bonus)
+# Step 3 — compare to reference centroids
+action = {"action_type": "request_comparison"}
 obs, reward, done, info = env.step(action)
+# obs.comparison_result["cosine_similarity_to_real"] → 0.8742
+# obs.comparison_result["closer_to"] → "real"
+# Step 4 — analyze all evidence
+action = {"action_type": "analyze_evidence"}
+obs, reward, done, info = env.step(action)
+# obs.evidence_summary → "Evidence analysis (3 sources): ..."
+# Step 5 — final classification
+action = {
+    "action_type": "final_classify",
+    "label": 0,
+    "confidence": 0.78,
+    "reasoning": "High jitter and shimmer indicate natural vocal cord variation. HNR is low, consistent with real speech. Comparison confirms closer to real centroid."
+}
 obs, reward, done, info = env.step(action)
+# reward → 0.87 (6-component graded score)
+# done → True
+# info["grader_breakdown"] → {correctness: 0.95, calibration: 0.84, ...}
 state = env.state()
 ```
 ## 📊 Baseline Scores
 Agent: `Qwen/Qwen2.5-72B-Instruct` via HuggingFace router
+Protocol: 5-action (temporal → spectral → comparison → analyze → classify)
 Runs: 10 independent episodes per task
 | Task | Difficulty | Avg Reward | Success Rate | Notes |
 |------|-----------|------------|--------------|-------|
 | clean_detection | Easy | 0.80 | 80% | Strong baseline on clean features |
 | compressed_detection | Medium | 0.45 | 55% | Compression degrades acoustic signal |
+| adversarial_detection | Hard | 0.50 | 50% | Overlapping distributions challenge models |
+| streaming_detection | Medium-Hard | 0.40 | 45% | Soft-gated noise reduces early accuracy |
+| phonecall_detection | Extreme | 0.30 | 35% | Near detection limit under phone conditions |
+Scores vary per run due to random sample selection. Higher rewards on harder tasks reflect confidence calibration — agents that express appropriate uncertainty score better than overconfident wrong answers.
 ---
 ## ⚠️ Known Limitations and Failure Cases
+- Synthetic voices with injected background noise may evade temporal feature detection
+- Real voices under heavy studio compression can mimic synthetic spectral profiles
 - Borderline acoustic feature overlap exists between real and adversarially crafted samples — no clean threshold separates them
+- Phone call simulation pushes detection to near-chance performance, reflecting genuine real-world difficulty
+- Streaming task noise is step-dependent — agents that don't re-request features may work from degraded data
 - Dataset of 500 samples is designed for evaluation structure and reward design, not production scale
+- Results may vary across accents, languages, and recording conditions not represented in the data
+This environment is designed to be extended with real enterprise datasets. The evaluation structure, 6-component grader, and feature pipeline are production-ready; the dataset is a research prototype.
 ---
 # Terminal 1 — start the environment server
 python app.py
+# Terminal 2 — run baseline inference (5-action protocol, all 5 tasks)
 python inference.py
 ```
 voice-authenticity-openenv/
 ├── environment/
 │   ├── __init__.py
+│   ├── env.py              # 5-action step/reset/state with partial observability
 │   ├── models.py           # Pydantic Observation/Action/Reward models
+│   ├── graders.py          # 6-component scoring with difficulty weights
 │   └── data/
 │       ├── features.npy            # clean features (500 × 48)
 │       ├── features_compressed.npy # codec-degraded features
 │       ├── features_adversarial.npy# adversarially perturbed features
+│       ├── features_streaming.npy  # streaming degraded features
+│       ├── features_phonecall.npy  # phone call degraded features
+│       ├── features_raw.npy        # unnormalized values
 │       ├── labels.npy              # ground truth labels
 │       ├── labels_compressed.npy
+│       ├── labels_adversarial.npy
+│       ├── labels_streaming.npy
+│       └── labels_phonecall.npy
 ├── scripts/
 │   ├── download_data.py    # fetch dataset from HuggingFace
+│   └── extract_features.py # audio → feature vectors (5 tasks)
 ├── server/
 │   └── app.py              # OpenEnv HTTP server entry point
 ├── app.py                  # FastAPI server (root)
+├── inference.py            # baseline LLM agent (5-action protocol)
+├── openenv.yaml            # OpenEnv spec (5 tasks)
 ├── pyproject.toml          # package config
 ├── Dockerfile
 ├── requirements.txt
     ↓ parselmouth/Praat → jitter, shimmer, HNR
     ↓ z-score normalization
     ↓ 48-dim float32 vector
+    → stored as .npy arrays (5 variants)
 ```
 ### Compression Simulation (Task 2)
 Codec compression is simulated by degrading MFCC standard deviations, reducing jitter and shimmer values, and adding spectral artifact signals — replicating the acoustic degradation introduced by MP3/codec pipelines.
 ### Adversarial Simulation (Task 3)
+Adversarial perturbation shifts synthetic sample features into the real speech distribution range, and real sample features toward the synthetic range. Controlled label noise (8%) simulates real-world annotation ambiguity. No clean threshold separates the classes.
+### Streaming Simulation (Task 4)
+Features undergo two layers of degradation: a static perturbation (partial MFCC decode, mild temporal noise) baked into the data files, and a dynamic soft-gated noise applied at runtime that reduces as the agent takes more steps. Early requests return noisier data; later requests return cleaner data — rewarding intelligent sequencing without forcing a fixed order.
+### Phone Call Simulation (Task 5)
+The most aggressive degradation: narrowband codec compression zeros out high-order MFCCs, flattens MFCC temporal variation, injects broadband Gaussian noise, severely degrades HNR, and adds RMS energy fluctuation simulating packet loss. Designed to be near the limit of what's detectable.
 ---
 ## 📋 Expected stdout Format
 ```
 [START] task=clean_detection env=voice-authenticity model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action={"action_type": "request_temporal_features"} reward=0.05 done=false error=null
+[STEP] step=2 action={"action_type": "request_spectral_features"} reward=0.05 done=false error=null
+[STEP] step=3 action={"action_type": "request_comparison"} reward=0.05 done=false error=null
+[STEP] step=4 action={"action_type": "analyze_evidence"} reward=0.05 done=false error=null
+[STEP] step=5 action={"action_type": "final_classify", "label": 0, "confidence": 0.78, "reasoning": "..."} reward=0.87 done=true error=null
+[END] success=true steps=5 score=0.870 rewards=0.05,0.05,0.05,0.05,0.87
 ```
 ---

app.py CHANGED Viewed

@@ -1,29 +1,218 @@
 from fastapi import FastAPI
-from fastapi.responses import JSONResponse
 from pydantic import BaseModel
-from typing import Optional
 import uvicorn
 import os
 from environment.env import VoiceAuthenticityEnv
-app = FastAPI(title="Voice Authenticity OpenEnv")
-# Global env instances per task
-envs = {
-    "clean_detection":      VoiceAuthenticityEnv("clean_detection"),
-    "compressed_detection": VoiceAuthenticityEnv("compressed_detection"),
-    "adversarial_detection":VoiceAuthenticityEnv("adversarial_detection"),
-}
 current_task = "clean_detection"
 class ActionRequest(BaseModel):
-    label: Optional[int] = 0
-    confidence: Optional[float] = 0.5
-    reasoning: Optional[str] = ""
     task_name: Optional[str] = None
 @app.post("/reset")
 def reset(request: dict = {}):
     global current_task
@@ -39,6 +228,7 @@ def reset(request: dict = {}):
         "info": {}
     })
 @app.post("/step")
 def step(action: ActionRequest):
     global current_task
@@ -46,9 +236,11 @@ def step(action: ActionRequest):
     if task not in envs:
         task = current_task
     action_dict = {
         "label": action.label,
         "confidence": action.confidence,
-        "reasoning": action.reasoning
     }
     obs, reward, done, info = envs[task].step(action_dict)
     return JSONResponse({
@@ -58,17 +250,30 @@ def step(action: ActionRequest):
         "info": info
     })
 @app.get("/state")
 def state():
     return JSONResponse(envs[current_task].state())
 @app.get("/health")
 def health():
-    return {"status": "ok"}
 @app.get("/")
 def root():
-    return {"name": "voice-authenticity-openenv", "status": "running"}
 if __name__ == "__main__":
-    uvicorn.run(app, host="0.0.0.0", port=7860)

+from dotenv import load_dotenv
+load_dotenv()
 from fastapi import FastAPI
+from fastapi.responses import JSONResponse, HTMLResponse
 from pydantic import BaseModel
+from typing import Optional, List
 import uvicorn
 import os
 from environment.env import VoiceAuthenticityEnv
+app = FastAPI(
+    title="Voice Authenticity OpenEnv",
+    description="Multi-step agentic environment for detecting synthetic speech",
+    version="2.0.0"
+)
+TASKS = [
+    "clean_detection",
+    "compressed_detection",
+    "adversarial_detection",
+    "streaming_detection",
+    "phonecall_detection",
+]
+envs = {task: VoiceAuthenticityEnv(task) for task in TASKS}
 current_task = "clean_detection"
 class ActionRequest(BaseModel):
+    action_type: str = "final_classify"
+    label: int = 0
+    confidence: float = 0.5
+    reasoning: str = ""
+    focus: List[str] = []
     task_name: Optional[str] = None
+@app.get("/web", response_class=HTMLResponse)
+def web_interface():
+    return """
+<!DOCTYPE html>
+<html>
+<head>
+    <title>Voice Authenticity OpenEnv</title>
+    <style>
+        * { box-sizing: border-box; margin: 0; padding: 0; }
+        body { font-family: -apple-system, sans-serif; max-width: 860px; margin: 50px auto; padding: 20px; background: #050508; color: #fff; }
+        h1 { color: #00c9a7; font-size: 28px; margin-bottom: 8px; }
+        h2 { font-size: 16px; font-weight: 500; margin-bottom: 12px; color: #00c9a7; }
+        p { color: #666; font-size: 14px; line-height: 1.6; margin-bottom: 8px; }
+        .card { background: #080810; border: 1px solid #0f0f1a; border-radius: 14px; padding: 20px; margin: 16px 0; }
+        .tag { background: #0d2d1e; color: #00c9a7; padding: 4px 12px; border-radius: 20px; font-size: 11px; margin: 3px; display: inline-block; border: 1px solid #0f2d26; }
+        a { color: #00c9a7; text-decoration: none; }
+        a:hover { text-decoration: underline; }
+        .task { border-left: 2px solid #00c9a7; padding: 8px 12px; margin: 8px 0; background: #050508; border-radius: 0 8px 8px 0; }
+        .task strong { font-size: 13px; color: #fff; }
+        .task span { font-size: 12px; color: #555; display: block; margin-top: 2px; }
+        .difficulty { display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 10px; margin-left: 8px; }
+        .easy { background: #0d2d1e; color: #00c9a7; }
+        .medium { background: #1a1a00; color: #f0a500; }
+        .hard { background: #1a0000; color: #ff6b6b; }
+        .extreme { background: #1a0010; color: #ff00aa; }
+        .medium_hard { background: #0d1a2d; color: #00aaff; }
+        .endpoint { display: flex; gap: 12px; align-items: center; padding: 8px 0; border-bottom: 1px solid #0f0f1a; }
+        .endpoint:last-child { border-bottom: none; }
+        .method { font-size: 11px; font-weight: 600; padding: 3px 8px; border-radius: 6px; min-width: 45px; text-align: center; }
+        .get { background: #0d2d1e; color: #00c9a7; }
+        .post { background: #1a1a00; color: #f0a500; }
+        .endpoint-path { font-size: 13px; color: #fff; font-family: monospace; }
+        .endpoint-desc { font-size: 12px; color: #444; }
+        .action-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-top: 8px; }
+        .action-card { background: #050508; border: 1px solid #0f0f1a; border-radius: 10px; padding: 12px; }
+        .action-name { font-size: 12px; font-family: monospace; color: #00c9a7; margin-bottom: 4px; }
+        .action-desc { font-size: 11px; color: #444; line-height: 1.5; }
+        .stat { text-align: center; padding: 16px; }
+        .stat-num { font-size: 28px; font-weight: 600; color: #fff; }
+        .stat-num span { color: #00c9a7; }
+        .stat-label { font-size: 11px; color: #444; margin-top: 4px; }
+        .stats-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1px; background: #0f0f1a; border-radius: 12px; overflow: hidden; }
+        .stat { background: #080810; }
+        .badge { display: inline-flex; align-items: center; gap: 6px; padding: 5px 14px; border: 1px solid #0f2d26; background: #050f0d; border-radius: 20px; font-size: 11px; color: #00c9a7; }
+        .dot { width: 6px; height: 6px; background: #00c9a7; border-radius: 50%; animation: pulse 2s infinite; }
+        @keyframes pulse { 0%,100%{opacity:1} 50%{opacity:.3} }
+        footer { text-align: center; padding: 2rem 0; color: #333; font-size: 12px; }
+        footer span { color: #00c9a7; }
+    </style>
+</head>
+<body>
+    <div style="margin-bottom:1.5rem">
+        <div class="badge"><div class="dot"></div>Live — 5 tasks available</div>
+    </div>
+    <h1>🎙️ Voice Authenticity OpenEnv</h1>
+    <p style="margin-bottom:1.5rem;font-size:16px;color:#888">
+        Multi-step agentic environment for detecting synthetic (AI-generated) speech
+        across real-world degradation and adversarial conditions.
+    </p>
+    <div class="stats-grid">
+        <div class="stat">
+            <div class="stat-num">5<span>+</span></div>
+            <div class="stat-label">Tasks</div>
+        </div>
+        <div class="stat">
+            <div class="stat-num">5</div>
+            <div class="stat-label">Steps per episode</div>
+        </div>
+        <div class="stat">
+            <div class="stat-num">48</div>
+            <div class="stat-label">Feature dimensions</div>
+        </div>
+    </div>
+    <div class="card">
+        <h2>Tasks</h2>
+        <div class="task">
+            <strong>clean_detection <span class="difficulty easy">easy</span></strong>
+            <span>Classify real vs synthetic speech from clean, unmodified audio features</span>
+        </div>
+        <div class="task">
+            <strong>compressed_detection <span class="difficulty medium">medium</span></strong>
+            <span>Classify speech under codec compression degradation</span>
+        </div>
+        <div class="task">
+            <strong>adversarial_detection <span class="difficulty hard">hard</span></strong>
+            <span>Adversarially crafted synthetic speech with overlapping feature distributions</span>
+        </div>
+        <div class="task">
+            <strong>streaming_detection <span class="difficulty medium_hard">medium-hard</span></strong>
+            <span>Step-dependent noise soft-gating — earlier steps noisier, later steps cleaner</span>
+        </div>
+        <div class="task">
+            <strong>phonecall_detection <span class="difficulty extreme">extreme</span></strong>
+            <span>Heavy codec compression and narrowband degradation simulating phone calls</span>
+        </div>
+    </div>
+    <div class="card">
+        <h2>5-Step Agent Protocol</h2>
+        <div class="action-grid">
+            <div class="action-card">
+                <div class="action-name">1. request_temporal_features</div>
+                <div class="action-desc">Reveals jitter, shimmer, and HNR — the core discriminating signals</div>
+            </div>
+            <div class="action-card">
+                <div class="action-name">2. request_spectral_features</div>
+                <div class="action-desc">Reveals 20 MFCC means, 20 MFCC stds, ZCR, spectral centroid</div>
+            </div>
+            <div class="action-card">
+                <div class="action-name">3. request_comparison</div>
+                <div class="action-desc">Compares sample to real/fake reference centroids via cosine similarity</div>
+            </div>
+            <div class="action-card">
+                <div class="action-name">4. analyze_evidence</div>
+                <div class="action-desc">Synthesizes all gathered signals into a structured evidence summary</div>
+            </div>
+            <div class="action-card" style="grid-column: span 2;">
+                <div class="action-name">5. final_classify</div>
+                <div class="action-desc">Submits final verdict: label (0=real, 1=synthetic) + confidence + reasoning. Terminates episode.</div>
+            </div>
+        </div>
+    </div>
+    <div class="card">
+        <h2>API Endpoints</h2>
+        <div class="endpoint">
+            <span class="method post">POST</span>
+            <span class="endpoint-path">/reset</span>
+            <span class="endpoint-desc">Reset episode, optionally set task_name</span>
+        </div>
+        <div class="endpoint">
+            <span class="method post">POST</span>
+            <span class="endpoint-path">/step</span>
+            <span class="endpoint-desc">Submit action, receive observation + reward</span>
+        </div>
+        <div class="endpoint">
+            <span class="method get">GET</span>
+            <span class="endpoint-path">/state</span>
+            <span class="endpoint-desc">Current environment state</span>
+        </div>
+        <div class="endpoint">
+            <span class="method get">GET</span>
+            <span class="endpoint-path">/health</span>
+            <span class="endpoint-desc">Health check</span>
+        </div>
+        <div class="endpoint">
+            <span class="method get">GET</span>
+            <span class="endpoint-path"><a href="/docs">/docs</a></span>
+            <span class="endpoint-desc">Interactive API documentation (Swagger UI)</span>
+        </div>
+    </div>
+    <div class="card">
+        <h2>Tags</h2>
+        <span class="tag">openenv</span>
+        <span class="tag">speech</span>
+        <span class="tag">fraud-detection</span>
+        <span class="tag">audio</span>
+        <span class="tag">partial-observability</span>
+        <span class="tag">multi-step</span>
+        <span class="tag">confidence-calibration</span>
+        <span class="tag">adversarial</span>
+    </div>
+    <footer>
+        Built by <span>Akshara Sharma</span> · Voice Authenticity OpenEnv v2.0.0
+        · <a href="https://github.com/AksharaaSharmaa/voice-authenticity-openenv">GitHub</a>
+    </footer>
+</body>
+</html>
+"""
 @app.post("/reset")
 def reset(request: dict = {}):
     global current_task
         "info": {}
     })
 @app.post("/step")
 def step(action: ActionRequest):
     global current_task
     if task not in envs:
         task = current_task
     action_dict = {
+        "action_type": action.action_type,
         "label": action.label,
         "confidence": action.confidence,
+        "reasoning": action.reasoning,
+        "focus": action.focus,
     }
     obs, reward, done, info = envs[task].step(action_dict)
     return JSONResponse({
         "info": info
     })
 @app.get("/state")
 def state():
     return JSONResponse(envs[current_task].state())
 @app.get("/health")
 def health():
+    return {"status": "healthy", "service": "voice-authenticity-openenv"}
 @app.get("/")
 def root():
+    return {
+        "name": "voice-authenticity-openenv",
+        "version": "2.0.0",
+        "status": "running",
+        "tasks": TASKS,
+        "web": "/web",
+        "docs": "/docs"
+    }
+def main():
+    uvicorn.run(app, host="0.0.0.0", port=7860)
 if __name__ == "__main__":
+    main()

environment/__pycache__/env.cpython-310.pyc CHANGED Viewed

Binary files a/environment/__pycache__/env.cpython-310.pyc and b/environment/__pycache__/env.cpython-310.pyc differ

environment/__pycache__/graders.cpython-310.pyc CHANGED Viewed

Binary files a/environment/__pycache__/graders.cpython-310.pyc and b/environment/__pycache__/graders.cpython-310.pyc differ

environment/__pycache__/models.cpython-310.pyc CHANGED Viewed

Binary files a/environment/__pycache__/models.cpython-310.pyc and b/environment/__pycache__/models.cpython-310.pyc differ

environment/data/features_phonecall.npy ADDED Viewed

Binary file (96.1 kB). View file

environment/data/features_streaming.npy ADDED Viewed

Binary file (96.1 kB). View file

environment/data/labels_phonecall.npy ADDED Viewed

Binary file (2.13 kB). View file

environment/data/labels_streaming.npy ADDED Viewed

Binary file (2.13 kB). View file

environment/env.py CHANGED Viewed

@@ -1,123 +1,638 @@
 import numpy as np
 import random
-from environment.models import VoiceObservation
-TASKS = ["clean_detection", "compressed_detection", "adversarial_detection"]
 DIFFICULTY_MAP = {
     "clean_detection":       "easy",
     "compressed_detection":  "medium",
-    "adversarial_detection": "hard"
 }
 DATA_FILES = {
     "clean_detection": (
         "environment/data/features.npy",
-        "environment/data/labels.npy"
     ),
     "compressed_detection": (
         "environment/data/features_compressed.npy",
-        "environment/data/labels_compressed.npy"
     ),
     "adversarial_detection": (
         "environment/data/features_adversarial.npy",
-        "environment/data/labels_adversarial.npy"
     ),
 }
 class VoiceAuthenticityEnv:
     def __init__(self, task_name: str = "clean_detection"):
-        assert task_name in TASKS, f"Unknown task: {task_name}"
-        self.task_name  = task_name
         self.difficulty = DIFFICULTY_MAP[task_name]
         feat_file, label_file = DATA_FILES[task_name]
-        self.features     = np.load(feat_file)
-        self.labels       = np.load(label_file)
         self.raw_features = np.load("environment/data/features_raw.npy")
-        self.indices      = list(range(len(self.labels)))
-        self.current_idx  = None
-        self.step_number  = 0
-        self.done         = False
-        self.phase        = "analyze"  # analyze → decide
-        self.focus_features = None
-    def reset(self):
-        self.step_number    = 0
-        self.done           = False
-        self.phase          = "analyze"
-        self.focus_features = None
-        self.current_idx    = random.choice(self.indices)
         return self._make_observation()
-    def step(self, action: dict):
         if self.done:
             raise RuntimeError("Episode done. Call reset().")
-        # Phase 1 — agent requests focused analysis
-        if self.phase == "analyze":
-            self.focus_features = action.get("focus", ["jitter", "shimmer", "hnr"])
-            self.step_number += 1
-            self.phase = "decide"
-            obs = self._make_observation()
-            return obs, 0.0, False, {
-                "phase": "decide",
-                "message": "Analysis received. Now submit your final classification.",
-                "focused_on": self.focus_features
-            }
-        # Phase 2 — agent submits final classification
-        elif self.phase == "decide":
-            from environment.graders import grade
-            true_label = int(self.labels[self.current_idx])
-            reward     = grade(true_label, action, self.difficulty)
-            self.step_number += 1
-            self.done  = True
-            self.phase = "done"
-            obs  = self._make_observation()
-            info = {
-                "phase":      "done",
-                "true_label": true_label,
-                "difficulty": self.difficulty,
-                "task":       self.task_name
-            }
-            return obs, reward, self.done, info
-    def state(self):
         return {
-            "task_name":      self.task_name,
-            "difficulty":     self.difficulty,
-            "step_number":    self.step_number,
-            "phase":          self.phase,
-            "done":           self.done,
-            "current_idx":    self.current_idx,
-            "focus_features": self.focus_features
         }
-    def _make_observation(self) -> VoiceObservation:
-        feat = self.features[self.current_idx].tolist()
-        raw  = self.raw_features[self.current_idx]
-        if self.phase == "analyze":
-            hint = f"Phase 1 of 2: Request which features to focus on by returning focus=[list of feature names]. Available: jitter, shimmer, hnr, mfcc, spectral. | Raw values → jitter={raw[42]:.5f} shimmer={raw[43]:.5f} hnr={raw[44]:.4f}"
-        elif self.phase == "decide":
-            focused = self.focus_features or ["jitter", "shimmer", "hnr"]
-            hint = f"Phase 2 of 2: Submit your final classification. You focused on: {focused}. | Raw values → jitter={raw[42]:.5f} shimmer={raw[43]:.5f} hnr={raw[44]:.4f}"
-            if self.difficulty == "medium":
-                hint += " | Note: audio has been codec-compressed."
-            elif self.difficulty == "hard":
-                hint += " | Warning: adversarial sample — feature distributions overlap with real speech."
         else:
-            hint = "Episode complete."
         return VoiceObservation(
-            features    = feat,
-            task_name   = self.task_name,
-            step_number = self.step_number,
-            difficulty  = self.difficulty,
-            sample_id   = int(self.current_idx),
-            hint        = hint
-        )

+"""
+Voice Authenticity Detection Environment — 5-action multi-step agent loop.
+Actions:
+    request_temporal_features  — reveals jitter, shimmer, HNR
+    request_spectral_features  — reveals MFCC values
+    request_comparison         — returns similarity to real/fake reference centroids
+    analyze_evidence           — synthesizes accumulated evidence
+    final_classify             — submits label + confidence + reasoning (terminal)
+Partial observability: the agent starts with NO features visible and must
+actively query the environment to build its picture before classifying.
+Step-level rewards provide shaping signals throughout the episode.
+"""
 import numpy as np
 import random
+from typing import List, Dict, Optional, Any
+from environment.models import VoiceObservation, ActionType
+# ── Task registry ───────────────────────────────────────────────────────
+TASKS = [
+    "clean_detection",
+    "compressed_detection",
+    "adversarial_detection",
+    "streaming_detection",
+    "phonecall_detection",
+]
 DIFFICULTY_MAP = {
     "clean_detection":       "easy",
     "compressed_detection":  "medium",
+    "adversarial_detection": "hard",
+    "streaming_detection":   "medium_hard",
+    "phonecall_detection":   "extreme",
 }
 DATA_FILES = {
     "clean_detection": (
         "environment/data/features.npy",
+        "environment/data/labels.npy",
     ),
     "compressed_detection": (
         "environment/data/features_compressed.npy",
+        "environment/data/labels_compressed.npy",
     ),
     "adversarial_detection": (
         "environment/data/features_adversarial.npy",
+        "environment/data/labels_adversarial.npy",
+    ),
+    "streaming_detection": (
+        "environment/data/features_streaming.npy",
+        "environment/data/labels_streaming.npy",
+    ),
+    "phonecall_detection": (
+        "environment/data/features_phonecall.npy",
+        "environment/data/labels_phonecall.npy",
     ),
 }
+MAX_STEPS = 6  # 5 actions + 1 buffer
+# ── Step-level reward constants ─────────────────────────────────────────
+REWARD_FIRST_ACTION_GATHER = 0.05        # first action is a feature request
+REWARD_MULTI_FEATURE_TYPES = 0.05        # requested both temporal AND spectral
+REWARD_ANALYZE_BEFORE_CLASSIFY = 0.05    # used analyze_evidence before final
+PENALTY_JUMP_TO_CLASSIFY = -0.10         # final_classify as first action
+PENALTY_REPEAT_ACTION = -0.05            # same action twice
+PENALTY_CONTRADICTORY_REASONING = -0.10  # reasoning contradicts label
 class VoiceAuthenticityEnv:
+    """Multi-step voice authenticity detection environment.
+    The agent starts with no features visible and must issue actions to
+    reveal information before making a final classification.
+    """
     def __init__(self, task_name: str = "clean_detection"):
+        assert task_name in TASKS, f"Unknown task: {task_name}. Valid: {TASKS}"
+        self.task_name = task_name
         self.difficulty = DIFFICULTY_MAP[task_name]
         feat_file, label_file = DATA_FILES[task_name]
+        self.features = np.load(feat_file)
+        self.labels = np.load(label_file)
         self.raw_features = np.load("environment/data/features_raw.npy")
+        self.indices = list(range(len(self.labels)))
+        # Precompute reference centroids for comparison action
+        self._compute_reference_centroids()
+        # Episode state
+        self.current_idx: Optional[int] = None
+        self.step_number: int = 0
+        self.done: bool = False
+        self.action_history: List[str] = []
+        self.revealed_features: Dict[str, Any] = {}
+        self.step_rewards: List[float] = []
+        self.evidence_accumulated: List[str] = []
+        # Streaming task noise schedule (soft-gating)
+        self._streaming_noise_schedule = {
+            1: 0.8,   # very noisy early
+            2: 0.5,
+            3: 0.3,
+            4: 0.1,
+            5: 0.05,  # nearly clean late
+        }
+    def _compute_reference_centroids(self):
+        """Compute mean feature vectors for real vs fake samples."""
+        real_mask = self.labels == 0
+        fake_mask = self.labels == 1
+        if real_mask.sum() > 0:
+            self.real_centroid = self.features[real_mask].mean(axis=0)
+        else:
+            self.real_centroid = np.full(self.features.shape[1], 0.05)
+        if fake_mask.sum() > 0:
+            self.fake_centroid = self.features[fake_mask].mean(axis=0)
+        else:
+            self.fake_centroid = np.full(self.features.shape[1], 0.05)
+    def reset(self) -> VoiceObservation:
+        """Reset episode. Returns observation with NO features visible."""
+        self.step_number = 0
+        self.done = False
+        self.action_history = []
+        self.revealed_features = {}
+        self.step_rewards = []
+        self.evidence_accumulated = []
+        self.current_idx = random.choice(self.indices)
         return self._make_observation()
+    def step(self, action: dict) -> tuple:
+        """Execute one action and return (observation, reward, done, info).
+        Args:
+            action: dict with 'action_type' and optionally label/confidence/reasoning.
+        Returns:
+            (VoiceObservation, float, bool, dict)
+        """
         if self.done:
             raise RuntimeError("Episode done. Call reset().")
+        action_type = action.get("action_type", "final_classify")
+        # Validate action type
+        valid_actions = [at.value for at in ActionType]
+        if action_type not in valid_actions:
+            raise ValueError(
+                f"Unknown action_type: {action_type}. Valid: {valid_actions}"
+            )
+        # Track action
+        self.action_history.append(action_type)
+        self.step_number += 1
+        # Compute step-level reward
+        step_reward = self._compute_step_reward(action_type, action)
+        # Dispatch to action handler
+        if action_type == ActionType.REQUEST_TEMPORAL.value:
+            obs, info = self._handle_request_temporal()
+        elif action_type == ActionType.REQUEST_SPECTRAL.value:
+            obs, info = self._handle_request_spectral()
+        elif action_type == ActionType.REQUEST_COMPARISON.value:
+            obs, info = self._handle_request_comparison()
+        elif action_type == ActionType.ANALYZE_EVIDENCE.value:
+            obs, info = self._handle_analyze_evidence(action)
+        elif action_type == ActionType.FINAL_CLASSIFY.value:
+            obs, final_reward, info = self._handle_final_classify(action)
+            step_reward += final_reward
+        self.step_rewards.append(step_reward)
+        # Cap total reward to [0.05, 0.95]
+        step_reward = max(0.05, min(0.95, step_reward))
+        # Check step limit
+        if self.step_number >= MAX_STEPS and not self.done:
+            self.done = True
+            info["message"] = "Max steps reached. Episode ended."
+        return obs, round(step_reward, 4), self.done, info
+    def state(self) -> dict:
+        """Return full environment state for debugging."""
         return {
+            "task_name":         self.task_name,
+            "difficulty":        self.difficulty,
+            "step_number":       self.step_number,
+            "done":              self.done,
+            "current_idx":       self.current_idx,
+            "action_history":    self.action_history,
+            "revealed_features": list(self.revealed_features.keys()),
+            "step_rewards":      self.step_rewards,
+        }
+    # ── Action handlers ─────────────────────────────────────────────────
+    def _handle_request_temporal(self) -> tuple:
+        """Reveal jitter, shimmer, HNR values."""
+        raw = self.raw_features[self.current_idx]
+        norm = self.features[self.current_idx]
+        temporal_data = {
+            "jitter":  round(float(raw[42]), 6),
+            "shimmer": round(float(raw[43]), 6),
+            "hnr":     round(float(raw[44]), 4),
+            "jitter_normalized":  round(float(norm[42]), 4),
+            "shimmer_normalized": round(float(norm[43]), 4),
+            "hnr_normalized":     round(float(norm[44]), 4),
+        }
+        # Apply streaming noise if applicable
+        if self.task_name == "streaming_detection":
+            temporal_data = self._apply_streaming_noise(temporal_data)
+        self.revealed_features["temporal"] = temporal_data
+        self.evidence_accumulated.append(
+            f"Temporal features: jitter={temporal_data['jitter']}, "
+            f"shimmer={temporal_data['shimmer']}, hnr={temporal_data['hnr']}"
+        )
+        obs = self._make_observation()
+        info = {
+            "action": "request_temporal_features",
+            "message": "Temporal features revealed: jitter, shimmer, HNR.",
+            "data": temporal_data,
+        }
+        return obs, info
+    def _handle_request_spectral(self) -> tuple:
+        """Reveal MFCC mean and std values."""
+        raw = self.raw_features[self.current_idx]
+        norm = self.features[self.current_idx]
+        spectral_data = {
+            "mfcc_means": [round(float(v), 4) for v in raw[0:20]],
+            "mfcc_stds":  [round(float(v), 4) for v in raw[20:40]],
+            "zcr": round(float(raw[40]), 6),
+            "spectral_centroid": round(float(raw[41]), 4),
+            "mfcc_means_normalized": [round(float(v), 4) for v in norm[0:20]],
+            "mfcc_stds_normalized":  [round(float(v), 4) for v in norm[20:40]],
+        }
+        # Apply streaming noise if applicable
+        if self.task_name == "streaming_detection":
+            spectral_data = self._apply_streaming_noise(spectral_data)
+        self.revealed_features["spectral"] = spectral_data
+        self.evidence_accumulated.append(
+            f"Spectral features: {len(spectral_data['mfcc_means'])} MFCC coefficients, "
+            f"ZCR={spectral_data['zcr']}, centroid={spectral_data['spectral_centroid']}"
+        )
+        obs = self._make_observation()
+        info = {
+            "action": "request_spectral_features",
+            "message": "Spectral features revealed: 20 MFCC means, 20 MFCC stds, ZCR, spectral centroid.",
+            "data": spectral_data,
+        }
+        return obs, info
+    def _handle_request_comparison(self) -> tuple:
+        """Compare this sample to known real/fake reference centroids."""
+        sample = self.features[self.current_idx]
+        # Cosine similarity to real and fake centroids
+        real_sim = self._cosine_similarity(sample, self.real_centroid)
+        fake_sim = self._cosine_similarity(sample, self.fake_centroid)
+        # Euclidean distance
+        real_dist = float(np.linalg.norm(sample - self.real_centroid))
+        fake_dist = float(np.linalg.norm(sample - self.fake_centroid))
+        comparison_data = {
+            "cosine_similarity_to_real": round(real_sim, 4),
+            "cosine_similarity_to_fake": round(fake_sim, 4),
+            "euclidean_distance_to_real": round(real_dist, 4),
+            "euclidean_distance_to_fake": round(fake_dist, 4),
+            "closer_to": "real" if real_dist < fake_dist else "fake",
+            "similarity_differential": round(real_sim - fake_sim, 4),
+        }
+        self.revealed_features["comparison"] = comparison_data
+        self.evidence_accumulated.append(
+            f"Comparison: cosine_sim_real={comparison_data['cosine_similarity_to_real']}, "
+            f"cosine_sim_fake={comparison_data['cosine_similarity_to_fake']}, "
+            f"closer_to={comparison_data['closer_to']}"
+        )
+        obs = self._make_observation()
+        info = {
+            "action": "request_comparison",
+            "message": "Comparison to reference centroids computed.",
+            "data": comparison_data,
+        }
+        return obs, info
+    def _handle_analyze_evidence(self, action: dict) -> tuple:
+        """Synthesize all gathered evidence into a structured summary."""
+        evidence_parts = []
+        # Build evidence summary from what's been revealed
+        if "temporal" in self.revealed_features:
+            t = self.revealed_features["temporal"]
+            jitter_val = t.get("jitter", 0)
+            shimmer_val = t.get("shimmer", 0)
+            hnr_val = t.get("hnr", 0)
+            # Provide interpretive guidance based on actual values
+            jitter_interp = "elevated (typical of real speech)" if jitter_val > 0.025 else "low (typical of synthetic)"
+            shimmer_interp = "elevated (typical of real speech)" if shimmer_val > 0.10 else "low (typical of synthetic)"
+            hnr_interp = "low (typical of real speech)" if hnr_val < 12.0 else "high (typical of synthetic)"
+            evidence_parts.append(
+                f"TEMPORAL: jitter={jitter_val} ({jitter_interp}), "
+                f"shimmer={shimmer_val} ({shimmer_interp}), "
+                f"HNR={hnr_val} ({hnr_interp})"
+            )
+        if "spectral" in self.revealed_features:
+            s = self.revealed_features["spectral"]
+            mfcc_mean_avg = np.mean(s.get("mfcc_means", [0])) if s.get("mfcc_means") else 0
+            mfcc_std_avg = np.mean(s.get("mfcc_stds", [0])) if s.get("mfcc_stds") else 0
+            evidence_parts.append(
+                f"SPECTRAL: avg_mfcc_mean={mfcc_mean_avg:.3f}, "
+                f"avg_mfcc_std={mfcc_std_avg:.3f}, "
+                f"zcr={s.get('zcr', 0)}, centroid={s.get('spectral_centroid', 0)}"
+            )
+        if "comparison" in self.revealed_features:
+            c = self.revealed_features["comparison"]
+            evidence_parts.append(
+                f"COMPARISON: closer_to={c['closer_to']}, "
+                f"diff={c['similarity_differential']}"
+            )
+        if not evidence_parts:
+            summary = "No evidence gathered yet. Request features before analyzing."
+        else:
+            # Count evidence signals pointing to real vs fake
+            real_signals = 0
+            fake_signals = 0
+            if "temporal" in self.revealed_features:
+                t = self.revealed_features["temporal"]
+                if t.get("jitter", 0) > 0.025:
+                    real_signals += 1
+                else:
+                    fake_signals += 1
+                if t.get("shimmer", 0) > 0.10:
+                    real_signals += 1
+                else:
+                    fake_signals += 1
+                if t.get("hnr", 0) < 12.0:
+                    real_signals += 1
+                else:
+                    fake_signals += 1
+            if "comparison" in self.revealed_features:
+                c = self.revealed_features["comparison"]
+                if c["closer_to"] == "real":
+                    real_signals += 1
+                else:
+                    fake_signals += 1
+            total_signals = real_signals + fake_signals
+            if total_signals > 0:
+                suggested_confidence = max(real_signals, fake_signals) / total_signals
+                leaning = "REAL" if real_signals > fake_signals else "SYNTHETIC"
+            else:
+                suggested_confidence = 0.55
+                leaning = "UNCERTAIN"
+            # Adjust confidence for difficulty
+            if self.difficulty in ("hard", "extreme", "medium_hard"):
+                suggested_confidence = min(suggested_confidence, 0.80)
+            summary = (
+                f"Evidence analysis ({len(evidence_parts)} sources):\n"
+                + "\n".join(f"  • {p}" for p in evidence_parts)
+                + f"\n\nSignal tally: {real_signals} real vs {fake_signals} synthetic"
+                + f"\nPreliminary assessment: leaning {leaning}"
+                + f"\nSuggested confidence: {suggested_confidence:.2f}"
+                + f"\nDifficulty context: {self.difficulty}"
+            )
+        self.revealed_features["analysis"] = {
+            "summary": summary,
+            "evidence_count": len(evidence_parts),
         }
+        obs = self._make_observation(evidence_summary=summary)
+        info = {
+            "action": "analyze_evidence",
+            "message": "Evidence synthesized.",
+            "summary": summary,
+            "evidence_count": len(evidence_parts),
+        }
+        return obs, info
+    def _handle_final_classify(self, action: dict) -> tuple:
+        """Submit final classification. Triggers grading. Episode ends."""
+        from environment.graders import grade
+        true_label = int(self.labels[self.current_idx])
+        result = grade(
+            true_label=true_label,
+            action=action,
+            difficulty=self.difficulty,
+            action_history=self.action_history,
+        )
+        self.done = True
+        obs = self._make_observation()
+        info = {
+            "action": "final_classify",
+            "phase": "done",
+            "true_label": true_label,
+            "predicted_label": action.get("label", 0),
+            "difficulty": self.difficulty,
+            "task": self.task_name,
+            "grader_breakdown": result["breakdown"],
+            "grader_weights": result["weights"],
+            "penalties": result["penalties"],
+            "correct": result["correct"],
+            "episode_summary": {
+                "actions_taken": self.action_history,
+                "features_revealed": list(self.revealed_features.keys()),
+                "total_steps": self.step_number
+            }
+        }
+        return obs, result["score"], info
+    # ── Step-level reward computation ───────────────────────────────────
+    def _compute_step_reward(self, action_type: str, action: dict) -> float:
+        """Compute shaping reward for this step."""
+        reward = 0.05
+        gathering_actions = {
+            ActionType.REQUEST_TEMPORAL.value,
+            ActionType.REQUEST_SPECTRAL.value,
+            ActionType.REQUEST_COMPARISON.value,
+        }
+        # Reward: first action is a feature request
+        if len(self.action_history) == 1 and action_type in gathering_actions:
+            reward += REWARD_FIRST_ACTION_GATHER
+        # Penalty: jumping straight to final_classify
+        if len(self.action_history) == 1 and action_type == ActionType.FINAL_CLASSIFY.value:
+            reward += PENALTY_JUMP_TO_CLASSIFY
+        # Reward: multi-feature-type requests
+        history_set = set(self.action_history)
+        if (ActionType.REQUEST_TEMPORAL.value in history_set and
+                ActionType.REQUEST_SPECTRAL.value in history_set and
+                len(self.action_history) >= 2 and
+                action_type in {ActionType.REQUEST_TEMPORAL.value, ActionType.REQUEST_SPECTRAL.value}):
+            # Only award once: check if this is the action that completed the pair
+            prev_set = set(self.action_history[:-1])
+            if not (ActionType.REQUEST_TEMPORAL.value in prev_set and
+                    ActionType.REQUEST_SPECTRAL.value in prev_set):
+                reward += REWARD_MULTI_FEATURE_TYPES
+        # Reward: analyze_evidence before final_classify
+        if (action_type == ActionType.FINAL_CLASSIFY.value and
+                ActionType.ANALYZE_EVIDENCE.value in self.action_history[:-1]):
+            reward += REWARD_ANALYZE_BEFORE_CLASSIFY
+        # Penalty: repeating same action
+        if len(self.action_history) >= 2 and self.action_history[-1] == self.action_history[-2]:
+            reward += PENALTY_REPEAT_ACTION
+        # Penalty: contradictory reasoning (only on final_classify)
+        if action_type == ActionType.FINAL_CLASSIFY.value:
+            label = action.get("label", 0)
+            reasoning = action.get("reasoning", "").lower()
+            if label == 0 and any(kw in reasoning for kw in ["synthetic", "fake", "artificial", "generated"]):
+                if not any(kw in reasoning for kw in ["not synthetic", "not fake", "not artificial"]):
+                    reward += PENALTY_CONTRADICTORY_REASONING
+            elif label == 1 and any(kw in reasoning for kw in ["real", "human", "natural", "authentic"]):
+                if not any(kw in reasoning for kw in ["not real", "not human", "not natural"]):
+                    reward += PENALTY_CONTRADICTORY_REASONING
+        return reward
+    # ── Streaming noise (soft-gating) ───────────────────────────────────
+    def _apply_streaming_noise(self, data: dict) -> dict:
+        """Apply noise to features based on step number for streaming task.
+        Earlier steps get noisier data, later steps get cleaner data.
+        This is soft-gating: features are always available but with
+        varying fidelity.
+        """
+        noise_level = self._streaming_noise_schedule.get(
+            self.step_number, 0.05
+        )
+        noisy_data = {}
+        for key, value in data.items():
+            if isinstance(value, (int, float)):
+                noise = np.random.normal(0, noise_level * abs(value) + 1e-6)
+                noisy_data[key] = round(float(value + noise), 6)
+            elif isinstance(value, list):
+                noisy_data[key] = [
+                    round(float(v + np.random.normal(0, noise_level * abs(v) + 1e-6)), 4)
+                    for v in value
+                ]
+            else:
+                noisy_data[key] = value
+        return noisy_data
+    # ── Helper methods ──────────────────────────────────────────────────
+    @staticmethod
+    def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
+        """Cosine similarity between two vectors."""
+        norm_a = np.linalg.norm(a)
+        norm_b = np.linalg.norm(b)
+        if norm_a == 0 or norm_b == 0:
+            return 0.05
+        return float(np.dot(a, b) / (norm_a * norm_b))
+    def _make_observation(
+        self,
+        evidence_summary: Optional[str] = None,
+    ) -> VoiceObservation:
+        """Build observation from current state.
+        The full feature vector is only included if the agent has requested
+        both temporal AND spectral features, or if the episode is done.
+        Otherwise it's zeroed out to enforce partial observability.
+        """
+        has_temporal = "temporal" in self.revealed_features
+        has_spectral = "spectral" in self.revealed_features
+        full_revealed = has_temporal and has_spectral
+        if full_revealed or self.done:
+            feat = self.features[self.current_idx].tolist()
         else:
+            # Partial observability: base value 0.05 for unrevealed features
+            feat = [0.05] * self.features.shape[1]
+        # Build hint based on current state
+        hint = self._build_hint()
+        # Comparison result from revealed features
+        comparison = self.revealed_features.get("comparison", None)
+        # Available actions
+        available = self._get_available_actions()
         return VoiceObservation(
+            features=feat,
+            task_name=self.task_name,
+            step_number=self.step_number,
+            difficulty=self.difficulty,
+            sample_id=int(self.current_idx),
+            hint=hint,
+            visible_features=dict(self.revealed_features),
+            evidence_summary=evidence_summary,
+            comparison_result=comparison,
+            available_actions=available,
+            actions_taken=list(self.action_history),
+        )
+    def _build_hint(self) -> str:
+        """Build context hint for the agent."""
+        if self.done:
+            return "Episode complete."
+        if self.step_number == 0:
+            hint = (
+                f"Task: {self.task_name} (difficulty: {self.difficulty}). "
+                f"You have {MAX_STEPS - self.step_number} steps remaining. "
+                "No features are visible yet. Use request_temporal_features, "
+                "request_spectral_features, or request_comparison to gather "
+                "evidence before classifying."
+            )
+            if self.difficulty in ("hard", "extreme"):
+                hint += " Warning: this is a challenging task. Gather thorough evidence and calibrate your confidence carefully."
+            if self.task_name == "streaming_detection":
+                hint += " Note: this is a streaming scenario — earlier feature requests may contain noise that reduces over time."
+            if self.task_name == "phonecall_detection":
+                hint += " Note: this is a phone call scenario with heavy codec compression and background noise."
+            return hint
+        parts = [
+            f"Step {self.step_number}/{MAX_STEPS}.",
+            f"Task: {self.task_name} ({self.difficulty}).",
+            f"Actions taken: {', '.join(self.action_history)}.",
+        ]
+        if self.revealed_features:
+            revealed = list(self.revealed_features.keys())
+            parts.append(f"Features revealed: {', '.join(revealed)}.")
+        remaining = MAX_STEPS - self.step_number
+        if remaining <= 2:
+            parts.append(f"⚠️ Only {remaining} steps remaining — consider classifying soon.")
+        return " ".join(parts)
+    def _get_available_actions(self) -> List[str]:
+        """Return list of actions the agent can still take."""
+        if self.done:
+            return []
+        available = []
+        for at in ActionType:
+            # final_classify is always available
+            if at == ActionType.FINAL_CLASSIFY:
+                available.append(at.value)
+                continue
+            # Don't allow repeating the exact same action consecutively
+            # (but allow re-requesting after other actions)
+            if (self.action_history and
+                    self.action_history[-1] == at.value):
+                continue
+            available.append(at.value)
+        return available

environment/graders.py CHANGED Viewed

@@ -1,30 +1,329 @@
-def grade(true_label: int, action: dict, difficulty: str) -> float:
-    label = action.get("label")
-    confidence = action.get("confidence", 0.5)
-    correct = (label == true_label)
-    if difficulty == "easy":
-        if correct:
-            return 0.95   # was 1.0
         else:
-            return 0.05   # was 0.0
-    elif difficulty == "medium":
-        if correct:
-            base = 0.6
-            bonus = 0.35 * confidence   # max = 0.95
-            return round(base + bonus, 3)
         else:
-            penalty = 0.3 * confidence
-            return round(max(0.05, 0.2 - penalty), 3)
-    elif difficulty == "hard":
-        if correct:
-            base = 0.5
-            calibration_bonus = 0.45 * (1 - abs(confidence - 0.7))
-            return round(base + calibration_bonus, 3)
         else:
-            if confidence < 0.4:
-                return 0.15
-            else:
-                return 0.05   # was 0.0

+"""
+6-component grader for Voice Authenticity OpenEnv.
+Components:
+  1. Correctness         — label matches ground truth
+  2. Confidence calibration — penalizes overconfidence on wrong, rewards calibrated
+  3. Trajectory quality  — did agent analyze before classifying
+  4. Feature utilization — did agent request temporal/spectral features
+  5. Reasoning consistency — does reasoning text match chosen label
+  6. Action ordering     — logical gather → analyze → classify sequence
+Difficulty weighting adjusts component weights per task difficulty.
+"""
+from typing import Dict, List, Optional
+# ── Difficulty-based component weights ──────────────────────────────────
+COMPONENT_WEIGHTS = {
+    "easy": {
+        "correctness":            0.40,
+        "confidence_calibration": 0.15,
+        "trajectory_quality":     0.10,
+        "feature_utilization":    0.15,
+        "reasoning_consistency":  0.10,
+        "action_ordering":        0.10,
+    },
+    "medium": {
+        "correctness":            0.30,
+        "confidence_calibration": 0.20,
+        "trajectory_quality":     0.15,
+        "feature_utilization":    0.15,
+        "reasoning_consistency":  0.10,
+        "action_ordering":        0.10,
+    },
+    "medium_hard": {
+        "correctness":            0.25,
+        "confidence_calibration": 0.22,
+        "trajectory_quality":     0.18,
+        "feature_utilization":    0.15,
+        "reasoning_consistency":  0.10,
+        "action_ordering":        0.10,
+    },
+    "hard": {
+        "correctness":            0.25,
+        "confidence_calibration": 0.25,
+        "trajectory_quality":     0.18,
+        "feature_utilization":    0.12,
+        "reasoning_consistency":  0.10,
+        "action_ordering":        0.10,
+    },
+    "extreme": {
+        "correctness":            0.20,
+        "confidence_calibration": 0.25,
+        "trajectory_quality":     0.20,
+        "feature_utilization":    0.15,
+        "reasoning_consistency":  0.10,
+        "action_ordering":        0.10,
+    },
+}
+# ── Keywords for reasoning consistency check ────────────────────────────
+REAL_KEYWORDS = [
+    "real", "human", "natural", "authentic", "genuine", "organic",
+    "jitter", "high jitter", "shimmer variation", "low hnr",
+    "irregular", "imperfect", "variation",
+]
+SYNTHETIC_KEYWORDS = [
+    "synthetic", "fake", "artificial", "generated", "tts",
+    "ai-generated", "deepfake", "machine", "clone",
+    "smooth", "perfect", "uniform", "low jitter", "high hnr",
+    "stable", "consistent",
+]
+def _score_correctness(true_label: int, predicted_label: int) -> float:
+    """Binary correctness: 0.95 if correct, 0.05 if wrong."""
+    return 0.95 if predicted_label == true_label else 0.05
+def _score_confidence_calibration(
+    correct: bool, confidence: float, difficulty: str
+) -> float:
+    """Score confidence calibration.
+    Correct + calibrated confidence → high score
+    Correct + overconfident on hard tasks → penalized
+    Wrong + low confidence → partial credit
+    Wrong + high confidence → zero
+    """
+    if correct:
+        if difficulty in ("easy", "medium"):
+            # Reward higher confidence when correct on easier tasks
+            return 0.6 + 0.4 * confidence
+        elif difficulty == "medium_hard":
+            # Reward moderate confidence
+            ideal = 0.75
+            deviation = abs(confidence - ideal)
+            return max(0.05, 0.95 - 1.5 * deviation)
+        elif difficulty in ("hard", "extreme"):
+            # Reward calibrated ~0.7 confidence, penalize overconfidence
+            ideal = 0.7
+            deviation = abs(confidence - ideal)
+            return max(0.05, 0.95 - 2.0 * deviation)
+    else:
+        # Wrong answer — reward uncertainty, punish overconfidence
+        if confidence < 0.3:
+            return 0.4   # appropriately uncertain
+        elif confidence < 0.5:
+            return 0.2
+        elif confidence < 0.7:
+            return 0.1
+        else:
+            return 0.05   # overconfident AND wrong
+def _score_trajectory_quality(action_history: List[str]) -> float:
+    """Did the agent analyze evidence before classifying?
+    Best: gathered features → analyzed → classified
+    Okay: gathered features → classified (skipped analysis)
+    Worst: jumped straight to final_classify
+    """
+    if len(action_history) <= 1:
+        # Only final_classify, no exploration at all
+        return 0.05
+    has_analysis = "analyze_evidence" in action_history
+    has_gathering = any(
+        a in action_history for a in [
+            "request_temporal_features",
+            "request_spectral_features",
+            "request_comparison",
+        ]
+    )
+    if has_gathering and has_analysis:
+        return 0.95
+    elif has_gathering:
+        return 0.6
+    elif has_analysis:
+        return 0.3
+    else:
+        return 0.1
+def _score_feature_utilization(action_history: List[str]) -> float:
+    """Did the agent request specific feature types?
+    Best: requested both temporal AND spectral
+    Good: requested temporal OR spectral + comparison
+    Okay: requested only one type
+    Bad: no feature requests
+    """
+    has_temporal = "request_temporal_features" in action_history
+    has_spectral = "request_spectral_features" in action_history
+    has_comparison = "request_comparison" in action_history
+    count = sum([has_temporal, has_spectral, has_comparison])
+    if has_temporal and has_spectral and has_comparison:
+        return 0.95
+    elif has_temporal and has_spectral:
+        return 0.9
+    elif count == 2:
+        return 0.7
+    elif count == 1:
+        return 0.4
+    else:
+        return 0.05
+def _score_reasoning_consistency(
+    label: int, reasoning: str
+) -> float:
+    """Does the reasoning text match the chosen label?
+    Checks for keyword alignment between reasoning and label.
+    """
+    reasoning_lower = reasoning.lower()
+    if not reasoning or len(reasoning.strip()) < 5:
+        return 0.2  # minimal reasoning provided
+    real_hits = sum(1 for kw in REAL_KEYWORDS if kw in reasoning_lower)
+    synthetic_hits = sum(1 for kw in SYNTHETIC_KEYWORDS if kw in reasoning_lower)
+    if label == 0:  # predicted real
+        if real_hits > 0 and real_hits >= synthetic_hits:
+            return 0.95
+        elif real_hits > 0:
+            return 0.5
+        elif synthetic_hits > 0:
+            return 0.1  # contradictory
         else:
+            return 0.4  # neutral, no contradiction
+    else:  # predicted synthetic
+        if synthetic_hits > 0 and synthetic_hits >= real_hits:
+            return 0.95
+        elif synthetic_hits > 0:
+            return 0.5
+        elif real_hits > 0:
+            return 0.1  # contradictory
+        else:
+            return 0.4  # neutral
+def _score_action_ordering(action_history: List[str]) -> float:
+    """Logical sequence: gather → analyze → classify.
+    Ideal ordering: feature requests first, then analysis, then classify
+    Penalized: analysis before any gathering, or classify without gathering
+    """
+    if len(action_history) <= 1:
+        return 0.1  # jumped straight to classify
+    gathering_actions = {
+        "request_temporal_features",
+        "request_spectral_features",
+        "request_comparison",
+    }
+    # Find position indices
+    first_gather_idx = None
+    analysis_idx = None
+    classify_idx = None
+    for i, action in enumerate(action_history):
+        if action in gathering_actions and first_gather_idx is None:
+            first_gather_idx = i
+        if action == "analyze_evidence" and analysis_idx is None:
+            analysis_idx = i
+        if action == "final_classify":
+            classify_idx = i
+    score = 0.5  # baseline — at least did more than one action
+    # Gathering before analysis is good
+    if first_gather_idx is not None and analysis_idx is not None:
+        if first_gather_idx < analysis_idx:
+            score += 0.25
         else:
+            score -= 0.15  # analyzed before gathering
+    # Analysis before classify
+    if analysis_idx is not None and classify_idx is not None:
+        if analysis_idx < classify_idx:
+            score += 0.25
         else:
+            score -= 0.10
+    # Gathering happened at all
+    if first_gather_idx is not None:
+        score += 0.1
+    return max(0.05, min(0.95, score))
+def grade(
+    true_label: int,
+    action: dict,
+    difficulty: str,
+    action_history: Optional[List[str]] = None,
+) -> dict:
+    """6-component grader with difficulty-weighted scoring.
+    Args:
+        true_label: ground truth label (0=real, 1=synthetic)
+        action: dict with label, confidence, reasoning
+        difficulty: one of easy, medium, medium_hard, hard, extreme
+        action_history: list of action_type strings taken this episode
+    Returns:
+        dict with:
+            score: float in [0.05, 0.95]
+            breakdown: dict of component scores
+            penalties: list of penalty descriptions
+    """
+    if action_history is None:
+        action_history = ["final_classify"]
+    label = action.get("label", 0)
+    confidence = action.get("confidence", 0.5)
+    reasoning = action.get("reasoning", "")
+    correct = (label == true_label)
+    # Resolve difficulty weights
+    weights = COMPONENT_WEIGHTS.get(difficulty, COMPONENT_WEIGHTS["medium"])
+    # Score each component
+    scores = {
+        "correctness": _score_correctness(true_label, label),
+        "confidence_calibration": _score_confidence_calibration(
+            correct, confidence, difficulty
+        ),
+        "trajectory_quality": _score_trajectory_quality(action_history),
+        "feature_utilization": _score_feature_utilization(action_history),
+        "reasoning_consistency": _score_reasoning_consistency(label, reasoning),
+        "action_ordering": _score_action_ordering(action_history),
+    }
+    # Weighted total
+    total = sum(
+        scores[component] * weights[component]
+        for component in scores
+    )
+    total = round(max(0.05, min(0.95, total)), 4)
+    # Collect penalties for transparency
+    penalties = []
+    if not correct:
+        penalties.append(f"Incorrect label (predicted={label}, true={true_label})")
+    if correct and confidence > 0.9 and difficulty in ("hard", "extreme"):
+        penalties.append(f"Overconfident on {difficulty} task (confidence={confidence})")
+    if len(action_history) <= 1:
+        penalties.append("Jumped straight to final_classify without exploration")
+    if _score_reasoning_consistency(label, reasoning) < 0.3:
+        penalties.append("Reasoning contradicts chosen label")
+    # Apply extreme difficulty cap
+    if difficulty == "extreme":
+        total = min(total, 0.85)
+    return {
+        "score": total,
+        "correct": correct,
+        "breakdown": scores,
+        "penalties": penalties,
+        "weights": weights,
+    }

environment/models.py CHANGED Viewed

@@ -1,21 +1,70 @@
 from pydantic import BaseModel, Field
-from typing import Optional, List
 class VoiceObservation(BaseModel):
     features: List[float]
     task_name: str
     step_number: int
     difficulty: str
     sample_id: int
-    hint: Optional[str] = None  # extra context for hard task
 class VoiceAction(BaseModel):
-    label: int = Field(..., ge=0, le=1)          # 0=real, 1=synthetic
-    confidence: float = Field(..., ge=0.0, le=1.0)
     reasoning: str = Field(default="")
 class VoiceReward(BaseModel):
     score: float
     correct: bool
-    confidence_penalty: float
-    breakdown: str

 from pydantic import BaseModel, Field
+from typing import Optional, List, Dict, Any
+from enum import Enum
+class ActionType(str, Enum):
+    """Five distinct agent actions for real partial observability."""
+    REQUEST_TEMPORAL = "request_temporal_features"
+    REQUEST_SPECTRAL = "request_spectral_features"
+    REQUEST_COMPARISON = "request_comparison"
+    ANALYZE_EVIDENCE = "analyze_evidence"
+    FINAL_CLASSIFY = "final_classify"
 class VoiceObservation(BaseModel):
+    """Observation returned to the agent after each action.
+    features: full 48-dim vector (only populated after sufficient exploration
+              or on final step)
+    visible_features: dict of feature groups revealed so far
+    evidence_summary: structured summary from analyze_evidence action
+    comparison_result: similarity scores from request_comparison action
+    """
     features: List[float]
     task_name: str
     step_number: int
     difficulty: str
     sample_id: int
+    hint: Optional[str] = None
+    visible_features: Dict[str, Any] = Field(default_factory=dict)
+    evidence_summary: Optional[str] = None
+    comparison_result: Optional[Dict[str, Any]] = None
+    available_actions: List[str] = Field(default_factory=list)
+    actions_taken: List[str] = Field(default_factory=list)
 class VoiceAction(BaseModel):
+    """Action submitted by the agent.
+    action_type: which of the 5 actions to perform
+    label: classification (only used for final_classify)
+    confidence: agent confidence (used for final_classify and analyze_evidence)
+    reasoning: explanation (used for final_classify)
+    focus: optional list of feature names (backward compat)
+    """
+    action_type: str = Field(default="final_classify")
+    label: int = Field(default=0, ge=0, le=1)
+    confidence: float = Field(default=0.5, ge=0.0, le=1.0)
     reasoning: str = Field(default="")
+    focus: List[str] = Field(default_factory=list)
+class GraderBreakdown(BaseModel):
+    """Detailed 6-component grading breakdown."""
+    correctness: float = 0.0
+    confidence_calibration: float = 0.0
+    trajectory_quality: float = 0.0
+    feature_utilization: float = 0.0
+    reasoning_consistency: float = 0.0
+    action_ordering: float = 0.0
 class VoiceReward(BaseModel):
+    """Reward with full breakdown."""
     score: float
     correct: bool
+    step_rewards: List[float] = Field(default_factory=list)
+    grader_breakdown: Optional[GraderBreakdown] = None
+    penalties: List[str] = Field(default_factory=list)
+    breakdown: str = ""

inference.py CHANGED Viewed

@@ -1,3 +1,14 @@
 from dotenv import load_dotenv
 load_dotenv()
@@ -6,14 +17,13 @@ import os
 import textwrap
 import json
 import requests
-from typing import List, Optional
 from openai import OpenAI
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
 BENCHMARK = "voice-authenticity"
-MAX_STEPS = 1
 SUCCESS_SCORE_THRESHOLD = 0.5
 # Environment server URL
@@ -21,47 +31,64 @@ ENV_SERVER_URL = os.getenv("ENV_SERVER_URL", "http://localhost:7860")
 SYSTEM_PROMPT = textwrap.dedent("""
 You are an expert audio forensics agent detecting synthetic (AI-generated) speech.
-You receive a 48-dimensional normalized feature vector AND key raw values in the hint.
-Always use the KEY VALUES in the hint for classification:
-REAL speech thresholds (from dataset):
-- jitter > 0.025
-- shimmer > 0.10
-- hnr < 12.0
-SYNTHETIC speech thresholds:
-- jitter < 0.020
-- shimmer < 0.09
-- hnr > 12.0
-When in doubt, lower your confidence. Never exceed 0.85 confidence on hard tasks.
-Respond ONLY with valid JSON:
-{"label": 0 or 1, "confidence": 0.0-1.0, "reasoning": "brief"}
-0 = real human speech
-1 = synthetic/AI-generated speech
 """).strip()
 def log_start(task, env, model):
     print(f"[START] task={task} env={env} model={model}", flush=True)
 def log_step(step, action, reward, done, error):
     error_val = error if error else "null"
-    print(f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error_val}", flush=True)
 def log_end(success, steps, score, rewards):
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
-    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
 def env_reset(task_name: str) -> dict:
     """Call /reset on the environment server."""
     response = requests.post(
         f"{ENV_SERVER_URL}/reset",
         json={"task_name": task_name},
-        timeout=30
     )
     response.raise_for_status()
     return response.json()
@@ -70,49 +97,59 @@ def env_reset(task_name: str) -> dict:
 def env_step(action: dict, task_name: str) -> dict:
     """Call /step on the environment server."""
     payload = {
-        "label": action["label"],
-        "confidence": action["confidence"],
         "reasoning": action.get("reasoning", ""),
-        "task_name": task_name
     }
     response = requests.post(
         f"{ENV_SERVER_URL}/step",
         json=payload,
-        timeout=30
     )
     response.raise_for_status()
     return response.json()
-def get_agent_action(client, observation: dict) -> dict:
-    features = observation.get("features", [])
-    task_name = observation.get("task_name", "")
-    difficulty = observation.get("difficulty", "")
-    hint = observation.get("hint", "")
     user_prompt = f"""
-Audio sample features: {features}
-Task: {task_name} (difficulty: {difficulty})
-{f'Note: {hint}' if hint else ''}
-Classify this audio sample. Keep reasoning under 70 words. Respond with JSON only.
 """
     try:
         completion = client.chat.completions.create(
             model=MODEL_NAME,
             messages=[
                 {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": user_prompt.strip()}
             ],
             temperature=0.3,
-            max_tokens=250,
-            stream=False
         )
         text = completion.choices[0].message.content.strip()
         text = text.replace("```json", "").replace("```", "").strip()
         last_brace = text.rfind("}")
         if last_brace != -1:
-            text = text[:last_brace + 1]
         result = json.loads(text)
         result["label"] = int(result.get("label", 0))
         result["confidence"] = float(result.get("confidence", 0.5))
@@ -124,70 +161,141 @@ Classify this audio sample. Keep reasoning under 70 words. Respond with JSON onl
         return {"label": 0, "confidence": 0.5, "reasoning": "fallback"}
 async def run_task(client: OpenAI, task_name: str):
     rewards: List[float] = []
     steps_taken = 0
     success = False
     score = 0.0
     log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
     try:
-        # Reset environment
         reset_response = env_reset(task_name)
         observation = reset_response.get("observation", {})
-        done = reset_response.get("done", False)
-        # Step 1 — Analysis phase
-        analyze_action = {
-            "focus": ["jitter", "shimmer", "hnr"],
-            "label": 0,
-            "confidence": 0.5,
-            "reasoning": "Requesting focused analysis"
         }
-        analyze_str = json.dumps(analyze_action)
-        step1_response = env_step(analyze_action, task_name)
-        observation = step1_response.get("observation", {})
-        reward1 = float(step1_response.get("reward", 0.0))
-        done = step1_response.get("done", False)
-        steps_taken = 1
         rewards.append(reward1)
-        log_step(step=1, action=analyze_str, reward=reward1,
-                done=done, error=None)
-        if not done:
-            # Step 2 — Decision phase
-            action_dict = get_agent_action(client, observation)
-            action_str = json.dumps(action_dict)
-            step2_response = env_step(action_dict, task_name)
-            reward2 = float(step2_response.get("reward", 0.0))
-            done = step2_response.get("done", True)
-            steps_taken = 2
-            rewards.append(reward2)
-            log_step(step=2, action=action_str, reward=reward2,
-                    done=done, error=None)
-        # Score is based only on decision phase (Step 2)
-        decision_rewards = rewards[1:] if len(rewards) > 1 else rewards
-        score = sum(decision_rewards) / len(decision_rewards) if decision_rewards else 0.0
         success = score >= SUCCESS_SCORE_THRESHOLD
     except Exception as e:
         print(f"[DEBUG] Task error: {e}", flush=True)
     finally:
-        decision_rewards = rewards[1:] if len(rewards) > 1 else rewards
-        score_val = sum(decision_rewards) / len(decision_rewards) if decision_rewards else 0.0
-        log_end(success=success, steps=steps_taken,
-               score=score_val, rewards=rewards)
 async def main():
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    tasks = ["clean_detection", "compressed_detection", "adversarial_detection"]
     for task in tasks:
         await run_task(client, task)

+"""
+Baseline inference agent for Voice Authenticity OpenEnv v2.
+Uses the 5-action protocol:
+  1. request_temporal_features  → get jitter, shimmer, HNR
+  2. request_spectral_features  → get MFCC values
+  3. request_comparison         → get similarity to real/fake centroids
+  4. analyze_evidence           → synthesize gathered information
+  5. final_classify             → submit label + confidence + reasoning
+"""
 from dotenv import load_dotenv
 load_dotenv()
 import textwrap
 import json
 import requests
+from typing import List
 from openai import OpenAI
 API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
 BENCHMARK = "voice-authenticity"
 SUCCESS_SCORE_THRESHOLD = 0.5
 # Environment server URL
 SYSTEM_PROMPT = textwrap.dedent("""
 You are an expert audio forensics agent detecting synthetic (AI-generated) speech.
+You operate in a multi-step environment where you must gather evidence before classifying.
+REAL speech indicators:
+- jitter > 0.025 (vocal cord irregularity)
+- shimmer > 0.10 (amplitude variation)
+- HNR < 12.0 (more noise in signal)
+- Higher MFCC std deviations (natural variation)
+SYNTHETIC speech indicators:
+- jitter < 0.020 (too stable)
+- shimmer < 0.09 (too uniform)
+- HNR > 12.0 (too clean)
+- Lower MFCC std deviations (artificial consistency)
+COMPARISON interpretation:
+- Higher cosine similarity to real centroid → likely real
+- Higher cosine similarity to fake centroid → likely synthetic
+- Closer euclidean distance to real → likely real
+CONFIDENCE GUIDELINES:
+- Easy tasks: confident predictions okay (0.7-0.9)
+- Medium tasks: moderate confidence (0.6-0.8)
+- Hard/extreme tasks: calibrate carefully, never exceed 0.85
+Respond ONLY with valid JSON for the requested action type.
 """).strip()
 def log_start(task, env, model):
     print(f"[START] task={task} env={env} model={model}", flush=True)
 def log_step(step, action, reward, done, error):
     error_val = error if error else "null"
+    print(
+        f"[STEP] step={step} action={json.dumps(action)} "
+        f"reward={reward:.2f} done={str(done).lower()} error={error_val}",
+        flush=True,
+    )
 def log_end(success, steps, score, rewards):
     rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ── Environment API calls ──────────────────────────────────────────────
 def env_reset(task_name: str) -> dict:
     """Call /reset on the environment server."""
     response = requests.post(
         f"{ENV_SERVER_URL}/reset",
         json={"task_name": task_name},
+        timeout=30,
     )
     response.raise_for_status()
     return response.json()
 def env_step(action: dict, task_name: str) -> dict:
     """Call /step on the environment server."""
     payload = {
+        "action_type": action.get("action_type", "final_classify"),
+        "label": action.get("label", 0),
+        "confidence": action.get("confidence", 0.5),
         "reasoning": action.get("reasoning", ""),
+        "task_name": task_name,
     }
     response = requests.post(
         f"{ENV_SERVER_URL}/step",
         json=payload,
+        timeout=30,
     )
     response.raise_for_status()
     return response.json()
+# ── LLM agent decision making ─────��────────────────────────────────────
+def get_classification(client, context: dict) -> dict:
+    """Ask the LLM to make a final classification based on accumulated evidence."""
     user_prompt = f"""
+Based on the following evidence gathered from an audio sample, classify it as
+real (0) or synthetic (1). Task: {context['task_name']} (difficulty: {context['difficulty']})
+EVIDENCE GATHERED:
+{json.dumps(context.get('visible_features', {}), indent=2)}
+COMPARISON RESULTS:
+{json.dumps(context.get('comparison_result', {}), indent=2)}
+ANALYSIS SUMMARY:
+{context.get('evidence_summary', 'No analysis performed.')}
+ACTIONS TAKEN: {', '.join(context.get('actions_taken', []))}
+Respond with JSON only:
+{{"label": 0 or 1, "confidence": 0.0-1.0, "reasoning": "brief explanation under 70 words"}}
 """
     try:
         completion = client.chat.completions.create(
             model=MODEL_NAME,
             messages=[
                 {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt.strip()},
             ],
             temperature=0.3,
+            max_tokens=300,
+            stream=False,
         )
         text = completion.choices[0].message.content.strip()
         text = text.replace("```json", "").replace("```", "").strip()
         last_brace = text.rfind("}")
         if last_brace != -1:
+            text = text[: last_brace + 1]
         result = json.loads(text)
         result["label"] = int(result.get("label", 0))
         result["confidence"] = float(result.get("confidence", 0.5))
         return {"label": 0, "confidence": 0.5, "reasoning": "fallback"}
+# ── Main task runner ────────────────────────────────────────────────────
 async def run_task(client: OpenAI, task_name: str):
+    """Run one episode of a task using the 5-action protocol."""
     rewards: List[float] = []
     steps_taken = 0
     success = False
     score = 0.0
+    context = {}
     log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
     try:
+        # ── Reset ───────────────────────────────────────────
         reset_response = env_reset(task_name)
         observation = reset_response.get("observation", {})
+        context = {
+            "task_name": observation.get("task_name", task_name),
+            "difficulty": observation.get("difficulty", ""),
+            "visible_features": {},
+            "comparison_result": None,
+            "evidence_summary": None,
+            "actions_taken": [],
         }
+        # ── Step 1: Request temporal features ───────────────
+        action1 = {"action_type": "request_temporal_features"}
+        step1 = env_step(action1, task_name)
+        observation = step1.get("observation", {})
+        reward1 = float(step1.get("reward", 0.0))
         rewards.append(reward1)
+        steps_taken = 1
+        context["visible_features"] = observation.get("visible_features", {})
+        context["actions_taken"] = observation.get("actions_taken", [])
+        log_step(step=1, action=action1, reward=reward1,
+                 done=step1.get("done", False), error=None)
+        if step1.get("done", False):
+            raise RuntimeError("Episode ended prematurely at step 1")
+        # ── Step 2: Request spectral features ───────────────
+        action2 = {"action_type": "request_spectral_features"}
+        step2 = env_step(action2, task_name)
+        observation = step2.get("observation", {})
+        reward2 = float(step2.get("reward", 0.0))
+        rewards.append(reward2)
+        steps_taken = 2
+        context["visible_features"] = observation.get("visible_features", {})
+        context["actions_taken"] = observation.get("actions_taken", [])
+        log_step(step=2, action=action2, reward=reward2,
+                 done=step2.get("done", False), error=None)
+        if step2.get("done", False):
+            raise RuntimeError("Episode ended prematurely at step 2")
+        # ── Step 3: Request comparison ──────────────────────
+        action3 = {"action_type": "request_comparison"}
+        step3 = env_step(action3, task_name)
+        observation = step3.get("observation", {})
+        reward3 = float(step3.get("reward", 0.0))
+        rewards.append(reward3)
+        steps_taken = 3
+        context["visible_features"] = observation.get("visible_features", {})
+        context["comparison_result"] = observation.get("comparison_result", {})
+        context["actions_taken"] = observation.get("actions_taken", [])
+        log_step(step=3, action=action3, reward=reward3,
+                 done=step3.get("done", False), error=None)
+        if step3.get("done", False):
+            raise RuntimeError("Episode ended prematurely at step 3")
+        # ── Step 4: Analyze evidence ────────────────────────
+        action4 = {"action_type": "analyze_evidence"}
+        step4 = env_step(action4, task_name)
+        observation = step4.get("observation", {})
+        reward4 = float(step4.get("reward", 0.0))
+        rewards.append(reward4)
+        steps_taken = 4
+        context["evidence_summary"] = observation.get("evidence_summary", "")
+        context["actions_taken"] = observation.get("actions_taken", [])
+        log_step(step=4, action=action4, reward=reward4,
+                 done=step4.get("done", False), error=None)
+        if step4.get("done", False):
+            raise RuntimeError("Episode ended prematurely at step 4")
+        # ── Step 5: Final classify (LLM decision) ──────────
+        classification = get_classification(client, context)
+        action5 = {
+            "action_type": "final_classify",
+            "label": classification["label"],
+            "confidence": classification["confidence"],
+            "reasoning": classification.get("reasoning", ""),
+        }
+        step5 = env_step(action5, task_name)
+        reward5 = float(step5.get("reward", 0.0))
+        rewards.append(reward5)
+        steps_taken = 5
+        done = step5.get("done", True)
+        log_step(step=5, action=action5, reward=reward5,
+                 done=done, error=None)
+        # Score is the final classify reward (main grader score)
+        score = reward5
         success = score >= SUCCESS_SCORE_THRESHOLD
     except Exception as e:
         print(f"[DEBUG] Task error: {e}", flush=True)
     finally:
+        # The competition judges score based on the final classify reward only
+        if rewards:
+            score = rewards[-1]
+        else:
+            score = 0.0
+        log_end(
+            success=success, steps=steps_taken,
+            score=score, rewards=rewards,
+        )
 async def main():
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    tasks = [
+        "clean_detection",
+        "compressed_detection",
+        "adversarial_detection",
+        "streaming_detection",
+        "phonecall_detection",
+    ]
     for task in tasks:
         await run_task(client, task)

openenv.yaml CHANGED Viewed

@@ -1,39 +1,71 @@
 name: voice-authenticity
-version: "1.0.0"
-description: "Voice authenticity detection across real-world degradation conditions"
-author: "Akshara-Sharma"
-tags: ["speech", "fraud-detection", "content-moderation", "audio"]
 tasks:
   - name: clean_detection
     difficulty: easy
     description: "Classify real vs synthetic speech from clean audio features"
   - name: compressed_detection
-    difficulty: medium
     description: "Classify speech under codec compression degradation"
   - name: adversarial_detection
     difficulty: hard
-    description: "Classify adversarially crafted synthetic speech"
 observation_space:
   type: object
   properties:
     features:
       type: array
-      description: "48-dim feature vector: MFCCs, jitter, shimmer, HNR"
     task_name:
       type: string
     step_number:
       type: integer
     difficulty:
       type: string
 action_space:
   type: object
   properties:
     label:
       type: integer
       description: "0=real, 1=synthetic"
     confidence:
       type: number
-      description: "confidence in [0.0, 1.0]"
     reasoning:
       type: string
-      description: "brief explanation of decision"

 name: voice-authenticity
+version: "2.0.0"
+description: "Voice authenticity detection across real-world degradation conditions with multi-step agentic interaction"
+author: "AksharaSharma"
+tags:
+  - openenv
+  - speech
+  - fraud-detection
+  - audio
+  - partial-observability
 tasks:
   - name: clean_detection
     difficulty: easy
     description: "Classify real vs synthetic speech from clean audio features"
   - name: compressed_detection
+    difficulty: medium
     description: "Classify speech under codec compression degradation"
   - name: adversarial_detection
     difficulty: hard
+    description: "Classify adversarially crafted synthetic speech with overlapping distributions"
+  - name: streaming_detection
+    difficulty: medium_hard
+    description: "Streaming detection with step-dependent noise soft-gating"
+  - name: phonecall_detection
+    difficulty: extreme
+    description: "Phone call simulation with heavy codec compression and narrowband degradation"
 observation_space:
   type: object
   properties:
     features:
       type: array
+      description: "48-dim feature vector (zeroed until revealed via actions)"
+    visible_features:
+      type: object
+      description: "Feature groups revealed so far"
+    evidence_summary:
+      type: string
+      description: "Structured summary from analyze_evidence action"
+    comparison_result:
+      type: object
+      description: "Similarity scores to real/fake reference centroids"
     task_name:
       type: string
     step_number:
       type: integer
     difficulty:
       type: string
+    available_actions:
+      type: array
+    actions_taken:
+      type: array
 action_space:
   type: object
   properties:
+    action_type:
+      type: string
+      enum:
+        - request_temporal_features
+        - request_spectral_features
+        - request_comparison
+        - analyze_evidence
+        - final_classify
     label:
       type: integer
       description: "0=real, 1=synthetic"
     confidence:
       type: number
+      description: "Agent confidence [0.0, 1.0]"
     reasoning:
       type: string
+      description: "Explanation of decision"

pyproject.toml CHANGED Viewed

@@ -4,8 +4,8 @@ build-backend = "setuptools.backends.legacy:build"
 [project]
 name = "voice-authenticity-openenv"
-version = "1.0.0"
-description = "Voice authenticity detection OpenEnv environment"
 requires-python = ">=3.10"
 dependencies = [
     "librosa",
@@ -26,6 +26,6 @@ server = "server.app:main"
 [tool.openenv]
 name = "voice-authenticity"
-version = "1.0.0"
-tasks = ["clean_detection", "compressed_detection", "adversarial_detection"]
 entry_point = "app:app"

 [project]
 name = "voice-authenticity-openenv"
+version = "2.0.0"
+description = "Voice authenticity detection OpenEnv environment with multi-step agentic interaction"
 requires-python = ">=3.10"
 dependencies = [
     "librosa",
 [tool.openenv]
 name = "voice-authenticity"
+version = "2.0.0"
+tasks = ["clean_detection", "compressed_detection", "adversarial_detection", "streaming_detection", "phonecall_detection"]
 entry_point = "app:app"

scripts/extract_features.py CHANGED Viewed

@@ -112,8 +112,9 @@ def process_directory(directory, label, desc):
 def add_compression_artifacts(features, strength=0.3):
     degraded = features.copy()
     degraded[20:40] *= (1 - strength * np.random.uniform(0.5, 1.0, 20))
     degraded[42] *= (1 - strength * np.random.uniform(0.3, 0.7))
     degraded[43] *= (1 - strength * np.random.uniform(0.3, 0.7))
@@ -121,7 +122,7 @@ def add_compression_artifacts(features, strength=0.3):
     degraded[45] *= (1 + strength * np.random.uniform(0.3, 0.8))
     degraded[46] *= (1 - strength * np.random.uniform(0.2, 0.6))
     degraded[47] += strength * np.random.uniform(0.1, 0.4)
     return degraded
@@ -160,10 +161,95 @@ def add_adversarial_perturbation(features, label):
     return perturbed
 def main():
-    print("=" * 50)
-    print("Feature Extraction Pipeline")
-    print("=" * 50)
     real_feat, real_labels = process_directory(
         REAL_DIR, label=0, desc="REAL audio"
@@ -196,9 +282,11 @@ def main():
     print(f"\nTask 1 (clean): {len(all_labels)} samples saved")
     # ── TASK 2: Compressed features ─────────────────────────
     compressed_features = np.array([
         add_compression_artifacts(f, strength=0.3)
-        for f in (real_feat + fake_feat)
     ], dtype=np.float32)
     compressed_features = compressed_features[idx]
@@ -210,7 +298,6 @@ def main():
     print(f"Task 2 (compressed): {len(all_labels)} samples saved")
     # ── TASK 3: Adversarial features ────────────────────────
-    raw_combined        = real_feat + fake_feat
     raw_labels_combined = real_labels + fake_labels
     adversarial_features = np.array([
@@ -226,21 +313,52 @@ def main():
     print(f"Task 3 (adversarial): {len(all_labels)} samples saved")
-    print(f"\n{'='*50}")
     print("DONE")
     print(f"Total samples : {len(all_labels)}")
     print(f"Real samples  : {all_labels.tolist().count(0)}")
     print(f"Fake samples  : {all_labels.tolist().count(1)}")
     print(f"Feature shape : {all_features_norm.shape}")
-    print(f"{'='*50}")
     print("\nSanity check — jitter/shimmer/HNR comparison:")
     for i in range(min(2, len(all_labels))):
         label_str = "REAL" if all_labels[i] == 0 else "FAKE"
         print(f"\n  [{label_str}]")
-        print(f"  Clean      → jitter={all_features[i][42]:.4f} shimmer={all_features[i][43]:.4f} hnr={all_features[i][44]:.4f}")
-        print(f"  Compressed → jitter={compressed_features[i][42]:.4f} shimmer={compressed_features[i][43]:.4f} hnr={compressed_features[i][44]:.4f}")
-        print(f"  Adversarial→ jitter={adversarial_features[i][42]:.4f} shimmer={adversarial_features[i][43]:.4f} hnr={adversarial_features[i][44]:.4f}")
 if __name__ == "__main__":

 def add_compression_artifacts(features, strength=0.3):
+    """Simulate codec compression degradation."""
     degraded = features.copy()
     degraded[20:40] *= (1 - strength * np.random.uniform(0.5, 1.0, 20))
     degraded[42] *= (1 - strength * np.random.uniform(0.3, 0.7))
     degraded[43] *= (1 - strength * np.random.uniform(0.3, 0.7))
     degraded[45] *= (1 + strength * np.random.uniform(0.3, 0.8))
     degraded[46] *= (1 - strength * np.random.uniform(0.2, 0.6))
     degraded[47] += strength * np.random.uniform(0.1, 0.4)
     return degraded
     return perturbed
+def add_streaming_degradation(features, label):
+    """Simulate streaming/partial decode degradation.
+    Models real-time audio streaming where features are partially decoded:
+    - MFCC values slightly degraded (simulating partial frame decode)
+    - Temporal features intact but with mild additive noise
+    - High-frequency spectral features mildly rolled off
+    - Overall mild Gaussian noise across all dims
+    This is the base perturbation for Task 4 (streaming_detection).
+    The environment also applies step-dependent soft-gated noise at runtime.
+    """
+    degraded = features.copy()
+    # Partial MFCC decode: higher-order coefficients more degraded
+    for i in range(20):
+        degradation = 0.02 * (i / 20)  # more degradation on higher coeffs
+        degraded[i] += np.random.normal(0, degradation + 0.01)
+    for i in range(20, 40):
+        degradation = 0.03 * ((i - 20) / 20)
+        degraded[i] *= (1 - degradation * np.random.uniform(0.3, 0.8))
+    # Mild noise on temporal features
+    degraded[42] += np.random.normal(0, 0.003)   # jitter noise
+    degraded[43] += np.random.normal(0, 0.008)   # shimmer noise
+    degraded[44] += np.random.normal(0, 0.5)     # HNR noise
+    # Mild spectral rolloff
+    degraded[41] *= np.random.uniform(0.92, 0.98)   # spectral centroid
+    degraded[45] *= np.random.uniform(0.90, 0.97)   # spectral bandwidth
+    degraded[46] *= np.random.uniform(0.88, 0.96)   # spectral rolloff
+    # Global mild noise
+    degraded += np.random.normal(0, 0.015, len(degraded))
+    return degraded
+def add_phonecall_degradation(features, label):
+    """Simulate phone call conditions: heavy codec + background noise.
+    Models the worst-case real-world scenario:
+    - Aggressive codec compression (narrowband telephony)
+    - Background noise injection across all bands
+    - Severe HNR degradation (noisy channel)
+    - MFCC high-frequency rolloff (narrowband 300-3400Hz telephony)
+    - RMS energy fluctuation (network jitter/packet loss)
+    - Jitter/shimmer partially masked by channel noise
+    This is the hardest task — designed to be near the limit of detectability.
+    """
+    degraded = features.copy()
+    # ── Heavy codec compression (narrowband telephony simulation) ───
+    # MFCC means: zero out high-order coefficients (narrowband kills them)
+    for i in range(12, 20):
+        degraded[i] *= np.random.uniform(0.1, 0.4)  # severe suppression
+    # MFCC stds: flatten temporal variation (codec smoothing)
+    degraded[20:40] *= np.random.uniform(0.3, 0.6, 20)
+    # ── Background noise injection ──────────────────────────────────
+    noise_strength = np.random.uniform(0.15, 0.35)
+    degraded += np.random.normal(0, noise_strength, len(degraded))
+    # ── Severe HNR degradation (noisy channel) ─────────────────────
+    degraded[44] -= np.random.uniform(3.0, 7.0)  # massive HNR drop
+    # ── Jitter/shimmer partially masked by channel noise ───────────
+    degraded[42] += np.random.normal(0, 0.015)  # large jitter noise
+    degraded[43] += np.random.normal(0, 0.03)   # large shimmer noise
+    # ── Spectral degradation (narrowband rolloff) ──────────────────
+    degraded[41] *= np.random.uniform(0.5, 0.75)   # centroid drops
+    degraded[45] *= np.random.uniform(0.4, 0.65)   # bandwidth severely narrows
+    degraded[46] *= np.random.uniform(0.3, 0.55)   # rolloff drastically drops
+    # ── RMS energy fluctuation (packet loss / AGC) ──────────────────
+    degraded[47] *= np.random.uniform(0.5, 1.5)
+    # ── ZCR noise (transmission artifacts) ──────────────────────────
+    degraded[40] += np.random.normal(0, 0.02)
+    return degraded
 def main():
+    print("=" * 60)
+    print("Feature Extraction Pipeline (5 Tasks)")
+    print("=" * 60)
     real_feat, real_labels = process_directory(
         REAL_DIR, label=0, desc="REAL audio"
     print(f"\nTask 1 (clean): {len(all_labels)} samples saved")
     # ── TASK 2: Compressed features ─────────────────────────
+    raw_combined = real_feat + fake_feat
     compressed_features = np.array([
         add_compression_artifacts(f, strength=0.3)
+        for f in raw_combined
     ], dtype=np.float32)
     compressed_features = compressed_features[idx]
     print(f"Task 2 (compressed): {len(all_labels)} samples saved")
     # ── TASK 3: Adversarial features ────────────────────────
     raw_labels_combined = real_labels + fake_labels
     adversarial_features = np.array([
     print(f"Task 3 (adversarial): {len(all_labels)} samples saved")
+    # ── TASK 4: Streaming features ──────────────────────────
+    streaming_features = np.array([
+        add_streaming_degradation(f, l)
+        for f, l in zip(raw_combined, raw_labels_combined)
+    ], dtype=np.float32)
+    streaming_features = streaming_features[idx]
+    streaming_norm = (streaming_features - mean) / std
+    np.save(f"{OUTPUT_DIR}/features_streaming.npy", streaming_norm)
+    np.save(f"{OUTPUT_DIR}/labels_streaming.npy", all_labels)
+    print(f"Task 4 (streaming): {len(all_labels)} samples saved")
+    # ── TASK 5: Phone call features ─────────────────────────
+    phonecall_features = np.array([
+        add_phonecall_degradation(f, l)
+        for f, l in zip(raw_combined, raw_labels_combined)
+    ], dtype=np.float32)
+    phonecall_features = phonecall_features[idx]
+    phonecall_norm = (phonecall_features - mean) / std
+    np.save(f"{OUTPUT_DIR}/features_phonecall.npy", phonecall_norm)
+    np.save(f"{OUTPUT_DIR}/labels_phonecall.npy", all_labels)
+    print(f"Task 5 (phonecall): {len(all_labels)} samples saved")
+    print(f"\n{'='*60}")
     print("DONE")
     print(f"Total samples : {len(all_labels)}")
     print(f"Real samples  : {all_labels.tolist().count(0)}")
     print(f"Fake samples  : {all_labels.tolist().count(1)}")
     print(f"Feature shape : {all_features_norm.shape}")
+    print(f"{'='*60}")
     print("\nSanity check — jitter/shimmer/HNR comparison:")
     for i in range(min(2, len(all_labels))):
+        raw_i = np.array(raw_combined)[idx][i]
         label_str = "REAL" if all_labels[i] == 0 else "FAKE"
         print(f"\n  [{label_str}]")
+        print(f"  Clean       → jitter={raw_i[42]:.4f} shimmer={raw_i[43]:.4f} hnr={raw_i[44]:.4f}")
+        print(f"  Compressed  → jitter={compressed_features[i][42]:.4f} shimmer={compressed_features[i][43]:.4f} hnr={compressed_features[i][44]:.4f}")
+        print(f"  Adversarial → jitter={adversarial_features[i][42]:.4f} shimmer={adversarial_features[i][43]:.4f} hnr={adversarial_features[i][44]:.4f}")
+        print(f"  Streaming   → jitter={streaming_features[i][42]:.4f} shimmer={streaming_features[i][43]:.4f} hnr={streaming_features[i][44]:.4f}")
+        print(f"  PhoneCall   → jitter={phonecall_features[i][42]:.4f} shimmer={phonecall_features[i][43]:.4f} hnr={phonecall_features[i][44]:.4f}")
 if __name__ == "__main__":

server/app.py CHANGED Viewed

@@ -1,31 +1,218 @@
 from fastapi import FastAPI
-from fastapi.responses import JSONResponse
 from pydantic import BaseModel
-from typing import Optional
 import uvicorn
 import os
-import sys
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from environment.env import VoiceAuthenticityEnv
-app = FastAPI(title="Voice Authenticity OpenEnv")
-envs = {
-    "clean_detection":       VoiceAuthenticityEnv("clean_detection"),
-    "compressed_detection":  VoiceAuthenticityEnv("compressed_detection"),
-    "adversarial_detection": VoiceAuthenticityEnv("adversarial_detection"),
-}
 current_task = "clean_detection"
 class ActionRequest(BaseModel):
-    label: Optional[int] = 0
-    confidence: Optional[float] = 0.5
-    reasoning: Optional[str] = ""
     task_name: Optional[str] = None
 @app.post("/reset")
 def reset(request: dict = {}):
     global current_task
@@ -41,6 +228,7 @@ def reset(request: dict = {}):
         "info": {}
     })
 @app.post("/step")
 def step(action: ActionRequest):
     global current_task
@@ -48,9 +236,11 @@ def step(action: ActionRequest):
     if task not in envs:
         task = current_task
     action_dict = {
         "label": action.label,
         "confidence": action.confidence,
-        "reasoning": action.reasoning
     }
     obs, reward, done, info = envs[task].step(action_dict)
     return JSONResponse({
@@ -60,17 +250,28 @@ def step(action: ActionRequest):
         "info": info
     })
 @app.get("/state")
 def state():
     return JSONResponse(envs[current_task].state())
 @app.get("/health")
 def health():
-    return {"status": "ok"}
 @app.get("/")
 def root():
-    return {"name": "voice-authenticity-openenv", "status": "running"}
 def main():
     uvicorn.run(app, host="0.0.0.0", port=7860)

+from dotenv import load_dotenv
+load_dotenv()
 from fastapi import FastAPI
+from fastapi.responses import JSONResponse, HTMLResponse
 from pydantic import BaseModel
+from typing import Optional, List
 import uvicorn
 import os
 from environment.env import VoiceAuthenticityEnv
+app = FastAPI(
+    title="Voice Authenticity OpenEnv",
+    description="Multi-step agentic environment for detecting synthetic speech",
+    version="2.0.0"
+)
+TASKS = [
+    "clean_detection",
+    "compressed_detection",
+    "adversarial_detection",
+    "streaming_detection",
+    "phonecall_detection",
+]
+envs = {task: VoiceAuthenticityEnv(task) for task in TASKS}
 current_task = "clean_detection"
 class ActionRequest(BaseModel):
+    action_type: str = "final_classify"
+    label: int = 0
+    confidence: float = 0.5
+    reasoning: str = ""
+    focus: List[str] = []
     task_name: Optional[str] = None
+@app.get("/web", response_class=HTMLResponse)
+def web_interface():
+    return """
+<!DOCTYPE html>
+<html>
+<head>
+    <title>Voice Authenticity OpenEnv</title>
+    <style>
+        * { box-sizing: border-box; margin: 0; padding: 0; }
+        body { font-family: -apple-system, sans-serif; max-width: 860px; margin: 50px auto; padding: 20px; background: #050508; color: #fff; }
+        h1 { color: #00c9a7; font-size: 28px; margin-bottom: 8px; }
+        h2 { font-size: 16px; font-weight: 500; margin-bottom: 12px; color: #00c9a7; }
+        p { color: #666; font-size: 14px; line-height: 1.6; margin-bottom: 8px; }
+        .card { background: #080810; border: 1px solid #0f0f1a; border-radius: 14px; padding: 20px; margin: 16px 0; }
+        .tag { background: #0d2d1e; color: #00c9a7; padding: 4px 12px; border-radius: 20px; font-size: 11px; margin: 3px; display: inline-block; border: 1px solid #0f2d26; }
+        a { color: #00c9a7; text-decoration: none; }
+        a:hover { text-decoration: underline; }
+        .task { border-left: 2px solid #00c9a7; padding: 8px 12px; margin: 8px 0; background: #050508; border-radius: 0 8px 8px 0; }
+        .task strong { font-size: 13px; color: #fff; }
+        .task span { font-size: 12px; color: #555; display: block; margin-top: 2px; }
+        .difficulty { display: inline-block; padding: 2px 8px; border-radius: 10px; font-size: 10px; margin-left: 8px; }
+        .easy { background: #0d2d1e; color: #00c9a7; }
+        .medium { background: #1a1a00; color: #f0a500; }
+        .hard { background: #1a0000; color: #ff6b6b; }
+        .extreme { background: #1a0010; color: #ff00aa; }
+        .medium_hard { background: #0d1a2d; color: #00aaff; }
+        .endpoint { display: flex; gap: 12px; align-items: center; padding: 8px 0; border-bottom: 1px solid #0f0f1a; }
+        .endpoint:last-child { border-bottom: none; }
+        .method { font-size: 11px; font-weight: 600; padding: 3px 8px; border-radius: 6px; min-width: 45px; text-align: center; }
+        .get { background: #0d2d1e; color: #00c9a7; }
+        .post { background: #1a1a00; color: #f0a500; }
+        .endpoint-path { font-size: 13px; color: #fff; font-family: monospace; }
+        .endpoint-desc { font-size: 12px; color: #444; }
+        .action-grid { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-top: 8px; }
+        .action-card { background: #050508; border: 1px solid #0f0f1a; border-radius: 10px; padding: 12px; }
+        .action-name { font-size: 12px; font-family: monospace; color: #00c9a7; margin-bottom: 4px; }
+        .action-desc { font-size: 11px; color: #444; line-height: 1.5; }
+        .stat { text-align: center; padding: 16px; }
+        .stat-num { font-size: 28px; font-weight: 600; color: #fff; }
+        .stat-num span { color: #00c9a7; }
+        .stat-label { font-size: 11px; color: #444; margin-top: 4px; }
+        .stats-grid { display: grid; grid-template-columns: repeat(3, 1fr); gap: 1px; background: #0f0f1a; border-radius: 12px; overflow: hidden; }
+        .stat { background: #080810; }
+        .badge { display: inline-flex; align-items: center; gap: 6px; padding: 5px 14px; border: 1px solid #0f2d26; background: #050f0d; border-radius: 20px; font-size: 11px; color: #00c9a7; }
+        .dot { width: 6px; height: 6px; background: #00c9a7; border-radius: 50%; animation: pulse 2s infinite; }
+        @keyframes pulse { 0%,100%{opacity:1} 50%{opacity:.3} }
+        footer { text-align: center; padding: 2rem 0; color: #333; font-size: 12px; }
+        footer span { color: #00c9a7; }
+    </style>
+</head>
+<body>
+    <div style="margin-bottom:1.5rem">
+        <div class="badge"><div class="dot"></div>Live — 5 tasks available</div>
+    </div>
+    <h1>🎙️ Voice Authenticity OpenEnv</h1>
+    <p style="margin-bottom:1.5rem;font-size:16px;color:#888">
+        Multi-step agentic environment for detecting synthetic (AI-generated) speech
+        across real-world degradation and adversarial conditions.
+    </p>
+    <div class="stats-grid">
+        <div class="stat">
+            <div class="stat-num">5<span>+</span></div>
+            <div class="stat-label">Tasks</div>
+        </div>
+        <div class="stat">
+            <div class="stat-num">5</div>
+            <div class="stat-label">Steps per episode</div>
+        </div>
+        <div class="stat">
+            <div class="stat-num">48</div>
+            <div class="stat-label">Feature dimensions</div>
+        </div>
+    </div>
+    <div class="card">
+        <h2>Tasks</h2>
+        <div class="task">
+            <strong>clean_detection <span class="difficulty easy">easy</span></strong>
+            <span>Classify real vs synthetic speech from clean, unmodified audio features</span>
+        </div>
+        <div class="task">
+            <strong>compressed_detection <span class="difficulty medium">medium</span></strong>
+            <span>Classify speech under codec compression degradation</span>
+        </div>
+        <div class="task">
+            <strong>adversarial_detection <span class="difficulty hard">hard</span></strong>
+            <span>Adversarially crafted synthetic speech with overlapping feature distributions</span>
+        </div>
+        <div class="task">
+            <strong>streaming_detection <span class="difficulty medium_hard">medium-hard</span></strong>
+            <span>Step-dependent noise soft-gating — earlier steps noisier, later steps cleaner</span>
+        </div>
+        <div class="task">
+            <strong>phonecall_detection <span class="difficulty extreme">extreme</span></strong>
+            <span>Heavy codec compression and narrowband degradation simulating phone calls</span>
+        </div>
+    </div>
+    <div class="card">
+        <h2>5-Step Agent Protocol</h2>
+        <div class="action-grid">
+            <div class="action-card">
+                <div class="action-name">1. request_temporal_features</div>
+                <div class="action-desc">Reveals jitter, shimmer, and HNR — the core discriminating signals</div>
+            </div>
+            <div class="action-card">
+                <div class="action-name">2. request_spectral_features</div>
+                <div class="action-desc">Reveals 20 MFCC means, 20 MFCC stds, ZCR, spectral centroid</div>
+            </div>
+            <div class="action-card">
+                <div class="action-name">3. request_comparison</div>
+                <div class="action-desc">Compares sample to real/fake reference centroids via cosine similarity</div>
+            </div>
+            <div class="action-card">
+                <div class="action-name">4. analyze_evidence</div>
+                <div class="action-desc">Synthesizes all gathered signals into a structured evidence summary</div>
+            </div>
+            <div class="action-card" style="grid-column: span 2;">
+                <div class="action-name">5. final_classify</div>
+                <div class="action-desc">Submits final verdict: label (0=real, 1=synthetic) + confidence + reasoning. Terminates episode.</div>
+            </div>
+        </div>
+    </div>
+    <div class="card">
+        <h2>API Endpoints</h2>
+        <div class="endpoint">
+            <span class="method post">POST</span>
+            <span class="endpoint-path">/reset</span>
+            <span class="endpoint-desc">Reset episode, optionally set task_name</span>
+        </div>
+        <div class="endpoint">
+            <span class="method post">POST</span>
+            <span class="endpoint-path">/step</span>
+            <span class="endpoint-desc">Submit action, receive observation + reward</span>
+        </div>
+        <div class="endpoint">
+            <span class="method get">GET</span>
+            <span class="endpoint-path">/state</span>
+            <span class="endpoint-desc">Current environment state</span>
+        </div>
+        <div class="endpoint">
+            <span class="method get">GET</span>
+            <span class="endpoint-path">/health</span>
+            <span class="endpoint-desc">Health check</span>
+        </div>
+        <div class="endpoint">
+            <span class="method get">GET</span>
+            <span class="endpoint-path"><a href="/docs">/docs</a></span>
+            <span class="endpoint-desc">Interactive API documentation (Swagger UI)</span>
+        </div>
+    </div>
+    <div class="card">
+        <h2>Tags</h2>
+        <span class="tag">openenv</span>
+        <span class="tag">speech</span>
+        <span class="tag">fraud-detection</span>
+        <span class="tag">audio</span>
+        <span class="tag">partial-observability</span>
+        <span class="tag">multi-step</span>
+        <span class="tag">confidence-calibration</span>
+        <span class="tag">adversarial</span>
+    </div>
+    <footer>
+        Built by <span>Akshara Sharma</span> · Voice Authenticity OpenEnv v2.0.0
+        · <a href="https://github.com/AksharaaSharmaa/voice-authenticity-openenv">GitHub</a>
+    </footer>
+</body>
+</html>
+"""
 @app.post("/reset")
 def reset(request: dict = {}):
     global current_task
         "info": {}
     })
 @app.post("/step")
 def step(action: ActionRequest):
     global current_task
     if task not in envs:
         task = current_task
     action_dict = {
+        "action_type": action.action_type,
         "label": action.label,
         "confidence": action.confidence,
+        "reasoning": action.reasoning,
+        "focus": action.focus,
     }
     obs, reward, done, info = envs[task].step(action_dict)
     return JSONResponse({
         "info": info
     })
 @app.get("/state")
 def state():
     return JSONResponse(envs[current_task].state())
 @app.get("/health")
 def health():
+    return {"status": "healthy", "service": "voice-authenticity-openenv"}
 @app.get("/")
 def root():
+    return {
+        "name": "voice-authenticity-openenv",
+        "version": "2.0.0",
+        "status": "running",
+        "tasks": TASKS,
+        "web": "/web",
+        "docs": "/docs"
+    }
 def main():
     uvicorn.run(app, host="0.0.0.0", port=7860)