Streamline CLAUDE.md: remove session history, add sync workflow info
Browse files
CLAUDE.md
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
HallucinationGuard-Env is an OpenEnv RL environment for training LLMs to avoid hallucinations. It runs as a FastAPI server on HuggingFace Spaces with 3 graded tasks (beginner → advanced) and a 9-component reward system.
|
| 8 |
+
|
| 9 |
+
## Key Commands
|
| 10 |
+
|
| 11 |
+
```bash
|
| 12 |
+
# Install dependencies
|
| 13 |
+
pip install -r server/requirements.txt
|
| 14 |
+
|
| 15 |
+
# Run server locally (port 7860)
|
| 16 |
+
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
|
| 17 |
+
|
| 18 |
+
# Run heuristic baseline (no API key needed)
|
| 19 |
+
python inference.py --heuristic --env-url http://localhost:7860
|
| 20 |
+
|
| 21 |
+
# Run tests
|
| 22 |
+
pytest tests/ # All tests
|
| 23 |
+
pytest tests/test_grader.py -v # Specific test file
|
| 24 |
+
pytest tests/test_grader.py::TestGraderScoreRange -v # Specific test class
|
| 25 |
+
|
| 26 |
+
# Lint (CI uses this)
|
| 27 |
+
ruff check . --ignore E501,F401,F403
|
| 28 |
+
|
| 29 |
+
# Docker build
|
| 30 |
+
docker build -t hallucination-guard-env .
|
| 31 |
+
docker run -p 7860:7860 hallucination-guard-env
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## Running with LLM APIs
|
| 35 |
+
|
| 36 |
+
### Groq (cloud)
|
| 37 |
+
```bash
|
| 38 |
+
export API_BASE_URL=https://api.groq.com/openai/v1
|
| 39 |
+
export MODEL_NAME=llama-3.3-70b-versatile
|
| 40 |
+
export HF_TOKEN=gsk_your_key_here
|
| 41 |
+
python inference.py --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
### Ollama (local)
|
| 45 |
+
```bash
|
| 46 |
+
ollama pull qwen2.5:7b
|
| 47 |
+
export API_BASE_URL=http://localhost:11434/v1
|
| 48 |
+
export MODEL_NAME=qwen2.5:7b
|
| 49 |
+
export HF_TOKEN=ollama # Any non-empty value triggers LLM mode
|
| 50 |
+
python inference.py --env-url http://localhost:7860 --episodes 3 --steps 5 --seed 42
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Critical Dependencies
|
| 54 |
+
|
| 55 |
+
- **NumPy must be <2.0.0** — Pre-compiled packages (sentence-transformers, bert-score) crash with NumPy 2.x. Pinned in requirements.
|
| 56 |
+
- **Protobuf required** — BERTScore dependency; explicitly listed in requirements.
|
| 57 |
+
|
| 58 |
+
## Architecture
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
server/
|
| 62 |
+
├── app.py # FastAPI endpoints
|
| 63 |
+
├── environment.py # HallucinationEnvironment class (OpenEnv step/reset/state)
|
| 64 |
+
├── grader.py # 9-component reward calculation + refusal handling + explanations
|
| 65 |
+
├── dataset_loader.py # Loads 38 datasets from HF cache
|
| 66 |
+
└── tasks.py # Task registry with difficulty-weighted graders
|
| 67 |
+
|
| 68 |
+
models.py # Pydantic models: HallucinationAction, HallucinationObservation, HallucinationState
|
| 69 |
+
inference.py # Hackathon submission script (OpenAI-compatible client)
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### Data Flow
|
| 73 |
+
|
| 74 |
+
1. **reset()** → Samples question from dataset_loader, returns HallucinationObservation
|
| 75 |
+
2. **step(HallucinationAction)** → Grades answer via grader.py, returns reward + feedback
|
| 76 |
+
3. **grader.calculate_reward()** → 9 components (see Reward System below)
|
| 77 |
+
4. **tasks.compute_task_score()** → Aggregates per-step rewards into 0.0-1.0 task score
|
| 78 |
+
|
| 79 |
+
### API Endpoints
|
| 80 |
+
|
| 81 |
+
| Category | Method | Endpoint | Description |
|
| 82 |
+
|----------|--------|----------|-------------|
|
| 83 |
+
| Environment | POST | `/reset` | Start new episode |
|
| 84 |
+
| Environment | POST | `/step` | Submit answer |
|
| 85 |
+
| Environment | GET | `/state` | Get episode state |
|
| 86 |
+
| Batch | POST | `/batch/evaluate` | Evaluate multiple Q&A pairs |
|
| 87 |
+
| Batch | POST | `/batch/stream` | Streaming batch (NDJSON) |
|
| 88 |
+
| Metrics | GET | `/metrics/timing` | Time-per-step latency stats |
|
| 89 |
+
| Leaderboard | GET | `/leaderboard/viz` | Chart data (bar, scatter, tiers) |
|
| 90 |
+
| OpenEnv | GET | `/tasks` | List tasks + action schema |
|
| 91 |
+
| OpenEnv | POST | `/grader` | Score completed episode |
|
| 92 |
+
| OpenEnv | POST | `/baseline` | Run heuristic baseline |
|
| 93 |
+
|
| 94 |
+
### Dataset Loading
|
| 95 |
+
|
| 96 |
+
Datasets load from `SamSankar/hallucination-guard-cache` HF Dataset repo. Core datasets load synchronously on startup; extended datasets load in background thread. Cached locally at `/tmp/halluguard_cache/`.
|
| 97 |
+
|
| 98 |
+
### Model Preloading
|
| 99 |
+
|
| 100 |
+
ML models (sentence-transformers, CrossEncoder/NLI, ROUGE, BERTScore) preload at server startup in `lifespan()` to avoid 30-60s cold start delays. Environment variable `HF_HOME=/tmp/hf_cache` replaces deprecated `TRANSFORMERS_CACHE`.
|
| 101 |
+
|
| 102 |
+
## Reward System (grader.py)
|
| 103 |
+
|
| 104 |
+
9-component reward system:
|
| 105 |
+
|
| 106 |
+
| Component | Weight | Description |
|
| 107 |
+
|-----------|--------|-------------|
|
| 108 |
+
| factual_correctness | 0.35 | Exact/fuzzy match + semantic similarity to ground truth |
|
| 109 |
+
| source_grounding | 0.20 | Answer supported by context (reduced for wrong answers) |
|
| 110 |
+
| citation_accuracy | 0.10 | source_quote found verbatim in context |
|
| 111 |
+
| confidence_calibration | 0.10 | ECE between stated confidence and correctness |
|
| 112 |
+
| semantic_consistency | 0.10 | NLI entailment score (DeBERTa-v3) |
|
| 113 |
+
| hallucination_penalty | 0.10 | Penalizes detected hallucinations |
|
| 114 |
+
| rouge_score | 0.02 | ROUGE-1/2/L overlap with reference |
|
| 115 |
+
| bertscore | 0.02 | Token-level semantic similarity |
|
| 116 |
+
| alignscore | 0.01 | Faithfulness to context (RoBERTa) |
|
| 117 |
+
|
| 118 |
+
**Key behavior:**
|
| 119 |
+
- Wrong answers capped at ~0.4 reward regardless of grounding
|
| 120 |
+
- Grounding contribution reduced for incorrect answers
|
| 121 |
+
- Difficulty multiplier: beginner×0.9, intermediate×1.0, advanced×1.1, expert×1.2
|
| 122 |
+
|
| 123 |
+
## Refusal Handling
|
| 124 |
+
|
| 125 |
+
The grader detects when models appropriately refuse to answer unanswerable questions:
|
| 126 |
+
|
| 127 |
+
| Scenario | Reward | Behavior |
|
| 128 |
+
|----------|--------|----------|
|
| 129 |
+
| Proper refusal on unanswerable | 0.65–0.80 | Rewarded for honesty |
|
| 130 |
+
| Refusal with low confidence | 0.50 | Partial credit |
|
| 131 |
+
| Underconfident refusal (answer exists) | 0.30 | Penalized for not trying |
|
| 132 |
+
|
| 133 |
+
Detected phrases: "I cannot answer", "not in the context", "I don't know", "cannot determine", "insufficient information". See `is_refusal_answer()` in grader.py.
|
| 134 |
+
|
| 135 |
+
## Pydantic Models
|
| 136 |
+
|
| 137 |
+
All models inherit from `openenv.core.env_server.Action`, `Observation`, `State` (Pydantic BaseModel, not dataclass). When modifying:
|
| 138 |
+
- Use `Field(default_factory=...)` not `field(default_factory=...)`
|
| 139 |
+
- Use `str` for enum values in model fields (e.g., `difficulty: str = "intermediate"`)
|
| 140 |
+
- Serialization uses `_safe_dict()` in app.py which handles Pydantic models via `model_dump()`
|
| 141 |
+
|
| 142 |
+
## Test Structure
|
| 143 |
+
|
| 144 |
+
```
|
| 145 |
+
tests/
|
| 146 |
+
├── test_grader.py # 20 tests: reward calculation, refusal handling, hallucination detection
|
| 147 |
+
├── test_adversarial.py # 18 tests: HaluEval, TruthfulQA edge cases
|
| 148 |
+
├── test_endpoints.py # 15 tests: batch eval, metrics, leaderboard endpoints
|
| 149 |
+
├── test_environment.py # 13 tests: reset/step behavior
|
| 150 |
+
└── test_dataset_loader.py # 14 tests: dataset loading, caching
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
Run with `pytest tests/ -v`. CI runs automatically via `.github/workflows/test.yml`.
|
| 154 |
+
|
| 155 |
+
## Repositories
|
| 156 |
+
|
| 157 |
+
- **GitHub:** https://github.com/SS-360/hallucination-guard-env
|
| 158 |
+
- **HuggingFace Space:** https://huggingface.co/spaces/SamSankar/hallucination-guard-env
|
| 159 |
+
|
| 160 |
+
Changes pushed to GitHub automatically sync to HuggingFace Spaces via `.github/workflows/sync-to-hf.yml`. Requires `HF_TOKEN` secret with write permissions in GitHub repo settings.
|
| 161 |
+
|
| 162 |
+
## Baseline Scores
|
| 163 |
+
|
| 164 |
+
Heuristic agent (seed=42, 3 episodes × 5 steps):
|
| 165 |
+
- task_1_factual_grounding: 0.29 (±0.15)
|
| 166 |
+
- task_2_multi_hop_synthesis: 0.25 (±0.14)
|
| 167 |
+
- task_3_adversarial_resistance: 0.22 (±0.16)
|
| 168 |
+
- Overall: 0.25
|