Spaces:

EQUES
/

Response-Quality-Assessment

Sleeping

App Files Files Community

Ryoya Awano commited on 22 days ago

Commit

19fc84f

0 Parent(s):

deploy: fix MedLFQA Marginal mode sample matching

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.DS_Store +0 -0
.env.example +16 -0
.gitignore +11 -0
CLAUDE.md +57 -0
README.md +95 -0
conf/config.yaml +40 -0
conf/dataset_config.yaml +17 -0
conf/path_config.yaml +9 -0
data/.DS_Store +0 -0
data/processed/FactScore/fact_score_queries.json +552 -0
data/processed/HotpotQA/hotpot_qa_queries.json +0 -0
data/processed/PopQA/pop_qa_queries.json +0 -0
data/processed/base_schema.json +16 -0
data/processed/document_schema.json +12 -0
data/processed/wiki_schema.json +26 -0
demo/__init__.py +0 -0
demo/app.py +362 -0
demo/constants.py +23 -0
demo/data/samples.json +322 -0
demo/data/thresholds.csv +145 -0
demo/inference_api.py +302 -0
demo/precompute.py +311 -0
docs/context/01_original_architecture.md +404 -0
docs/context/02_design_patterns_analysis.md +264 -0
docs/context/03_demo_app_design.md +327 -0
docs/context/04_environment_setup.md +373 -0
docs/context/05_medrag_corpus_integration.md +196 -0
docs/context/06_test_strategy.md +180 -0
docs/context/07_huggingface_spaces_deployment.md +259 -0
main.py +376 -0
requirements-dev.txt +13 -0
requirements.txt +13 -0
scripts/build_sif.sh +15 -0
scripts/run_demo.sh +28 -0
scripts/run_main.sh +22 -0
scripts/run_precompute.sh +25 -0
scripts/run_tests.sh +37 -0
singularity/response_quality.def +45 -0
src/calibration/__init__.py +0 -0
src/calibration/base_calibration.py +19 -0
src/calibration/conditional_conformal.py +278 -0
src/calibration/conformal.py +422 -0
src/calibration/utils.py +161 -0
src/common/__init__.py +0 -0
src/common/chunker.py +41 -0
src/common/config_manager.py +171 -0
src/common/faiss_manager.py +309 -0
src/common/file_manager.py +128 -0
src/common/llm/__init__.py +0 -0
src/common/llm/llm_agent.py +14 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

.env.example ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copy this file to .env and fill in the values for your environment.
+# .env is listed in .gitignore and must never be committed.
+# Absolute path to the data root on the host.
+# Heavy files (FAISS indices, raw data, main.py output) are stored here
+# and bind-mounted into the container.
+DATA_ROOT=/mnt/data/your_username/ResponseQualityAssessment
+# Directory where the Singularity .sif file is stored.
+SIF_DIR=/mnt/data/your_username/sif
+# HuggingFace model cache directory.
+HF_HOME=/mnt/data/your_username/hf_cache
+# OpenAI API key (required for live inference in the demo).
+OPENAI_API_KEY=sk-...

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+data/raw/WikiDB/enwiki-20230401.db
+logs/
+index_store/
+data/
+# demo/data/ is generated by precompute.py but committed to the repo
+!demo/data/
+!demo/data/*
+*.pyc
+__pycache__/
+launch.json
+.env

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,57 @@

+# 開発ガイドライン
+## プロジェクトの目的
+RAG の回答品質評価システム。共形予測（Conformal Prediction）を用いて、LLM 回答のサブクレームごとに統計的保証付きのスコアリングを行う。既存のバッチ処理ロジック（`src/` や `main.py`）を活かしつつ、インタラクティブに操作できるデモアプリ（`demo/`）をアジャイルに開発している。
+---
+## ディレクトリ構成
+```
+src/
+├── calibration/         # 共形予測キャリブレーション
+├── common/              # 共通コンポーネント（設定・ファイル・LLM 管理）
+├── data_processor/      # データセット処理パイプライン
+├── dataloader/          # HuggingFace データローダー
+├── rag/                 # RAG（SQLite ベース文書 DB）
+├── subclaim_processor/  # 回答品質評価の中核パイプライン
+└── utils/               # 汎用ヘルパー関数
+demo/
+├── app.py               # Streamlit エントリーポイント
+├── inference_api.py     # 単一クエリ推論ラッパー（src/ を呼び出す）
+└── precompute.py        # 閾値の事前計算スクリプト（オフライン実行用）
+```
+---
+## コンテキストと設計
+詳細は `docs/context/` を参照。
+| ドキュメント | 内容 |
+|------------|------|
+| `01_original_architecture.md` | アーキテクチャ全体像・データフロー・クラス設計 |
+| `02_design_patterns_analysis.md` | 設計パターンの評価と改善案 |
+| `03_demo_app_design.md` | デモアプリの技術選定・UI 設計・開発ステップ |
+| `04_environment_setup.md` | 環境構築・実行手順（Singularity、`.env` 設定、スクリプト一覧） |
+| `06_test_strategy.md` | テスト計画と実行フロー（ユニットテスト〜UI テスト） |
+ロジックを変更した際、アーキテクチャに大きな変更が生じた場合は `docs/context/` 内のドキュメントも合わせて更新してください。
+---
+## コーディングルール
+### src/（コアロジック）
+- `src/` 配下のコアロジックの修正・リファクタリングは許可されています。
+- ただし、既存の `main.py` を使ったバッチ処理が壊れないよう、後方互換性を意識するか、影響範囲を報告してください。
+### demo/（デモアプリ）
+- デモ特有の UI ロジックやエンドポイントは `demo/` ディレクトリ内に隔離し、`src/` のコアロジックと密結合させないでください。
+- `demo/` から `src/` モジュールを `import` するのみとし、コアロジックは変更しないでください。
+- キャリブレーション閾値（$\hat{q}$）は `demo/precompute.py` で事前計算し、JSON/CSV として静的保持してください。デモ実行時にキャリブレーションループを回さないでください。
+### Git
+- 機能単位でキリが良いところで `git commit` を提案してください。

README.md ADDED Viewed

	@@ -0,0 +1,95 @@

+---
+title: Conformal RAG Quality Assessment Demo
+emoji: 📊
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.43.2
+app_file: demo/app.py
+pinned: false
+---
+# Response Quality Assessment for Retrieval-Augmented Generation via Conditional Conformal Factuality
+This repository contains production-ready code and resources for our research for paper "Response Quality Assessment for Retrieval-Augmented
+Generation via Conditional Conformal Factuality" Accepted by SIGIR 2025
+Arxiv ver: [https://www.arxiv.org/pdf/2506.20978](https://www.arxiv.org/abs/2506.20978)
+Poster: https://drive.google.com/file/d/1k7fSa4k07dPJUCvsjU0hE0e486T5UNoy/view?usp=sharing
+## Table of Contents
+- [Structure](#Structure)
+- [Data](#data)
+  - [Query Data](#query-data)
+  - [Wikipedia Extraction](#wikipedia-extraction)
+- [Usage](#usage)
+- [References](#references)
+- [More Information](#more-information)
+## Structure
+```bash
+.
+├── conf/                   # Configuration file location
+├── data/
+│   ├── out/               # Final subclaims with scores (follows `subclaims_schema`)
+│   ├── processed/         # Standardized test data (follows `base_schema`)
+│   └── raw/               # Original raw data from source (unstructured)
+├── index_store/           # Chunked documents and embeddings
+├── logs/                  # Config and logs in format `run_{data}_{run_id}`
+├── src/
+│   ├── calibration/       # Conformal prediction calibration logic
+│   ├── common/            # Reusable components (e.g., config manager, FAISS vector DB manager)
+│   ├── data_processor/    # Converts raw QA data to standardized format (see `data/processed`)
+│   ├── dataloader/        # Loads data from source datasets (e.g., AkariASAI/PopQA, KILT benchmark)
+│   ├── rag/               # RAG system components for document retrieval
+│   ├── subclaim_processor/# Generates, scores, and annotates subclaims for different datasets
+│   └── utils/             # Miscellaneous utilities
+```
+## Data
+### Query Data
+This repository includes the following query datasets:
+- [FactScore](https://github.com/shmsw25/FActScore)
+- [PopQA](https://huggingface.co/datasets/akariasai/PopQA)
+- [HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa)
+- [MedLFQA] (https://github.com/dmis-lab/OLAPH/tree/main/MedLFQA) or (https://github.com/jjcherian/conformal-safety/tree/main/data/MedLFQAv2)
+### Wikipedia Extraction
+We utilize Wikipedia dumps for knowledge retrieval:
+- [enwiki-20230401.db](https://drive.google.com/file/d/1mekls6OGOKLmt7gYtHs0WGf5oTamTNat/view?usp=drive_link)
+This file is not included in this github, you could download it through ming's google drive above (source: https://github.com/shmsw25/FActScore) and put it under
+\data\raw folder in order to generate reference doucument for wiki based queries (popqa and hotpotqa)
+## Usage
+Project is build on python version 3.11
+First, set up project env using [requirements.txt](requirements.txt).
+To run the pipeline:
+```python
+python main.py --config conf/config.yaml --dataset fact_score --query_size 500
+```
+Only 1 dataset at a time in 1 thread.
+avaliable dataset currently are:
+["fact_score", "hotpot_qa", "pop_qa", "medlf_qa"]
+## Conditional Conformal
+This repo only support conditional conformal in medlf_qa dataset. By default the config in /conf/dataset_config.yaml
+The medlf_qa.is_grouped = true while other are set to false
+The factuality result will be put in different csv files under the result/${datetime}_${run_id} folder naming by each different pre-defined group name
+## Start with metadata
+In order to not have any OpenAI token comsumption, one can choose use these metadata already produced to just verify conformal prediction part. The result will be stable
+You can get required metadata here: https://drive.google.com/drive/folders/1aLbHxS6V1ipMH8FpVCxKmr8oMYfqmRgb?usp=drive_link
+## More Information
+For further details, please refer to our Paper (link on top)
+The baseline group conditional conformal (https://arxiv.org/abs/2406.09714) result
+for medlfqav2 is produced by their code: github.com/jjcherian/conformal-safety
+and is not in part of this repo.
+## License
+This project is licensed under the [MIT License](https://opensource.org/license/mit).

conf/config.yaml ADDED Viewed

	@@ -0,0 +1,40 @@

+dataset:
+  name: "pop_qa"  # Options: fact_score, hotpot_qa, pop_qa, medlf_qa
+  query_size: 500
+  wiki_db_file: "enwiki-20230401.db"
+# Index configuration
+index:
+  delete_existing: false
+  embedding_model: "text-embedding-3-large"
+  # Truncation strategies by dataset
+  truncation_config:  # TODO
+    strategy: "fixed_length"  # false
+    truncate_by: null  # "\n"
+    chunk_size: 2000
+    chunk_overlap: 25
+# RAG configuration
+rag:
+  retrival_topk: 10
+  retrival_threshold: 0.3
+  response_model: "gpt-4.1-mini"
+  response_temperature: 0.7
+  fact_generation_model: "gpt-4.1-mini"
+# Prediction configuration
+conformal_prediction:
+  aggregation_strategy: "mean"
+  scoring_strategy: "product"
+  claim_verification_model: "gpt-4.1-mini"  # "gpt-4o-mini"
+  frequency_score_model: "gpt-4.1-mini"
+  split_conformal: true
+  conformal_alphas:
+    start: 0.05
+    end: 0.45
+    step: 0.05
+  a_value: 1.0

conf/dataset_config.yaml ADDED Viewed

	@@ -0,0 +1,17 @@

+datasets:
+  fact_score:
+    name: "FactScore"
+    index_store: "${DATA_ROOT}/index_store/FactScore"
+    is_grouped: false
+  hotpot_qa:
+    name: "HotpotQA"
+    index_store: "${DATA_ROOT}/index_store/HotpotQA"
+    is_grouped: false
+  pop_qa:
+    name: "PopQA"
+    index_store: "${DATA_ROOT}/index_store/PopQA"
+    is_grouped: false
+  medlf_qa:
+    name: "MedLFQA"
+    index_store: "${DATA_ROOT}/index_store/MedLFQA"
+    is_grouped: true

conf/path_config.yaml ADDED Viewed

	@@ -0,0 +1,9 @@

+# Path configuration
+# Paths are relative to DATA_ROOT (set in .env).
+# The loader must call os.path.expandvars() after yaml.safe_load().
+paths:
+  raw_data_dir: "${DATA_ROOT}/data/raw"
+  processed_data_dir: "${DATA_ROOT}/data/processed"
+  response_dir: "${DATA_ROOT}/data/out"
+  wiki_db_dir: "${DATA_ROOT}/data/raw/WikiDB"
+  result_dir: "${DATA_ROOT}/data/result"

data/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

data/processed/FactScore/fact_score_queries.json ADDED Viewed

	@@ -0,0 +1,552 @@

+[
+    {
+        "input": "What is Lanny Flaherty's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Lanny Flaherty"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Marianne McAndrew's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Marianne McAndrew"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Chaim Malinowitz's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Chaim Malinowitz"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Doug Sheehan's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Doug Sheehan"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Gerhard Fischer (inventor)'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Gerhard Fischer (inventor)"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Chief Jones's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Chief Jones"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Focus...'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Focus..."
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Joey D. Vieira's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Joey D. Vieira"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Taral Hicks's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Taral Hicks"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Lees Knowles's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Lees Knowles"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Carolina Portesi Peroni's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Carolina Portesi Peroni"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Quintus Sosius Senecio's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Quintus Sosius Senecio"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Joeri Adams's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Joeri Adams"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Dominic King's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Dominic King"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Vincenzo Tusa's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Vincenzo Tusa"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Jean Daull\u00e9's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Jean Daull\u00e9"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Richard Burkewood Welbourn's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Richard Burkewood Welbourn"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Damir Memovi\u0107's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Damir Memovi\u0107"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Mar\u00eda Elena Medina-Mora Icaza's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Mar\u00eda Elena Medina-Mora Icaza"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Hesham Nazih's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Hesham Nazih"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Luis Guillermo Rivera's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Luis Guillermo Rivera"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Desmond Luke's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Desmond Luke"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Miguel D\u00edaz (baseball)'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Miguel D\u00edaz (baseball)"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Roxana D\u00edaz (athlete)'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Roxana D\u00edaz (athlete)"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Fernando (footballer, born 1984)'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Fernando (footballer, born 1984)"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Gerardo Fernandez Fe's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Gerardo Fernandez Fe"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Augusto Mart\u00ednez S\u00e1nchez's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Augusto Mart\u00ednez S\u00e1nchez"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Sharad Kumar (athlete)'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Sharad Kumar (athlete)"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Ko Itakura's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Ko Itakura"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Besant Ravi's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Besant Ravi"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Thomas Houghton (rugby league)'s occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Thomas Houghton (rugby league)"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Katsunosuke Hori's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Katsunosuke Hori"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Ra Jong-yil's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Ra Jong-yil"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Song Kang's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Song Kang"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Rin Iwanaga's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Rin Iwanaga"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Keita Kadokura's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Keita Kadokura"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Lily Branscombe's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Lily Branscombe"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Luis N. Rivera-Pag\u00e1n's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Luis N. Rivera-Pag\u00e1n"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Jesse Foppert's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Jesse Foppert"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Mike Trivisonno's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Mike Trivisonno"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Zelma Wilson's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Zelma Wilson"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Michael Valpy's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Michael Valpy"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Prince Rivers's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Prince Rivers"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is John Estes's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "John Estes"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Eric Hacker's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Eric Hacker"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Alma Katsu's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Alma Katsu"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Carlos Alfonso's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Carlos Alfonso"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Rennie Fritchie, Baroness Fritchie's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Rennie Fritchie, Baroness Fritchie"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Daniel Charles's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Daniel Charles"
+                }
+            ]
+        }
+    },
+    {
+        "input": "What is Zamfir Arbore's occupation?",
+        "output": {
+            "answer": "",
+            "provenance": [
+                {
+                    "title": "Zamfir Arbore"
+                }
+            ]
+        }
+    }
+]

data/processed/HotpotQA/hotpot_qa_queries.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/processed/PopQA/pop_qa_queries.json ADDED Viewed

The diff for this file is too large to render. See raw diff

data/processed/base_schema.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+    "$schema": "https://json-schema.org/draft-07/schema",
+    "type": "object",
+    "properties": {
+      "input": { "type": "string" },
+      "output": {
+        "type": "object",
+        "properties": {
+          "answer": { "type": "string" }
+        },
+        "required": ["answer"]
+      }
+    },
+    "required": ["input", "output"]
+  }

data/processed/document_schema.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+    "$schema": "https://json-schema.org/draft-07/schema",
+    "type": "object",
+    "unevaluatedProperties": {
+      "type": "array",
+      "minItems": 1,
+      "items": {
+        "type": "string"
+      }
+    }
+  }

data/processed/wiki_schema.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "$schema": "https://json-schema.org/draft-07/schema",
+    "allOf": [
+      { "$ref": "base_schema.json" },
+      {
+        "properties": {
+          "output": {
+            "properties": {
+              "provenance": {
+                "type": "array",
+                "items": {
+                  "type": "object",
+                    "properties": {
+                    "wikipedia_id": { "type": "integer" },
+                    "title": { "type": ["string", "null"] }
+                  },
+                  "required": ["title"]
+                }
+              }
+            }
+          }
+        }
+      }
+    ]
+  }

demo/__init__.py ADDED Viewed

File without changes

demo/app.py ADDED Viewed

	@@ -0,0 +1,362 @@

+# demo/app.py
+# Streamlit entry point.  Run with: streamlit run demo/app.py
+#
+# Prerequisites:
+#   1. FAISS indices built by main.py
+#   2. demo/data/thresholds.csv and demo/data/samples.json built by precompute.py
+import json
+import os
+import re
+import time
+import pandas as pd
+import streamlit as st
+# When SPACES_DEMO=1, live inference (FAISS + LLM pipeline) is disabled.
+# Only precomputed sample queries are available.  Set this in HF Spaces Secrets.
+SPACES_DEMO: bool = os.getenv("SPACES_DEMO", "0") == "1"
+from demo.constants import ALPHA_LEVELS, DEFAULT_SCORING_METHOD
+from demo.inference_api import (
+    SubclaimResult,
+    FilteredResult,
+    build_faiss_manager,
+    build_scorer,
+    process_query,
+    apply_threshold,
+    reintegrate_subclaims,
+)
+# ── Constants ─────────────────────────────────────────────────────────────────
+DATASET_LABELS: dict[str, str] = {
+    "fact_score": "FactScore",
+    "hotpot_qa": "HotpotQA",
+    "pop_qa": "PopQA",
+    "medlf_qa": "MedLFQA",
+}
+MEDLF_GROUPS: list[str] = [
+    "healthsearch_qa",
+    "medication_qa",
+    "kqa_silver_wogold",
+    "kqa_golden",
+    "live_qa",
+]
+# Slider options in ascending order (95 → highest factuality at the right)
+FACTUALITY_OPTIONS: list[int] = sorted(
+    int(round((1 - a) * 100)) for a in ALPHA_LEVELS
+)
+THRESHOLDS_PATH = "demo/data/thresholds.csv"
+SAMPLES_PATH = "demo/data/samples.json"
+# ── Cached resources ──────────────────────────────────────────────────────────
+@st.cache_resource
+def get_faiss_manager(dataset: str):
+    return build_faiss_manager(dataset)
+@st.cache_resource
+def get_scorer(dataset: str):
+    return build_scorer(dataset)
+@st.cache_data
+def load_thresholds() -> pd.DataFrame:
+    return pd.read_csv(THRESHOLDS_PATH)
+@st.cache_data
+def load_samples() -> list[dict]:
+    with open(SAMPLES_PATH, encoding="utf-8") as f:
+        return json.load(f)["samples"]
+# ── Helpers ───────────────────────────────────────────────────────────────────
+def _lookup_q_hat(
+    df: pd.DataFrame, dataset: str, mode: str, group: str, alpha: float
+) -> float:
+    alpha = round(alpha, 2)
+    row = df.query(
+        "dataset == @dataset and mode == @mode and "
+        "scoring_method == @DEFAULT_SCORING_METHOD and "
+        "group == @group and alpha == @alpha"
+    )
+    if row.empty:
+        raise ValueError(
+            f"Threshold not found for "
+            f"({dataset}, {mode}, {DEFAULT_SCORING_METHOD}, {group}, α={alpha}). "
+            "Re-run precompute.py."
+        )
+    return float(row["q_hat"].iloc[0])
+def _render_subclaims(result: SubclaimResult, q_hat: float) -> str:
+    """Return HTML with keep/remove highlighting for each subclaim."""
+    parts = []
+    for sc in result["subclaims"]:
+        text = sc["text"]
+        if sc["score"] >= q_hat:
+            parts.append(
+                f'<span style="padding:2px 5px; display:inline-block;">'
+                f"{text}</span>"
+            )
+        else:
+            parts.append(
+                f'<span style="text-decoration:line-through; color:#dc3545; '
+                f'padding:2px 5px; display:inline-block;">'
+                f"{text}</span>"
+            )
+    return " ".join(parts)
+def _extract_page_content(doc: str) -> str:
+    """Extract page_content text from FAISSIndexManager search result string.
+    Strips the source query prefix that is prepended in the format
+    "source_text: document_text".
+    """
+    m = re.search(r"page_content='(.*?)'\s+metadata=\{", doc, re.DOTALL)
+    if not m:
+        return doc
+    content = m.group(1)
+    source_m = re.search(r"'source':\s*'(.*?)'", doc)
+    if source_m:
+        prefix = source_m.group(1) + ": "
+        if content.startswith(prefix):
+            content = content[len(prefix):]
+    return content
+def _params_changed(dataset: str, mode: str, group: str) -> bool:
+    prev = st.session_state.get("prev_params")
+    return prev != (dataset, mode, group)
+def _save_params(dataset: str, mode: str, group: str) -> None:
+    st.session_state.prev_params = (dataset, mode, group)
+# ── App ───────────────────────────────────────────────────────────────────────
+def main() -> None:
+    st.set_page_config(page_title="Conformal RAG Demo", layout="wide")
+    st.title("Conformal RAG — 回答品質評価デモ")
+    # Session state defaults
+    st.session_state.setdefault("query_processed", False)
+    st.session_state.setdefault("result", None)
+    st.session_state.setdefault("elapsed_sec", None)
+    st.session_state.setdefault("prev_params", None)
+    st.session_state.setdefault("integrated_answer", None)
+    st.session_state.setdefault("integrated_alpha", None)
+    # Load static data (cached after first call)
+    try:
+        thresholds_df = load_thresholds()
+        samples = load_samples()
+    except FileNotFoundError as e:
+        st.error(
+            f"データファイルが見つかりません: {e}\n\n"
+            "`python -m demo.precompute` を実行してください。"
+        )
+        return
+    # ── Sidebar ──────────────────────────────────────────────────────────────
+    with st.sidebar:
+        st.header("設定")
+        # Dataset
+        dataset: str = st.selectbox(
+            "データセット",
+            options=list(DATASET_LABELS.keys()),
+            format_func=lambda k: DATASET_LABELS[k],
+        )
+        is_medlf = dataset == "medlf_qa"
+        # Mode (MedLFQA only)
+        mode_raw: str = st.radio(
+            "モード",
+            options=["marginal", "conditional"],
+            format_func=lambda m: (
+                "Marginal（全体）" if m == "marginal" else "Conditional（グループ別）"
+            ),
+            disabled=not is_medlf,
+        )
+        mode: str = mode_raw if is_medlf else "marginal"
+        # Group (MedLFQA + Conditional only)
+        is_conditional = is_medlf and mode == "conditional"
+        if is_conditional:
+            group: str = st.selectbox("グループ", options=MEDLF_GROUPS)
+        else:
+            group = "default"
+        # Clear cached result when key params change
+        if _params_changed(dataset, mode, group):
+            st.session_state.query_processed = False
+            st.session_state.result = None
+            st.session_state.elapsed_sec = None
+            st.session_state.integrated_answer = None
+            st.session_state.integrated_alpha = None
+            _save_params(dataset, mode, group)
+        st.divider()
+        # Query input
+        dataset_samples = [
+            s for s in samples
+            if s["dataset"] == dataset
+            and (not is_conditional or s.get("group") == group)
+        ]
+        sample_map: dict[str, dict] = {s["query"]: s for s in dataset_samples}
+        query_input: str = st.selectbox(
+            "サンプルクエリ", options=list(sample_map.keys())
+        )
+        st.divider()
+        run_btn = st.button("推論実行", use_container_width=True, type="primary")
+        # Factuality slider (active only after inference)
+        target_factuality_pct: int = st.select_slider(
+            "Target Factuality (1−α)",
+            options=FACTUALITY_OPTIONS,
+            value=90,
+            format_func=lambda v: f"{v}%",
+            disabled=not st.session_state.query_processed,
+        )
+    alpha = round((100 - target_factuality_pct) / 100, 2)
+    # ── Heavy processing (button click only) ─────────────────────────────────
+    if run_btn and query_input:
+        st.session_state.integrated_answer = None
+        st.session_state.integrated_alpha = None
+        # Use precomputed result if available for this exact (mode, group)
+        precomputed = sample_map.get(query_input)
+        if (
+            precomputed is not None
+            and (
+                mode == "marginal"
+                or (precomputed["mode"] == mode and precomputed["group"] == group)
+            )
+        ):
+            st.session_state.result = precomputed
+            st.session_state.elapsed_sec = None
+            st.session_state.query_processed = True
+            st.rerun()
+        elif SPACES_DEMO:
+            st.error("このデモではサンプルクエリのみ対応しています。")
+        else:
+            t0 = time.perf_counter()
+            with st.spinner("推論中… (30〜60 秒ほどかかります)"):
+                faiss_manager = get_faiss_manager(dataset)
+                scorer = get_scorer(dataset)
+                st.session_state.result = process_query(
+                    query=query_input,
+                    dataset=dataset,
+                    mode=mode,
+                    group=group,
+                    faiss_manager=faiss_manager,
+                    scorer=scorer,
+                )
+            st.session_state.elapsed_sec = time.perf_counter() - t0
+            st.session_state.query_processed = True
+            st.rerun()
+    # ── Results panel ─────────────────────────────────────────────────────────
+    if not st.session_state.query_processed or st.session_state.result is None:
+        st.info("サイドバーからクエリを選択して「推論実行」��押してください。")
+        return
+    result: SubclaimResult = st.session_state.result
+    try:
+        q_hat = _lookup_q_hat(thresholds_df, dataset, mode, group, alpha)
+    except ValueError as e:
+        st.error(str(e))
+        return
+    filtered: FilteredResult = apply_threshold(result, q_hat)
+    # Query
+    st.subheader("クエリ")
+    st.markdown(result["query"])
+    st.divider()
+    # Original answer (left) | Re-integrated answer (right)
+    col_orig, col_integ = st.columns(2)
+    with col_orig:
+        st.subheader("元の RAG 回答")
+        st.markdown(result["rag_answer"])
+    with col_integ:
+        st.subheader("フィルタ後の回答")
+        if st.session_state.integrated_answer is not None:
+            if st.session_state.integrated_alpha != alpha:
+                st.warning(
+                    f"α={st.session_state.integrated_alpha:.2f} 時点の結果です。"
+                    "「回答を生成」を再度押して更新してください。"
+                )
+            st.markdown(st.session_state.integrated_answer)
+        if st.button("回答を生成", type="secondary"):
+            removed = [sc["text"].strip() for sc in result["subclaims"] if sc["score"] < q_hat]
+            with st.spinner("回答を生成中…"):
+                st.session_state.integrated_answer = reintegrate_subclaims(
+                    rag_answer=result["rag_answer"],
+                    removed_subclaims=removed,
+                )
+            st.session_state.integrated_alpha = alpha
+            st.rerun()
+    st.divider()
+    # Subclaims with keep/remove highlighting
+    st.subheader("Conformal-RAG による回答")
+    html = _render_subclaims(result, q_hat)
+    st.markdown(html, unsafe_allow_html=True)
+    st.divider()
+    # Retrieved documents
+    with st.expander(
+        f"検索されたナレッジ（{len(result['retrieved_docs'])} 件）",
+        expanded=False,
+    ):
+        idx = 1
+        for doc in result["retrieved_docs"]:
+            content = _extract_page_content(doc)
+            if len(content) < 15:
+                continue
+            st.markdown(f"**[{idx}]** {content}")
+            idx += 1
+    # Metrics
+    st.subheader("分析メトリクス")
+    cols = st.columns(4 if st.session_state.elapsed_sec is None else 5)
+    cols[0].metric("閾値 q̂", f"{q_hat:.4f}")
+    cols[1].metric("サブクレーム数", len(result["subclaims"]))
+    cols[2].metric("Keep", filtered["keep_count"])
+    cols[3].metric(
+        "削除率",
+        f"{filtered['remove_count'] / max(len(result['subclaims']), 1) * 100:.0f}%",
+    )
+    if st.session_state.elapsed_sec is not None:
+        cols[4].metric("推論時間", f"{st.session_state.elapsed_sec:.1f} 秒")
+if __name__ == "__main__":
+    main()

demo/constants.py ADDED Viewed

	@@ -0,0 +1,23 @@

+# demo/constants.py — shared constants; precompute.py and app.py must import from here.
+# ── Scoring ──────────────────────────────────────────────────────────────────
+# To switch scoring methods, change this line and re-run precompute.py.
+DEFAULT_SCORING_METHOD: str = "relevance"
+# Maps clean scoring method names to the keys used in subclaim["scores"]
+# in data files produced by main.py.
+# "relavance" is a typo in the original codebase — preserved for compatibility.
+SCORING_METHOD_TO_DATA_KEY: dict[str, str] = {
+    "relevance": "relavance",
+    "cosine_similarity": "query_claim_cosine_similarity",
+    "min_log_prob": "min_log_prob",
+}
+# ── Conformal prediction ─────────────────────────────────────────────────────
+# alpha = 1 - target_factuality.
+# precompute.py computes thresholds for exactly these values;
+# app.py passes this list to st.select_slider.
+# → Target Factuality: 95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%
+ALPHA_LEVELS: list[float] = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40]

demo/data/samples.json ADDED Viewed

	@@ -0,0 +1,322 @@

+{
+  "metadata": {
+    "generated_at": "2026-03-29T17:28:52.797800+00:00",
+    "model": "gpt-4.1-mini",
+    "scoring_method": "relevance"
+  },
+  "samples": [
+    {
+      "query": "Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?",
+      "dataset": "medlf_qa",
+      "mode": "conditional",
+      "group": "live_qa",
+      "scoring_method": "relevance",
+      "rag_answer": "Yes, you can stop using the menopausal hormone therapy patch after 4.5 months, but it is generally recommended that attempts to taper or discontinue the medication be made at intervals of 3 to 6 months. It is important to consider your symptoms and discuss with your healthcare provider the best approach for stopping the therapy safely.",
+      "retrieved_docs": [
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: For the treatment of vasomotor symptoms, treatment should be initiated with the 6' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=21 fileposition=21 score=0.8116",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: Attempts to taper or discontinue the medication should be made at 3- to 6-month intervals' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=26 fileposition=26 score=0.7919",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: Six (6' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=16 fileposition=16 score=0.7711",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: The dose should be adjusted as necessary to control symptoms' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=24 fileposition=24 score=0.7696",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: 5, 9' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=17 fileposition=17 score=0.7678",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: This system is applied to the skin once weekly for the treatment of vasomotor symptoms related to menopause.' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=28 fileposition=28 score=0.7627",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: 375, 12' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=18 fileposition=18 score=0.7588",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: 5 cm2 (0' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=22 fileposition=22 score=0.7534",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: 5, 15, 18' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=19 fileposition=19 score=0.7519",
+        "page_content='Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?: Clinical responses (relief of symptoms) at the lowest effective dose should be the guide for establishing administration of the Climara system, especially in women with an intact uterus' metadata={'source': 'Can I stop using the patch (menopausal hormone therapy) after only 4.5 months?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=25 fileposition=25 score=0.7484"
+      ],
+      "subclaims": [
+        {
+          "id": 0,
+          "text": "You can stop using the menopausal hormone therapy patch after 4.5 months",
+          "score": 0.5595795678160917
+        },
+        {
+          "id": 1,
+          "text": " It is generally recommended to attempt tapering or discontinuing the medication at intervals of 3 to 6 months",
+          "score": 0.31861125529509804
+        },
+        {
+          "id": 2,
+          "text": " It is important to consider your symptoms when stopping menopausal hormone therapy",
+          "score": 0.3732023054183166
+        },
+        {
+          "id": 3,
+          "text": " You should discuss with your healthcare provider the best approach for safely stopping the therapy",
+          "score": 0.2703604207878165
+        }
+      ]
+    },
+    {
+      "query": "Would you recommend something a bit stronger than meloxicam?",
+      "dataset": "medlf_qa",
+      "mode": "conditional",
+      "group": "kqa_golden",
+      "scoring_method": "relevance",
+      "rag_answer": "If you are looking for something stronger than meloxicam, options include combining drugs from different classes or considering opioid medications. Opioids such as tramadol (Conzip) and oxycodone (Oxyado, Oxycontin) are more potent pain relievers but come with significant risks, including addiction, toxicity, and adverse cognitive and respiratory effects. It's important to consult with a physician before changing your treatment regimen to carefully weigh the benefits and risks.",
+      "retrieved_docs": [
+        "page_content='Would you recommend something a bit stronger than meloxicam?: For a stronger pain relief effect, combining two drugs from different classes or medications from the opioid group is possible' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=41 fileposition=41 score=0.7034",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: NSAIDs are generally more potent but carry a risk of harming the kidneys and other side effects' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=38 fileposition=38 score=0.6924",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: In case of acute and severe pain, NSAIDs like Meloxicam are often recommended' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=40 fileposition=40 score=0.6871",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: Opioids such as Tramadol (Conzip) and Oxycodone (Oxyado, Oxycontin) are the most potent but carry a significant risk of addiction, toxicity, and cognitive and respiratory adverse effects' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=39 fileposition=39 score=0.6863",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: It is essential to consult with a physician before changing treatment regimens due to the risk of addiction and potential side effects' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=42 fileposition=42 score=0.6812",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: There are generally three main types of drugs used for pain management: acetaminophen (paracetamol), NSAIDs (like Meloxicam), and opioids' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=36 fileposition=36 score=0.6810",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: NSAIDs like Meloxicam carry a risk of harming the kidneys and other side effects' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=46 fileposition=46 score=0.6799",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: Opioids such as Tramadol (Conzip) and Oxycodone (Oxyado, Oxycontin) are the most potent pain management drugs' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=47 fileposition=47 score=0.6730",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: NSAIDs like Meloxicam are generally more potent than acetaminophen' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=45 fileposition=45 score=0.6688",
+        "page_content='Would you recommend something a bit stronger than meloxicam?: Acetaminophen (paracetamol) is the least potent but has the safest side effect profile' metadata={'source': 'Would you recommend something a bit stronger than meloxicam?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=37 fileposition=37 score=0.6409"
+      ],
+      "subclaims": [
+        {
+          "id": 0,
+          "text": "If you are looking for something stronger than meloxicam, options include combining drugs from different classes",
+          "score": 0.38080484418115046
+        },
+        {
+          "id": 1,
+          "text": " Opioids such as tramadol (Conzip) and oxycodone (Oxyado, Oxycontin) are more potent pain relievers",
+          "score": 0.2872582910028152
+        },
+        {
+          "id": 2,
+          "text": " Opioids come with significant risks including addiction, toxicity, and adverse cognitive effects",
+          "score": 0.19291751764168302
+        },
+        {
+          "id": 3,
+          "text": " Opioids also have adverse respiratory effects",
+          "score": 0.1868863743184725
+        },
+        {
+          "id": 4,
+          "text": " It is important to consult with a physician before changing your treatment regimen",
+          "score": 0.1655506771407161
+        },
+        {
+          "id": 5,
+          "text": " Consulting a physician helps to carefully weigh the benefits and risks of the treatment",
+          "score": 0.17235453684696078
+        }
+      ]
+    },
+    {
+      "query": "how soon does losartan afffect blood pressure",
+      "dataset": "medlf_qa",
+      "mode": "conditional",
+      "group": "medication_qa",
+      "scoring_method": "relevance",
+      "rag_answer": "The clinical effect of losartan on blood pressure typically becomes apparent within two weeks. Peak effects are observed around 6 hours after taking the medication, with these peak effects being moderately larger than the trough (lowest) effects.",
+      "retrieved_docs": [
+        "page_content='how soon does losartan afffect blood pressure: Hypertension … Peak (6 hour) effects were uniformly, but moderately, larger than trough effects,' metadata={'source': 'how soon does losartan afffect blood pressure', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=51 fileposition=51 score=0.7362",
+        "page_content='how soon does losartan afffect blood pressure: The peak effects of losartan were uniformly, but moderately, larger than trough effects, with the trough-to-peak ratio for systolic and diastolic responses between 50-95%.' metadata={'source': 'how soon does losartan afffect blood pressure', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=52 fileposition=52 score=0.7016",
+        "page_content='how soon does losartan afffect blood pressure: Clinical effect of losartan typically becomes apparent within two weeks.' metadata={'source': 'how soon does losartan afffect blood pressure', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=53 fileposition=53 score=0.6849"
+      ],
+      "subclaims": [
+        {
+          "id": 0,
+          "text": "The clinical effect of losartan on blood pressure typically becomes apparent within two weeks",
+          "score": 0.46106826938943024
+        },
+        {
+          "id": 1,
+          "text": " Peak effects of losartan are observed around 6 hours after taking the medication",
+          "score": 0.42004030890343347
+        },
+        {
+          "id": 2,
+          "text": " The peak effects of losartan are moderately larger than the trough (lowest) effects",
+          "score": 0.4264968324332668
+        }
+      ]
+    },
+    {
+      "query": "Are there any signs I should look out for if Lyme develops and I should follow up?",
+      "dataset": "medlf_qa",
+      "mode": "conditional",
+      "group": "kqa_silver_wogold",
+      "scoring_method": "relevance",
+      "rag_answer": "Yes, there are several signs to watch for if Lyme disease develops after a tick bite, and you should follow up with your healthcare provider if you notice any of these symptoms:\n\n- Early symptoms (within a week to a few weeks after the bite) may include:\n  - Fever\n  - Fatigue\n  - Headache\n  - Muscle and joint aches\n  - Swollen lymph nodes\n  - A characteristic skin rash called erythema migrans, which often looks like a bull’s-eye on the skin\n\n- If left untreated, symptoms can progress to more severe issues such as:\n  - Severe fatigue\n  - Stiff, aching neck\n  - Severe headaches\n  - Irregularities in heartbeat\n  - Changes in vision\n  - Inflammatory arthritis, often affecting the knee\n\nIf you experience any of these signs or symptoms after suspected or known exposure to ticks, it’s important to schedule an appointment with your healthcare provider for evaluation and possible treatment. Regular check-ins can be helpful to monitor for any developments.",
+      "retrieved_docs": [
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: Regular check-ins with a healthcare provider are beneficial for managing and observing the potential developments of Lyme disease following suspected or known tick exposure.' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=68 fileposition=68 score=0.7716",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: If you exhibit any of these signs or symptoms after suspected or known exposure to ticks, schedule an appointment with your healthcare provider' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=66 fileposition=66 score=0.7560",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: If such an infection develops, early signs of Lyme disease may occur within a week to a few weeks of a tick bite' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=62 fileposition=62 score=0.7273",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: If left untreated, symptoms can progress to severe fatigue, a stiff, aching neck, and severe headaches' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=64 fileposition=64 score=0.7108",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: One may experience irregularities in heartbeat, changes in vision, or an inflammatory arthritis often affecting the knee as well' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=65 fileposition=65 score=0.6972",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: These early symptoms can include: fever, fatigue, headache, muscle and joint aches, swollen lymph nodes, and a characteristic skin rash called erythema migrans, which often looks like a bull’s-eye on the skin' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=63 fileposition=63 score=0.6907",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: Lyme disease is a bacterial infection that can be transmitted to humans through the bite of infected black-legged ticks' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=61 fileposition=61 score=0.6902",
+        "page_content='Are there any signs I should look out for if Lyme develops and I should follow up?: Lyme disease is a bacterial infection that is transmitted to humans through the bite of infected black-legged ticks.' metadata={'source': 'Are there any signs I should look out for if Lyme develops and I should follow up?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=67 fileposition=67 score=0.6879"
+      ],
+      "subclaims": [
+        {
+          "id": 0,
+          "text": "Lyme disease can develop after a tick bite",
+          "score": 0.36037872147634314
+        },
+        {
+          "id": 1,
+          "text": " Early symptoms of Lyme disease may appear within a week to a few weeks after the tick bite",
+          "score": 0.3590103639552498
+        },
+        {
+          "id": 2,
+          "text": " Early symptoms of Lyme disease include fever",
+          "score": 0.330641627741826
+        },
+        {
+          "id": 3,
+          "text": " Early symptoms of Lyme disease include fatigue",
+          "score": 0.3326410987371023
+        },
+        {
+          "id": 4,
+          "text": " Early symptoms of Lyme disease include headache",
+          "score": 0.3409522381113656
+        },
+        {
+          "id": 5,
+          "text": " Early symptoms of Lyme disease include muscle and joint aches",
+          "score": 0.31654988182829147
+        },
+        {
+          "id": 6,
+          "text": " Early symptoms of Lyme disease include swollen lymph nodes",
+          "score": 0.3134207503110557
+        },
+        {
+          "id": 7,
+          "text": " Early symptoms of Lyme disease include a characteristic skin rash called erythema migrans",
+          "score": 0.31260172708098244
+        },
+        {
+          "id": 8,
+          "text": " Erythema migrans often looks like a bull’s-eye on the skin",
+          "score": 0.2383296001897222
+        },
+        {
+          "id": 9,
+          "text": " If Lyme disease is left untreated, symptoms can progress to more severe issues",
+          "score": 0.3353740959350072
+        },
+        {
+          "id": 10,
+          "text": " Severe issues from untreated Lyme disease include severe fatigue",
+          "score": 0.3008999281508631
+        },
+        {
+          "id": 11,
+          "text": " Severe issues from untreated Lyme disease include stiff, aching neck",
+          "score": 0.32575460132098455
+        },
+        {
+          "id": 12,
+          "text": " Severe issues from untreated Lyme disease include severe headaches",
+          "score": 0.31177354749866565
+        },
+        {
+          "id": 13,
+          "text": " Severe issues from untreated Lyme disease include irregularities in heartbeat",
+          "score": 0.31990244889782615
+        },
+        {
+          "id": 14,
+          "text": " Severe issues from untreated Lyme disease include changes in vision",
+          "score": 0.30950684605158113
+        },
+        {
+          "id": 15,
+          "text": " Severe issues from untreated Lyme disease include inflammatory arthritis",
+          "score": 0.3049160752333055
+        },
+        {
+          "id": 16,
+          "text": " Inflammatory arthritis from Lyme disease often affects the knee",
+          "score": 0.2874386307287703
+        },
+        {
+          "id": 17,
+          "text": " If you experience signs or symptoms after tick exposure, you should schedule an appointment with your healthcare provider",
+          "score": 0.3763904855114536
+        },
+        {
+          "id": 18,
+          "text": " Regular check-ins with a healthcare provider can help monitor for developments in Lyme disease",
+          "score": 0.3957411848526053
+        }
+      ]
+    },
+    {
+      "query": "How do adults usually get botulism?",
+      "dataset": "medlf_qa",
+      "mode": "conditional",
+      "group": "healthsearch_qa",
+      "scoring_method": "relevance",
+      "rag_answer": "Adults usually get botulism in one of three main ways: \n\n1. **Foodborne botulism**: This is the most common form in adults and occurs when they eat foods contaminated with the botulinum toxin. This often happens with home-canned, home-preserved, or fermented foods prepared under conditions that encourage bacterial growth.\n\n2. **Wound botulism**: This occurs when the bacteria enter a wound, which is more common in people who inject drugs.\n\n3. **Inhalation botulism**: This is very rare and can happen if the pure toxin is released into the air, such as during laboratory accidents or if used as a biological weapon.\n\nOverall, foodborne botulism is the most frequent way adults contract the illness.",
+      "retrieved_docs": [
+        "page_content='How do adults usually get botulism?: In adults, botulism can be contracted in a few different ways: through food, wound botulism, and inhalation botulism' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=78 fileposition=78 score=0.6993",
+        "page_content='How do adults usually get botulism?: Foodborne botulism is the most common form in adults and occurs when adults eat foods that are contaminated with the botulism toxin' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=79 fileposition=79 score=0.6891",
+        "page_content='How do adults usually get botulism?: Wound botulism happens when the bacteria enter a wound, usually in people who inject drugs' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=81 fileposition=81 score=0.6884",
+        "page_content='How do adults usually get botulism?: Botulism is a rare but serious illness caused by a toxin that attacks the body's nerves' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=77 fileposition=77 score=0.6836",
+        "page_content='How do adults usually get botulism?: This typically happens with home-canned, home-preserved, or fermented foods that have been prepared in conditions that encourage bacterial growth' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=80 fileposition=80 score=0.6832",
+        "page_content='How do adults usually get botulism?: In very rare instances, inhalation botulism can occur when the pure toxin is released into the air, such as during laboratory accidents or if used as a biological weapon.' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=83 fileposition=83 score=0.6773",
+        "page_content='How do adults usually get botulism?: Inhalation botulism is very rare and can occur when the pure toxin is released into the air as it can in laboratory accidents or if used as a biological weapon' metadata={'source': 'How do adults usually get botulism?', 'file_path': '/home/ryoya.awano/ResponseQualityAssessment/data/processed/MedLFQA/sampled_10_medlf_qa_documents.txt'} indice=82 fileposition=82 score=0.6543"
+      ],
+      "subclaims": [
+        {
+          "id": 0,
+          "text": "Adults usually get botulism in one of three main ways",
+          "score": 0.42031683919779217
+        },
+        {
+          "id": 1,
+          "text": " Foodborne botulism is the most common form in adults",
+          "score": 0.39594970883087005
+        },
+        {
+          "id": 2,
+          "text": " Foodborne botulism occurs when adults eat foods contaminated with the botulinum toxin",
+          "score": 0.3942247114653798
+        },
+        {
+          "id": 3,
+          "text": " Foodborne botulism often happens with home-canned, home-preserved, or fermented foods prepared under conditions that encourage bacterial growth",
+          "score": 0.3146562102397841
+        },
+        {
+          "id": 4,
+          "text": " Wound botulism occurs when the bacteria enter a wound",
+          "score": 0.2937099827086362
+        },
+        {
+          "id": 5,
+          "text": " Wound botulism is more common in people who inject drugs",
+          "score": 0.29189200145131067
+        },
+        {
+          "id": 6,
+          "text": " Inhalation botulism is very rare",
+          "score": 0.3344457211496001
+        },
+        {
+          "id": 7,
+          "text": " Inhalation botulism can happen if the pure toxin is released into the air",
+          "score": 0.3194226107069425
+        },
+        {
+          "id": 8,
+          "text": " Inhalation botulism can occur during laboratory accidents",
+          "score": 0.3357236089778982
+        },
+        {
+          "id": 9,
+          "text": " Inhalation botulism can occur if used as a biological weapon",
+          "score": 0.31917884005382047
+        },
+        {
+          "id": 10,
+          "text": " Foodborne botulism is the most frequent way adults contract botulism",
+          "score": 0.3974304925095534
+        }
+      ]
+    }
+  ]
+}

demo/data/thresholds.csv ADDED Viewed

	@@ -0,0 +1,145 @@

+dataset,mode,scoring_method,group,alpha,q_hat
+medlf_qa,marginal,relevance,default,0.05,0.26235458225908953
+medlf_qa,marginal,relevance,default,0.1,0.26235458225908953
+medlf_qa,marginal,relevance,default,0.15,0.26235458225908953
+medlf_qa,marginal,relevance,default,0.2,0.26235458225908953
+medlf_qa,marginal,relevance,default,0.25,0.26235458225908953
+medlf_qa,marginal,relevance,default,0.3,0.26235458225908953
+medlf_qa,marginal,relevance,default,0.35,-1
+medlf_qa,marginal,relevance,default,0.4,-1
+medlf_qa,conditional,relevance,live_qa,0.05,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.1,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.15,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.2,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.25,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.3,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.35,0.26235458225908953
+medlf_qa,conditional,relevance,live_qa,0.4,0.26235458225908953
+medlf_qa,conditional,relevance,kqa_golden,0.05,-1
+medlf_qa,conditional,relevance,kqa_golden,0.1,-1
+medlf_qa,conditional,relevance,kqa_golden,0.15,-1
+medlf_qa,conditional,relevance,kqa_golden,0.2,-1
+medlf_qa,conditional,relevance,kqa_golden,0.25,-1
+medlf_qa,conditional,relevance,kqa_golden,0.3,-1
+medlf_qa,conditional,relevance,kqa_golden,0.35,-1
+medlf_qa,conditional,relevance,kqa_golden,0.4,-1
+medlf_qa,conditional,relevance,medication_qa,0.05,-1
+medlf_qa,conditional,relevance,medication_qa,0.1,-1
+medlf_qa,conditional,relevance,medication_qa,0.15,-1
+medlf_qa,conditional,relevance,medication_qa,0.2,-1
+medlf_qa,conditional,relevance,medication_qa,0.25,-1
+medlf_qa,conditional,relevance,medication_qa,0.3,-1
+medlf_qa,conditional,relevance,medication_qa,0.35,-1
+medlf_qa,conditional,relevance,medication_qa,0.4,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.05,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.1,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.15,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.2,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.25,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.3,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.35,-1
+medlf_qa,conditional,relevance,kqa_silver_wogold,0.4,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.05,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.1,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.15,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.2,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.25,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.3,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.35,-1
+medlf_qa,conditional,relevance,healthsearch_qa,0.4,-1
+medlf_qa,marginal,cosine_similarity,default,0.05,0.3992308077305861
+medlf_qa,marginal,cosine_similarity,default,0.1,0.3992308077305861
+medlf_qa,marginal,cosine_similarity,default,0.15,0.3992308077305861
+medlf_qa,marginal,cosine_similarity,default,0.2,0.3992308077305861
+medlf_qa,marginal,cosine_similarity,default,0.25,0.3992308077305861
+medlf_qa,marginal,cosine_similarity,default,0.3,0.3992308077305861
+medlf_qa,marginal,cosine_similarity,default,0.35,-1
+medlf_qa,marginal,cosine_similarity,default,0.4,-1
+medlf_qa,conditional,cosine_similarity,live_qa,0.05,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.1,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.15,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.2,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.25,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.3,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.35,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,live_qa,0.4,0.3992308077305861
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.05,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.1,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.15,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.2,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.25,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.3,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.35,-1
+medlf_qa,conditional,cosine_similarity,kqa_golden,0.4,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.05,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.1,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.15,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.2,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.25,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.3,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.35,-1
+medlf_qa,conditional,cosine_similarity,medication_qa,0.4,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.05,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.1,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.15,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.2,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.25,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.3,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.35,-1
+medlf_qa,conditional,cosine_similarity,kqa_silver_wogold,0.4,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.05,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.1,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.15,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.2,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.25,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.3,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.35,-1
+medlf_qa,conditional,cosine_similarity,healthsearch_qa,0.4,-1
+medlf_qa,marginal,min_log_prob,default,0.05,1.002567365930604
+medlf_qa,marginal,min_log_prob,default,0.1,1.002567365930604
+medlf_qa,marginal,min_log_prob,default,0.15,1.002567365930604
+medlf_qa,marginal,min_log_prob,default,0.2,1.002567365930604
+medlf_qa,marginal,min_log_prob,default,0.25,1.002567365930604
+medlf_qa,marginal,min_log_prob,default,0.3,1.002567365930604
+medlf_qa,marginal,min_log_prob,default,0.35,-1
+medlf_qa,marginal,min_log_prob,default,0.4,-1
+medlf_qa,conditional,min_log_prob,live_qa,0.05,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.1,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.15,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.2,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.25,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.3,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.35,1.002567365930604
+medlf_qa,conditional,min_log_prob,live_qa,0.4,1.002567365930604
+medlf_qa,conditional,min_log_prob,kqa_golden,0.05,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.1,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.15,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.2,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.25,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.3,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.35,-1
+medlf_qa,conditional,min_log_prob,kqa_golden,0.4,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.05,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.1,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.15,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.2,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.25,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.3,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.35,-1
+medlf_qa,conditional,min_log_prob,medication_qa,0.4,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.05,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.1,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.15,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.2,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.25,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.3,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.35,-1
+medlf_qa,conditional,min_log_prob,kqa_silver_wogold,0.4,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.05,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.1,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.15,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.2,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.25,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.3,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.35,-1
+medlf_qa,conditional,min_log_prob,healthsearch_qa,0.4,-1

demo/inference_api.py ADDED Viewed

	@@ -0,0 +1,302 @@

+# demo/inference_api.py
+# Single-query inference wrapper around src/ core logic.
+# Does NOT modify src/ — imports only.
+#
+# Heavy resources (FAISSIndexManager, SubclaimScorer) are constructed here
+# via build_faiss_manager() / build_scorer() and cached in app.py with
+# @st.cache_resource, so they survive Streamlit re-runs.
+import os
+import re
+import glob
+import yaml
+from typing import TYPE_CHECKING, TypedDict
+if TYPE_CHECKING:
+    # FAISSIndexManager and SubclaimScorer transitively import torch/transformers
+    # via langchain_text_splitters (file_manager.py).  To avoid loading these
+    # heavy packages at module import time (e.g. in HF Spaces where FAISS is
+    # unused), we guard them here and do the real import inside each build_*()
+    # function instead.  String annotations ("FAISSIndexManager") ensure the
+    # type hints remain valid at runtime without triggering the import.
+    from src.common.faiss_manager import FAISSIndexManager
+    from src.subclaim_processor.scorer.subclaim_scorer import SubclaimScorer
+def _expand_env_vars(obj):
+    """Recursively expand environment variables in string values."""
+    if isinstance(obj, dict):
+        return {k: _expand_env_vars(v) for k, v in obj.items()}
+    if isinstance(obj, list):
+        return [_expand_env_vars(v) for v in obj]
+    if isinstance(obj, str):
+        return os.path.expandvars(obj)
+    return obj
+from src.common.llm.openai_rag_agent import OpenAIRAGAgent
+from src.common.llm.openai_atomicfact_generator import OpenAIAtomicFactGenerator
+from src.common.llm.openai_llm_agent import OpenAILLMAgent
+from demo.constants import DEFAULT_SCORING_METHOD
+# ── Config loading (module-level cache) ──────────────────────────────────────
+_main_config: dict | None = None
+_dataset_config: dict | None = None
+def _load_main_config() -> dict:
+    global _main_config
+    if _main_config is None:
+        with open("conf/config.yaml") as f:
+            _main_config = yaml.safe_load(f)
+    return _main_config
+def _load_dataset_config() -> dict:
+    global _dataset_config
+    if _dataset_config is None:
+        with open("conf/dataset_config.yaml") as f:
+            _dataset_config = _expand_env_vars(yaml.safe_load(f))
+    return _dataset_config
+# ── FAISS index path resolution ───────────────────────────────────────────────
+def _get_index_paths(dataset: str) -> tuple[str, str]:
+    """Resolve index_path and indice2fm_path for a dataset.
+    Scans the index_store directory for files created by main.py
+    (index_{query_size}.faiss) and picks the one with the largest query_size.
+    """
+    dataset_cfg = _load_dataset_config()
+    if dataset not in dataset_cfg["datasets"]:
+        raise ValueError(f"Unknown dataset: '{dataset}'. "
+                         f"Available: {list(dataset_cfg['datasets'].keys())}")
+    index_store = dataset_cfg["datasets"][dataset]["index_store"]
+    faiss_files = glob.glob(f"{index_store}/index_*.faiss")
+    if not faiss_files:
+        raise FileNotFoundError(
+            f"No FAISS index found in '{index_store}'. "
+            "Run main.py first to build the index."
+        )
+    def _query_size(path: str) -> int:
+        m = re.search(r"index_(\d+)\.faiss$", path)
+        return int(m.group(1)) if m else 0
+    index_path = max(faiss_files, key=_query_size)
+    indice2fm_path = re.sub(r"index_(\d+)\.faiss$", r"indice2fm_\1.json", index_path)
+    return index_path, indice2fm_path
+# ── Resource builders (call from app.py wrapped with @st.cache_resource) ─────
+def build_faiss_manager(dataset: str) -> "FAISSIndexManager":
+    """Build a FAISSIndexManager for the given dataset.
+    Intended to be wrapped with @st.cache_resource in app.py.
+    FAISSIndexManager is imported here (not at module level) to avoid pulling
+    torch/transformers via langchain_text_splitters at startup.
+    """
+    from src.common.faiss_manager import FAISSIndexManager  # noqa: PLC0415
+    config = _load_main_config()
+    index_path, indice2fm_path = _get_index_paths(dataset)
+    return FAISSIndexManager(
+        index_truncation_config=config["index"]["truncation_config"],
+        index_path=index_path,
+        indice2fm_path=indice2fm_path,
+    )
+def build_scorer(dataset: str) -> "SubclaimScorer":
+    """Build a SubclaimScorer for the given dataset.
+    SubclaimScorer loads its own FAISS index internally, so wrap this with
+    @st.cache_resource in app.py. Only needed for "relevance" and
+    "cosine_similarity" scoring methods.
+    SubclaimScorer is imported here (not at module level) for the same reason
+    as FAISSIndexManager — to avoid pulling torch/transformers at startup.
+    """
+    from src.subclaim_processor.scorer.subclaim_scorer import SubclaimScorer  # noqa: PLC0415
+    config = _load_main_config()
+    index_path, indice2fm_path = _get_index_paths(dataset)
+    return SubclaimScorer(
+        index_truncation_config=config["index"]["truncation_config"],
+        embedding_model=config["index"]["embedding_model"],
+        index_path=index_path,
+        indice2fm_path=indice2fm_path,
+        frequency_score_model=config["conformal_prediction"]["frequency_score_model"],
+    )
+# ── TypedDict definitions ─────────────────────────────────────────────────────
+class Subclaim(TypedDict):
+    id: int
+    text: str
+    score: float  # semantics depend on SubclaimResult.scoring_method
+class SubclaimResult(TypedDict):
+    query: str
+    dataset: str
+    mode: str           # "marginal" or "conditional"
+    group: str          # group name, or "default" for marginal / non-grouped datasets
+    scoring_method: str  # e.g. "relevance"; used as key for threshold lookup in thresholds.csv
+    rag_answer: str
+    retrieved_docs: list[str]
+    subclaims: list[Subclaim]
+class FilteredResult(TypedDict):
+    subclaims: list[Subclaim]
+    q_hat: float
+    keep_count: int
+    remove_count: int
+# ── Score computation ─────────────────────────────────────────────────────────
+def _compute_score(
+    scoring_method: str,
+    subclaim_text: str,
+    subclaim_probs: list[tuple[str, float]],  # (token, probability) — probability = exp(log_prob)
+    query: str,
+    retrieved_docs: list[str],
+    scorer: "SubclaimScorer | None",
+) -> float:
+    if scoring_method == "relevance":
+        if scorer is None:
+            raise ValueError("scorer is required for scoring_method='relevance'")
+        config = _load_main_config()
+        cp_config = config["conformal_prediction"]
+        return float(scorer.score(
+            claim=subclaim_text,
+            retrieved_docs=retrieved_docs,
+            aggregation_strategy=cp_config["aggregation_strategy"],
+            scoring_strategy=cp_config["scoring_strategy"],
+        ))
+    elif scoring_method == "cosine_similarity":
+        if scorer is None:
+            raise ValueError("scorer is required for scoring_method='cosine_similarity'")
+        return float(scorer.cosine_similarity(subclaim_text, query))
+    elif scoring_method == "min_log_prob":
+        # min(probability) = least confident token; higher value = more confident subclaim
+        probs = [p for _, p in subclaim_probs]
+        return min(probs) if probs else 0.0
+    else:
+        raise ValueError(
+            f"Unknown scoring method: '{scoring_method}'. "
+            "Supported: 'relevance', 'cosine_similarity', 'min_log_prob'"
+        )
+# ── Main inference functions ──────────────────────────────────────────────────
+def process_query(
+    query: str,
+    dataset: str,
+    mode: str,
+    group: str,
+    faiss_manager: "FAISSIndexManager",
+    scorer: "SubclaimScorer | None" = None,
+    scoring_method: str = DEFAULT_SCORING_METHOD,
+) -> SubclaimResult:
+    """Run the full inference pipeline for a single query.
+    Heavy: makes OpenAI API calls. Trigger only on button click and cache the
+    result in st.session_state.result.
+    scorer is required when scoring_method is "relevance" or "cosine_similarity".
+    Pass a @st.cache_resource-wrapped instance from app.py.
+    """
+    config = _load_main_config()
+    rag_cfg = config["rag"]
+    # 1. FAISS search
+    retrieved_docs: list[str] = faiss_manager.search_faiss_index(
+        query,
+        top_k=rag_cfg["retrival_topk"],
+        threshold=rag_cfg["retrival_threshold"],
+    )
+    # 2. RAG answer generation
+    rag_agent = OpenAIRAGAgent(faiss_manager, model=rag_cfg["response_model"])
+    response = rag_agent.answer(
+        query,
+        retrieved_docs,
+        temperature=rag_cfg["response_temperature"],
+        n_samples=1,
+    )
+    rag_answer: str = response.choices[0].message.content
+    # 3. Subclaim decomposition (returns zip iterator of (text, [(token, prob)]))
+    generator = OpenAIAtomicFactGenerator(model=rag_cfg["fact_generation_model"])
+    subclaims_with_probs = list(generator.get_facts_from_text(rag_answer))
+    # 4. Score each subclaim
+    subclaims: list[Subclaim] = []
+    for i, (text, token_probs) in enumerate(subclaims_with_probs):
+        score = _compute_score(
+            scoring_method=scoring_method,
+            subclaim_text=text,
+            subclaim_probs=token_probs,
+            query=query,
+            retrieved_docs=retrieved_docs,
+            scorer=scorer,
+        )
+        subclaims.append(Subclaim(id=i, text=text, score=score))
+    return SubclaimResult(
+        query=query,
+        dataset=dataset,
+        mode=mode,
+        group=group,
+        scoring_method=scoring_method,
+        rag_answer=rag_answer,
+        retrieved_docs=retrieved_docs,
+        subclaims=subclaims,
+    )
+_REINTEGRATE_INSTRUCTION = (
+    "You are given an original answer and a list of statements to remove from it. "
+    "Rewrite the answer by removing information corresponding to the listed statements "
+    "while keeping the remaining text fluent and natural. "
+    "Return only the revised answer without any explanation."
+)
+def reintegrate_subclaims(
+    rag_answer: str,
+    removed_subclaims: list[str],
+    model: str = "gpt-4o-mini",
+) -> str:
+    """Rewrite rag_answer by removing the specified subclaims via LLM."""
+    if not removed_subclaims:
+        return rag_answer
+    removed_text = "\n".join(f"- {t}" for t in removed_subclaims)
+    question = f"Original answer:\n{rag_answer}\n\nStatements to remove:\n{removed_text}"
+    agent = OpenAILLMAgent(instruction=_REINTEGRATE_INSTRUCTION, model=model)
+    return agent.answer(question)
+def apply_threshold(result: SubclaimResult, q_hat: float) -> FilteredResult:
+    """Apply a conformal threshold to a cached SubclaimResult.
+    Light: no API calls. Safe to run on every slider move.
+    Subclaims with score >= q_hat are kept.
+    """
+    subclaims = result["subclaims"]
+    keep_count = sum(1 for sc in subclaims if sc["score"] >= q_hat)
+    remove_count = len(subclaims) - keep_count
+    return FilteredResult(
+        subclaims=subclaims,
+        q_hat=q_hat,
+        keep_count=keep_count,
+        remove_count=remove_count,
+    )

demo/precompute.py ADDED Viewed

	@@ -0,0 +1,311 @@

+# demo/precompute.py
+# Offline script — run once before launching the demo.
+#
+# Produces:
+#   demo/data/thresholds.csv  — conformal thresholds for every
+#                               (dataset, mode, scoring_method, group, alpha)
+#   demo/data/samples.json    — pre-computed SubclaimResult objects for
+#                               sample queries (no API calls at demo runtime)
+#
+# Usage:
+#   python -m demo.precompute
+#
+# Prerequisites:
+#   Run main.py for each dataset first to generate FAISS indices and the
+#   subclaim data files in data/out/.
+import csv
+import glob
+import json
+import os
+import yaml
+from collections import defaultdict
+from datetime import datetime, timezone
+def _expand_env_vars(obj):
+    """Recursively expand environment variables in string values."""
+    if isinstance(obj, dict):
+        return {k: _expand_env_vars(v) for k, v in obj.items()}
+    if isinstance(obj, list):
+        return [_expand_env_vars(v) for v in obj]
+    if isinstance(obj, str):
+        return os.path.expandvars(obj)
+    return obj
+from src.calibration.utils import compute_threshold, load_subclaim_data
+from demo.constants import (
+    ALPHA_LEVELS,
+    DEFAULT_SCORING_METHOD,
+    SCORING_METHOD_TO_DATA_KEY,
+)
+# ── Config ────────────────────────────────────────────────────────────────────
+_main_cfg = yaml.safe_load(open("conf/config.yaml"))
+_path_cfg = _expand_env_vars(yaml.safe_load(open("conf/path_config.yaml")))
+_dataset_cfg = _expand_env_vars(yaml.safe_load(open("conf/dataset_config.yaml")))
+A_VALUE: float = _main_cfg["conformal_prediction"]["a_value"]
+RESPONSE_MODEL: str = _main_cfg["rag"]["response_model"]
+OUTPUT_THRESHOLDS = "demo/data/thresholds.csv"
+OUTPUT_SAMPLES = "demo/data/samples.json"
+# Number of entries held out per group (or per dataset for non-grouped) that
+# are reserved exclusively for samples.json and excluded from threshold
+# calibration.  Conformal guarantees require calibration and evaluation data
+# to be disjoint, so these entries must never appear in compute_thresholds().
+HOLDOUT_PER_GROUP: int = 1
+# ── Helpers ───────────────────────────────────────────────────────────────────
+def _split_calibration_holdout(
+    data: list[dict], is_grouped: bool
+) -> tuple[list[dict], list[dict]]:
+    """Split entries into (calibration, holdout) sets.
+    holdout entries are reserved for samples.json and excluded from threshold
+    calibration to prevent data leakage.  When a group has only HOLDOUT_PER_GROUP
+    or fewer entries it is kept entirely in calibration and the holdout for that
+    group is empty (a warning is printed).
+    Returns (calib_data, holdout_data).
+    """
+    if not is_grouped:
+        if len(data) > HOLDOUT_PER_GROUP:
+            return data[:-HOLDOUT_PER_GROUP], data[-HOLDOUT_PER_GROUP:]
+        print(f"  WARNING: only {len(data)} entries — too few to split; all used for calibration, holdout is empty")
+        return data, []
+    grouped = _group_data(data)
+    calib: list[dict] = []
+    holdout: list[dict] = []
+    for grp, grp_entries in grouped.items():
+        if len(grp_entries) > HOLDOUT_PER_GROUP:
+            calib.extend(grp_entries[:-HOLDOUT_PER_GROUP])
+            holdout.extend(grp_entries[-HOLDOUT_PER_GROUP:])
+        else:
+            print(f"  WARNING: group '{grp}' has only {len(grp_entries)} entries — too few to split; all used for calibration")
+            calib.extend(grp_entries)
+    return calib, holdout
+def _find_calibration_file(dataset: str) -> str:
+    """Return the path to the latest subclaim data file for a dataset.
+    Looks for files matching the naming convention used by main.py:
+    data/out/{DatasetName}/{dataset}_{query_size}_subclaims_with_scores_{model}.json
+    """
+    full_name = _dataset_cfg["datasets"][dataset]["name"]
+    response_dir = os.path.join(_path_cfg["paths"]["response_dir"], full_name)
+    pattern = os.path.join(response_dir, f"{dataset}_*_subclaims_with_scores_*.json")
+    files = glob.glob(pattern)
+    if not files:
+        raise FileNotFoundError(
+            f"No subclaim data file found for dataset '{dataset}' "
+            f"(searched: {pattern}). Run main.py first."
+        )
+    return max(files, key=os.path.getmtime)
+def _group_data(data: list[dict]) -> dict[str, list[dict]]:
+    """Group entries by their first group label."""
+    groups: dict[str, list[dict]] = defaultdict(list)
+    for entry in data:
+        grp = entry["groups"][0] if entry.get("groups") else "default"
+        groups[grp].append(entry)
+    return dict(groups)
+def _has_score(entry: dict, data_key: str) -> bool:
+    return all(
+        data_key in sc.get("scores", {})
+        for sc in entry.get("subclaims", [])
+    )
+# ── A. Threshold computation ──────────────────────────────────────────────────
+def compute_thresholds() -> None:
+    """Compute conformal thresholds and write demo/data/thresholds.csv.
+    Iterates over all datasets, scoring methods, modes, groups, and alpha
+    values defined in constants.py.  Each row corresponds to one lookup key
+    used by apply_threshold() at demo runtime.
+    CSV schema: dataset, mode, scoring_method, group, alpha, q_hat
+    """
+    os.makedirs(os.path.dirname(OUTPUT_THRESHOLDS), exist_ok=True)
+    rows: list[dict] = []
+    for dataset, ds_info in _dataset_cfg["datasets"].items():
+        is_grouped = ds_info.get("is_grouped", False)
+        print(f"\n[{dataset}] Loading calibration data…")
+        try:
+            calib_path = _find_calibration_file(dataset)
+        except FileNotFoundError as e:
+            print(f"  SKIP: {e}")
+            continue
+        data = load_subclaim_data(calib_path)
+        calib_data, holdout_data = _split_calibration_holdout(data, is_grouped)
+        print(f"  Loaded {len(data)} entries from {calib_path} (calib={len(calib_data)}, holdout={len(holdout_data)})")
+        for method_name, data_key in SCORING_METHOD_TO_DATA_KEY.items():
+            valid = [e for e in calib_data if _has_score(e, data_key)]
+            if not valid:
+                print(f"  SKIP scoring_method='{method_name}': no entries with key '{data_key}'")
+                continue
+            # ── Marginal mode ────────────────────────────────────────────────
+            for alpha in ALPHA_LEVELS:
+                q_hat = compute_threshold(alpha, valid, A_VALUE, data_key)
+                rows.append({
+                    "dataset": dataset,
+                    "mode": "marginal",
+                    "scoring_method": method_name,
+                    "group": "default",
+                    "alpha": alpha,
+                    "q_hat": q_hat,
+                })
+            print(f"  marginal / {method_name}: {len(ALPHA_LEVELS)} thresholds computed")
+            # ── Conditional mode (grouped datasets only) ─────────────────────
+            if is_grouped:
+                grouped = _group_data(valid)
+                for grp, grp_data in grouped.items():
+                    for alpha in ALPHA_LEVELS:
+                        q_hat = compute_threshold(alpha, grp_data, A_VALUE, data_key)
+                        rows.append({
+                            "dataset": dataset,
+                            "mode": "conditional",
+                            "scoring_method": method_name,
+                            "group": grp,
+                            "alpha": alpha,
+                            "q_hat": q_hat,
+                        })
+                print(f"  conditional / {method_name}: {len(grouped)} groups × {len(ALPHA_LEVELS)} thresholds computed")
+    # Write CSV
+    fieldnames = ["dataset", "mode", "scoring_method", "group", "alpha", "q_hat"]
+    with open(OUTPUT_THRESHOLDS, "w", newline="") as f:
+        writer = csv.DictWriter(f, fieldnames=fieldnames)
+        writer.writeheader()
+        writer.writerows(rows)
+    print(f"\nThresholds saved → {OUTPUT_THRESHOLDS}  ({len(rows)} rows)")
+# ── B. Sample query precomputation ────────────────────────────────────────────
+def _entry_to_subclaim_result(entry: dict, dataset: str, is_grouped: bool) -> dict:
+    """Convert an existing subclaim data entry to SubclaimResult format.
+    Uses DEFAULT_SCORING_METHOD and the corresponding data key so that the
+    stored score is directly comparable with the thresholds in thresholds.csv.
+    """
+    data_key = SCORING_METHOD_TO_DATA_KEY[DEFAULT_SCORING_METHOD]
+    group = entry["groups"][0] if is_grouped and entry.get("groups") else "default"
+    mode = "conditional" if is_grouped else "marginal"
+    subclaims = [
+        {
+            "id": i,
+            "text": sc["subclaim"],
+            # Add noise so the score is in the same space as q_hat.
+            # q_hat is calibrated using (score + noise) in calibration/utils.py.
+            "score": float(sc["scores"].get(data_key, 0.0))
+                   + float(sc["scores"].get("noise", 0.0)),
+        }
+        for i, sc in enumerate(entry.get("subclaims", []))
+    ]
+    return {
+        "query": entry["query"],
+        "dataset": dataset,
+        "mode": mode,
+        "group": group,
+        "scoring_method": DEFAULT_SCORING_METHOD,
+        "rag_answer": entry["response"],
+        "retrieved_docs": entry["retrieved_docs"],
+        "subclaims": subclaims,
+    }
+def _pick_entries(data: list[dict], is_grouped: bool) -> list[dict]:
+    """Pick entries with non-empty subclaims from the holdout set.
+    data is expected to be the holdout portion returned by
+    _split_calibration_holdout(), so at most HOLDOUT_PER_GROUP entries per
+    group are available.
+    """
+    has_subclaims = [e for e in data if e.get("subclaims")]
+    if not is_grouped:
+        return has_subclaims[:HOLDOUT_PER_GROUP]
+    grouped = _group_data(has_subclaims)
+    picked = []
+    for grp_entries in grouped.values():
+        picked.extend(grp_entries[:HOLDOUT_PER_GROUP])
+    return picked
+def compute_samples() -> None:
+    """Precompute SubclaimResult objects for sample queries.
+    Converts existing subclaim data (produced by main.py) to SubclaimResult
+    format without making any API calls.  Results are saved with metadata so
+    it is easy to detect when re-generation is needed (e.g. after a model
+    change).
+    """
+    os.makedirs(os.path.dirname(OUTPUT_SAMPLES), exist_ok=True)
+    samples: list[dict] = []
+    for dataset, ds_info in _dataset_cfg["datasets"].items():
+        is_grouped = ds_info.get("is_grouped", False)
+        print(f"\n[{dataset}] Picking sample queries…")
+        try:
+            calib_path = _find_calibration_file(dataset)
+        except FileNotFoundError as e:
+            print(f"  SKIP: {e}")
+            continue
+        data = load_subclaim_data(calib_path)
+        _, holdout_data = _split_calibration_holdout(data, is_grouped)
+        picked = _pick_entries(holdout_data, is_grouped)
+        for entry in picked:
+            result = _entry_to_subclaim_result(entry, dataset, is_grouped)
+            samples.append(result)
+            print(f"  + query={result['query'][:60]!r}  group={result['group']}")
+    output = {
+        "metadata": {
+            "generated_at": datetime.now(timezone.utc).isoformat(),
+            "model": RESPONSE_MODEL,
+            "scoring_method": DEFAULT_SCORING_METHOD,
+        },
+        "samples": samples,
+    }
+    with open(OUTPUT_SAMPLES, "w", encoding="utf-8") as f:
+        json.dump(output, f, indent=2, ensure_ascii=False)
+    print(f"\nSamples saved → {OUTPUT_SAMPLES}  ({len(samples)} entries)")
+# ── Entry point ───────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    print("=== Step A: Computing conformal thresholds ===")
+    compute_thresholds()
+    print("\n=== Step B: Precomputing sample queries ===")
+    compute_samples()
+    print("\nDone. Run `streamlit run demo/app.py` to launch the demo.")

docs/context/01_original_architecture.md ADDED Viewed

	@@ -0,0 +1,404 @@

+# ResponseQualityAssessment リポジトリ解析
+> 対象コード: `ResponseQualityAssessment/`（conformal-rag-demo サブモジュール）
+> 論文: "Response Quality Assessment for Retrieval-Augmented Generation via Conditional Conformal Factuality"（SIGIR 2025 想定）
+---
+## 1. システム概要
+RAG（Retrieval-Augmented Generation）システムの回答品質を評価するパイプラインの実装。**Conformal Prediction** を用いて、LLM の回答をサブクレームに分解し、各サブクレームの事実性スコアを計算・校正することで、統計的保証付きの品質評価を実現する。
+### 主な研究貢献
+- **Split Conformal Prediction**: キャリブレーションデータから閾値を算出し、テストデータへ適用
+- **Group Conditional Conformal**: グループ別（例: 医療カテゴリ）の閾値を独立に算出
+- 複数のスコアリング手法（類似度・頻度・対数確率など）の横断比較
+---
+## 2. ディレクトリ構成
+```
+ResponseQualityAssessment/
+├── conf/
+│   ├── config.yaml              # 実行全体の設定
+│   ├── dataset_config.yaml      # データセット別設定
+│   └── path_config.yaml         # ファイルパス設定
+├── data/
+│   ├── raw/                     # 生データ（FactScore, HotpotQA, PopQA, MedLFQA, WikiDB）
+│   ├── processed/               # 標準化済みクエリ・文書データ
+│   ├── out/                     # サブクレーム＋スコア出力
+│   └── result/                  # 最終結果・可視化
+├── index_store/                 # FAISS インデックスとマッピング
+├── logs/                        # 実行ログ
+├── src/
+│   ├── calibration/             # 共形予測キャリブレーション
+│   ├── common/                  # 共通コンポーネント（設定・ファイル・LLM 管理）
+│   ├── data_processor/          # データセット処理パイプライン
+│   ├── dataloader/              # HuggingFace データローダー
+│   ├── rag/                     # RAG（SQLite ベース文書 DB）
+│   ├── subclaim_processor/      # 回答品質評価の中核パイプライン
+│   └── utils/                   # 汎用ユーティリティ
+├── main.py                      # エントリーポイント
+└── requirements.txt             # 依存ライブラリ
+```
+---
+## 3. 使用ライブラリ
+| ライブラリ | 用途 |
+|---|---|
+| `openai` | LLM API（GPT-4o-mini による応答生成・サブクレーム抽出・アノテーション）、埋め込みモデル |
+| `faiss-cpu` | ベクトル近傍探索（FAISS `IndexFlatIP`、L2 正規化後の内積＝コサイン類似度） |
+| `sentence-transformers` / `transformers` | 埋め込みモデル（補助的）|
+| `datasets` | HuggingFace からの QA データセット取得 |
+| `langchain` | LLM オーケストレーション |
+| `numpy` | 数値計算（共形予測の分位点計算、スコア演算）|
+| `torch` | Transformer モデルのバックエンド |
+| `PyPDF2` | PDF 文書のテキスト抽出 |
+| `matplotlib` | キャリブレーション結果の可視化 |
+| `jsonschema` | 各フェーズの出力データのスキーマ検証 |
+| `python-dotenv` | `.env` からの API キー読み込み |
+| `flask` / `flask-cors` | Web API（コードに存在するが現状未使用）|
+**Python バージョン**: 3.11
+---
+## 4. アーキテクチャ設計
+### 4.1. 設計パターン
+| パターン | 適用箇所 |
+|---|---|
+| **Strategy パターン** | スコアリング戦略（`ProductScoreStrategy`）、集約戦略（`MeanAggregation` / `MaxAggregation`）、チャンキング戦略 |
+| **Template Method パターン** | `RawDataProcessor`（抽象基底）← データセット別実装、`ICalibration` ← 共形予測実装 |
+| **Factory / Dispatcher パターン** | `QueryProcessor` がデータセット名に基づき各プロセッサへ委譲 |
+| **Manager パターン** | `ConfigManager`、`FAISSIndexManager`、`FileManager`、`OpenAIManager` による各リソースのライフサイクル管理 |
+| **Pipeline アーキテクチャ** | `main.py` で各ステージを順次呼び出す直列パイプライン |
+| **Dependency Injection** | コンストラクタ経由でコンポーネントを注入（`faiss_manager`、`scorer` など）|
+### 4.2. コンポーネント相関図
+```
+main.py
+  ├─ ConfigManager        … YAML 読み込み・ログ設定
+  ├─ DataLoader           … HuggingFace / Wikipedia SQLite DB
+  ├─ QueryProcessor       … データセット別に標準化
+  ├─ FAISSIndexManager    … ベクトルインデックス作成・検索
+  │   └─ OpenAIManager    … text-embedding-3-large で埋め込み生成
+  ├─ process_subclaims()  … サブクレーム処理オーケスト���ーター
+  │   └─ SubclaimProcessor
+  │       ├─ OpenAIRAGAgent               … RAG 応答生成
+  │       ├─ OpenAIAtomicFactGenerator    … サブクレーム抽出（logprobs 付き）
+  │       ├─ OpenAIClaimVerification      … 事実性アノテーション（S/I/U/N）
+  │       └─ SubclaimScorer              … 7 種スコア計算
+  └─ SplitConformalCalibration / GroupConditionalConformal
+      └─ calibration/utils … r_a スコア・分位点閾値計算
+```
+---
+## 5. 処理パイプライン詳細
+### Step 1: データ読み込み・標準化
+| 処理 | 実装クラス |
+|---|---|
+| HuggingFace からデータセット取得 | `DataLoader.load_qa_data()` |
+| Wikipedia SQLite DB 構築 | `DataLoader.create_wiki_db()` |
+| データセット別正規化 | `FactScoreProcessor`, `HotpotQAProcessor`, `PopQAProcessor`, `MedLFQAProcessor` |
+#### データセット別 外部コーパス対応表
+FAISS インデックスの構築・検索に用いる外部コーパスはデータセットによって異なる。
+| データセット | 外部コーパス | 取得元 / 形式 |
+|---|---|---|
+| **FActScore** | Wikipedia（2023-04-01 ダンプ） | bz2 ダンプ → SQLite DB（`enwiki-20230401.db`）|
+| **HotpotQA** | Wikipedia（2023-04-01 ダンプ） | 同上（HuggingFace `kilt_tasks/hotpotqa` と組み合わせ）|
+| **PopQA** | Wikipedia（2023-04-01 ダンプ） | 同上（HuggingFace `akariasai/PopQA` と組み合わせ）|
+| **MedLFQA** | MedLFQAv2 の QA データセット自体 | GitHub `jjcherian/conformal-safety` の JSONL ファイル群（`healthsearch_qa`, `kqa_golden`, `kqa_silver_wogold`, `live_qa`, `medication_qa`）|
+FActScore・HotpotQA・PopQA は共通の Wikipedia SQLite DB（`DocDB`）から文書を取得する。MedLFQA のみ Wikipedia を使わず、各 JSONL ファイルに格納された `retrieved_passages` フィールドの文書をそのままコーパスとして使用する。
+出力形式（JSON）:
+```json
+{"input": "クエリ文字列", "output": {"answer": "正解", "provenance": [...]}}
+```
+### Step 2: FAISS インデックス構築
+#### 構築フロー
+全データセット共通で、以下の 2 フェーズに分かれる。
+**フェーズ A: 文書収集（データセット別）**
+各プロセッサが「どこから文書を取ってくるか」が異なり、最終的に同一 JSON 形式で保存される。
+| データセット | 文書取得元 | 1クエリあたり文書数 |
+|---|---|---|
+| **FActScore** | Wikipedia SQLite DB（provenance タイトル 1 件でタイトル引き） | 1 記事 |
+| **PopQA** | Wikipedia SQLite DB（`s_wiki_title` 1 件、重複排除あり） | 1 記事 |
+| **HotpotQA** | Wikipedia SQLite DB（provenance の複数タイトル、重複排除あり） | 2〜3 記事 |
+| **MedLFQA** | JSONL の `Free_form_answer` を文単位に分割 + `Nice_to_have` リスト | 10〜20 文（短文） |
+Wikipedia 3 データセットは Wikipedia 全文をそのまま FAISS に入れるのではなく、各クエリの provenance タイトルに対応する記事のみを SQLite から引き当てて保存する。MedLFQA は Wikipedia を使わず、JSONL ファイルに格納済みの文書を直接使用する。
+**フェーズ B: 埋め込み・インデックス化（全データセット共通、[main.py:211-275](main.py#L211-L275)）**
+1. 文書を **Fixed-Length チャンカー**（デフォルト: 2000 語、25 語オーバーラップ）で分割
+2. OpenAI `text-embedding-3-large`（次元数: 3072）で埋め込みベクトルを生成
+3. L2 正規化 → `IndexFlatIP`（内積 ≒ コサイン類似度）に追加
+4. インデックスバイナリと `indice2fm` マッピング（FAISS インデックス ID → ファイル位置）を保存
+各データセット独立したインデックスとして `index_store/{Dataset}/` に保存される。インデックスは `.gitignore` によりリポジトリ管理外であり、**実行のたびに再構築が必要**。
+#### text-embedding-3-large の構築コスト見積もり（`query_size: 500`、$0.13/1M トークン）
+| データセット | ユニーク文書数の目安 | チャンク/文書 | 総トークン数 | 推定コスト |
+|---|---|---|---|---|
+| **FActScore** | ~500 記事 | ~2 | ~2.7M | ~$0.35 |
+| **PopQA** | ~500 記事 | ~2 | ~2.7M | ~$0.35 |
+| **HotpotQA** | ~1,000 記事（2〜3 provenance × 重複排除） | ~2 | ~5.4M | ~$0.70 |
+| **MedLFQA** | ~500 クエリ × ~15 文 | チャンク分割なし（短文） | ~0.2M | ~$0.03 |
+| **合計（4 データセット）** | | | ~11M | **~$1.4** |
+これは**1 回限りのインデックス構築コスト**。推論時のクエリ埋め込み・サブクレーム埋め込みは数千トークン程度で無視できる。Wikipedia 記事長のばらつきにより±2 倍程度の誤差あり。
+### Step 3: 応答生成
+`SubclaimProcessor.generate_responses()`:
+1. クエリに対し FAISS から上位 k 件（デフォルト: 10）の文書を検索（閾値: 0.3 以上）
+2. 検索文書をコンテキストとして GPT-4o-mini に渡し、初期回答を生成
+### Step 4: サブクレーム抽出
+`OpenAIAtomicFactGenerator.get_facts_from_text()`:
+- LLM に回答テキストを渡し、セミコロン区切りの原子的事実に分解
+- `logprobs=True, top_logprobs=1` でトークンごとの対数確率を取得
+- 各サブクレームの対数確率リストを保持
+### Step 5: サブクレームスコアリング
+`SubclaimProcessor.score_subclaim()` で 7 種のスコアを計算：
+| スコア名 | 計算方法 | 意味 |
+|---|---|---|
+| `relavance`（relevance） | FAISS 検索スコア × サブクレーム-文書コサイン類似度 の積、文書間で集約 | 検索文書との総合関連度 |
+| `query_claim_cosine_similarity` | クエリ埋め込み ↔ サブクレーム埋め込み のコサイン類似度 | クエリとの意味的整合性 |
+| `doc_claim_cosine_similarity` | 全検索文書とサブクレームのコサイン類似度の最大値 | 文書との意味的整合性 |
+| `frequency` | 温度 1.0 で 5 回サンプリングし、同内容のサブクレームが出現した割合 | LLM 自体の一貫性（自己信頼度）|
+| `min_log_prob` | サブクレーム中のトークン対数確率の最小値 | LLM の生成確信度 |
+| `random` | Uniform(0, 1) | ベースライン |
+| `ordinal` | `i / サブクレーム数` | 応答内の出現順序（ベースライン）|
+> 全スコアに N(0, 0.001) のガウスノイズを付加（安定化のため）
+集約戦略（文書間集約）:
+- `MeanAggregation`: 文書スコアの平均
+- `MaxAggregation`: 文書スコアの最大値
+スコアリング戦略:
+- `ProductScoreStrategy`: FAISS スコア × コサイン類似度 の積（現状唯一の実装）
+### Step 6: アノテーション
+`OpenAIClaimVerification.annotate()`:
+- サブクレームを検索文書・正解と照合し、GPT-4o-mini が 4 段階ラベルを付与
+| ラベル | 意味 |
+|---|---|
+| `S` (Supported) | 事実として支持される |
+| `I` (Irrelevant) | 質問と無関係 |
+| `U` (Unverifiable) | 検証不可能 |
+| `N` (Nonfactual) | 事実に反する |
+### Step 7: 共形キャリブレーション
+#### Split Conformal Prediction
+```
+For each alpha in [0.05, 0.10, ..., 0.40]:
+  Repeat 1000 runs:
+    1. データをシャッフルし 50/50 でキャリブ/テストに分割
+    2. キャリブデータで各サブクレームの r_a スコアを計算:
+       r_a(x) = "サブクレームを低スコア順に除去したとき、
+                 残存サブクレームの正確率が初めて a 以上になる最小閾値"
+    3. 閾値 q̂ = ceil((n+1)*(1-alpha))/n 分位点（共形予測の保証付き計算）
+    4. テストデータで閾値 q̂ を適用し、除去率・正確率を計算
+```
+**キャリブレーションのデータ量について**
+キャリブレーションに使うデータは `query_size` で処理したクエリ全体を 50/50 分割したものであり、専用のキャリブレーションセットは別途用意しない。デフォルト（`query_size: 500`）の場合、1 ランあたりキャリブレーション 250 件・テスト 250 件となる。各データセットの総件数（HotpotQA ~5,600 件、PopQA ~14,267 件など）に対して 500 件はサンプリング上限であり、`query_size` を増やすほどキャリブレーション精度は上がるが LLM 呼び出しコストも線形に増加する。1000 回のランダムシャッフルを繰り返すことで、少ないサンプル数でも統計的ばらつきを吸収している。
+#### Group Conditional Conformal
+`GroupConditionalConformal`: グループ（MedLFQA のカテゴリなど）ごとに独立してキャリブレーションを実行し、グループ別閾値を算出。
+### Step 8: 結果出力
+- **CSV**: alpha × confidence_method ごとの除去率・正確率
+- **PNG プロット**: 共形除去カーブ（除去率 vs. 事実性保証）、事実的正確率カーブ
+- **JSON**: 全サブクレーム（スコア・アノテーション込み）
+- **config/**: 再現性のため実行設定の YAML コピー
+---
+## 6. データフロー
+```
+生データ（HuggingFace / bz2 Wikipedia）
+  ↓ DataLoader
+SQLite Wikipedia DB + 標準化クエリ JSON
+  ↓ FileManager + OpenAIManager
+チャンク埋め込み行列（N × 3072）
+  ↓ FAISSIndexManager
+FAISS IndexFlatIP + indice2fm マッピング
+  ↓ SubclaimProcessor.generate_responses()
+{query, gld_ans, retrieved_docs, response, groups}
+  ↓ OpenAIAtomicFactGenerator
+{subclaim_text, log_prob_list}（サブクレーム + 対数確率）
+  ↓ SubclaimScorer (7 種スコア)
+{scores: {relavance, query_claim_cosine, ..., ordinal}}
+  ↓ OpenAIClaimVerification
+{annotations: {gpt: "S"|"I"|"U"|"N"}}
+  ↓ SplitConformalCalibration / GroupConditionalConformal
+除去率・正確率の統計（1000 runs × alpha × method）
+  ↓
+CSV / PNG / config YAML
+```
+---
+## 7. 設定ファイル
+### `conf/config.yaml`（主要パラメータ）
+```yaml
+dataset:
+  name: "pop_qa"          # fact_score | hotpot_qa | pop_qa | medlf_qa
+  query_size: 500          # 処理クエリ数（-1 = 全件）
+index:
+  embedding_model: "text-embedding-3-large"
+  truncation_config:
+    strategy: "fixed_length"
+    chunk_size: 2000        # チャンクあたり単語数
+    chunk_overlap: 25
+rag:
+  retrival_topk: 10         # 検索上位件数
+  retrival_threshold: 0.3   # コサイン類似度の下限閾値
+  response_model: "gpt-4o-mini"
+conformal_prediction:
+  aggregation_strategy: "mean"   # mean | max
+  scoring_strategy: "product"
+  split_conformal: true
+  conformal_alphas:
+    start: 0.05
+    end: 0.45
+    step: 0.05
+  a_value: 1.0              # 目標正確率（0〜1）
+```
+### `conf/dataset_config.yaml`
+```yaml
+datasets:
+  fact_score:
+    is_grouped: false        # グループ条件付き共形を無効
+  medlf_qa:
+    is_grouped: true         # グループ条件付き共形を有効
+```
+---
+## 8. 主要クラス・関数一覧
+### 設定・管理系
+| クラス / 関数 | 役割 |
+|---|---|
+| `ConfigManager` | YAML 設定の読み込み・保存・更新、ロギング設定 |
+| `FileManager` | 文書（PDF/テキスト）処理、チャンキング、埋め込みキャッシュ |
+| `FAISSIndexManager` | FAISS インデックスの作成・読み込み・検索、ファイル位置マッピング |
+| `OpenAIManager` | OpenAI API ラッパー（埋め込み・アシスタント・スレッド）|
+### データ処理系
+| クラス / 関数 | 役割 |
+|---|---|
+| `DataLoader.load_qa_data()` | 4 データセットを HuggingFace から取得 |
+| `DataLoader.create_wiki_db()` | Wikipedia bz2 ダンプから SQLite DB 構築 |
+| `QueryProcessor.get_queries()` | 標準スキーマへの変換・サンプリング |
+| `DocDB` | SQLite ベースの文書検索（FActScore 実装を参照）|
+### RAG・スコアリング系
+| クラス / 関数 | 役割 |
+|---|---|
+| `OpenAIRAGAgent.answer()` | クエリ + 検索文書 → LLM 回答生成 |
+| `OpenAIAtomicFactGenerator.get_facts_from_text()` | テキスト → 原子的事実リスト（logprobs 付き）|
+| `OpenAIClaimVerification.annotate()` | サブクレームの S/I/U/N ラベル付け |
+| `SubclaimScorer.score()` | コサイン類似度ベースの関連度スコア計算 |
+| `SubclaimScorer.frequency_score()` | LLM 複数サンプリングによる一貫性スコア |
+### キャリブレーション系
+| クラス / 関数 | 役割 |
+|---|---|
+| `get_r_score(entry, method, a)` | `r_a` スコア（閾値を下げたとき正確率が `a` を超える最小値）の計算 |
+| `compute_threshold(alpha, data, a, method)` | 共形予測の分位点閾値 `q̂` の計算 |
+| `SplitConformalCalibration.plot_conformal_removal()` | 除去率 vs. 事実性保証のカーブ生成 |
+| `GroupConditionalConformal` | グループ別閾値による条件付き共形予測 |
+---
+## 9. テスト・品質保証
+専用のテストスイートは存在しない。品質保証は以下の方法で実施:
+- **JSON スキーマ検証**: `jsonschema` による各ステージ出力の構造検証（`base_schema.json`, `wiki_schema.json`, `subclaims_schema.json`）
+- **インデックス整合性チェック**: `FAISSIndexManager.is_indice_align()` によるインデックスとマッピングの同期確認
+- **バッチ処理の冪等性**: サブクレーム処理は既存の処理済みエントリをスキップ（再実行安全）
+- **設定ログ**: 実行時の設定 YAML を `config/` ディレクトリにコピーして再現性を担保
+---
+## 10. 既知の制約・拡張ポイント
+| 項目 | 現状 | 拡張の方向性 |
+|---|---|---|
+| スコアリング戦略 | `ProductScoreStrategy` のみ実装 | Strategy パターンにより追加容易 |
+| テキストチャンキング | Fixed-Length のみ（Recursive は未実装） | `FixedLengthChunker` と同インターフェースで追加可 |
+| 共形予測の種別 | Split / Group Conditional の 2 種 | オンライン共形予測などへの拡張余地あり |
+| 並列処理 | なし（完全逐次処理） | バッチ API や非同期処理で高速化可 |
+| Web API | `flask` が依存関係に含まれるが未使用 | デモ API サーバー化の際に活用可 |
+| FAISS コーパス範囲 | 各クエリの provenance に対応する文書のみインデックス化（正解文書が事前判明している前提） | データセット外の任意クエリに対応するには Wikipedia 全文などコーパス全体をインデックス化する必要がある |
+| 全コーパス検索時の精度 | `IndexFlatIP` は完全探索のため数学的な top-k 精度は落ちない。ただし埋め込み空間に無関係文書が増えるこ���で真に関連する文書が top-k から押し出される可能性がある。FAISS スコアはスコアリングにも使われるため最終的な品質評価精度にも直結する | ドメイン特化の fine-tuned 埋め込みモデルへの切り替えや BM25 との hybrid retrieval が対策として有効 |
+---
+## 11. 実行方法
+```bash
+# 基本実行
+python main.py --config conf/config.yaml
+# データセット・クエリ数を上書き指定
+python main.py --dataset hotpot_qa --query_size 200
+# カスタム実行 ID を付与
+python main.py --run_id my_experiment_01
+```
+環境変数（`.env`）:
+```
+OPENAI_API_KEY=sk-...
+```

docs/context/02_design_patterns_analysis.md ADDED Viewed

	@@ -0,0 +1,264 @@

+# 設計パターン詳細分析
+## 概要
+本ドキュメントでは、現在の `src/` に適用されている設計パターンをコードベースに基づいて詳細に解説し、問題点と改善案を記録する。
+---
+## 1. Strategy パターン
+### 目的
+アルゴリズムをオブジェクトとして差し替え可能にする。
+### 適用箇所
+```
+src/subclaim_processor/strategies/
+├── aggregation/
+│   ├── base.py        → AggregationStrategy (ABC)
+│   ├── mean.py        → MeanAggregation
+│   └── max.py         → MaxAggregation
+└── scoring/
+    ├── base.py        → ScoringStrategy (ABC)
+    └── product.py     → ProductScoreStrategy
+```
+### 実際の動作（`subclaim_scorer.py:46-68`）
+```python
+AGGREGATION_STRATEGIES: Dict[str, Callable] = {
+    "max": MaxAggregation,
+    "mean": MeanAggregation,
+}
+SCORING_STRATEGIES: Dict[str, Callable] = {
+    "product": ProductScoreStrategy,
+}
+def score(self, claim, retrieved_docs, aggregation_strategy, scoring_strategy):
+    agg_func = AGGREGATION_STRATEGIES[aggregation_strategy]()   # "mean" or "max"
+    scoring_func = SCORING_STRATEGIES[scoring_strategy]()       # "product"
+    for doc in retrieved_docs:
+        score = scoring_func.compute_score(...)  # 差し替え可能
+    return agg_func.aggregate(doc_scores)        # 差し替え可能
+```
+### 問題点
+Strategy の選択が文字列キーの辞書で行われており、型安全性がない。`"product"` を typo しても実行時まで気づけない。
+### 改善案
+```python
+# Enum で型安全にする
+from enum import Enum
+class AggregationMethod(Enum):
+    MEAN = "mean"
+    MAX = "max"
+# または Protocol を使う（Python 3.8+）
+from typing import Protocol
+class AggregationStrategy(Protocol):
+    def aggregate(self, scores: list[float]) -> float: ...
+```
+---
+## 2. Template Method パターン
+### 目的
+処理の骨格を親クラスで定め、詳細を子クラスに委ねる。
+### 適用箇所
+```
+src/data_processor/raw_data_processor.py  → IRawDataProcessor, DatasetProcessor (ABC)
+src/calibration/base_calibration.py       → ICalibration (ABC)
+src/common/llm/llm_agent.py               → LLMAgent (ABC)
+```
+### 現在の実装（`raw_data_processor.py`）
+```python
+class IRawDataProcessor(ABC):
+    @abstractmethod
+    def get_queries(self, input_file, output_file): pass
+    @abstractmethod
+    def get_documents(self, query_file, output_file): pass
+class DatasetProcessor(ABC):
+    @abstractmethod
+    def process_queries(self, input_file, **kwargs) -> list: pass
+    @abstractmethod
+    def process_documents(self, query_file, db, **kwargs) -> dict: pass
+```
+### 問題点
+抽象クラスが2つある（`IRawDataProcessor` と `DatasetProcessor`）。
+| クラス | 役割 |
+|--------|------|
+| `IRawDataProcessor` | ファイルI/O込みのインターフェース |
+| `DatasetProcessor` | 純粋な処理ロジックのインターフェース |
+`QueryProcessor` は `IRawDataProcessor` を継承しつつ、内部で `DatasetProcessor` を利用するというダブル構造になっている。
+真の Template Method パターンなら、親クラスに「骨格メソッド」が実装されているはずだが、現在の実装では `QueryProcessor.get_queries()` が骨格を担っている。
+---
+## 3. Factory / Dispatcher パターン
+### 目的
+入力に基づいて適切なオブジェクトを生成・委譲する。
+### 実際のコード（`query_processor.py:26-31`）
+```python
+self.processors = {
+    "fact_score": FactScoreProcessor(),
+    "hotpot_qa": HotpotQAProcessor(),
+    "pop_qa": PopQAProcessor(),
+    "medlf_qa": MedLFQAProcessor(),
+}
+# 使用時
+processor = self.processors.get(dataset)
+```
+### 問題点
+これは「Factory パターン」ではなく「Registry パターン」に近い。
+| パターン | 特徴 |
+|---------|------|
+| Factory（正） | 必要時にオブジェクトを生成する |
+| Registry（現在） | 起動時に全オブジェクトを一括生成して辞書に保持 |
+`QueryProcessor()` を作った時点で4データセット全プロセッサがインスタンス化される。実害は少ないが、設計の意図とずれている。
+---
+## 4. Manager パターン
+### 目的
+リソースのライフサイクル（生成・使用・破棄）を一元管理する。
+### 各 Manager の評価
+| Manager | 管理対象 | 実際の責任数 | 評価 |
+|---------|---------|------------|------|
+| `ConfigManager` | YAMLファイル | 設定読み込み + ログセットアップ | △ 2責任 |
+| `FAISSIndexManager` | FAISSインデックス | インデックス管理 + 検索 + 応答生成 | ✗ 3責任 |
+| `FileManager` | ドキュメント | ファイル読み込み + チャンキング | △ 2責任 |
+| `OpenAIManager` | Embeddings API | API呼び出しのみ | ✓ 適切 |
+### 最大の問題：`FAISSIndexManager`（`faiss_manager.py:245-286`）
+```python
+class FAISSIndexManager:
+    def upsert_file_to_faiss(...)              # インデックス管理 ✓
+    def search_faiss_index(...)                # 検索 ✓
+    def generate_response_from_context(...)    # ← LLM応答生成（無関係）
+    def parse_result(...)                      # ← 文字列パース（雑多）
+```
+`generate_response_from_context` はインデックス管理とは無関係なのに `FAISSIndexManager` に含まれている。この責任は `OpenAIRAGAgent` に移すべき。
+---
+## 5. Pipeline アーキテクチャ
+### 目的
+処理を段階的に分けて順次実行する。
+### 現在のパイプライン
+```
+main.py
+  1. DataLoader         → 生データ取得
+  2. QueryProcessor     → データ正規化
+  3. FAISSIndexManager  → インデックス構築
+  4. SubclaimProcessor  → 応答生成 → サブクレーム抽出 → スコアリング → アノテーション
+  5. SplitConformalCalibration → 統計的キャリブレーション
+```
+### 問題点：ステージ間の文字列シリアライズ（`faiss_manager.py:196-216`）
+```python
+# 検索結果を文字列として返す
+results.append(
+    f"{text} indice={idx} fileposition={relative_idx} score={dist:.4f}"
+    # TODO reformat this  ← 本人コメントあり
+)
+```
+その後 `parse_result()` で正規表現パースし直している：
+```python
+pattern = re.compile(
+    r"page_content='(.*?)'\smetadata=(\{.*?\})\sindice=(\d+)\sfileposition=(\d+)\sscore=([\d.]+)",
+    re.DOTALL,
+)
+```
+### 改善案
+dataclass でステージ間のデータ型を定義すれば `parse_result()` が不要になる：
+```python
+@dataclass
+class SearchResult:
+    text: str
+    indice: int
+    fileposition: int
+    score: float
+```
+---
+## 6. Dependency Injection（DI）
+### 目的
+依存オブジェクトを外から注入してテスト容易性を高める。
+### 現在の実装（`subclaim_scorer.py:29-44`）
+```python
+class SubclaimScorer(IDocumentScorer):
+    def __init__(self, index_truncation_config, embedding_model, index_path, ...):
+        self.faiss_manager = FAISSIndexManager(...)      # ← 内部で直接生成
+        self.gen = OpenAIAtomicFactGenerator()            # ← 内部で直接生成
+        self.openai_client = OpenAI()                     # ← 内部で直接生成
+```
+### 問題点
+これは DI ではない。本当の DI は依存オブジェクトを外部から受け取る：
+```python
+# DI（正）
+class SubclaimScorer:
+    def __init__(
+        self,
+        faiss_manager: FAISSIndexManager,
+        fact_generator: OpenAIAtomicFactGenerator,
+        openai_client: OpenAI,
+    ):
+        self.faiss_manager = faiss_manager
+```
+現在の設計では `SubclaimScorer` の単体テストに本物の OpenAI API と FAISS インデックスが必要になる。
+---
+## 設計全体の評価サマリー
+| パターン | 意図の正確さ | 実装の品質 | 改善優先度 |
+|---------|------------|----------|----------|
+| Strategy | ✓ 正しい | △ 文字列キーで型安全でない | 低（機能はしている） |
+| Template Method | △ 混乱あり | △ 2つの抽象クラスが混在 | 中 |
+| Factory/Dispatcher | △ Registry に近い | △ 起動時に全数生成 | 低 |
+| Manager | △ 責任過多 | ✗ FAISSがLLM応答生成を担当 | 高 |
+| Pipeline | ✓ 正しい | ✗ 文字列シリアライズ＋正規表現パース | 高 |
+| Dependency Injection | ✗ 名前だけ | ✗ 内部生成で外部注入不可 | 中 |
+### 優先度「高」の改善項目（構造変更不要）
+1. **`FAISSIndexManager.generate_response_from_context` の移動**
+   - 移動先：`OpenAIRAGAgent`（すでに存在する）
+   - 効果：Manager の単一責任原則回復
+2. **パイプライン間データ型の構造化**
+   - `SearchResult` dataclass の導入
+   - `parse_result()` の正規表現解析の廃止
+   - 効果：バグ発生箇所の削減、デバッグ容易性の向上

docs/context/03_demo_app_design.md ADDED Viewed

	@@ -0,0 +1,327 @@

+# デモアプリケーション設計
+## 1. 技術選定
+既存のバックエンド資産（Python 3.11, OpenAI, FAISS）を活かしつつ、デモ用途に最適化する。
+| 層 | 選定技術 | 理由 |
+|----|---------|------|
+| フロントエンド・UI | **Streamlit** | Python のみで完結し、既存推論コードを直接インポート可能。Flask/FastAPI で別途 API サーバーを立てる手間が不要。テキストのハイライト（Markdown/HTML 描画）やスライダーも標準サポート |
+| バックエンド・推論 | 既存 `src/` モジュールを再利用 | - |
+| LLM API | `gpt-4o-mini` | デモの応答速度とコストのバランスが良い |
+| ベクトル検索 | FAISS + `text-embedding-3-large` | そのまま利用 |
+| データストア | 事前計算済み閾値を CSV として静的保持 | デモ実行時にキャリブレーションループを回すのは非現実的なため。alpha を float 列として保持するため型変換問題が発生しない |
+---
+## 2. UI / UX 設計
+ユーザーが「Target Factuality ($1-\alpha$) を操作すると、生成された回答のどの部分が削られるか」を体感できる構成にする。
+### 画面レイアウト
+#### サイドバー（コントロールパネル）
+| コントロール | 内容 | 備考 |
+|------------|------|------|
+| データセット選択 | FactScore / HotpotQA / PopQA / MedLFQA から選択 | - |
+| モード選択 | Marginal（全体）か Conditional（グループ別）か | **MedLFQA 選択時のみ有効**。他のデータセットでは非活性（`disabled=True`）または非表示にする |
+| グループ選択 | healthsearch_qa / medication_qa / kqa_silver_wogold / kqa_golden / live_qa から選択 | **MedLFQA + Conditional モード選択時のみ有効**。それ以外は非表示 |
+| 質問の入力 | サンプルクエリのプルダウン選択のみ | FAISS インデックスは各クエリの provenance に対応する文書のみで構成されているため、任意のフリーテキスト入力は検索精度・スコアリング精度が保証されない。デモではサンプルクエリに限定する |
+| 推論実行ボタン | `st.button("推論実行")` | クリックで重い処理を起動 |
+| Factuality スライダー | Target Factuality ($1-\alpha$) を 60% 〜 95% の範囲で調整 | 推論結果がキャッシュされている場合のみ有効 |
+#### メインパネル（結果表示）
+**クエリ**
+- 選択されたサンプルクエリの全文を表示
+---
+**元の RAG 回答 ｜ フィルタ後の回答**（2カラム）
+左カラム: 元の RAG 回答
+- `OpenAIRAGAgent` が生成した回答をそのまま表示
+右カラム: フィルタ後の回答
+- 「回答を生成」ボタン（`st.button`）を押すと LLM が回答を再生成（`reintegrate_subclaims()` 呼び出し）
+- 生成された回答をボタンの上に表示
+- スライダー（$\alpha$）を変更した後に古い結果が残っている場合は警告を表示
+---
+**Conformal-RAG による回答（インタラクティブ表示）**
+- スライダーの値に対応する閾値（$\hat{q}$）を事前計算済み CSV から取得し、キャッシュ済みスコアと比較して即時再描画
+- スコア（$R(c)$）が閾値（$\hat{q}$）以上のサブクレーム → 装飾なし（そのまま表示）
+- スコアが閾値未満のサブクレーム → **取り消し線・赤字**で表示
+---
+**検索されたナレッジ（Retrieved Documents）**
+- `st.expander` でアコーディオン表示（デフォルト折りたたみ）
+- FAISS 検索結果の `page_content` を抽出（`_extract_page_content()` でソースプレフィックスを除去）
+**分析メトリクス**
+- 算出された閾値（$\hat{q}$）
+- 元のサブクレーム数
+- Keep 件数
+- 削除率（Removal Rate）
+- 推論にかかった時間（ライブ推論時のみ表示。サンプルクエリ使用時は非表示）
+---
+## 3. バックエンド設計変更（デモ向け改修）
+現在のリポジトリは「バッチ評価用（論文の実験用）」設計のため、インタラクティブなデモ向けに以下のリファクタリングが必要。
+> **方針（CLAUDE.md より）**: デモ特有の UI ロジックやエンドポイントは `demo/` ディレクトリ内に隔離し、`src/` のコアロジックと密結合させない。`main.py` を使ったバッチ処理が壊れないよう後方互換性を維持する。
+### ディレクトリ構成
+```
+demo/
+├── app.py               # Streamlit エントリーポイント
+├── inference_api.py     # 単一クエリ推論ラッパー（src/ を呼び出す）
+├── precompute.py        # キャリブレーション閾値・サンプルクエリの事前計算（オフライン実行用）
+├── constants.py         # ALPHA_LEVELS など demo/ 内の共有定数
+└── data/
+    ├── thresholds.csv   # 事前計算済みキャリブレーション閾値マスタ
+    └── samples.json     # 事前計算済みサンプルクエリ推論結果
+```
+`src/` のコアロジックは変更せず、`demo/` 側から `import` して利用する。
+### A. 事前計算（オフライン）フェーズ — `precompute.py`
+`precompute.py` は以下 **2つの責務** を持つ。どちらもオフラインで実行し、デモ起動時には読み込むだけにする。
+#### A-1. キャリブレーション閾値の計算
+各データセット・スコアリング手法・$\alpha$ 値・グループごとの閾値（$\hat{q}$）を計算して静的ファイルに保存する。
+> **データ分割（ホールドアウト）**: `main.py` の出力データはキャリブレーション用とサンプル用に分割して使用する（`precompute.py` の `HOLDOUT_PER_GROUP` 定数で制御）。Conformal Prediction の理論的 coverage 保証はキャリブレーションに使っていないデータに対してのみ成立するため、同一データを両方に使ってはならない。`compute_thresholds()` はキャリブレーション側のみ、`compute_samples()` はホールドアウト側のみを参照する。
+- 対象データセット: `fact_score`（FactScore）、`hotpot_qa`（HotpotQA）、`pop_qa`（PopQA）、`medlf_qa`（MedLFQA）
+- MedLFQA の Conditional モードのグループ: **HealthsearchQA / MedicationQA / K-QA_Silver / K-QA_Golden / LiveQA**（全グループを網羅すること）
+- スコアリング手法: デフォルト `relevance`。手法名は定数として管理し、変更可能にする
+- 保存フォーマットは **CSV（縦持ち）**、出力先: `demo/data/thresholds.csv`
+- スキーマ:
+| 列名 | 型 | 説明 | 取りうる値 |
+|------|----|------|-----------|
+| `dataset` | `str` | データセット識別子 | `"fact_score"`, `"hotpot_qa"`, `"pop_qa"`, `"medlf_qa"` |
+| `mode` | `str` | キャリブレーションモード | `"marginal"`, `"conditional"` |
+| `scoring_method` | `str` | スコアリング手法名 | `"relevance"`, `"cosine_similarity"`, `"min_log_prob"` など。手法を追加する際は `precompute.py` を再実行して行を追加するだけでよい（スキーマ変更不要） |
+| `group` | `str` | グループ名。Marginal モードまたはグループなしデータセットは `"default"` | `"HealthsearchQA"`, `"MedicationQA"`, `"K-QA_Silver"`, `"K-QA_Golden"`, `"LiveQA"`, `"default"` |
+| `alpha` | `float` | 誤り率（$\alpha = 1 - \text{target factuality}$） | `ALPHA_LEVELS`（後述）と完全一致させること |
+| `q_hat` | `float` | 共形予測の閾値 | — |
+- `alpha` は浮動小数点演算の誤差（例: `1 - 0.9 = 0.09999...`）を避けるため、CSV 書き込み時・lookup 時の両方で `round(alpha, 2)` を適用すること
+- lookup 例: `df.query("dataset==@d and mode==@m and scoring_method==@s and group==@g and alpha==@a")["q_hat"].iloc[0]`
+#### A-2. サンプルクエリの推論結果の事前計算
+UI のプルダウンで選択できるサンプルクエリについて、`inference_api.process_query()` を実行済みの `SubclaimResult` を静的ファイルとして保存しておく。これによりプルダウン選択時に API 呼び出しなしで即座に表示できる。
+- サンプルクエリ結果の保存ファイルには**メタデータ（生成日時・使用モデル名・スコアリング手法）を付与**し、モデルやプロンプト変更後に再生成が必要かどうか判断できるようにする
+- MedLFQA のサンプルクエリは全グループ（HealthsearchQA / MedicationQA / K-QA_Silver / K-QA_Golden / LiveQA）から少なくとも 1 件ずつ含めること
+保存フォーマット（`demo/data/samples.json`）:
+```json
+{
+  "metadata": {
+    "generated_at": "2026-03-21T12:00:00Z",
+    "model": "gpt-4o-mini",
+    "scoring_method": "relevance"
+  },
+  "samples": [
+    {
+      "query": "What is the first-line treatment for hypertension?",
+      "dataset": "medlf_qa",
+      "mode": "conditional",
+      "group": "MedicationQA",
+      "scoring_method": "relevance",
+      "rag_answer": "...",
+      "retrieved_docs": ["...", "..."],
+      "subclaims": [
+        {"id": 0, "text": "...", "score": 0.85},
+        {"id": 1, "text": "...", "score": 0.42}
+      ]
+    }
+  ]
+}
+```
+- `samples` 配列の各要素は `SubclaimResult` と同一スキーマ
+- プルダウン表示ラベルは `query` フィールドから生成する
+### B. 状態管理（`st.session_state`）
+Streamlit はウィジェット操作のたびにスクリプト全体を再実行するため、重い処理の結果を `st.session_state` に保持して再計算を回避する。
+#### B-1. セッションキー一覧
+| キー | 型 | 内容 |
+|------|----|------|
+| `query_processed` | `bool` | クエリが処理済みかどうか |
+| `result` | `SubclaimResult \| None` | `process_query()` の戻り値（サブクレーム・スコア・検索結果・RAG 回答をまとめて保持） |
+| `elapsed_sec` | `float \| None` | ライブ推論の所要時間。サンプルクエリ使用時は `None`（推論時間メトリクスの表示制御に使用） |
+| `prev_params` | `tuple \| None` | 直前の `(dataset, mode, group)` の組。変更検出してセッションをリセットするために使用 |
+| `integrated_answer` | `str \| None` | `reintegrate_subclaims()` が生成した再統合済み回答。パラメータ変更や新規推論時に `None` にリセット |
+| `integrated_alpha` | `float \| None` | `integrated_answer` 生成時の $\alpha$ 値。現在のスライダー値と異なれば警告を表示するために保持 |
+#### B-2. データ型定義
+`demo/inference_api.py` で定義する型。`app.py` との境界を明確にするため `TypedDict` で宣言する。
+```python
+from typing import TypedDict
+from demo.constants import DEFAULT_SCORING_METHOD  # constants.py を単一の定義元とする
+class Subclaim(TypedDict):
+    id: int
+    text: str       # サブクレームのテキスト
+    score: float    # 計算されたスコア値（手法名は SubclaimResult.scoring_method を参照）
+class SubclaimResult(TypedDict):
+    query: str
+    dataset: str            # 例: "medlf_qa"
+    mode: str               # 例: "marginal" / "conditional"。thresholds.csv の mode 列に対応
+    group: str              # 例: "MedicationQA"。Marginal モードまたは非グループデータセットは "default"
+    scoring_method: str     # 例: "relevance"。apply_threshold() での閾値ルックアップキーに使用
+    rag_answer: str
+    retrieved_docs: list[str]
+    subclaims: list[Subclaim]
+class FilteredResult(TypedDict):
+    subclaims: list[Subclaim]
+    q_hat: float
+    keep_count: int
+    remove_count: int
+```
+- `Subclaim.score` は手法を問わず汎用名とし、何のスコアかは `SubclaimResult.scoring_method` で示す
+- `SubclaimResult.group` は CSV の `group` 列と対応。Marginal モードや非グループデータセットでは `"default"` を使い、`None` は使わない
+- `status`（Keep/Remove）は UI 側でスコアと `q_hat` を比較して動的に判定するため、バックエンドからは返さない
+#### 共有定数 — `demo/constants.py`
+`precompute.py`（CSV 生成）と `app.py`（スライダー）の両方が参照する定数をここに一元管理する。スライダーの選択値と CSV の `alpha` 列が一致しないと lookup が失敗するため、両者は必ずこのファイルを `import` して使う。
+```python
+# demo/constants.py
+# スコアリング手法のデフォルト値。変更時はここのみ修正し、precompute.py を再実行する
+DEFAULT_SCORING_METHOD = "relevance"
+# クリーンな手法名 → main.py の出力データ内のキー名 のマッピング
+# "relavance" は元コードベースのタイポ。データファイルとの互換性のため保持
+SCORING_METHOD_TO_DATA_KEY: dict[str, str] = {
+    "relevance": "relavance",
+    "cosine_similarity": "query_claim_cosine_similarity",
+    "min_log_prob": "min_log_prob",
+}
+# Target Factuality スライダーの刻み（alpha = 1 - target_factuality）
+# precompute.py はこのリストの alpha 値のみ計算する
+# app.py の st.select_slider はこのリストをそのまま使う
+ALPHA_LEVELS: list[float] = [0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40]
+# → Target Factuality: 95%, 90%, 85%, 80%, 75%, 70%, 65%, 60%
+```
+---
+### C. 推論パイプライン — 重い処理と軽い処理の分離
+**重い処理① — 「推論実行」ボタンのクリック時のみ起動**
+```
+1. クエリのベクトル化 & FAISS 検索         ← src/common/faiss_manager.py
+2. RAG による初期回答（ŷ）の生成           ← src/common/llm/openai_rag_agent.py
+3. OpenAI API によるサブクレームへの分解    ← src/common/llm/openai_atomicfact_generator.py
+4. サブクレームごとのスコア計算             ← src/subclaim_processor/scorer/
+5. 結果を st.session_state に格納 → st.rerun() でスライダーを即時有効化
+```
+- `st.spinner` でユーザーに待機（30〜60 秒）を明示する
+- アノテーション（S/I/U/N ラベル付け）はデモの表示に不要なため**スキップ**する
+- frequency スコア（`src/subclaim_processor/scorer/subclaim_scorer.py` の `frequency_score()`）は1サブクレームあたり LLM を 5 回呼び出すため、**デモでは除外する**。表示するスコアは relevance・cosine similarity・min_log_prob に限定する
+**重い処理② — 「回答を生成」ボタンのクリック時のみ起動**
+```
+1. 閾値未満のサブクレームを removed_subclaims として収集
+2. reintegrate_subclaims(rag_answer, removed_subclaims) を呼び出し
+   → OpenAILLMAgent 経由で元の回答から除去対象サブク��ームを削除した自然文を生成
+3. 結果を st.session_state.integrated_answer に格納 → st.rerun()
+```
+**軽い処理（リアルタイム実行）— スライダー操作のたびに実行**
+```
+1. スライダーの alpha 値に対応する q̂ を事前計算済み CSV から取得
+2. st.session_state.result のスコアと q̂ を比較
+3. Keep/Remove の HTML タグを生成して st.markdown で再描画
+   <span style="padding:2px 5px;">サブクレーム</span>                                    ← Keep（装飾なし）
+   <span style="text-decoration:line-through; color:#dc3545;">サブクレーム</span>  ← Remove
+```
+API 呼び出しは一切行わない。
+### D. リソースキャッシュ（`@st.cache_resource`）
+FAISS インデックスはデータセットごとに数百 MB あるため、データセット切り替えのたびにロードし直すと数十秒かかる。`@st.cache_resource` でデータセット名をキーにキャッシュし、初回のみロードする。
+```python
+@st.cache_resource
+def get_faiss_manager(dataset: str) -> FAISSIndexManager:
+    return build_faiss_manager(dataset)
+@st.cache_resource
+def get_scorer(dataset: str) -> SubclaimScorer:
+    return build_scorer(dataset)
+```
+`thresholds.csv` と `samples.json` の静的ファイルは `@st.cache_data` でキャッシュする（`@st.cache_resource` はオブジェクト向け。DataFrameや辞書は `@st.cache_data`）。
+> **実装上の注意（HF Spaces 対応）**: `FAISSIndexManager` / `SubclaimScorer` は
+> `langchain_text_splitters` を経由して `torch` / `transformers` 等の重い依存を引き込む。
+> HF Spaces など FAISS 不要な環境での起動コストを避けるため、これらのクラスは
+> `inference_api.py` のモジュールレベルではなく `build_faiss_manager()` / `build_scorer()`
+> 関数内で遅延 import している。型ヒントは `TYPE_CHECKING` ガードで維持。
+---
+## 4. 開発ステップ
+### ① `demo/constants.py`（共有定数）
+他モジュールが依存するため最初に作成する。`ALPHA_LEVELS` と `DEFAULT_SCORING_METHOD` のみ定義。
+### ② `demo/inference_api.py`（推論ラッパー）
+`precompute.py` がこれを呼び出すため、先に実装する。
+- `src/` モジュールを `import` するのみで、コアロジックは変更しない
+- インターフェース:
+  - `process_query(query: str, dataset: str, mode: str, group: str) -> SubclaimResult`（重い処理①）
+  - `reintegrate_subclaims(rag_answer: str, removed_subclaims: list[str], model: str) -> str`（重い処理②）
+  - `apply_threshold(result: SubclaimResult, q_hat: float) -> FilteredResult`（軽い処理）
+- 使用するスコアリング手法は `DEFAULT_SCORING_METHOD` 定数で管理し、ハードコーディングしない
+### ③ `demo/precompute.py` + データ生成（オフライン）
+`inference_api` が完成してから実行する。
+- 既存の `main.py` を実行し、デモ用データセットの FAISS インデックスを作成（未作成の場合）
+- `demo/precompute.py` を実装・実行し、以下を生成:
+  1. $\alpha$ ごとのキャリブレーション閾値マスタ（CSV: `demo/data/thresholds.csv`）
+  2. サンプルクエリ分の `SubclaimResult` キャッシュ（`demo/data/samples.json`、生成日時・モデル名のメタデータ付き）
+### ④ `demo/app.py`（Streamlit UI）
+上記 3 つが揃ってから実装する。
+- ボタン・スライダー・`st.session_state` を連動させる
+- MedLFQA 以外のデータセット選択時は Conditional モードおよびグループ選択を非活性にする

docs/context/04_environment_setup.md ADDED Viewed

	@@ -0,0 +1,373 @@

+# 環境構築・実行手順
+実行環境は **Singularity コンテナ経由** と **ホスト直接実行** の2通りをサポートする。
+---
+## 1. 前提条件
+| ツール | バージョン要件 | 用途 |
+|--------|-------------|------|
+| Git | 任意 | リポジトリのクローン |
+| Singularity (SingularityCE) | 3.x 以上 | コンテナ実行・SIF ビルド（コンテナ経由の場合） |
+| `fakeroot` | — | SIF ビルド時に必要（`singularity build --fakeroot`） |
+| Python 3.11 + `requirements-dev.txt` | — | ホスト直接実行の場合のみ |
+> **推奨**: HPC 環境など Python 環境の管理が難しい場合は Singularity コンテナ経由を使う。ローカル開発では直接実行でも動作する。
+---
+## 2. ホスト側ディレクトリ構成
+ホストの任意のディレクトリ（`DATA_ROOT` の親）に以下の構造を作成する。
+重いファイル（FAISS インデックス、生データ、HuggingFace キャッシュ）はすべてここに集約し、Singularity の `--bind` でコンテナ内にマウントする。
+```
+/mnt/data/<username>/
+├── sif/
+│   └── response_quality.sif        ← SIF_DIR
+├── hf_cache/                        ← HF_HOME
+└── ResponseQualityAssessment/       ← DATA_ROOT
+    ├── data/
+    │   ├── raw/
+    │   │   └── WikiDB/              ← Wikipedia SQLite DB
+    │   ├── processed/
+    │   ├── out/                     ← main.py 出力（サブクレーム JSON）
+    │   └── result/
+    └── index_store/
+        ├── FactScore/               ← FAISS インデックス
+        ├── HotpotQA/
+        ├── PopQA/
+        └── MedLFQA/
+```
+> `demo/data/`（thresholds.csv, samples.json）はリポジトリ内に含まれるため `DATA_ROOT` 側への配置・バインドは不要。
+---
+## 3. `.env` の設定
+`.env.example` をコピーして `.env` を作成し、環境に合わせて値を設定する。
+```bash
+cp .env.example .env
+```
+`.env` の内容（`.gitignore` 対象）:
+```bash
+# Machine-specific absolute paths
+DATA_ROOT=/mnt/data/<username>/ResponseQualityAssessment
+SIF_DIR=/mnt/data/<username>/sif
+HF_HOME=/mnt/data/<username>/hf_cache
+# API keys
+OPENAI_API_KEY=sk-...
+```
+`conf/path_config.yaml` と `conf/dataset_config.yaml` の各パスは `${DATA_ROOT}` を起点として定義されており、実行時にこの環境変数を展開して解釈する。
+> - **ホスト直接実行時**: `.env` の `DATA_ROOT`（例: `/mnt/data/<username>/ResponseQualityAssessment`）をそのまま使用する。
+> - **コンテナ経由実行時**: `run_*.sh` スクリプトがホストの `data/` と `index_store/` を `REPO_ROOT` 以下にバインドマウントするため、コンテナ内での `DATA_ROOT` は `REPO_ROOT` に上書きされる。スクリプト内の `--env DATA_ROOT="${REPO_ROOT}"` がこれを担っている。
+---
+## 4. Singularity コンテナのビルド（初回のみ）
+```bash
+bash scripts/build_sif.sh
+```
+- `singularity/response_quality.def`（`docker://python:3.11-slim` ベース）から SIF をビルドする
+- 出力先: `${SIF_DIR}/response_quality.sif`
+- ビルドには数分かかる。`fakeroot` が必要
+- `torch` はコンテナ内では CPU 専用 wheel（`--index-url https://download.pytorch.org/whl/cpu`）でインストールされる。ローカル開発で `requirements-dev.txt` から直接インストールする場合は同オプションを手動で指定すること
+**requirements ファイルの構成:**
+| ファイル | 用途 |
+|----------|------|
+| `requirements.txt` | HF Spaces 向け最小セット（`torch` 等の重い依存を除く）。Spaces はこのファイルを自動で読む |
+| `requirements-dev.txt` | ローカル・Singularity 向け全依存（`-r requirements.txt` で共通部分を継承） |
+ホスト直接実行時のインストール:
+```bash
+pip install -r requirements-dev.txt
+```
+---
+## 5. データ準備（初回のみ）
+| 方法 | 用途 | API 消費 |
+|--------|------|---------|
+| **A. 著者提供データを使う** | 生データ・FAISS インデックスを再利用して API 呼び出しを削減 | Embeddings 不要（FAISS スキップ）、Chat は必要 |
+| **B. 自前でフルパイプライン実行** | 新モデルや新データセットで結果を再現 | あり（Chat + Embeddings） |
+---
+### A. 著者提供データを使う（FAISS 構築スキップ）
+著者提供の Google Drive フォルダには以下が含まれる（**注**: 事前計算済みサブクレーム JSON は含まれていない）:
+- `data.zip` — 生クエリデータ（`HotpotQA/raw_hotpot_qa.json`、`PopQA/raw_pop_qa.json`、`FactScore/factscore_names.txt`、`MedLFQA/*.jsonl`）
+- `index_store.zip` — 事前構築済み FAISS インデックス（各データセット × 各 query_size 分の `.faiss` と `indice2fm_*.json`）
+FAISS インデックスが揃っていれば Embeddings API（インデックス構築）はスキップされる。ただし、RAG 回答生成・サブクレーム抽出・スコアリング・アノテーションには Chat API が必要。
+**1. Wikipedia DB をダウンロードして配置する**
+全データセット共通で必要（QueryProcessor が常に参照する）。
+> [enwiki-20230401.db（Google Drive）](https://drive.google.com/file/d/1mekls6OGOKLmt7gYtHs0WGf5oTamTNat/view?usp=drive_link)
+```
+DATA_ROOT/data/raw/WikiDB/enwiki-20230401.db
+```
+**2. `medlf_qa` を使う場合は生データを手動取得する**
+`fact_score` / `hotpot_qa` / `pop_qa` は DataLoader が HuggingFace から自動取得するため不要。
+```bash
+source .env
+# 以下どちらか一方
+git clone https://github.com/dmis-lab/OLAPH.git /tmp/olaph
+cp -r /tmp/olaph/MedLFQA ${DATA_ROOT}/data/raw/MedLFQA
+# または
+git clone https://github.com/jjcherian/conformal-safety.git /tmp/cs
+cp -r /tmp/cs/data/MedLFQAv2 ${DATA_ROOT}/data/raw/MedLFQA
+```
+**3. 著者提供データをダウンロードして展開する**
+> https://drive.google.com/drive/folders/1aLbHxS6V1ipMH8FpVCxKmr8oMYfqmRgb
+```bash
+source .env
+# gdown でフォルダをダウンロード（"conformal prediction with RAG/" という名前で保存される）
+gdown --folder https://drive.google.com/drive/folders/1aLbHxS6V1ipMH8FpVCxKmr8oMYfqmRgb \
+  -O "${DATA_ROOT}/data/out/"
+# 生データを data/raw/ に展開
+unzip "${DATA_ROOT}/data/out/conformal prediction with RAG/data.zip" \
+  -d "${DATA_ROOT}/data/raw/"
+# FAISS インデックスを DATA_ROOT 直下に展開（index_store/ が作られる）
+unzip "${DATA_ROOT}/data/out/conformal prediction with RAG/index_store.zip" \
+  -d "${DATA_ROOT}/"
+```
+展開後の構造:
+```
+DATA_ROOT/
+├── data/
+│   └── raw/
+│       ├── HotpotQA/raw_hotpot_qa.json
+│       ├── PopQA/raw_pop_qa.json
+│       ├── FactScore/factscore_names.txt
+│       └── MedLFQA/*.jsonl
+└── index_store/
+    ├── FactScore/index_500.faiss, indice2fm_500.json, ...
+    ├── HotpotQA/index_500.faiss, indice2fm_500.json
+    └── MedLFQA/index_10.faiss, index_500.faiss, indice2fm_*.json, ...
+```
+**4. 実行する**
+FAISS インデックスが存在する場合、インデックス構築（Embeddings API）はスキップされる。RAG 推論・スコアリング・アノテーションには Chat API が必要。
+```bash
+# コンテナ経由
+bash scripts/run_main.sh --dataset medlf_qa
+# ホスト直接実行
+source .env && python main.py --dataset medlf_qa
+```
+---
+### B. 自前でフルパイプライン実行（API あり）
+**1. Wikipedia DB をダウンロードして配置する**
+上記 A の手順 1 と同じ。
+**2. `medlf_qa` を使う場合は生データを手動取得する**
+上記 A の手順 2 と同じ。`fact_score` / `hotpot_qa` / `pop_qa` は不要。
+**3. 実行する**
+データセットごとに1つずつ実行する（並列不可）。
+```bash
+# コンテナ経由
+bash scripts/run_main.sh --dataset medlf_qa
+# 利用可能: fact_score, hotpot_qa, pop_qa, medlf_qa
+# ホスト直接実行
+source .env && python main.py --dataset medlf_qa
+```
+---
+実行後に生成されるファイル（`DATA_ROOT` 以下）:
+| パス | 内容 |
+|------|------|
+| `data/out/{DatasetName}/` | サブクレームスコア付き JSON |
+| `index_store/{DatasetName}/` | FAISS インデックス (`index_*.faiss`, `indice2fm_*.json`) |
+---
+### CLI オプション一覧
+| オプション | デフォルト | 説明 |
+|---|---|---|
+| `--dataset` | config 参照 | データセット名（`fact_score` / `hotpot_qa` / `pop_qa` / `medlf_qa`） |
+| `--query_size` | 500 | 処理するクエリ件数 |
+| `--run_id` | タイムスタンプ | ログ・結果ディレクトリの識別子 |
+| `--lite` | false | `frequency`・`doc_claim_cosine_similarity` スコアと conformal グラフ生成をスキップ |
+| `--log_level` | `INFO` | ログレベル（`DEBUG` / `INFO` / `WARNING` / `ERROR`） |
+### API コールのログ確認
+`--log_level DEBUG` を付けると、OpenAI API の個別コール（purpose・model・トークン数）が `[API:Chat]` / `[API:Embeddings]` プレフィックスで記録される。
+```bash
+# API ログを有効にして実行（コンテナ経由）
+bash scripts/run_main.sh --dataset medlf_qa --query_size 10 --lite --log_level DEBUG
+# ログから API コールのみ抽出
+grep "\[API:" logs/<run_id>/run_<run_id>.log
+# purpose 別の集計
+grep "\[API:" logs/<run_id>/run_<run_id>.log \
+  | sed 's/.*\[API:\([^]]*\)\] purpose=\([^ ]*\).*/\1 \2/' \
+  | sort | uniq -c | sort -rn
+```
+出力例:
+```
+    103 Embeddings cosine_similarity(claim)
+    103 Embeddings cosine_similarity(query)
+    103 Embeddings relevance_score
+    103 Chat claim_verification
+     10 Chat subclaim_extraction
+     10 Chat rag_response
+```
+---
+## 6. 事前計算（データ更新・モデル変更時）
+`demo/data/thresholds.csv` と `demo/data/samples.json` を（再）生成する。
+モデル変更やデータセット追加後に実行する。
+**コンテナ経由:**
+```bash
+bash scripts/run_precompute.sh
+```
+**ホスト直接実行:**
+```bash
+source .env
+python -m demo.precompute
+```
+出力はリポジトリ内の `demo/data/` に書き込まれる。
+生成後は `git add demo/data/ && git commit` でコミットする。
+---
+## 7. デモの起動
+**コンテナ経由:**
+```bash
+bash scripts/run_demo.sh            # デフォルト: ポート 8502
+bash scripts/run_demo.sh --port 8503  # ポートを指定する場合
+```
+**ホスト直接実行:**
+```bash
+source .env
+streamlit run demo/app.py --server.port 8502
+```
+- デフォルトポート: **8502**（8501 はサーバー上の別アプリが使用中のため）
+- ポートが使用中の場合は `--port` オプションで別ポートを指定する
+**Singularity バインド構成（スクリプト内部）:**
+```bash
+singularity run \
+  --bind "${DATA_ROOT}/data:${REPO_ROOT}/data" \
+  --bind "${DATA_ROOT}/index_store:${REPO_ROOT}/index_store" \
+  --bind "${HF_HOME}:${HF_HOME}" \
+  --env HF_HOME="${HF_HOME}" \
+  --env OPENAI_API_KEY="${OPENAI_API_KEY}" \
+  --env DATA_ROOT="${REPO_ROOT}" \
+  --env PYTHONPATH="${REPO_ROOT}" \
+  --pwd "${REPO_ROOT}" \
+  "${SIF_DIR}/response_quality.sif" \
+  streamlit run demo/app.py --server.port 8502
+```
+（`REPO_ROOT` はスクリプト自身の場所から自動解決される。コンテナ内では `DATA_ROOT` を `REPO_ROOT` に上書きすることで、バインドマウント先のパスと整合させている。）
+`demo/data/` はリポジトリ内に存在するため、バインド不要。
+---
+## 8. トラブルシューティング
+### `.env` が読み込まれない
+各スクリプトはリポジトリルートの `.env` を自動的に探して読み込む（`REPO_ROOT` をスクリプトの場所から導出するため、どのディレクトリから実行しても動作する）。`.env` ファイル自体が存在するか確認する。
+```bash
+ls /path/to/ResponseQualityAssessment/.env
+```
+### `DATA_ROOT` 以下のディレクトリが存在しない
+`data/out/` や `index_store/` は `main.py` 実行時に自動作成されるが、
+`DATA_ROOT` 自体と `DATA_ROOT/data/`, `DATA_ROOT/index_store/` は事前に作成が必要。
+```bash
+source .env
+mkdir -p ${DATA_ROOT}/data/raw/WikiDB \
+         ${DATA_ROOT}/data/processed \
+         ${DATA_ROOT}/data/out \
+         ${DATA_ROOT}/data/result \
+         ${DATA_ROOT}/index_store \
+         ${HF_HOME} \
+         ${SIF_DIR}
+```
+### FAISS インデックスが見つからない（デモ起動時にエラー）
+`scripts/run_main.sh` が未実行、または対象データセットが処理されていない。
+§5「データ準備」を実行する。
+### `demo/data/thresholds.csv` が見つからない
+`scripts/run_precompute.sh` が未実行。§6「事前計算」を実行する。
+または、リポジトリの `demo/data/` にコミット済みファイルがあるか確認する（`git pull` で最新化）。
+### `subclaims_schema.json` が見つからない（コンテナ経由実行時）
+コンテナ実行時は `DATA_ROOT/data/` が `REPO_ROOT/data/` にバインドマウントされるため、リポジトリ内の `data/out/subclaims_schema.json` が隠れてしまう。初回実行前に手動でコピーする。
+```bash
+source .env
+cp data/out/subclaims_schema.json ${DATA_ROOT}/data/out/subclaims_schema.json
+```
+### OpenAI API エラー
+`.env` の `OPENAI_API_KEY` が正しく設定されているか確認する。
+`#` で始まる行はコメントとして無視されるため、`OPENAI_API_KEY=sk-...` のように `#` なしで記述する。
+推論実行ボタンを押した場合のみ API が呼ばれる。サンプルクエリはボタンを押さずにプルダウン選択するだけのため API 不要。

docs/context/05_medrag_corpus_integration.md ADDED Viewed

	@@ -0,0 +1,196 @@

+# MedRAGコーパス導入・システム拡張要件書
+## 1. プロジェクト概要
+### 1.1. 目的
+現状のRAG回答品質評価システム（Conformal-RAG）は、MedLFQAデータセット評価において、データセット内にあらかじめ用意された参照文書（JSONL）から検索インデックスを構築している。本プロジェクトでは、この閉じた検証環境を脱却し、外部の大規模医療特化コーパス（MedRAG）を統合することで、実運用に近いノイズ環境下でのRAGファクチュアリティ（事実性）評価基盤を構築する。
+### 1.2. 課題と解決方針
+大規模コーパスを導入すると、検索時のディストラクター（意味的に類似するが無関係なノイズ文書）が激増し、コンフォーマル予測のスコアリングが破綻するリスクがある。これに対し、以下の3点をシステム要件として定義する。
+- 医療特化の埋め込みモデルへの切り替え
+- ハイブリッド検索・リランカーの導入
+- チャンク戦略の最適化
+---
+## 2. システム構成・データパイプライン要件
+### 2.1. コーパスデータベースの構築
+- **要件**: MedRAGが提供する生データ（PubMedアブストラクト、StatPearls等）をシステムに取り込む。
+- **MedRAGコーパスのデータ形式**: MedRAGは以下のコーパスをHuggingFace（`MedRAG/pubmed` 等）でチャンク済みデータセットとして公開している。スキーマは全コーパス共通の4フィールド。
+  | コーパス | スニペット数 | 平均トークン数 | チャンク戦略 | 備考 |
+  |---------|------------|------------|------------|------|
+  | PubMed  | 約2,390万件 | **296 tokens** | 論文1件=1スニペット（title + abstract） | HuggingFaceで配布済み |
+  | StatPearls | 約30万件 | 119 tokens | 段落単位 | Privacy Policyによりコンテンツ非公開。自前処理必要 |
+  | Textbooks | 約13万件 | 182 tokens | `RecursiveCharacterTextSplitter`（上限1,000文字） | HuggingFaceで配布済み |
+  ```json
+  {"id": "pubmed23n0001_0", "title": "...", "content": "...", "contents": "title + content の結合（BM25用）"}
+  ```
+- **実装方針**: HuggingFaceから配布済みのParquet/JSONL形式データをロードし、`DocDB.build_db()`（`src/rag/retrieval.py`）で`title`・`content`フィールドをSQLiteに格納する。`DataLoader.create_wiki_db()` と同様の仕組みを活用できる。
+### 2.2. チャンキング戦略の方針
+- **現状の課題**: `FixedLengthChunker`（`src/common/chunker.py`）がデフォルトで2000語（オーバーラップ25語）と大きく設定されているため、情報の境界が曖昧になりやすい。設定は `conf/config.yaml` の `index.truncation_config` で管理されている。
+- **コーパス別の対応方針**:
+  - **PubMed**: チャンク済みデータ（平均296トークン）をそのまま使用。MedCPT Article Encoderの最大入力長512トークンの範囲内に収まっており、**追加のチャンキングは不要**。MedCPT使用時は `[title, content]` をリスト形式で渡す必要がある（文字列連結不可）。
+  - **StatPearls / Textbooks**: 自前チャンキングが必要な場合は `RecursiveCharacterChunker` を `FileManager` に追加し、BPEトークンベースで256〜512トークン程度を目標とする。`conf/config.yaml` の `strategy` キーで選択可能にする。
+---
+## 3. 検索アルゴリズム要件（Retrieval & Embedding）
+### 3.1. 医療特化Embeddingモデルの導入
+- **現状の課題**: `conf/config.yaml` の `index.embedding_model` に `text-embedding-3-large`（汎用モデル）が固定で指定されており、`OpenAIManager.create_openai_embeddings()` を通じて埋め込みを生成している。
+- **要件**: 医療専門用語や略語を正確にベクトル空間へマッピングする。
+- **実装方針**: `OpenAIManager` に依存しないローカルモデル用のEmbeddingクラスを新設し（例: `LocalEmbeddingManager`）、以下のモデルをロードして埋め込みを生成できるようにする。`conf/config.yaml` の `embedding_model` キーで切り替え可能な設計とする。
+  - **`MedCPT`**（`ncbi/MedCPT-Query-Encoder` / `ncbi/MedCPT-Article-Encoder`）: NCBIが255MのPubMed検索ログでContrastive学習したbi-encoderであり、RAGの密ベクトル検索に直接適用できる。MedRAGベンチマークでも高い性能を示しており、第一候補。
+  - **`MedEmbed`**（`abhinand/MedEmbed`シリーズ）: 汎用モデルを医療データでContrastive学習させたモデル。MTEBの医療タスクで競争力のある性能を示す代替選択肢。
+  - > **注意**: `PubMedBERT` や `BioLinkBERT` はContrastive学習を経ていないMLMであり、fine-tuneなしでのbi-encoderとしての利用は非推奨。
+### 3.2. ��イブリッド検索（Hybrid Search）の実装
+- **現状の課題**: `FAISSIndexManager`（`src/common/faiss_manager.py`）による密ベクトル（Dense）検索のみが実装されており、薬品名・疾患名などの固有表現の完全一致を取りこぼすケースがある。
+- **要件**: 固有表現の完全一致を取りこぼさない検索機構の構築。
+- **実装方針**: `FAISSIndexManager` に加えて疎ベクトル（Sparse）検索エンジンを導入し、両者のスコアをReciprocal Rank Fusion（RRF）で統合して上位文書を取得する仕組みを追加する。
+  - **Sparse検索の選択肢**: BM25は実装が容易で安定したベースラインとなる。ただし医療・科学文書では専門用語の語彙ミスマッチが起きやすいため、意味的な語彙展開が可能な **SPLADE** も代替として検討する。
+  - **スコア統合**: RRFは訓練データなしで使えるロバストな手法であり第一選択。ラベル付きデータが入手できる場合は重み付きRRFや凸結合（Convex Combination）でさらなる精度改善が見込める。
+### 3.3. リランカー（Re-ranker）の導入
+- **要件**: 大規模コーパス特有のノイズ文書（ディストラクター）を排除する。
+- **実装方針**: `SubclaimProcessor.generate_responses()`（`src/subclaim_processor/subclaim_processor.py`）において、ハイブリッド検索で上位100件程度を粗く取得したのち、Cross-Encoderで精緻な関連度再計算を行い、最終的に上位10件に絞り込む処理を追加する。
+  - **推奨モデル（セルフホスト）**: `BAAI/bge-reranker-v2-m3`（多言語・汎用）または `BAAI/bge-reranker-v2.5-gemma2-lightweight`（2025年時点のSOTA、トークン圧縮機能付き）
+  - **推奨モデル（商用API）**: Cohere Rerank（現行v4）。精度は高いがセルフホスト不可のため、運用コストと可搬性を考慮して選択する。
+---
+## 4. 評価・スコアリング要件
+### 4.1. スコアリング戦略（Strategy）の改修
+- **現状の課題**: `ProductScoreStrategy`（`src/subclaim_processor/strategies/scoring/product.py`）がFAISSの検索スコアとサブクレームのコサイン類似度の積でスコアを算出している。ハイブリッド検索やリランカーを導入するとベーススコアのスケールが変化するため、この計算式が破綻する。
+- **要件**: 検索手法に依存しない普遍的なサブクレーム関連度スコアの定義。
+- **実装方針**: `ScoringStrategy`（`src/subclaim_processor/strategies/scoring/base.py`）を継承した新たな戦略クラスを追加し、Cross-Encoderが出力する「クエリ＋サブクレーム」と「検索文書」の推論ロジットを直接スコアとして採用する。`conf/config.yaml` の `conformal_prediction.scoring_strategy` キーで切り替え可能にする。
+---
+## 5. 非機能要件（パフォーマンス・拡張性）
+### 5.1. 非同期処理・バッチ処理の導入
+- **現状の課題**: RAG応答生成からサブクレーム抽出・アノテーションに至るまで、パイプライン全体が逐次処理で実装されており、大量クエリ評価時のスループットが低い。OpenAI APIのレートリミット（429エラー）も問題になりうる。
+- **要件**: 大規模コーパスおよび大量クエリ評価における実行時間の大幅な短縮とAPIレートリミット回避。
+- **実装方針**: 以下のいずれか、または組み合わせを採用する。
+  - `asyncio` を用いた非同期パイプラインへの書き換え（リアルタイム性が求められる場合）
+  - OpenAI Batch APIを用いたアノテーションの一括処理（コスト重視の場合、レイテンシは最大24時間）
+### 5.2. インデックス管理の堅牢化
+- **現状の課題**: `FAISSIndexManager` は `IndexFlatIP`（内積）を使用しており、ファイルパスとFAISSインデックス番号のリストを紐づけるマッピング `indice2fm`（`{file_path: [indices]}`、`index_store/indice2fm.json` に保存）をオンメモリで管理している。数百万規模のドキュメントではメモリ不足が発生しうる。
+- **要件**: 大規模インデックスにおけるメモリ効率の確保と、インデックス更新の容易性。
+- **実装方針**: インデックス構築をバッチ処理化する。必要に応じてディスクベースのベクトルDB（Milvus、Qdrant等）への移行を見据え、`FAISSIndexManager` のインターフェースを抽象化しておく。
+---
+## 6. ハイパーパラメータ調整・実験計画
+### 6.1. 調整対象ハイパーパラメータの一覧
+パイプラインを構成する4つのステージごとにハイパーパラメータを整理する。各パラメータの現行値は `conf/config.yaml` に基づく。
+#### ステージA: チャンキング（StatPearls / Textbooks のみ）
+| パラメータ | 現行値 | 探索範囲 | 備考 |
+|-----------|------|--------|------|
+| `chunk_size` | 2000語 | 128 / 256 / **512** / 1024 トークン | PubMedは296トークン固定のため対象外 |
+| `chunk_overlap` | 25語 | 0 / 32 / **64** / 128 トークン | chunkサイズの10〜20%が目安 |
+#### ステージB: 検索（Retrieval）
+| パラメータ | 現行値 | 探索範囲 | 備考 |
+|-----------|------|--------|------|
+| `retrival_topk` | 10 | 10 / 20 / **50** / 100 | リランカー導入後は候補プール用に拡大 |
+| `retrival_threshold` | 0.3 | 0.1 / **0.2** / 0.3 | Denseスコア下限。低すぎるとノイズ増加 |
+| BM25 `k1` | —（未実装） | 0.5 / **1.2** / 2.0 | 語頻度の飽和速度。医療文書は1.2前後が標準 |
+| BM25 `b` | —（未実装） | 0.25 / **0.75** | 文書長正規化。長いアブストラクトには0.75推奨 |
+| RRF `k` | —（未実装） | 10 / **60** | 標準値60。訓練データなしで使える頑健な設定 |
+| RRF Dense重み | —（未実装） | 0.3 / **0.5** / 0.7 | Dense寄りにするほど意味検索優先 |
+#### ステージC: リランキング
+| パラメータ | 現行値 | 探索範囲 | 備考 |
+|-----------|------|--------|------|
+| リランカー候補プール数 | —（未実装） | 50 / **100** / 200 | 大きいほど再現率↑、遅延↑ |
+| リランカー後 `topk` | —（未実装） | 5 / **10** / 20 | 最終的にLLMに渡す文書数 |
+#### ステージD: スコアリング・集約（Conformal Prediction）
+| パラメータ | 現行値 | 探索範囲 | 備考 |
+|-----------|------|--------|------|
+| `scoring_strategy` | `product` | `product` / `cross_encoder` | リランカー導入後は `cross_encoder` に切り替えを検討 |
+| `aggregation_strategy` | `mean` | `mean` / `max` | `max`は最もサポートされた証拠を採用 |
+| `a_value` | 1.0 | 0.8 / 0.9 / **1.0** | 「ファクチュアル」と見なす正解保持率の閾値 |
+---
+### 6.2. 評価指標
+実験には以下の指標を使用する。`SplitConformalCalibration`（`src/calibration/conformal.py`）が既に以下の計算を実装済み。
+| 指標 | 定義 | 担当ステージ |
+|-----|------|------------|
+| **Recall@k** | 正解文書がtop-k内に含まれる割合 | B, C |
+| **MRR@k** | Mean Reciprocal Rank（正解文書の順位逆数の平均） | B, C |
+| **Coverage rate** | 実経験的ファクチュアリティ ≥ 1−α を満たすクエリの割合（≥ 1−α であれば保証成立） | D |
+| **Efficiency** | 各αにおける平均サブクレーム除去率（高いほど効率的にノイズを除去できている） | D |
+---
+### 6.3. 実験フロー
+各ステージで最良の設定を固定したうえで次ステージの探索を行う、逐次的な**グリーディサーチ**を採用する。ステージ間の依存関係があるため、並列探索は同一ステージ内のみで行う。
+```
+Phase 1: チャンキング探索（StatPearls / Textbooks のみ）
+  └─ 評価指標: Recall@10 on MedLFQA
+  └─ 固定: Embedding=MedCPT, Dense-only 検索
+  └─ 探索: chunk_size × chunk_overlap の格子探索（8条件）
+Phase 2: ハイブリッド検索の探索
+  └─ 評価指標: Recall@10, MRR@10
+  └─ 固定: Phase 1 のベストチャンク設定
+  └─ 探索ステップ:
+       2a. Dense-only ベースライン（現行）
+       2b. BM25 k1 × b の格子探索（6条件）、Denseとの RRF 融合
+       2c. SPLADE vs BM25 の比較（GPU使用可能な場合のみ）
+       2d. RRF Dense重み の探索（0.3 / 0.5 / 0.7）
+Phase 3: リランカーの探索
+  └─ 評価指標: Recall@10 after rerank, NDCG@10
+  └─ 固定: Phase 2 のベストハイブリッド設定
+  └─ 探索: 候補プール数 × 最終topk の格子探索（6条件）
+  └─ モデル比較: bge-reranker-v2-m3 vs bge-reranker-v2.5-gemma2-lightweight
+Phase 4: Conformal Prediction の探索
+  └─ 評価指標: Coverage rate, Efficiency（fraction_removed）
+  └─ 固定: Phase 3 のベスト検索設定
+  └─ 探索: scoring_strategy × aggregation_strategy × a_value（12条件）
+  └─ alphaは 0.05〜0.45 を既定のステップ幅でスイープ（conformal.py の既存実装を使用）
+```
+---
+### 6.4. 実験管理上の注意点
+- **PubMedのチャンク設定は固定**: PubMedスニペットは平均296トークンで配布されており、MedCPTの512トークン上限に収まる。Phase 1 の探索対象はStatPearls / Textbooks のみとする。
+- **MedCPTの入力形式**: `[title, content]` をリスト要素として渡す必要があり、他のモデルとの入力フォーマットが異なる。`LocalEmbeddingManager` 実装時に対応すること。
+- **キャッシュの活用**: `conformal.py` はスコア計算結果を `.npy` ファイルにキャッシュする仕組みを持つ��チャンクや検索設定を変えた場合はキャッシュを削除してから再実行すること。
+- **Phase間の依存**: 各フェーズのベスト設定を `conf/config.yaml` に反映してから次フェーズを実行する。バージョン管理のため、フェーズごとに `git tag` を付けることを推奨する。

docs/context/06_test_strategy.md ADDED Viewed

	@@ -0,0 +1,180 @@

+# テスト計画と実行フロー
+OpenAI API への依存度に基づいてステップを分離し、API キーなしでも中核ロジックを早期に検証できる構成にする。
+## 全体像
+| ステップ | 目的 | API 使用 | 使用データ | 対象コンポーネント |
+| :--- | :--- | :---: | :--- | :--- |
+| **1. ユニットテスト** | コアロジック・純粋関数の単体テスト | **不要** | 合成データ（手作り） | `src/calibration/`、`demo/inference_api.py` |
+| **2. スモークテスト** | 実データの生成とパイプラインの疎通確認 | **必要** | 実データ（極小スケール） | `main.py`（データ取得〜スコア計算） |
+| **3. 統合テスト** | 事前計算ロジックの動作確認 | **不要** | ステップ 2 の出力 JSON | `demo/precompute.py` |
+| **4. UI テスト** | Streamlit UI と状態管理の動作確認 | **不要** | ステップ 3 の出力 CSV/JSON | `demo/app.py`、`st.session_state` |
+---
+## 1. ユニットテスト（API 不要）
+論文の中核である「Conformal Guarantee（事実性の統計的保証）」を計算するロジックと、デモの表示切り替えロジックを最優先でテストする。`pytest` で即座に実行できる。
+実装ファイル:
+- `tests/test_calibration.py` — キャリブレーション計算関数
+- `tests/test_precompute.py` — `demo/precompute.py` のデータ変換・サンプリング関数
+**準備するデータ:** ダミーのサブクレームリスト（スコア `relevance: 0.8`、ラベル `S` などを持たせた数十件の Python 辞書）
+**テスト対象と確認事項:**
+1. **`src/calibration/utils.py` — `get_r_score`**
+   - ダミーのサブクレーム群を与え、目標正確率 $a$ を満たすための最小スコア $r_a$ が数学的に正しく算出されるか。
+   - 全サブクレームが正解の場合に `-1`（常に安全）が返るか。
+2. **`src/calibration/utils.py` — `compute_threshold`**
+   - 特定の $\alpha$（例: `0.10`）を与えたとき、`ceil((n+1)*(1-alpha))` 分位点ベースの閾値 $\hat{q}$ が正しく計算されるか。
+3. **`src/calibration/utils.py` — `split_group`**
+   - グループ別に calibration/test が指定比率（デフォルト 50/50）で分割されるか。
+   - グループをまたいだデータ汚染（calibration と test のオーバーラップ）がないか。
+4. **`src/calibration/conformal.py` — `_evaluate_conformal_correctness`**
+   - 閾値を適用したとき、スコアが閾値以上のサブクレームのみ retain され、`correctly_retained` と `fraction_removed` が正しく計算されるか。
+   - サブクレームが全件 remove された場合（`retained_cnt == 0`）に `correctly_retained_percentage = 1` のフォールバックが効くか。
+5. **`demo/inference_api.py` — `apply_threshold`**
+   - モックの `SubclaimResult` と任意の $\hat{q}$ を渡し、スコアが $\hat{q}$ 以上のサブクレームが `keep_count` として正しく集計され、意図した `FilteredResult` が返るか。
+   - > **ノイズについて**: `calibration/utils.py` の `get_r_score` は `score + noise` の空間でスコアを集計し、`q_hat` もその空間で計算される。`noise` は Conformal Prediction のタイブレーカーとして `N(0, 0.001)` で付与される微小乱数（`subclaim_processor.py:150`）。`precompute.py` はサンプルクエリのスコアを `score + noise` で保存するため、`apply_threshold` の `score >= q_hat` 比較は理論的に整合している。ライブ推論（`process_query`）では noise を生成しないため、同一保証は成立しない点に注意。
+6. **`demo/precompute.py` — `_entry_to_subclaim_result`**（リグレッションテスト）
+   - `subclaims[i]["score"]` が `score + noise` の値になっているか（ノイズ加算の回帰防止）。
+   - `noise` キーが欠落した古いデータ形式でもクラッシュしないか（デフォルト `0.0`）。
+   - `is_grouped=False` のとき `mode="marginal"`、`group="default"` が設定されるか。
+   - `is_grouped=True` のとき `mode="conditional"`、`group` が `entry["groups"][0]` から取得されるか。
+7. **`demo/precompute.py` — `_group_data`、`_has_score`、`_pick_entries`**
+   - エントリが正しくグループ別に分類されるか。
+   - サブクレームに指定スコアキーがない場合に `_has_score` が `False` を返すか。
+   - `_pick_entries` がサブクレーム空のエントリをスキップし、指定件数（`N_SAMPLES_PER_DATASET` / `N_SAMPLES_PER_GROUP`）を返すか。
+---
+## 2. スモークテスト（API 必要）
+以降のテストには本物のデータ構造が必要なため、既存のバッチ処理を最小規模で走らせる。既存の `src/` ロジックが壊れていないことも同時に確認する。
+**実行手順:**
+```bash
+# フル��行（全スコア）
+python main.py --dataset medlf_qa --query_size 10
+# スモークテスト推奨（frequency・doc_claim_cosine_similarity をスキップ）
+python main.py --dataset medlf_qa --query_size 10 --lite
+```
+- データセットは `medlf_qa` を使用する（グループ構造を持つ最も複雑なケースのため）。
+- `--query_size 10` で引数上書きするため、`conf/config.yaml` の編集は不要。
+- `2-3` だと 50/50 split（`runs=1000`）に対してデータが少なすぎ、キャリブレーションが意味をなさない。統合テストの動作確認には最低 10 件程度が必要。
+**API 呼び出し数の目安（`--query_size 10`、平均サブクレーム ~5/クエリ → 計 ~50 サブクレーム）:**
+| ステップ | API | 呼び出し数 |
+|---------|-----|---------|
+| FAISS インデックス構築 | Embeddings | **0**（`index_10.faiss` 既存の場合スキップ） |
+| RAG 回答生成 | Chat (gpt-4.1-mini) | **10** |
+| サブクレーム抽出 | Chat (gpt-4.1-mini) | **10** |
+| `relavance` スコア（claim 埋め込み） | Embeddings | **~50** |
+| `query_claim_cosine_similarity` | Embeddings | **~100**（2 コール × 50） |
+| `doc_claim_cosine_similarity` | Embeddings | **~1,000**（2 × top_k=10 × 50） |
+| `frequency` 代替回答生成（n=5 バッチ） | Chat (gpt-4.1-mini) | **~50 バッチ** |
+| `frequency` カウントプロンプト（× 5） | Chat (gpt-4.1-mini) | **~250** |
+| アノテーション（claim verification） | Chat (gpt-4.1-mini) | **~50** |
+| **合計 Chat** | | **~370 コール** |
+| **合計 Embeddings** | | **~1,150 コール** |
+`index_store.zip`（著者提供）に `MedLFQA/index_10.faiss` が含まれているため、それを配置すれば Embeddings API によるインデックス構築はスキップできる。
+`--lite` フラグを付けると以下のスコアをスキップし、API コストを大幅に削減できる。
+| スキップされるスコア | 理由 |
+|---|---|
+| `frequency` | Chat ~300 コール（総コストの ~80%）、デモ未使用 |
+| `doc_claim_cosine_similarity` | Embeddings ~1,000 コール（~87%）、デモ未使用 |
+またキャリブレーション・グラフ生成（split conformal / group conditional conformal）もスキップされる。
+**確認事項:**
+1. **E2E パイプラインの疎通**: データ取得・FAISS インデックス生成・LLM 推論・スコア計算がエラーなく完走するか。
+2. **出力ファイルの検証**: `data/out/` に正しい JSON スキーマのファイルが生成されているか（`subclaims[].scores` に `relavance`・`noise` 等のキーが存在するか）。`--lite` 使用時は `frequency`・`doc_claim_cosine_similarity` キーは生成されず、`relavance`・`noise`・`query_claim_cosine_similarity`・`min_log_prob` は常に計算される。
+---
+## 3. 統合テスト（API 不要）
+ステップ 2 で得られた少量の本物データ（`data/out/` 内の JSON）を入力として、デモ用の事前計算スクリプトが正しく動作するかをテストする。
+**実行手順:**
+```bash
+python demo/precompute.py
+```
+**確認事項:**
+1. **`demo/data/thresholds.csv` の生成**
+   - `ALPHA_LEVELS`（例: `0.05, 0.10, ...`）に対応する行が全件生成されているか。
+   - 浮動小数点の丸め誤差が発生していないか（`alpha` 列が `0.09999...` ではなく `0.1` になっているか）。
+   - キャリブレーションに使われるのがホールドアウト分を除いたデータのみであるか（ログの `calib=N, holdout=M` で確認）。
+   - > **注意**: `query_size: 10` 相当の少量データを使っている場合、$\hat{q}$ の値自体は統計的に意味が薄い。ここではスキーマと生成フローの正常動作の確認が目的。本番品質の閾値は `query_size: 500` で再実行する。
+2. **`demo/data/samples.json` の生成**
+   - メタデータ（生成日時・モデル名・スコアリング手法）が正しく付与されているか。
+   - MedLFQA の場合、全グループ（HealthsearchQA / MedicationQA / K-QA_Silver / K-QA_Golden / LiveQA）から少なくとも 1 件ずつ含まれているか。
+   - `subclaims[].score` フィールドが `FilteredResult` の lookup キーと一致するか（`constants.py` の `SCORING_METHOD_TO_DATA_KEY` のマッピングが正しいか）。
+---
+## 4. UI テスト（API 不要）
+生成された静的ファイル（CSV/JSON）を用いて、ブラウザ上での UI コンポーネントと状態遷移をテストする。Streamlit の `AppTest` フレームワークを使えば自動テストも可能。
+**実装ファイル:** `tests/test_app_ui.py`（16 テスト、API 不要）
+**実行手順:**
+```bash
+# コンテナ経由
+bash scripts/run_tests.sh tests/test_app_ui.py -v
+# ホスト直接実行
+source .env && python -m pytest tests/test_app_ui.py -v
+```
+**前提条件:** `demo/data/thresholds.csv` と `demo/data/samples.json` が存在すること（ステップ 3 の実行済み状態）。
+**確認事項:**
+1. **条件付き UI のレンダリング**（`TestConditionalUIRendering`、4 テスト）
+   - 「MedLFQA」選択時のみ「Marginal / Conditional」ラジオボタンが活性化するか。
+   - 「Conditional」選択時のみグループ選択プルダウンが表示されるか。
+   - 検証方法: `at.radio[0].disabled`、`len(at.selectbox)` の件数（Marginal=2, Conditional=3）。
+2. **セッションリセットの動作**（`TestSessionStateReset`、4 テスト）
+   - データセット・モード・グループを切り替えたとき、`prev_params` の変化検出により `st.session_state` がリセットされ、前回の推論結果が表示されたままにならないか。
+   - 変更なしの再実行では `query_processed` と `result` が保持されるか。
+3. **キャッシュの動作（`@st.cache_resource`）**（`TestCacheBehavior`、2 テスト）
+   - `get_faiss_manager` / `get_scorer` が `dataset` 引数を持ち、データセットごとにキャッシュが分離される設計になっているかをシグネチャで静的検証する。
+   - > **注意**: AppTest 環境ではキャッシュ実体の動作検証が難しいため、Streamlit がキャッシュキーに使う関数シグネチャの存在を確認する形にとどめている。
+4. **スライダーのインタラクティビティ（最重要）**（`TestSliderInteractivity`、6 テスト）
+   - Target Factuality スライダーを動かしたとき、「推論実行」の重い処理が再トリガーされず、`st.session_state` のスコアと CSV の $\hat{q}$ を比較するハイライト再描画（軽い処理）のみが即座に実行されるか。
+   - 推論前は disabled、推論後は enabled になるか。
+   - Keep / 削除率メトリクスの値が `apply_threshold` の計算結果と一致するか。
+   - 検証方法: `process_query` を `MagicMock` で差し替え、スライダー操作後に `assert_not_called()` で確認。
+**テスト設計上の注意点:**
+- `samples.json` には `medlf_qa`（conditional）のサンプルのみ含まれるため、結果パネルの描画テストはすべて `medlf_qa / marginal / default` で行い、セッション状態を手動注入する。
+- `st.metric` の値は AppTest では文字列として返るため、数値との比較は `== "1"` のように文字列で行う。
+- `get_faiss_manager` / `get_scorer` / `process_query` は `unittest.mock.patch` で差し替え、API・FAISS・重い推論をすべてモックアウトする。

docs/context/07_huggingface_spaces_deployment.md ADDED Viewed

	@@ -0,0 +1,259 @@

+# Hugging Face Spaces デプロイ設計
+## 1. 目標とスコープ
+デモアプリ（`demo/app.py`）を Hugging Face Spaces（公開）で動かす。
+**スコープ: サンプルクエリ専用モード**
+- `demo/data/samples.json` に事前計算済みのクエリのみ動作する
+- 「推論実行」ボタンは FAISS・Wikipedia DB を使わずに事前計算済み結果を即時表示
+- 「回答を生成」ボタン（LLM 再統合）は OpenAI API を呼ぶため、API キーが必要
+- ライブ推論（任意クエリ入力）は行わない
+**Spaces の制約**
+- Singularity は使えない（Docker / Python 環境）
+- リポジトリに大きなバイナリファイル（FAISS インデックス、SQLite DB）は置けない
+- 環境変数（Secrets）で API キーを設定できる
+---
+## 2. 現行コードの移植可否分析
+| 処理 | Spaces での動作 | 対応方針 |
+|------|---------------|---------|
+| `samples.json` / `thresholds.csv` の読み込み | ○ そのまま動く（リポジトリ内） | 変更不要 |
+| 「推論実行」→ 事前計算済み結果の表示 | ○ API 呼び出しなし | 変更不要 |
+| 「回答を生成」（`reintegrate_subclaims`） | ○ OpenAI API のみ | Secrets に `OPENAI_API_KEY` を設定 |
+| `build_faiss_manager()` / `build_scorer()` | △ ファイルが存在しないためエラー | ライブ推論パスに到達しなければ呼ばれない（後述） |
+| `@st.cache_resource` でのリソース初期化 | △ 呼ばれた時点でエラー | 環境フラグで無効化 |
+| `.env` の読み込み | △ Spaces では `.env` ファイルは使わない | Spaces Secrets → 環境変数として自動注入 |
+### ライブ推論パスが呼ばれる条件
+現在のコードでは、`build_faiss_manager()` / `build_scorer()` はライブ推論パス内でしか呼ばれない。
+```python
+# app.py 内の live inference ブランチ（sample_map にないクエリのみ）
+faiss_manager = get_faiss_manager(dataset)   ← ここでエラーになる
+scorer = get_scorer(dataset)
+```
+サンプルクエリのプルダウンには `samples.json` 内のクエリしか表示されないため、
+通常操作では **このパスには到達しない**。
+ただし、予期しないエラー時のフォールバックや将来の機能追加を考慮し、
+**環境変数フラグ `SPACES_DEMO=1`** でライブ推論パスを明示的に無効化する。
+---
+## 3. 必要な変更
+### 3-1. `app.py` の変更（最小限）
+`SPACES_DEMO=1` が設定されている場合、ライブ推論パスをブロックする。
+```python
+import os
+SPACES_DEMO = os.getenv("SPACES_DEMO", "0") == "1"
+```
+「推論実行」ボタン処理内：
+```python
+if run_btn and query_input:
+    precomputed = sample_map.get(query_input)
+    if precomputed is not None and ...:
+        # 事前計算済み → そのまま表示（変更なし）
+        ...
+    elif SPACES_DEMO:
+        st.error("このデモではサンプルクエリのみ対応しています。")
+    else:
+        # ライブ推論（ローカル環境のみ）
+        ...
+```
+### 3-2. Spaces 用設定ファイル
+Spaces は **リポジトリの `README.md`（frontmatter）** で設定を宣言する。
+```yaml
+---
+title: Response Quality Assessment Demo
+emoji: 📊
+colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.43.2
+app_file: demo/app.py
+pinned: false
+---
+```
+`app_file` に `demo/app.py` を指定することでリポジトリ構造を変えずに済む。
+### 3-3. `requirements.txt` の整理（実装済み）
+Spaces はリポジトリルートの `requirements.txt` を自動で `pip install` する。
+リポジトリでは以下の構成を採用している:
+| ファイル | 用途 |
+|----------|------|
+| `requirements.txt` | HF Spaces 向け最小セット。Spaces はこのファイルを自動で読む |
+| `requirements-dev.txt` | ローカル・Singularity 向け全依存。`-r requirements.txt` で共通部分を継承 |
+**`requirements.txt`（Spaces 向け最小セット、§4-1 調査結果に基づく）:**
+```
+openai>=2.0
+python-dotenv>=1.0
+numpy>=1.24
+pandas>=2.0
+pyyaml>=6.0
+streamlit>=1.43
+```
+`faiss-cpu` / `torch` / `transformers` / `sentence-transformers` / `langchain-core` /
+`scikit-learn` はいずれも不要（遅延 import により Spaces モードでは読み込まれない）。
+ローカル開発時は `pip install -r requirements-dev.txt` を使う。
+### 3-4. Secrets の設定
+Spaces の **Settings > Variables and secrets** に以下を追加する：
+| キー | 値 | 用途 |
+|------|----|------|
+| `OPENAI_API_KEY` | `sk-...` | 「回答を生成」ボタン（`reintegrate_subclaims`） |
+| `SPACES_DEMO` | `1` | ライブ推論パスを無効化 |
+`.env` ファイルは Spaces 環境では使わない（`python-dotenv` の `load_dotenv()` は
+環境変数が既にセットされていれば上書きしないため、ローカルとの互換性は保たれる）。
+---
+## 4. 事前調査結果
+### 4-1. `src/` の import チェーンの影響調査（済）
+`sys.modules` を import 前後で比較し、重い依存パッケージの混入を確認した。
+**原因**: `src.common.faiss_manager` → `src.common.file_manager` → `langchain_text_splitters`
+が `torch` / `transformers` / `sentence_transformers` 等を連鎖的に引き込む。
+`src.subclaim_processor.scorer.subclaim_scorer` も同様。
+**対処**: `FAISSIndexManager` / `SubclaimScorer` の import を `build_faiss_manager()` /
+`build_scorer()` 関数内に移動（遅延 import）。型ヒントは `TYPE_CHECKING` ガードで維持。
+これにより、サンプルクエリ専用モードでは import 時に重い依存が読み込まれない。
+**Spaces 向け最小パッケージセット**（`build_*()` が呼ばれない前提）:
+```
+openai
+python-dotenv
+numpy
+pandas
+pyyaml
+langchain-core
+streamlit>=1.43
+```
+`faiss-cpu` / `torch` / `transformers` / `sentence-transformers` / `scipy` /
+`scikit-learn` は不要。
+### 4-2. `DATA_ROOT` / config パスの解決（済）
+`_load_main_config()` / `_load_dataset_config()` は関数呼び出し時に初めて実行される
+（モジュールレベルでは実行されない）ため、`DATA_ROOT` 未設定でも import は成功する。
+サンプルクエリ専用モードではこれらの関数は呼ばれないため問題なし。
+---
+## 5. デプロイ手順
+1. **HF Space の作成**
+   - `huggingface.co/spaces/<username>/<space-name>` を新規作成（SDK: Streamlit）
+2. **`spaces` remote を追加**
+   ```bash
+   git remote add spaces https://huggingface.co/spaces/EQUES/Response-Quality-Assessment
+   ```
+   認証はリモート URL にトークンを埋め込む方法を使う（`.git/config` は `.gitignore` 対象外だが
+   `git` の管理ファイルであり GitHub にはプッシュされない）：
+   ```bash
+   git remote set-url spaces https://<username>:<hf_token>@huggingface.co/spaces/EQUES/Response-Quality-Assessment
+   ```
+3. **Orphan ブランチで push**
+   HF Spaces はプッシュ時に **全コミット履歴** をスキャンし、10 MiB 超のファイルを拒否する。
+   `feature/hf-spaces` の祖先コミットに大きなデータファイル（`data/out/Medlfqav2/` 等）が含まれるため、
+   履歴なしの Orphan ブランチを一時作成してから push する：
+   ```bash
+   git checkout --orphan spaces-deploy
+   git add -A
+   git rm --cached data/out/ data/raw/ -r --ignore-unmatch
+   git commit -m "deploy: initial push to HF Spaces"
+   git push --force spaces spaces-deploy:main
+   git checkout feature/hf-spaces
+   git branch -D spaces-deploy
+   ```
+   > **なぜ Orphan か**: 通常の push では過去コミットごと送られる。Orphan ブランチは親コミットが
+   > 存在しない「起点」なので、HF Spaces には現在のスナップショット 1 コミットだけが届く。
+   > ローカルと GitHub の履歴には一切影響しない。
+4. **Secrets の設定**
+   - Space の **Settings > Variables and secrets** に以下を追加:
+   | キー | 値 |
+   |------|----|
+   | `OPENAI_API_KEY` | `sk-...` |
+   | `SPACES_DEMO` | `1` |
+5. **動作確認**
+   - サンプルクエリの表示・スライダー操作
+   - 「回答を生成」ボタン（OpenAI API 呼び出し）
+### 再デプロイ（コード変更後）
+コードを変更したら同じ Orphan 手順を繰り返す。`spaces` remote は設定済みのため手順 2 は不要。
+---
+## 6. 既知の問題・注意点
+### MedLFQA Marginal モードのサンプル一致
+`precompute.py` は grouped データセット（medlf_qa）のサンプルを全て `mode="conditional"` で
+生成するため、`samples.json` に `mode="marginal"` エントリが存在しない。
+`app.py` では Marginal モード選択時にモード不一致でサンプルが見つからない問題を回避するため、
+Marginal モードではサンプルの `mode`/`group` フィールドを無視してマッチさせる：
+```python
+if (
+    precomputed is not None
+    and (
+        mode == "marginal"
+        or (precomputed["mode"] == mode and precomputed["group"] == group)
+    )
+):
+```
+サブクレームスコア自体はモードに依存しないため、閾値（`_lookup_q_hat` が marginal 用 q_hat を
+参照）との比較は正しく行われる。
+---
+## 7. 開発ステップ
+| ステップ | 状態 | 担当ファイル |
+|----------|------|------------|
+| ① import 影響調査 | 完了 | — |
+| ② requirements 整理 | 完了 | `requirements.txt`, `requirements-dev.txt` |
+| ③ `app.py` にフラグ追加 | 完了 | `demo/app.py` |
+| ④ `README.md` の作成 | 完了 | `README.md` |
+| ⑤ Spaces へ push | 完了 | — |
+| ⑥ Marginal モードのバグ修正 | 完了 | `demo/app.py` |

main.py ADDED Viewed

	@@ -0,0 +1,376 @@

+import os
+import argparse
+import numpy as np
+import logging
+import yaml
+from pathlib import Path
+from src.common.config_manager import ConfigManager
+from src.dataloader.dataloader import DataLoader
+from src.data_processor.query_processor import QueryProcessor
+from src.common.file_manager import FileManager
+from src.common.faiss_manager import FAISSIndexManager
+from src.subclaim_processor.scorer.subclaim_scorer import SubclaimScorer
+from src.subclaim_processor.subclaim_processor import process_subclaims
+from src.calibration.conformal import SplitConformalCalibration
+from src.calibration.conditional_conformal import GroupConditionalConformal
+def parse_args(dataset_aliases):
+    """Parse command line arguments"""
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--config",
+        type=str,
+        default="conf/config.yaml",
+        help="Path to configuration file",
+    )
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        help="Override dataset name from config",
+        choices=dataset_aliases,
+    )
+    parser.add_argument(
+        "--query_size", type=int, default=500, help="Override query size from config"
+    )
+    parser.add_argument("--run_id", type=str, help="Custom run identifier")
+    parser.add_argument(
+        "--lite",
+        action="store_true",
+        help="Skip frequency and doc_claim_cosine_similarity scoring (demo-unused) and conformal prediction plots",
+    )
+    parser.add_argument(
+        "--log_level",
+        type=str,
+        default="INFO",
+        choices=["DEBUG", "INFO", "WARNING", "ERROR"],
+        help="Logging level. Use DEBUG to record individual API calls (default: INFO)",
+    )
+    return parser.parse_args()
+def main():
+    avaliable_datasets = []
+    with open("conf/dataset_config.yaml", 'r') as f:
+        dataset_config = yaml.safe_load(f)
+        avaliable_datasets = list(dataset_config["datasets"].keys())
+    # Parse arguments
+    args = parse_args(avaliable_datasets)
+    # Initialize config manager
+    config_manager = ConfigManager(
+        config_path=args.config,
+        path_config_path="conf/path_config.yaml",
+        dataset_config_path="conf/dataset_config.yaml",
+        run_id=args.run_id,
+    )
+    dataset_aliases = list(dataset_config["datasets"].keys())
+    # Setup logging
+    log_file, run_id = config_manager.setup_logging(
+        log_level=getattr(logging, args.log_level)
+    )
+    # Update config with command line arguments if provided
+    if args.dataset or args.query_size:
+        updates = {"dataset": {}}
+        if args.dataset:
+            updates["dataset"]["name"] = args.dataset
+        if args.query_size:
+            updates["dataset"]["query_size"] = args.query_size
+        config_manager.update_config(updates)
+    lite = args.lite
+    # Save updated config
+    config_file = config_manager.save_config()
+    logging.info(f"Configuration saved to: {config_file}")
+    # Log important config values
+    config_manager.log_config()
+    # Get the config
+    config = config_manager.config
+    path_config = config_manager.path_config
+    dataset_config = config_manager.dataset_config
+    ####################################### Data and Folder Set up ############################################
+    dataset_name = config["dataset"]["name"]
+    query_size = config["dataset"]["query_size"]
+    wiki_db_file = config["dataset"]["wiki_db_file"]
+    delete_existing_index = config["index"]["delete_existing"]
+    embedding_model = config["index"]["embedding_model"]
+    index_truncation_config = config["index"]["truncation_config"]
+    truncation_strategy = index_truncation_config["strategy"]
+    truncate_by = index_truncation_config["truncate_by"]
+    response_model = config["rag"]["response_model"]
+    frequency_score_model = config["conformal_prediction"]["frequency_score_model"]
+    alpha_config = config["conformal_prediction"]["conformal_alphas"]
+    conformal_alphas = np.arange(
+        alpha_config["start"], alpha_config["end"], alpha_config["step"]
+    )
+    a_value = config["conformal_prediction"]["a_value"]
+    dataset_custom_config = dataset_config["datasets"].get(dataset_name)
+    if not dataset_custom_config:
+        raise ValueError(f"Unknown dataset: {dataset_name}")
+    full_dataset_name = dataset_custom_config["name"]
+    index_store_dir = dataset_custom_config["index_store"]
+    group_conditional_conformal = dataset_custom_config.get("is_grouped", False)
+    raw_data_dir = os.path.join(path_config["paths"]["raw_data_dir"], full_dataset_name)
+    processed_data_dir = os.path.join(
+        path_config["paths"]["processed_data_dir"], full_dataset_name
+    )
+    response_dir = os.path.join(path_config["paths"]["response_dir"], full_dataset_name)
+    wiki_db_path = os.path.join(path_config["paths"]["wiki_db_dir"], wiki_db_file)
+    result_dir = os.path.join(
+        path_config["paths"]["result_dir"], full_dataset_name, run_id
+    )
+    # set up directories
+    for dir_path in [raw_data_dir, processed_data_dir, response_dir, result_dir]:
+        os.makedirs(dir_path, exist_ok=True)
+        logging.info(f"Directory ensured: {dir_path}")
+    # Determine raw data file path
+    if dataset_name == "medlf_qa":
+        input_file = os.path.join(path_config["paths"]["raw_data_dir"], "MedLFQA")
+        raw_data_path = input_file
+    else:
+        raw_data_file = f"raw_{dataset_name}.json"
+        raw_data_path = os.path.join(raw_data_dir, raw_data_file)
+    logging.info(f"Raw data path: {raw_data_path}")
+    # Load data if needed
+    if not os.path.exists(raw_data_path):
+        logging.info(f"Raw data not found. Loading data for {dataset_name}")
+        data_loader = DataLoader(dataset_name)
+        data_loader.load_qa_data(output_path=raw_data_path)
+        logging.info(f"Data loaded and saved to {raw_data_path}")
+    # create wiki db if needed
+    if not os.path.exists(wiki_db_path) or not os.path.isfile(wiki_db_path):
+        wiki_source = os.path.join(
+            path_config["paths"]["wiki_db_dir"],
+            "enwiki-20171001-pages-meta-current-withlinks-abstracts",
+        )
+        if not os.path.exists(wiki_source):
+            raise FileNotFoundError(f"Wiki source data not found at {wiki_source}")
+        logging.info(f"Wiki DB not found. Creating from source {wiki_source}")
+        data_loader = DataLoader(dataset_name)
+        data_loader.create_wiki_db(source_path=wiki_source, output_path=wiki_db_path)
+        logging.info(f"Wiki DB created at {wiki_db_path}")
+    # Process queries and documents
+    input_file = raw_data_path
+    if dataset_name == "medlf_qa":
+        input_file = os.path.join(path_config["paths"]["raw_data_dir"], "MedLFQA")
+    query_output_file = f"{dataset_name}_queries.json"
+    document_output_file = f"{dataset_name}_documents.txt"
+    subclaims_path = os.path.join(
+        response_dir,
+        f"{dataset_name}_{query_size}_subclaims_with_scores_{response_model}.json",
+    )
+    CP_result_fig_path = os.path.join(
+        result_dir, f"{dataset_name}_{query_size}_a={a_value:.2f}_CP_removal.png"
+    )
+    GCP_result_fig_path = os.path.join(
+        result_dir, f"{dataset_name}_{query_size}_a={a_value:.2f}_GCP_removal.png"
+    )
+    factual_result_fig_path = os.path.join(
+        result_dir,
+        f"{dataset_name}_{query_size}_a={a_value:.2f}_factual_correctness.png",
+    )
+    group_factual_result_fig_path = os.path.join(
+        result_dir,
+        f"group_{dataset_name}_{query_size}_a={a_value:.2f}_factual_correctness.png",
+    )
+    result_path = os.path.join(
+        result_dir, f"{dataset_name}_{query_size}_a={a_value:.2f}.csv"
+    )
+    group_result_path = os.path.join(
+        result_dir, f"group_{dataset_name}_{query_size}_a={a_value:.2f}.csv"
+    )
+    ####################################### End of Data and Folder Set up ######################################
+    # Create QueryProcessor
+    logging.info("Initializing QueryProcessor")
+    query_processor = QueryProcessor(db_path=wiki_db_path, query_size=query_size)
+    # Create queries data
+    logging.info("Processing queries")
+    queries, query_path = query_processor.get_queries(
+        dataset=dataset_name,
+        input_file=input_file,
+        output_dir=processed_data_dir,
+        output_file=query_output_file,
+    )
+    logging.info(f"Query size: {len(queries)}")
+    # Create documents data
+    logging.info("Processing documents")
+    document_path = query_processor.get_documents(
+        query_dir=query_path,
+        output_dir=processed_data_dir,
+        output_file=document_output_file,
+    )
+    logging.info(f"Documents saved to {document_path}")
+    # Index creation and retrieval
+    os.makedirs(index_store_dir, exist_ok=True)
+    index_file_path = os.path.join(index_store_dir, f"index_{query_size}.faiss")
+    indice2fm_path = os.path.join(index_store_dir, f"indice2fm_{query_size}.json")
+    logging.info(f"Setting up FAISS index manager")
+    faiss_manager = FAISSIndexManager(
+        index_truncation_config=index_truncation_config,
+        index_path=index_file_path,
+        indice2fm_path=indice2fm_path,
+    )
+    if delete_existing_index:
+        logging.info("Deleting existing index as requested")
+        faiss_manager.delete_index()
+    # Create index if it does not exist
+    document_file = FileManager(
+        document_path, index_truncation_config=index_truncation_config
+    )
+    logging.info(
+        f"Using truncation strategy: {truncation_strategy}, truncate_by: {truncate_by}"
+    )
+    # If Index doesn't exist yet
+    if not os.path.exists(index_file_path):
+        try:
+            logging.info(f"Creating new index with document '{document_path}'")
+            faiss_manager.upsert_file_to_faiss(
+                document_file,
+                truncation_strategy=truncation_strategy,
+                truncate_by=truncate_by,
+            )
+            logging.info("Index created successfully")
+        except Exception as e:
+            error_msg = f"Failed to create new index: {str(e)}"
+            logging.error(error_msg)
+            raise RuntimeError(error_msg)
+    # If Index exists but current document isn't indexed
+    elif document_path not in faiss_manager.indice2fm:
+        # Verify index integrity
+        logging.info("Checking index integrity")
+        if not faiss_manager.is_indice_align():
+            error_msg = "Index corruption detected: index and indice2fm are not aligned"
+            logging.error(error_msg)
+            raise ValueError(error_msg)
+        try:
+            logging.info(f"Adding document '{document_path}' to existing index")
+            faiss_manager.upsert_file_to_faiss(
+                document_file,
+                truncation_strategy=truncation_strategy,
+                truncate_by=truncate_by,
+            )
+            logging.info("Document added to index successfully")
+        except Exception as e:
+            error_msg = f"Failed to add document to index: {str(e)}"
+            logging.error(error_msg)
+            raise RuntimeError(error_msg)
+    # Case 3: Document is already indexed
+    else:
+        logging.info(f"Document '{document_path}' is already indexed")
+    # generate subclaims with scores
+    logging.info(f"Initializing SubclaimScorer with embedding model {embedding_model}")
+    scorer = SubclaimScorer(
+        index_truncation_config=index_truncation_config,
+        embedding_model=embedding_model,
+        index_path=index_file_path,
+        indice2fm_path=indice2fm_path,
+        frequency_score_model=frequency_score_model,
+    )
+    logging.info(f"Processing subclaims and generating scores")
+    subclaim_with_annotation_data = process_subclaims(
+        query_path=query_path,
+        subclaims_path=subclaims_path,
+        faiss_manager=faiss_manager,
+        scorer=scorer,
+        config=config,
+        lite=lite,
+    )
+    logging.info(f"Subclaims processed and saved to {subclaims_path}")
+    # calibration and conformal prediction results
+    if not lite and config["conformal_prediction"]["split_conformal"]:
+        logging.info("Running split conformal prediction")
+        conformal = SplitConformalCalibration(dataset_name=dataset_name)
+        logging.info(
+            f"Plotting conformal removal with alphas: {conformal_alphas}, a={a_value}"
+        )
+        conformal.plot_conformal_removal(
+            data=subclaim_with_annotation_data,
+            alphas=conformal_alphas,
+            a=a_value,
+            fig_filename=CP_result_fig_path,
+            csv_filename=result_path,
+        )
+        logging.info(f"CP removal plot saved to {CP_result_fig_path}")
+        logging.info("Plotting factual removal")
+        conformal.plot_factual_removal(
+            data=subclaim_with_annotation_data,
+            alphas=conformal_alphas,
+            a=a_value,
+            fig_filename=factual_result_fig_path,
+            csv_filename=result_path,
+        )
+        logging.info(f"Factual removal plot saved to {factual_result_fig_path}")
+        logging.info(f"Results saved to {result_path}")
+    if not lite and group_conditional_conformal:
+        logging.info("Running group conditional conformal prediction")
+        conformal = GroupConditionalConformal(dataset_name=dataset_name, result_dir=result_dir)
+        logging.info(
+            f"Plotting conformal removal with alphas: {conformal_alphas}, a={a_value}"
+        )
+        conformal.plot_conformal_removal(
+            data=subclaim_with_annotation_data,
+            alphas=conformal_alphas,
+            a=a_value,
+            fig_filename=GCP_result_fig_path,
+            csv_filename=group_result_path,
+        )
+        logging.info(f"CP removal plot saved to {GCP_result_fig_path}")
+        logging.info("Plotting factual removal")
+        conformal.plot_factual_removal(
+            data=subclaim_with_annotation_data,
+            alphas=conformal_alphas,
+            a=a_value,
+            fig_filename=group_factual_result_fig_path,
+            csv_filename=group_result_path,
+        )
+        logging.info(f"Factual removal plot saved to {factual_result_fig_path}")
+        logging.info(f"Results saved to {result_path}")
+    # Copy config and log files to result directory for reproducibility
+    result_run_dir = config_manager.copy_run_artifacts(result_dir)
+    logging.info(
+        f"Run completed successfully. Results and logs saved to {result_run_dir}"
+    )
+if __name__ == "__main__":
+    main()

requirements-dev.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+-r requirements.txt
+faiss-cpu
+torch
+transformers
+sentence-transformers
+huggingface-hub
+datasets
+langchain
+scikit-learn
+PyPDF2
+jsonschema
+matplotlib
+tqdm

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+# Minimal requirements for HF Spaces (SPACES_DEMO=1, sample queries only).
+# FAISS / torch / transformers are NOT needed because FAISSIndexManager and
+# SubclaimScorer are lazily imported and never called in this mode.
+#
+# Usage: rename or symlink this file to requirements.txt in the Space repo,
+# or specify it via the Spaces build configuration.
+openai>=2.0
+python-dotenv>=1.0
+numpy>=1.24
+pandas>=2.0
+pyyaml>=6.0
+streamlit>=1.43

scripts/build_sif.sh ADDED Viewed

	@@ -0,0 +1,15 @@

+#!/usr/bin/env bash
+# Build the Singularity SIF image from singularity/response_quality.def.
+# Usage: bash scripts/build_sif.sh
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+source "${REPO_ROOT}/.env"
+mkdir -p "${SIF_DIR}"
+singularity build --fakeroot --force \
+    "${SIF_DIR}/response_quality.sif" \
+    "${REPO_ROOT}/singularity/response_quality.def"
+echo "Built: ${SIF_DIR}/response_quality.sif"

scripts/run_demo.sh ADDED Viewed

	@@ -0,0 +1,28 @@

+#!/usr/bin/env bash
+# Launch the Streamlit demo inside the Singularity container.
+# Usage: bash scripts/run_demo.sh [--port PORT]
+# Access: http://localhost:<PORT>  (default: 8502)
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+source "${REPO_ROOT}/.env"
+PORT=8502
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        --port) PORT="$2"; shift 2 ;;
+        *) echo "Unknown option: $1" >&2; exit 1 ;;
+    esac
+done
+singularity run \
+    --bind "${DATA_ROOT}/data:${REPO_ROOT}/data" \
+    --bind "${DATA_ROOT}/index_store:${REPO_ROOT}/index_store" \
+    --bind "${HF_HOME}:${HF_HOME}" \
+    --env HF_HOME="${HF_HOME}" \
+    --env OPENAI_API_KEY="${OPENAI_API_KEY}" \
+    --env DATA_ROOT="${REPO_ROOT}" \
+    --env PYTHONPATH="${REPO_ROOT}" \
+    --pwd "${REPO_ROOT}" \
+    "${SIF_DIR}/response_quality.sif" \
+    streamlit run demo/app.py --server.port "${PORT}"

scripts/run_main.sh ADDED Viewed

	@@ -0,0 +1,22 @@

+#!/usr/bin/env bash
+# Run main.py inside the Singularity container.
+# All arguments are forwarded to main.py as-is.
+#
+# Usage:
+#   bash scripts/run_main.sh --dataset hotpot_qa
+#   bash scripts/run_main.sh --dataset hotpot_qa --query_size 200
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+source "${REPO_ROOT}/.env"
+singularity exec \
+    --bind "${DATA_ROOT}/data:${REPO_ROOT}/data" \
+    --bind "${DATA_ROOT}/index_store:${REPO_ROOT}/index_store" \
+    --bind "${HF_HOME}:${HF_HOME}" \
+    --env HF_HOME="${HF_HOME}" \
+    --env OPENAI_API_KEY="${OPENAI_API_KEY}" \
+    --env DATA_ROOT="${REPO_ROOT}" \
+    --pwd "${REPO_ROOT}" \
+    "${SIF_DIR}/response_quality.sif" \
+    python main.py "$@"

scripts/run_precompute.sh ADDED Viewed

	@@ -0,0 +1,25 @@

+#!/usr/bin/env bash
+# Run demo/precompute.py inside the Singularity container.
+# Usage: bash scripts/run_precompute.sh
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+source "${REPO_ROOT}/.env"
+singularity exec \
+    --bind "${DATA_ROOT}/data:${REPO_ROOT}/data" \
+    --bind "${DATA_ROOT}/index_store:${REPO_ROOT}/index_store" \
+    --bind "${HF_HOME}:${HF_HOME}" \
+    --env HF_HOME="${HF_HOME}" \
+    --env OPENAI_API_KEY="${OPENAI_API_KEY}" \
+    --env DATA_ROOT="${REPO_ROOT}" \
+    --pwd "${REPO_ROOT}" \
+    "${SIF_DIR}/response_quality.sif" \
+    python -m demo.precompute
+echo ""
+echo "Generated:"
+echo "  ${REPO_ROOT}/demo/data/thresholds.csv"
+echo "  ${REPO_ROOT}/demo/data/samples.json"
+echo ""
+echo "Commit these files: git add demo/data/ && git commit"

scripts/run_tests.sh ADDED Viewed

	@@ -0,0 +1,37 @@

+#!/usr/bin/env bash
+# Run pytest inside the Singularity container.
+# All arguments are forwarded to pytest as-is.
+#
+# Prerequisites: .env must define SIF_DIR (and optionally DATA_ROOT, HF_HOME).
+# See docs/context/04_environment_setup.md and .env.example.
+#
+# Usage:
+#   bash scripts/run_tests.sh                          # run all tests
+#   bash scripts/run_tests.sh tests/test_calibration.py -v
+set -euo pipefail
+REPO_ROOT="$(cd "$(dirname "$0")/.." && pwd)"
+source "${REPO_ROOT}/.env"
+if [[ -z "${SIF_DIR:-}" ]]; then
+    echo "ERROR: SIF_DIR is not set. Fill in .env (see .env.example)." >&2
+    exit 1
+fi
+# Build the bind list only for paths that are set.
+BINDS=()
+if [[ -n "${DATA_ROOT:-}" ]]; then
+    BINDS+=(--bind "${DATA_ROOT}/data:${REPO_ROOT}/data")
+    BINDS+=(--bind "${DATA_ROOT}/index_store:${REPO_ROOT}/index_store")
+fi
+if [[ -n "${HF_HOME:-}" ]]; then
+    BINDS+=(--bind "${HF_HOME}:${HF_HOME}")
+fi
+singularity exec \
+    "${BINDS[@]}" \
+    --env HF_HOME="${HF_HOME:-}" \
+    --env DATA_ROOT="${REPO_ROOT}" \
+    --pwd "${REPO_ROOT}" \
+    "${SIF_DIR}/response_quality.sif" \
+    python -m pytest "$@"

singularity/response_quality.def ADDED Viewed

	@@ -0,0 +1,45 @@

+Bootstrap: docker
+From: python:3.11-slim
+%post
+    apt-get update && apt-get install -y --no-install-recommends \
+        build-essential \
+        git \
+        && rm -rf /var/lib/apt/lists/*
+    # Install torch (CPU-only) explicitly before other packages to avoid
+    # pulling in the full CUDA build as a transitive dependency.
+    pip install --no-cache-dir \
+        torch --index-url https://download.pytorch.org/whl/cpu
+    pip install --no-cache-dir \
+        streamlit \
+        openai \
+        faiss-cpu \
+        numpy \
+        pandas \
+        pyyaml \
+        python-dotenv \
+        transformers \
+        sentence-transformers \
+        huggingface-hub \
+        datasets \
+        langchain \
+        scikit-learn \
+        PyPDF2 \
+        jsonschema \
+        matplotlib \
+        tqdm \
+        langchain-text-splitters \
+        pytest
+%environment
+    export PYTHONUNBUFFERED=1
+    export PYTHONDONTWRITEBYTECODE=1
+%runscript
+    exec "$@"
+%labels
+    Author ryoya.awano
+    Description "Response Quality Assessment demo (Conformal RAG)"

src/calibration/__init__.py ADDED Viewed

File without changes

src/calibration/base_calibration.py ADDED Viewed

	@@ -0,0 +1,19 @@

+from abc import ABC, abstractmethod
+class ICalibration(ABC):
+    """
+    Interface for calibration methods.
+    """
+    @abstractmethod
+    def plot_conformal_removal(
+        self, data, alphas, a, fig_filename, csv_filename, plot_group_results=False
+    ):
+        pass
+    @abstractmethod
+    def plot_factual_removal(
+        self, data, alphas, a, fig_filename, csv_filename, plot_group_results=False
+    ):
+        pass

src/calibration/conditional_conformal.py ADDED Viewed

	@@ -0,0 +1,278 @@

+import os
+import csv
+import random
+import numpy as np
+import matplotlib.pyplot as plt
+from tqdm import tqdm
+from collections import defaultdict
+from src.calibration.conformal import SplitConformalCalibration
+from src.calibration.utils import compute_threshold
+from src.calibration.utils import append_result_to_csv
+from src.calibration.utils import split_group
+CORRECT_ANNOTATIONS = ["S"]
+class GroupConditionalConformal(SplitConformalCalibration):
+    def __init__(self, dataset_name: str, result_dir: str, runs: int = 1000):
+        super().__init__(dataset_name, runs)
+        self.result_dir = result_dir
+    def compute_conformal_results(
+        self, data: list, alphas: np.ndarray, a: float, plot_group_results: bool = False
+    ):
+        results = {}
+        for confidence_method in self.confidence_method:
+            results[confidence_method] = {}
+            for alpha in tqdm(
+                alphas, desc=f"Computing conformal results for {confidence_method}"
+            ):
+                # TODO add grouping
+                groups = None
+                thresholds_result = []
+                correctness_list = []
+                fraction_removed_list = []
+                test_data = []
+                for _ in range(self.runs):
+                    random.shuffle(data)
+                    calibration_data, test_data = split_group(data)
+                    groups = list(calibration_data.keys())
+                    assert (
+                        len(calibration_data) != 0
+                    ), "Calibration data should not be empty"
+                    assert len(test_data) != 0, "Test data should not be empty"
+                    thresholds = self._compute_threshold_by_group(
+                        alpha, calibration_data, a, confidence_method, groups=groups
+                    )
+                    correctness, fraction_removed = (
+                        self._evaluate_conformal_correctness(
+                            test_data, thresholds, a, confidence_method
+                        )
+                    )
+                    thresholds_result.append(thresholds)
+                    correctness_list.append(correctness)
+                    fraction_removed_list.append(fraction_removed)
+                results[confidence_method][alpha] = {
+                    "threshold": thresholds_result,
+                    "correctness": correctness_list,
+                    "fraction_removed": fraction_removed_list,
+                }
+        return results
+    def _compute_threshold_by_group(
+        self,
+        alpha: float,
+        calibration_data: list,
+        a: float,
+        confidence_method: str,
+        groups: list | None = None,
+    ):
+        thresholds = {}
+        for group in groups:
+            group_data = calibration_data[group]
+            thresholds[group] = compute_threshold(
+                alpha, group_data, a, confidence_method
+            )
+        return thresholds
+    def _evaluate_conformal_correctness(
+        self, data: list, thresholds: dict, a: float, confidence_method: str
+    ):
+        """
+        Evaluates the performance of a conformal prediction model on test data.
+        Parameters:
+        data (list): A list of dictionaries, where each dictionary represents an entry with subclaims.
+        threshold (float): The similarity score threshold to determine if a subclaim is correctly retained.
+        a (float): The threshold for the correctly retained percentage to consider an entry as correctly retained.
+        Returns:
+        tuple: A tuple containing two lists:
+            - correctly_retained (float): Percentage of data that are correctly retained.
+            - fraction_removed (float): Percentage of subclaims removed for each entry.
+        """
+        correctly_retained = []
+        fraction_removed = []
+        for entry in data:
+            removal_count = 0
+            retained_cnt = 0
+            correctly_retained_count = 0
+            threshold = thresholds[entry["groups"][0]] # Get threshold for the group
+            if threshold is None:
+                raise ValueError(
+                    f"Threshold for group {entry['groups'][0]} is None. Check your calibration data."
+                )
+            for subclaim in entry["subclaims"]:
+                # Find similarity score
+                score = subclaim["scores"][confidence_method]
+                noise = subclaim["scores"]["noise"]
+                if score + noise >= threshold:
+                    retained_cnt += 1
+                    if (
+                        subclaim.get("annotations", {}).get("gpt", "")
+                        in CORRECT_ANNOTATIONS
+                    ):
+                        correctly_retained_count += 1
+                else:
+                    removal_count += 1
+            total_subclaims = len(entry["subclaims"])
+            # Calculate fraction of removed subclaims
+            entry_removal_rate = (
+                0 if total_subclaims == 0 else removal_count / total_subclaims
+            )
+            fraction_removed.append(
+                entry_removal_rate
+            )  # e.g. fraction_removed = [0.2, 0.5, 0.6, 0.2, 0.7] - one element per data entry
+            # Calculate correctly retained rate
+            correctly_retained_percentage = (
+                correctly_retained_count / retained_cnt if retained_cnt > 0 else 1
+            )
+            correctly_retained.append(correctly_retained_percentage >= a)
+        return np.mean(correctly_retained), np.mean(fraction_removed)
+    def compute_factual_results(self, data, alphas, a, calibrate_range=0.5):
+        overall_results = {}
+        per_group_results  = defaultdict(lambda: defaultdict(dict))
+        for method in self.confidence_method:
+            overall_results[method] = {}
+            for alpha in tqdm(
+                alphas, desc=f"Computing factual results for {method}"
+            ):
+                # trackers for this (method, alpha)
+                overall_correctness = []
+                thresholds_per_group = defaultdict(list)
+                correctness_per_group = defaultdict(list)
+                for _ in range(self.runs):
+                    random.shuffle(data)
+                    calibration_data, test_data = split_group(data, calibrate_range)
+                    groups = list(calibration_data.keys())
+                    # assert on nonempty
+                    assert calibration_data, "No calibration groups"
+                    assert len(test_data) != 0, "Test data should not be empty"
+                    # compute all thresholds at once
+                    thresholds = self._compute_threshold_by_group(
+                        alpha, calibration_data, a, method, groups=groups
+                    )
+                    fraction_correct = self._evaluate_factual_correctness(
+                        test_data, thresholds, a, method
+                    )
+                    for group in groups:
+                        thresholds_per_group[group].append(thresholds[group])
+                        correctness_per_group[group].append(
+                            fraction_correct[group]
+                        )
+                    overall_correctness.append(fraction_correct["overall"])
+                # package overall
+                overall_results[method][alpha] = {
+                    "threshold": thresholds_per_group,
+                    "correctness": overall_correctness,
+                    "factuality": 1 - alpha,
+                }
+                # package per‐group
+                for grp in thresholds_per_group:
+                    per_group_results[grp][method][alpha] = {
+                        "threshold": thresholds_per_group[grp],
+                        "correctness": correctness_per_group[grp],
+                        "factuality": 1 - alpha,
+                    }
+        # now write your CSVs
+        for grp, grp_results in per_group_results.items():
+            csv_name = os.path.join(self.result_dir, f"{self.dataset_name}_{grp}_factual_correctness.csv")
+            self._write_csv_header(csv_name, alphas)
+            for method, res in grp_results.items():
+                lvl, corr, err = self.process_factual_correctness_results(res)
+                append_result_to_csv(
+                    csv_filename=csv_name,
+                    label=f"{method}_factual_correctness",
+                    y=corr,
+                    yerr=err,
+                )
+        return overall_results
+    def _evaluate_factual_correctness(
+        self,
+        data: list,
+        thresholds: dict,
+        a: float,
+        confidence_method: str,
+    ):
+        """
+        Evaluates the factual correctness of subclaims within the provided data,
+        using a per‑group threshold but computing overall accuracy over all entries.
+        Args:
+            data (list): A list of dicts, each with "groups" and "subclaims".
+            thresholds (dict): Mapping from group_name -> threshold float.
+            a (float): The accuracy level to compare the correctly retained percentage against.
+            confidence_method (str): Which score key to use for similarity.
+        Returns:
+            dict: { "overall": float, "<group1>": float, "<group2>": float, … }
+        """
+        per_group_percentages = defaultdict(list)
+        total_pass = 0
+        total_entries = 0
+        for entry in data:
+            group_name = entry["groups"][0]
+            retained_cnt = 0
+            correctly_retained_cnt = 0
+            # count retained & correctly retained
+            for sub in entry["subclaims"]:
+                score = sub["scores"][confidence_method]
+                noise = sub["scores"]["noise"]
+                if score + noise >= thresholds[group_name]:
+                    retained_cnt += 1
+                    if sub.get("annotations", {}).get("gpt", "") in CORRECT_ANNOTATIONS:
+                        correctly_retained_cnt += 1
+            # pct for this entry (1.0 if nothing retained)
+            pct = (correctly_retained_cnt / retained_cnt) if retained_cnt > 0 else 1.0
+            per_group_percentages[group_name].append(pct)
+            # update global pass/fail
+            total_entries += 1
+            if pct >= a:
+                total_pass += 1
+        # compute per‑group correctness
+        per_group_correctness = {
+            grp: sum(1 for pct in pct_list if pct >= a) / len(pct_list)
+            for grp, pct_list in per_group_percentages.items()
+        }
+        # compute overall exactly as “fraction of all entries passing”
+        overall = total_pass / total_entries if total_entries > 0 else 0.0
+        # assemble result
+        result = {"overall": overall}
+        result.update(per_group_correctness)
+        return result

src/calibration/conformal.py ADDED Viewed

	@@ -0,0 +1,422 @@

+import os
+import csv
+import random
+import numpy as np
+import matplotlib.pyplot as plt
+from tqdm import tqdm
+from src.calibration.base_calibration import ICalibration
+from src.calibration.utils import compute_threshold
+from src.calibration.utils import append_result_to_csv
+CORRECT_ANNOTATIONS = ["S"]
+class SplitConformalCalibration(ICalibration):
+    """
+    Implementation of standard conformal calibration.
+    """
+    def __init__(self, dataset_name: str, runs: int = 1000):
+        self.dataset_name = dataset_name
+        self.confidence_method = [
+            "relavance",
+            "frequency",
+            "query_claim_cosine_similarity",
+            "doc_claim_cosine_similarity",
+            "min_log_prob",
+            "random",
+            "ordinal",
+        ]
+        self.runs = runs
+    def plot_conformal_removal(
+        self, data, alphas, a, fig_filename, csv_filename
+    ):
+        # compute the correctness and fraction removed for each alpha
+        cache_filename = f"{os.path.splitext(os.path.abspath(csv_filename))[0]}_conformal_removal_cache.npy"
+        if not os.path.exists(cache_filename):
+            results = self.compute_conformal_results(
+                data, alphas, a
+            )
+            print(f"Caching results to {cache_filename}")
+            np.save(cache_filename, results)
+        else:
+            print(f"Loading cached results from {cache_filename}")
+            results = np.load(cache_filename, allow_pickle=True).item()
+        ax = None
+        for confidence_method, result in results.items():
+            correctness, fraction_removed, yerr = (
+                self.process_conformal_removal_results(result)
+            )
+            # write the results to csv file
+            self._write_csv_header(csv_filename, alphas)
+            append_result_to_csv(
+                csv_filename=csv_filename,
+                label=f"{confidence_method}_conformal_removal_rate",
+                y=fraction_removed,
+                yerr=yerr,
+            )
+            # plot the results
+            print(f"Producing conformal plot for {confidence_method}")
+            ax = self.plot_conformal_removal_rate_by_alpha(
+                correctness,
+                fraction_removed,
+                yerr,
+                a,
+                confidence_method,
+                fig_filename,
+                ax,
+            )
+            print(f"Conformal plot saved to {fig_filename}")
+    def compute_conformal_results(
+        self, data: list, alphas: np.ndarray, a: float
+    ):
+        results = {}
+        for confidence_method in self.confidence_method:
+            results[confidence_method] = {}
+            for alpha in tqdm(
+                alphas, desc=f"Computing conformal results for {confidence_method}"
+            ):
+                thresholds = []
+                correctness_list = []
+                fraction_removed_list = []
+                for _ in range(self.runs):
+                    random.shuffle(data)
+                    split_index = len(data) // 2
+                    calibration_data = data[:split_index]
+                    test_data = data[split_index:]
+                    assert (
+                        len(calibration_data) != 0
+                    ), "Calibration data should not be empty"
+                    assert len(test_data) != 0, "Test data should not be empty"
+                    threshold = compute_threshold(alpha, calibration_data, a, confidence_method)
+                    correctness, fraction_removed = (
+                        self._evaluate_conformal_correctness(
+                            test_data, threshold, a, confidence_method
+                        )
+                    )
+                    thresholds.append(threshold)
+                    correctness_list.append(correctness)
+                    fraction_removed_list.append(fraction_removed)
+                results[confidence_method][alpha] = {
+                    "threshold": thresholds,
+                    "correctness": correctness_list,
+                    "fraction_removed": fraction_removed_list,
+                }
+        return results
+    def process_conformal_removal_results(self, results: dict):
+        """
+        x: list of average correctness
+        y: list of average fraction removed
+        yerr: list of standard error of fraction removed
+        """
+        x, y, yerr = [], [], []
+        for alpha, results_for_alpha in results.items():
+            x_per_alpha = np.mean(
+                results_for_alpha["correctness"]
+            )  # correct retainment percentage at a specific alpha value, averaging over 1000 times of shuffled data
+            y_per_alpha = np.mean(
+                results_for_alpha["fraction_removed"]
+            )  # removal percentage at a specific alpha value, averaging, averaging over 1000 times of shuffled data
+            x.append(x_per_alpha)
+            y.append(y_per_alpha)
+            yerr.append(
+                (
+                    np.std(results_for_alpha["fraction_removed"])
+                    * 1.96
+                    / np.sqrt(len(results_for_alpha["fraction_removed"]))
+                )
+            )
+        return x, y, yerr
+    def plot_conformal_removal_rate_by_alpha(
+        self, x, y, yerr, a, confidence_method, fig_filename, ax=None
+    ):
+        if not ax:
+            fig, ax = plt.subplots(figsize=(8, 6), dpi=800)
+            ax.set_title(
+                f"Conformal Plots for {self.dataset_name} Datasets (a={a})", fontsize=20
+            )
+            x_label = (
+                f"Fraction achieving avg factuality >= {a}"
+                if a != 1
+                else "Fraction of factual outputs"
+            )
+            ax.set_xlabel(x_label, fontsize=16)
+            ax.set_ylabel("Average percent removed", fontsize=16)
+        else:
+            fig = ax.figure
+        # Plot the data
+        ax.errorbar(x, y, yerr=yerr, label=confidence_method, linewidth=2)
+        # set the legend
+        ax.legend(loc="upper left", bbox_to_anchor=(0.02, 0.98), fontsize=10)
+        # Save the figure
+        fig.savefig(fig_filename, bbox_inches="tight")
+        return ax  # Return the ax for further modifications if needed
+    def _write_csv_header(self, csv_filename, alphas):
+        target_factuality = [f"{(1-x):.2f}" for x in alphas][::-1]
+        header = ["target_factuality"] + target_factuality
+        # Ensure the directory exists
+        os.makedirs(os.path.dirname(csv_filename), exist_ok=True)
+        if not os.path.exists(csv_filename):
+            with open(csv_filename, mode="w", newline="") as file:
+                csv.writer(file).writerow(header)
+    def _evaluate_conformal_correctness(
+        self, data: list, threshold: float, a: float, confidence_method: str
+    ):
+        """
+        Evaluates the performance of a conformal prediction model on test data.
+        Parameters:
+        data (list): A list of dictionaries, where each dictionary represents an entry with subclaims.
+        threshold (float): The similarity score threshold to determine if a subclaim is correctly retained.
+        a (float): The threshold for the correctly retained percentage to consider an entry as correctly retained.
+        Returns:
+        tuple: A tuple containing two lists:
+            - correctly_retained (float): Percentage of data that are correctly retained.
+            - fraction_removed (float): Percentage of subclaims removed for each entry.
+        """
+        correctly_retained = []
+        fraction_removed = []
+        for entry in data:
+            removal_count = 0
+            retained_cnt = 0
+            correctly_retained_count = 0
+            for subclaim in entry["subclaims"]:
+                # Find similarity score
+                score = subclaim["scores"][confidence_method]
+                noise = subclaim["scores"]["noise"]
+                if score + noise >= threshold:
+                    retained_cnt += 1
+                    if (
+                        subclaim.get("annotations", {}).get("gpt", "")
+                        in CORRECT_ANNOTATIONS
+                    ):
+                        correctly_retained_count += 1
+                else:
+                    removal_count += 1
+            total_subclaims = len(entry["subclaims"])
+            # Calculate fraction of removed subclaims
+            entry_removal_rate = (
+                0 if total_subclaims == 0 else removal_count / total_subclaims
+            )
+            fraction_removed.append(
+                entry_removal_rate
+            )  # e.g. fraction_removed = [0.2, 0.5, 0.6, 0.2, 0.7] - one element per data entry
+            # Calculate correctly retained rate
+            correctly_retained_percentage = (
+                correctly_retained_count / retained_cnt if retained_cnt > 0 else 1
+            )
+            correctly_retained.append(correctly_retained_percentage >= a)
+        return np.mean(correctly_retained), np.mean(fraction_removed)
+    def plot_factual_removal(
+        self, data, alphas, a, fig_filename, csv_filename, plot_group_results=False
+    ):
+        x_values = np.linspace(1 - alphas[-1] - 0.05, 1 - alphas[0] + 0.03, 100)
+        fig, ax = plt.subplots(figsize=(8, 6), dpi=800)
+        ax.plot(
+            x_values,
+            x_values,
+            "--",
+            color="gray",
+            linewidth=2,
+            label="Conformal guarantee lower bounds",
+        )
+        cache_filename = f"{os.path.splitext(os.path.abspath(csv_filename))[0]}_factual_correctness_cache.npy"
+        if not os.path.exists(cache_filename):
+            results = self.compute_factual_results(data, alphas, a)
+            print(f"Caching results to {cache_filename}")
+            np.save(cache_filename, results)
+        else:
+            print(f"Loading cached results from {cache_filename}")
+            results = np.load(cache_filename, allow_pickle=True).item()
+        for confidence_method, result in results.items():
+            conf_level, corretness, yerr = self.process_factual_correctness_results(
+                result
+            )
+            append_result_to_csv(
+                csv_filename=csv_filename,
+                label=f"{confidence_method}_factual_correctness",
+                y=corretness,
+                yerr=yerr,
+            )
+            print(
+                f"Producing factual removal plot for {confidence_method}: {fig_filename}"
+            )
+            ax = self.plot_factual_removal_rate_by_alpha(
+                conf_level, corretness, a, confidence_method, fig_filename, ax
+            )
+            print(f"Conformal plot saved to {fig_filename}")
+            if plot_group_results:
+                # self.plot_factual_group_results(results, csv_filename, x)
+                raise NotImplementedError("Not implemented")
+    def compute_factual_results(self, data, alphas, a):
+        results = {}
+        for confidence_method in self.confidence_method:
+            results[confidence_method] = {}
+            for alpha in tqdm(
+                alphas, desc=f"Computing factual results for {confidence_method}"
+            ):
+                thresholds = []
+                correctness = []
+                for _ in range(self.runs):
+                    random.shuffle(data)
+                    split_index = len(data) // 2
+                    calibration_data = data[:split_index]
+                    test_data = data[split_index:]
+                    assert (
+                        len(calibration_data) != 0
+                    ), "Calibration data should not be empty"
+                    assert len(test_data) != 0, "Test data should not be empty"
+                    threshold = compute_threshold(alpha, calibration_data, a, confidence_method)
+                    fraction_correct = self._evaluate_factual_correctness(
+                        test_data, threshold, a, confidence_method
+                    )
+                    thresholds.append(threshold)
+                    correctness.append(fraction_correct)
+                results[confidence_method][alpha] = {
+                    "threshold": thresholds,
+                    "correctness": correctness,
+                    "factuality": 1 - alpha,
+                }
+        return results
+    def process_factual_correctness_results(self, results: dict):
+        """
+        x: confidence level
+        y: list of average factual correctness
+        yerr: list of standard error of factual correctness
+        """
+        x, y, yerr = [], [], []
+        for alpha, results_for_alpha in results.items():
+            x.append(1 - alpha)
+            y.append(np.mean(results_for_alpha["correctness"]))
+            yerr.append(
+                (
+                    np.std(results_for_alpha["correctness"])
+                    * 1.96
+                    / np.sqrt(len(results_for_alpha["correctness"]))
+                )
+            )
+        return x, y, yerr
+    def plot_factual_removal_rate_by_alpha(
+        self, x, y, a, confidence_method, fig_filename, ax=None
+    ):
+        if not ax:
+            fig, ax = plt.subplots(figsize=(8, 6), dpi=800)
+        else:
+            fig = ax.figure  # Get the figure from the provided ax
+        ax.set_xlabel(f"Target factuality (1 - {chr(945)})", fontsize=16)
+        ax.set_ylabel("Empirical factuality", fontsize=16)
+        ax.set_title(
+            f"Factual correctness for {self.dataset_name} Datasets (a={a})", fontsize=20
+        )
+        # Plot the data
+        ax.plot(x, y, label=confidence_method, linewidth=2)
+        # Set legend
+        ax.legend(loc="upper left", bbox_to_anchor=(0.02, 0.98), fontsize=10)
+        # Save the figure
+        fig.savefig(fig_filename, bbox_inches="tight", dpi=800)
+        return ax  # Return the ax for further modifications if needed
+    def _evaluate_factual_correctness(
+        self, data: list, threshold: float, a: float, confidence_method: str
+    ):
+        """
+        Evaluates the factual correctness of subclaims within the provided data.
+        This function processes a list of data entries, each containing subclaims with similarity scores.
+        It calculates the percentage of correctly retained subclaims based on a given threshold and
+        compares it to a specified accuracy level `a`.
+        Args:
+            data (list): A list of dictionaries, where each dictionary represents an entry containing subclaims.
+            threshold (float): The similarity score threshold above which subclaims are considered retained.
+            a (float): The accuracy level to compare the correctly retained percentage against.
+        Returns:
+            float: The percentage of entries in the data that satisfy the correct level of accuracy `a`.
+        """
+        correctly_retained = []
+        # Process each item in the list
+        for entry in data:
+            # Extract subclaims from each item
+            retained_cnt = 0
+            correctly_retained_count = 0
+            for subclaim in entry["subclaims"]:
+                # Extract the score and noise
+                score = subclaim["scores"][confidence_method]
+                noise = subclaim["scores"]["noise"]
+                # Add the subclaim to the collection if similarity score is above threshold
+                if score + noise >= threshold:
+                    retained_cnt += 1
+                    if (
+                        subclaim.get("annotations", {}).get("gpt", "")
+                        in CORRECT_ANNOTATIONS
+                    ):
+                        correctly_retained_count += 1
+            # Calculate correctly retained rate
+            correctly_retained_percentage = (
+                correctly_retained_count / retained_cnt if retained_cnt > 0 else 1
+            )
+            correctly_retained.append(correctly_retained_percentage)
+        correctness_list = [
+            correctly_retained_percentage >= a
+            for correctly_retained_percentage in correctly_retained
+        ]
+        # percentage of test data satisfying correct level of a
+        return sum(correctness_list) / len(correctness_list)

src/calibration/utils.py ADDED Viewed

	@@ -0,0 +1,161 @@

+import json
+import csv
+import numpy as np
+from math import ceil
+from collections import defaultdict
+CORRECT_ANNOTATIONS = ["Y", "S"]
+def load_subclaim_data(file_path):
+    """Load calibration data from a JSON file"""
+    with open(file_path, "r", encoding="utf-8") as file:
+        return json.load(file)
+def append_result_to_csv(csv_filename, label, y, yerr):
+    """Append calibration results to CSV file"""
+    formatted_results = [f"{y:.4f} ± {yerr:.4f}" for y, yerr in zip(y, yerr)]
+    formatted_results.reverse()
+    row = [label] + formatted_results
+    with open(csv_filename, mode="a", newline="") as file:
+        writer = csv.writer(file)
+        writer.writerow(row)
+def _get_accepted_subclaims(entry, threshold, confidence_method):
+    """Helper function to get accepted subclaims based on threshold"""
+    return [
+        subclaim
+        for subclaim in entry["subclaims"]
+        if subclaim["scores"][confidence_method] + subclaim["scores"]["noise"]
+        >= threshold
+    ]
+def _calculate_entailed_fraction(subclaims):
+    """Helper function to calculate fraction of entailed/correct subclaims"""
+    if not subclaims:
+        return 1.0
+    return np.mean(
+        [
+            subclaim["annotations"]["gpt"] in CORRECT_ANNOTATIONS
+            for subclaim in subclaims
+        ]
+    )
+def get_r_score(entry: list, confidence_method: str, a: float):
+    """
+    Compute the r_a score for each data entry when confidence_method is used as the sub-claim scoring function.
+    This function calculates the minimum threshold at which the fraction of correct subclaims
+    falls below the required threshold 'a'. The r_a score represents the confidence score
+    at which the model's reliability drops below the acceptable level.
+    The algorithm works by:
+    1. First checking if the score was already calculated and cached
+    2. Sorting all subclaim scores in descending order
+    3. Testing each score as a potential threshold
+    4. For each threshold, accepting only subclaims with scores >= threshold
+    5. Calculating the fraction of correct subclaims among the accepted ones
+    6. Returning the first threshold where this fraction falls below 'a'
+    7. Returning -1 if all possible thresholds maintain accuracy above 'a'
+    Args:
+        entry: Dictionary containing claims data
+        confidence_method: Method used for scoring subclaims
+        a: Required fraction correct threshold
+    Returns:
+        float: r_a score for the entry
+    """
+    r_score_key = f"r_score_{a}_{confidence_method}"
+    if r_score_key in entry:
+        return entry[r_score_key]
+    #add a cache in entry to remember it's r_score
+    scores = [
+        subclaim["scores"][confidence_method] + subclaim["scores"]["noise"]
+        for subclaim in entry["subclaims"]
+    ]
+    threshold_set = sorted(scores, reverse=True)
+    for threshold in threshold_set:
+        accepted_subclaims = _get_accepted_subclaims(
+            entry, threshold, confidence_method
+        )
+        entailed_fraction = _calculate_entailed_fraction(accepted_subclaims)
+        if entailed_fraction < a:
+            entry[r_score_key] = threshold
+            return threshold
+    entry[r_score_key] = -1
+    return -1
+def compute_threshold(alpha, calibration_data, a, confidence_method):
+    """
+    Computes the quantile/threshold from conformal prediction.
+    # alpha: float in (0, 1)
+    # calibration_data: calibration data
+    # a: as in paper, required fraction correct, section 4.1
+    # confidence_method: string
+    """
+    # Compute r score for each example.
+    r_scores = [get_r_score(entry, confidence_method, a) for entry in calibration_data]
+    # Compute threshold for conformal prection. The quantile is ceil((n+1)*(1-alpha))/n, and
+    # We map this to the index by dropping the division by n and subtracting one (for zero-index).
+    quantile_target_index = min(ceil((len(r_scores) + 1) * (1 - alpha)), len(r_scores))
+    threshold = sorted(r_scores)[quantile_target_index - 1]
+    return threshold
+# Make sure the split calibrate_range ratio are all same not just in overall level but in group level
+# not return data in list but in a map with each group name as key
+def split_group(data, calibrate_range=0.5):
+    group_data = defaultdict(list)
+    calibration_data = defaultdict(list)
+    test_data = []
+    for entry in data:
+        group = entry["groups"][0]  # Use first group as default
+        group_data[group].append(entry)
+    for group, group_entries in group_data.items():
+        split_index = ceil(len(group_entries) * calibrate_range)
+        calibration_data[group].extend(group_entries[:split_index])
+        test_data.extend(group_entries[split_index:])
+    return calibration_data, test_data
+# Analyze Functions #
+def percentage_highest_not_S(data, key="relavance"):
+    count_total = 0
+    count_not_S = 0
+    for item in data:
+        subclaims = item.get("subclaims", [])
+        if not subclaims:
+            continue
+        # Sort subclaims by (score[key] + score[noise]), descending
+        subclaims_sorted = sorted(
+            subclaims,
+            key=lambda sc: sc["scores"].get(key, 0) + sc["scores"].get("noise", 0),
+            reverse=True
+        )
+        top_annotation = subclaims_sorted[0].get("annotations", {}).get("gpt", None)
+        count_total += 1
+        if top_annotation != "S":
+            count_not_S += 1
+    if count_total == 0:
+        return 0.0  # Avoid division by zero
+    return (count_not_S / count_total) * 100

src/common/__init__.py ADDED Viewed

File without changes

src/common/chunker.py ADDED Viewed

	@@ -0,0 +1,41 @@

+from abc import ABC, abstractmethod
+from typing import List, Any
+class BaseChunker(ABC):
+    """
+    Abstract base class for all chunking strategies.
+    """
+    def __init__(self, document, chunk_size: int, overlap_size: int = 0):
+        self.document = document
+        self.chunk_size = chunk_size
+        self.overlap_size = overlap_size
+    @abstractmethod
+    def create_chunks(self) -> list[dict[str, Any]]:
+        """
+        Abstract method to be implemented by subclasses for chunking text.
+        """
+        pass
+class FixedLengthChunker(BaseChunker):
+    """
+    Chunker that splits text into overlapping fixed-size chunks of words.
+    """
+    def create_chunks(self) -> list[str]:
+        chunks: list[str] = []
+        text = self.document
+        words = text.split()
+        start = 0
+        chunk_num = 0
+        while start < len(words):
+            end = start + self.chunk_size
+            chunks.append(" ".join(words[start:end]))
+            start += self.chunk_size - self.overlap_size
+            chunk_num += 1
+        return chunks, len(words)

src/common/config_manager.py ADDED Viewed

	@@ -0,0 +1,171 @@

+import os
+import yaml
+import logging
+import datetime
+import json
+import shutil
+from pathlib import Path
+class ConfigManager:
+    """Utility class to manage configuration loading, saving and logging"""
+    def __init__(self, config_path=None, path_config_path=None, dataset_config_path=None, run_id=None):
+        """
+        Initialize the ConfigManager with a config file path
+        Args:
+            config_path (str): Path to the YAML config file
+            run_id (str): Optional identifier for the run
+        """
+        self.config = {}
+        self.run_id = run_id or datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
+        self.log_dir = f"logs/{self.run_id}"
+        if config_path:
+            self.config = self.load_config(config_path)
+        if path_config_path:
+            self.path_config = self.load_config(path_config_path)
+        if dataset_config_path:
+            self.dataset_config = self.load_config(dataset_config_path)
+    def load_config(self, config_path):
+        """
+        Load configuration from a YAML file
+        Args:
+            config_path (str): Path to the YAML config file
+        Returns:
+            dict: The loaded configuration
+        """
+        with open(config_path, 'r') as f:
+            return self._expand_env_vars(yaml.safe_load(f))
+    def _expand_env_vars(self, obj):
+        """Recursively expand environment variables in string values."""
+        if isinstance(obj, dict):
+            return {k: self._expand_env_vars(v) for k, v in obj.items()}
+        if isinstance(obj, list):
+            return [self._expand_env_vars(v) for v in obj]
+        if isinstance(obj, str):
+            return os.path.expandvars(obj)
+        return obj
+    def save_config(self, output_path=None):
+        """
+        Save the current configuration to a YAML file
+        Args:
+            output_path (str): Path to save the config file, defaults to log directory
+        Returns:
+            str: Path to the saved config file
+        """
+        if output_path is None:
+            os.makedirs(self.log_dir, exist_ok=True)
+            output_path = os.path.join(self.log_dir, f"config_{self.run_id}.yaml")
+        # Ensure directory exists
+        os.makedirs(os.path.dirname(output_path), exist_ok=True)
+        with open(output_path, 'w') as f:
+            yaml.dump(self.config, f, default_flow_style=False)
+        return output_path
+    def setup_logging(self, log_level=logging.INFO):
+        """
+        Setup logging configuration
+        Args:
+            log_level: Logging level
+        Returns:
+            str: Path to the log file
+        """
+        os.makedirs(self.log_dir, exist_ok=True)
+        log_file = os.path.join(self.log_dir, f"run_{self.run_id}.log")
+        logging.basicConfig(
+            level=log_level,
+            format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
+            handlers=[
+                logging.FileHandler(log_file),
+                logging.StreamHandler()
+            ]
+        )
+        # Disable httpx logs
+        logging.getLogger("httpx").setLevel(logging.WARNING)
+        # Log some initial information
+        logging.info(f"Starting run with ID: {self.run_id}")
+        logging.info(f"Log file: {log_file}")
+        return log_file, self.run_id
+    def log_config(self):
+        """Log the important parts of the configuration"""
+        if not self.config:
+            logging.warning("No configuration loaded to log")
+            return
+        logging.info("=== Run Configuration ===")
+        # Log dataset info
+        if 'dataset' in self.config:
+            logging.info(f"Dataset: {self.config['dataset']['name']}")
+            logging.info(f"Query size: {self.config['dataset']['query_size']}")
+        # Log index info
+        if 'index' in self.config:
+            logging.info(f"Embedding model: {self.config['index']['embedding_model']}")
+            logging.info(f"Delete existing index: {self.config['index']['delete_existing']}")
+        logging.info("========================")
+    def update_config(self, updates):
+        """
+        Update the configuration with new values
+        Args:
+            updates (dict): Dictionary containing updates to apply
+        Returns:
+            dict: The updated configuration
+        """
+        # This is a simple implementation that only handles top-level keys
+        for key, value in updates.items():
+            if isinstance(value, dict) and key in self.config and isinstance(self.config[key], dict):
+                self.config[key].update(value)
+            else:
+                self.config[key] = value
+        return self.config
+    def copy_run_artifacts(self, result_dir):
+        """
+        Copy config and logs to a results directory for reproducibility
+        Args:
+            result_dir (str): Path to the results directory
+        Returns:
+            str: Path to the result run directory
+        """
+        result_run_dir = os.path.join(result_dir, "config")
+        os.makedirs(result_run_dir, exist_ok=True)
+        # Get the latest config and log files
+        config_files = sorted(Path(self.log_dir).glob("config_*.yaml"))
+        # log_files = sorted(Path(self.log_dir).glob("run_*.log"))
+        if config_files:
+            latest_config = str(config_files[-1])
+            shutil.copy2(latest_config, os.path.join(result_run_dir, "config.yaml"))
+        # if log_files:
+        #     latest_log = str(log_files[-1])
+        #     shutil.copy2(latest_log, os.path.join(result_run_dir, "run.log"))
+        return result_run_dir

src/common/faiss_manager.py ADDED Viewed

	@@ -0,0 +1,309 @@

+import os
+import json
+import re
+import ast
+import faiss
+from typing import Union, Optional
+from dotenv import load_dotenv
+import numpy as np
+from sklearn.preprocessing import normalize
+from src.common.file_manager import FileManager
+from src.common.llm.openai_manager import OpenAIManager
+class FAISSIndexManager:
+    def __init__(
+        self,
+        index_truncation_config,
+        dimension=3072,
+        index_path="index_store/index.faiss",
+        indice2fm_path="index_store/indice2fm.json",
+    ):
+        dotenv_path = os.path.join(os.getcwd(), ".env")
+        load_dotenv(dotenv_path)
+        self.openaiManager = OpenAIManager()
+        self.dimension = dimension
+        self.index = faiss.IndexFlatIP(dimension)
+        self.file_managers = []
+        self.indice2fm = (
+            {}
+        )  # Mapping from file texts tracking from file_path to faiss index indices, guarantee indice in asc order
+        self.index_path = index_path
+        self.indice2fm_path = indice2fm_path
+        # initialize index and indice2fm from saved files
+        if os.path.exists(index_path):
+            self.index = faiss.read_index(index_path)
+            print(f"Loaded FAISS index from {index_path}")
+        if os.path.exists(indice2fm_path):
+            with open(indice2fm_path, "r") as file:
+                self.indice2fm = json.load(file)
+            for file_path, _ in self.indice2fm.items():
+                self.file_managers.append(
+                    FileManager(
+                        file_path=file_path,
+                        index_truncation_config=index_truncation_config,
+                    )
+                )
+    def is_indice_align(self):
+        last_index_id = self.index.ntotal - 1
+        return last_index_id == max(max(values) for values in self.indice2fm.values())
+    def save_index(self, index_path, indice2fm_path):
+        if self.index:
+            os.makedirs(os.path.dirname(index_path), exist_ok=True)
+            faiss.write_index(self.index, index_path)
+            # also save file_path to indice mapping, self.indice2fm should be updated before calling this function
+            with open(indice2fm_path, mode="w") as file:
+                json.dump(self.indice2fm, file, indent=4)
+    def delete_index(self):
+        self.index.reset()
+        self.indice2fm = {}
+        if os.path.exists(self.index_path):
+            os.remove(self.index_path)
+        if os.path.exists(self.indice2fm_path):
+            os.remove(self.indice2fm_path)
+        print("FAISS index deleted.")
+    def upsert_file_to_faiss(
+        self,
+        file_manager,
+        model="text-embedding-3-large",
+        truncation_strategy: Optional[Union[str, bool]] = "fixed_length",
+        truncate_by: Optional[str] = "\n",
+    ):
+        if not file_manager.file_path in [
+            file_manager.file_path for file_manager in self.file_managers
+        ]:
+            self.file_managers.append(file_manager)
+        else:
+            print(f"File '{file_manager.file_path}' already exists in the FAISS index.")
+            return
+        # Process the file if necessary
+        # TODO: check if file_manager.texts will in any case be empty, if not, remove the below block
+        if not file_manager.texts:
+            print("Processing documents...")
+            file_manager.process_document(
+                truncation_strategy=truncation_strategy, truncate_by=truncate_by
+            )
+            print("Documents processing done.")
+        # Generate embeddings and append to index if not already present
+        if not file_manager.file_path in self.indice2fm:
+            print("Creating embedding for the document...")
+            embeddings = self.openaiManager.create_openai_embeddings(
+                file_manager.texts, model=model
+            )
+            # Normalize embeddings
+            embeddings_np = self.normalize_embeddings(embeddings)
+            start_index = self.index.ntotal
+            # Add embeddings to FAISS index
+            self.index.add(embeddings_np)
+            end_index = self.index.ntotal
+            added_indices = list(range(start_index, end_index))
+            # Update the self.indice2fm dictionary
+            self.indice2fm[file_manager.file_path] = added_indices
+            self.save_index(
+                index_path=self.index_path, indice2fm_path=self.indice2fm_path
+            )
+            print(
+                f"Embeddings from file '{file_manager.file_path}' added to FAISS index between indice {start_index} to {end_index}."
+            )
+        else:
+            print(f"File '{file_manager.file_path}' already exists in the FAISS index.")
+    def normalize_embeddings(self, embeddings):
+        if np.isnan(embeddings).any() or np.isinf(embeddings).any():
+            raise ValueError("Embeddings contain NaNs or Infs.")
+        embeddings_np = np.array(embeddings).astype("float32")
+        #faiss normalize give error zsh: segmentation fault python faiss manager at some edge case in hotpotqa
+        #faiss.normalize_L2(embeddings_np)
+        embeddings_normalized = normalize(embeddings_np, norm='l2', axis=1)
+        return embeddings_normalized
+    def search_faiss_index(
+        self,
+        query,
+        top_k=10,
+        threshold=0.5,
+        truncation_strategy: Optional[Union[str, bool]] = "fixed_length",
+        truncate_by: Optional[str] = "\n",
+    ):
+        if self.index.ntotal == 0:
+            return []
+        # Create a normalized embedding for the query
+        query_embedding = self.normalize_embeddings(
+            [
+                self.openaiManager.client.embeddings.create(
+                    input=[query], model="text-embedding-3-large"
+                )
+                .data[0]
+                .embedding
+            ]
+        )[0].reshape(1, -1)
+        # Perform the search
+        similarity, indices = self.index.search(query_embedding, top_k)
+        filtered_results = [
+            (idx, similar)
+            for idx, similar in zip(indices[0], similarity[0])
+            if similar >= threshold
+        ]
+        results = []
+        # Reverse map indices to file paths and text
+        for idx, dist in filtered_results:
+            file_path_found = None
+            relative_idx = None
+            # Find the file_path and relative index using self.indice2fm
+            for file_path, indice_list in self.indice2fm.items():
+                if idx in indice_list:
+                    file_path_found = file_path
+                    relative_idx = indice_list.index(idx)
+                    break
+            if file_path_found is not None and relative_idx is not None:
+                # Find the corresponding file_manager
+                file_manager = next(
+                    (
+                        fm
+                        for fm in self.file_managers
+                        if fm.file_path == file_path_found
+                    ),
+                    None,
+                )
+                if file_manager:
+                    # Process the file if necessary
+                    file_manager.process_document(
+                        truncation_strategy=truncation_strategy, truncate_by=truncate_by
+                    )
+                    try:
+                        # Get the text from the file_manager
+                        text = file_manager.texts[relative_idx][
+                            1
+                        ]  # Assuming (index, text) tuples in file_manager.texts
+                        results.append(
+                            f"{text} indice={idx} fileposition={relative_idx} score={dist:.4f}"
+                            # TODO reformat this
+                            # {
+                            #     "text": text,
+                            #     "indice": idx,
+                            #     "fileposition": relative_idx,
+                            #     "score": round(dist, 4),
+                            # }
+                        )
+                    except:
+                        print(
+                            f"Error while retriving id={relative_idx} from file manager. Skipping over id={relative_idx}."
+                        )
+                else:
+                    results.append(
+                        f"File manager not found for '{file_path_found}' score={dist:.4f}"
+                    )
+            else:
+                # TODO reformat this
+                results.append(f"Index not mapped, score={dist:.4f}")
+        return results
+    def parse_result(self, result):
+        """
+        Parse the result from the search and return the page content, metadata, indice, and score.
+        """
+        # Parse the input
+        parsed_item = None
+        pattern = re.compile(
+            r"page_content='(.*?)'\smetadata=(\{.*?\})\sindice=(\d+)\sfileposition=(\d+)\sscore=([\d.]+)",
+            re.DOTALL,
+        )
+        matches = pattern.findall(result)
+        # assume only 1 row with matched pattern will be feed in each time, only remain last item
+        for match in matches:
+            page_content, metadata, indice, fileposition, score = match
+            # Convert metadata string to a dictionary
+            metadata_dict = ast.literal_eval(metadata)
+            parsed_item = {
+                "page_content": page_content.strip(),
+                "metadata": metadata_dict,
+                "indice": int(indice),
+                "fileposition": int(fileposition),
+                "score": float(score),
+            }
+        return parsed_item
+    def generate_response_from_context(self, query, retrieved_docs, model="gpt-4o"):
+        if not retrieved_docs:
+            return "No relevant documents found in the FAISS index."
+        # Process retrieved documents into a clean context
+        formatted_docs = []
+        for doc in retrieved_docs:
+            try:
+                # Split the document string into page_content and metadata
+                doc_parts = doc.split("metadata=")
+                page_content = doc_parts[0].replace("page_content=", "").strip()
+                metadata = (
+                    doc_parts[1].strip() if len(doc_parts) > 1 else "Unknown source"
+                )
+                # Format each document clearly
+                formatted_doc = f"Content: {page_content}\nSource: {metadata}"
+                formatted_docs.append(formatted_doc)
+            except Exception as e:
+                formatted_docs.append(f"Error processing document: {e}")
+        # Combine the formatted documents into a single context
+        context = "\n\n---\n\n".join(formatted_docs)
+        # Construct the prompt for the OpenAI API
+        messages = [
+            {
+                "role": "system",
+                "content": "You are a helpful assistant that answers questions based on provided context.",
+            },
+            {"role": "user", "content": query},
+            {
+                "role": "assistant",
+                "content": f"The following context was retrieved from the database:\n\n{context}",
+            },
+        ]
+        # Generate response using OpenAI Chat API
+        response = self.openaiManager.client.chat.completions.create(
+            model=model, messages=messages, max_tokens=4096, temperature=0.7
+        )
+        return response.choices[0].message.content
+def main():
+    # Example Usage
+    file_path1 = os.path.join(os.getcwd(), "documents", "2024_Corrective_RAGv2.pdf")
+    file_manager1 = FileManager(file_path1)
+    manager = FAISSIndexManager(dimension=3072)
+    manager.upsert_file_to_faiss(file_manager1)
+    file_path2 = os.path.join(os.getcwd(), "documents", "2023_Iterative_RGen.pdf")
+    file_manager2 = FileManager(file_path2)
+    manager.upsert_file_to_faiss(file_manager2)
+    query = "tell me about corrective rag system."
+    retrieved_docs = manager.search_faiss_index(query, top_k=10, threshold=0.1)
+    print(retrieved_docs)
+    response = manager.generate_response_from_context(query, retrieved_docs)
+    print(response)
+if __name__ == "__main__":
+    print("Running faiss_manager.py")
+    main()

src/common/file_manager.py ADDED Viewed

	@@ -0,0 +1,128 @@

+import os
+import json
+from typing import Union, Optional
+from PyPDF2 import PdfReader
+from langchain_core.documents import Document
+from langchain_text_splitters import RecursiveCharacterTextSplitter
+from src.common.chunker import FixedLengthChunker
+class FileManager:
+    def __init__(self, file_path: str, index_truncation_config: dict):
+        self.file_path = file_path
+        self.chunk_size = index_truncation_config["chunk_size"]
+        self.chunk_overlap = index_truncation_config["chunk_overlap"]
+        self.texts = []
+        directory = os.path.dirname(file_path)
+        base_name = os.path.splitext(os.path.basename(file_path))[0]
+        self.texts_file = os.path.join(directory, f"{base_name}_texts.json")
+        self.text_splitter = RecursiveCharacterTextSplitter(
+            chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap
+        )  # TODO
+        # Load texts from file if it exists
+        if os.path.exists(self.texts_file):
+            with open(self.texts_file, "r", encoding="utf-8-sig") as f:
+                self.texts = json.load(f)
+            print(f"Loaded texts from file: {self.texts_file}")
+    def load_pdf_document(self):
+        pdf_reader = PdfReader(self.file_path)
+        documents = []
+        for page_num, page in enumerate(pdf_reader.pages):
+            page_text = page.extract_text()
+            if page_text:  # Ensure the page has text
+                document = Document(
+                    metadata={"source": self.file_path, "page": page_num},
+                    page_content=page_text,
+                )
+                documents.append(document)
+        return documents
+    def dump_documents(self, texts):
+        if texts and not os.path.exists(self.texts_file):
+            with open(self.texts_file, "w") as f:
+                json.dump(texts, f)
+            print(f"Associated texts saved to file: {self.texts_file}")
+        else:
+            raise FileExistsError(
+                f"File {self.texts_file} already exists. Please remove it before saving."
+            )
+    def process_pdf(self):
+        data = self.load_pdf_document()
+        documents = self.text_splitter.split_documents(data)
+        self.texts = [(i, str(doc)) for i, doc in enumerate(documents)]
+        self.dump_documents(self.texts)
+    def process_document(
+        self,
+        truncation_strategy: Optional[Union[str, bool]] = "fixed_length",
+        chunk_size: int = 2000,
+        overlap_size: int = 25,
+        truncate_by: Optional[str] = "\n",
+    ):
+        """
+        Process document according to the specified strategy.
+        Either truncation_strategy or truncate_by must be provided, but not both.
+        """
+        if truncation_strategy is None and truncate_by is None:
+            raise ValueError(
+                "Either truncation_strategy or truncate_by must be provided"
+            )
+        if self.texts:
+            return
+        chunks = []
+        with open(self.file_path, "r", encoding="utf-8") as f:
+            data = json.load(f)
+        for title, texts in data.items():
+            if not truncation_strategy and not truncate_by:
+                chunks.append(self.create_document(title, texts, self.file_path))
+                print(f"{title} - No text splitting. Chunk size: {len(texts)}")
+            elif truncation_strategy == "fixed_length":
+                chunk_list = []
+                for text in texts:
+                    fixed_length_chunks, texts_word_cnt = FixedLengthChunker(
+                        text, chunk_size, overlap_size
+                    ).create_chunks()
+                    chunk_list.extend(fixed_length_chunks)
+                print(
+                    f"Document '{title}' is splitted into {len(chunk_list)} chunk(s) by length of {chunk_size} words. Initial text size: {texts_word_cnt}."
+                )
+                for text in chunk_list:
+                    if text.strip():
+                        chunks.append(self.create_document(title, text, self.file_path))
+            elif truncation_strategy == "recursive":  # Fixed typo in strategy name
+                raise NotImplementedError(
+                    "Recursive truncation is currently not supported"
+                )
+            else:
+                # print("splitting by specific char")
+                if isinstance(texts, str):
+                    if truncate_by in texts:
+                        split_texts = texts.split(truncate_by)
+                    else:
+                        split_texts = [texts]
+                elif isinstance(texts, list):
+                    split_texts = texts
+                for text in split_texts:
+                    if text.strip():
+                        chunks.append(self.create_document(title, text, self.file_path))
+        self.texts = [(i, str(doc)) for i, doc in enumerate(chunks)]
+        self.dump_documents(self.texts)
+    def create_document(self, title, text, file_path):
+        """Create a document with the given title and text."""
+        return Document(
+            page_content=f"{title}: {text}",
+            metadata={"source": title, "file_path": file_path},
+        )

src/common/llm/__init__.py ADDED Viewed

File without changes

src/common/llm/llm_agent.py ADDED Viewed

	@@ -0,0 +1,14 @@

+from abc import ABC, abstractmethod
+class LLMAgent(ABC):
+    @abstractmethod
+    def answer(self, question) -> str:
+        pass
+    @abstractmethod
+    def preProcess(self, query):
+        pass
+    @abstractmethod
+    def postProcess(self, response):
+        pass