First_agent_template

Runtime error

App Files Files Community

mathidot commited on 8 days ago

Commit

4a8fc49

1 Parent(s): 884eda5

加强pdf提取能力，增加rag评测模块

Browse files

Files changed (14) hide show

eval/README.md +182 -0
eval/__init__.py +1 -0
eval/data/hf/open_ragbench/README.md +185 -0
eval/rag_eval.py +630 -0
eval/reports/beir_fiqa_retrieval_eval.md +124 -0
eval/run_eval_suite.py +173 -0
hf_cache/sentence_transformers/models--BAAI--bge-small-en-v1.5/snapshots/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/README.md +1 -0
load_docs.py +0 -216
pyproject.toml +1 -0
rag_pdf_optimization_notes.md +282 -0
requirements.txt +1 -0
tools/query_knowledge.py +1196 -0
tools/todo.md +5 -0
uv.lock +18 -0

eval/README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+# RAG Evaluation Module
+This folder contains a lightweight retrieval-evaluation harness for the project.
+## Supported Steps
+1. `beir/scifact`
+2. `beir/fiqa`
+3. `open-ragbench`
+4. `t2-ragbench`
+5. `local-options`
+Each run builds a temporary Chroma index under `eval/indexes/` and writes reports under `eval/reports/`.
+## Smoke Tests
+```bash
+uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset beir/scifact --max-corpus-docs 200 --max-queries 10 --rebuild
+uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset beir/fiqa --max-corpus-docs 500 --max-queries 10 --rebuild
+uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset open-ragbench --max-corpus-docs 50 --max-queries 10 --rebuild
+uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset t2-ragbench --max-corpus-docs 50 --max-queries 10 --rebuild
+uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset local-options --max-queries 3 --rebuild
+```
+## Run The Whole Suite
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite --rebuild
+```
+By default, the suite runs:
+- `beir/scifact`
+- `beir/fiqa`
+- `open-ragbench`
+- `local-options`
+Useful options:
+```bash
+# Accurate run after changing PDF parsing, chunking, embedding, retrieval code, or sampling parameters.
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite --rebuild
+# Faster run that reuses existing indexes.
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite
+# Run only selected datasets.
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite --datasets local-options,beir/fiqa
+# Override shared parameters for all selected datasets.
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite --top-k 10 --max-queries 20 --max-corpus-docs 1000
+# Save a stable suite-level report name.
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite --output-name latest_rag_eval
+```
+The suite writes per-dataset reports and one aggregate report under `eval/reports/`.
+## Common Commands
+Run the fastest local check while developing PDF parsing or chunking:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets local-options \
+  --max-queries 3 \
+  --top-k 5 \
+  --rebuild
+```
+Run only the standard public retrieval smoke tests:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets beir/scifact,beir/fiqa \
+  --rebuild
+```
+Run the financial benchmark only:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets beir/fiqa \
+  --max-corpus-docs 1000 \
+  --max-queries 50 \
+  --top-k 5 \
+  --rebuild
+```
+Run the PDF-like benchmark only:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets open-ragbench \
+  --max-corpus-docs 100 \
+  --max-queries 20 \
+  --top-k 5 \
+  --rebuild
+```
+Compare different `top-k` values:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets local-options \
+  --top-k 3 \
+  --output-name local_options_top3 \
+  --rebuild
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets local-options \
+  --top-k 10 \
+  --output-name local_options_top10 \
+  --rebuild
+```
+Compare different chunk settings:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets local-options \
+  --chunk-size 384 \
+  --chunk-overlap 64 \
+  --output-name local_options_chunk384 \
+  --rebuild
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets local-options \
+  --chunk-size 768 \
+  --chunk-overlap 128 \
+  --output-name local_options_chunk768 \
+  --rebuild
+```
+Run a larger, slower evaluation before reporting results:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets beir/scifact,beir/fiqa,open-ragbench,local-options \
+  --max-corpus-docs 2000 \
+  --max-queries 100 \
+  --top-k 5 \
+  --output-name full_rag_eval \
+  --rebuild
+```
+Stop immediately when one dataset fails:
+```bash
+uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
+  --datasets beir/scifact,beir/fiqa,open-ragbench,local-options \
+  --fail-fast \
+  --rebuild
+```
+Run a single dataset directly without the suite wrapper:
+```bash
+uv --cache-dir .uv-cache run python -m eval.rag_eval \
+  --dataset local-options \
+  --max-queries 3 \
+  --top-k 5 \
+  --rebuild
+```
+## Suggested Workflow
+1. During development, run `local-options` with a small query count.
+2. After changing PDF extraction, chunking, embeddings, or retrieval code, add `--rebuild`.
+3. Before comparing two versions, use the same `--datasets`, `--max-queries`, `--max-corpus-docs`, `--top-k`, `--chunk-size`, and `--chunk-overlap`.
+4. Use `--output-name` to save stable report names for before/after comparison.
+## Metrics
+- `hit_at_1`
+- `hit_at_3`
+- `hit_at_5`
+- `hit_at_k`
+- `mrr`
+- `ndcg_at_k`
+The public benchmarks test whether the eval pipeline works on standard datasets. The `local-options` benchmark is the project-specific check for PDF parsing, formula extraction, and section-aware chunking.

eval/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """RAG evaluation helpers."""

eval/data/hf/open_ragbench/README.md ADDED Viewed

	@@ -0,0 +1,185 @@

+---
+license: cc-by-nc-4.0
+---
+# Open RAG Benchmark
+The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes **pure PDF content**, meticulously extracting and generating queries on diverse modalities including **text, tables, and images**, even when they are intricately interwoven within a document.
+This dataset is purpose-built to power the company's [Open RAG Evaluation project](https://github.com/vectara/open-rag-eval), facilitating a holistic, end-to-end evaluation of RAG systems by offering:
+  - **Richer Multimodal Content:** A corpus derived exclusively from PDF documents, ensuring fidelity to real-world data and encompassing a wide spectrum of text, tabular, and visual information, often with intermodal crossovers.
+  - **Tailored for Open RAG Evaluation:** Designed to support the unique and comprehensive evaluation metrics adopted by the Open RAG Evaluation project, enabling a deeper understanding of RAG performance beyond traditional metrics.
+  - **High-Quality Retrieval Queries & Answers:** Each piece of extracted content is paired with expertly crafted retrieval queries and corresponding answers, optimized for robust RAG training and evaluation.
+  - **Diverse Knowledge Domains:** Content spanning various scientific and technical domains from arXiv, ensuring broad applicability and challenging RAG systems across different knowledge areas.
+The current draft version of the Arxiv dataset, as the first step in this multimodal RAG dataset collection, includes:
+  - **Documents:** 1000 PDF papers evenly distributed across all Arxiv categories.
+      - 400 positive documents (each serving as the golden document for some queries).
+      - 600 hard negative documents (completely irrelevant to all queries).
+  - **Multimodal Content:** Extracted text, tables, and images from research papers.
+  - **QA Pairs:** 3045 valid question-answer pairs.
+      - **Based on query types:**
+          - 1793 abstractive queries (requiring generating a summary or rephrased response using understanding and synthesis).
+          - 1252 extractive queries (seeking concise, fact-based answers directly extracted from a given text).
+      - **Based on generation sources:**
+          - 1914 text-only queries
+          - 763 text-image queries
+          - 148 text-table queries
+          - 220 text-table-image queries
+## Dataset Structure
+The dataset is organized similar to the [BEIR dataset](https://github.com/beir-cellar/beir) format within the `official/pdf/arxiv/` directory.
+```
+official/
+└── pdf
+    └── arxiv
+        ├── answers.json
+        ├── corpus
+        │   ├── {PAPER_ID_1}.json
+        │   ├── {PAPER_ID_2}.json
+        │   └── ...
+        ├── pdf_urls.json
+        ├── qrels.json
+        └── queries.json
+```
+Each file's format is detailed below:
+### `pdf_urls.json`
+This file provides the original PDF links to the papers in this dataset for downloading purposes.
+```json
+{
+    "Paper ID": "Paper URL",
+    ...
+}
+```
+### `corpus/`
+This folder contains all processed papers in JSON format.
+```json
+{
+    "title": "Paper Title",
+    "sections": [
+        {
+            "text": "Section text content with placeholders for tables/images",
+            "tables": {"table_id1": "markdown_table_string", ...},
+            "images": {"image_id1": "base64_encoded_string", ...},
+        },
+        ...
+    ],
+    "id": "Paper ID",
+    "authors": ["Author1", "Author2", ...],
+    "categories": ["Category1", "Category2", ...],
+    "abstract": "Abstract text",
+    "updated": "Updated date",
+    "published": "Published date"
+}
+```
+### `queries.json`
+This file contains all generated queries.
+```json
+{
+    "Query UUID": {
+        "query": "Query text",
+        "type": "Query type (abstractive/extractive)",
+        "source": "Generation source (text/text-image/text-table/text-table-image)"
+    },
+    ...
+}
+```
+### `qrels.json`
+This file contains the query-document-section relevance labels.
+```json
+{
+    "Query UUID": {
+        "doc_id": "Paper ID",
+        "section_id": Section Index
+    },
+    ...
+}
+```
+### `answers.json`
+This file contains the answers for the generated queries.
+```json
+{
+    "Query UUID": "Answer text",
+    ...
+}
+```
+## Dataset Creation
+The Open RAG Benchmark dataset is created through a systematic process involving document collection, processing, content segmentation, query generation, and quality filtering.
+1.  **Document Collection:** Gathering documents from sources like Arxiv.
+2.  **Document Processing:** Parsing PDFs via OCR into text, Markdown tables, and base64 encoded images.
+3.  **Content Segmentation:** Dividing documents into sections based on structural elements.
+4.  **Query Generation:** Using LLMs (currently `gpt-4o-mini`) to generate retrieval queries for each section, handling multimodal content such as tables and images.
+5.  **Quality Filtering:** Removing non-retrieval queries and ensuring quality through post-processing via a set of encoders for retrieval filtering and `gpt-4o-mini` for query quality filtering.
+6.  **Hard-Negative Document Mining (Optional):** Mining hard negative documents that are entirely irrelevant to any existing query, relying on agreement across multiple embedding models for accuracy.
+The code for reproducing and customizing the dataset generation process is available in the [Open RAG Benchmark GitHub repository](https://www.google.com/search?q=https://github.com/vectara/Open-RAG-Benchmark).
+## Limitations and Challenges
+Several challenges are inherent in the current dataset development process:
+  - **OCR Performance:** Mistral OCR, while performing well for structured documents, struggles with unstructured PDFs, impacting the quality of extracted content.
+  - **Multimodal Integration:** Ensuring proper extraction and seamless integration of tables and images with corresponding text remains a complex challenge.
+## Future Enhancements
+The project aims for continuous improvement and expansion of the dataset, with key next steps including:
+### Enhanced Dataset Structure and Usability:
+  - **Dataset Format and Content Enhancements:**
+      - **Rich Metadata:** Adding comprehensive document metadata (authors, publication date, categories, etc.) to enable better filtering and contextualization.
+      - **Flexible Chunking:** Providing multiple content granularity levels (sections, paragraphs, sentences) to accommodate different retrieval strategies.
+      - **Query Metadata:** Classifying queries by type (factual, conceptual, analytical), difficulty level, and whether they require multimodal understanding.
+  - **Advanced Multimodal Representation:**
+      - **Improved Image Integration:** Replacing basic placeholders with structured image objects including captions, alt text, and direct access URLs.
+      - **Structured Table Format:** Providing both markdown and programmatically accessible structured formats for tables (headers/rows).
+      - **Positional Context:** Maintaining clear positional relationships between text and visual elements.
+  - **Sophisticated Query Generation:**
+      - **Multi-stage Generation Pipeline:** Implementing targeted generation for different query types (factual, conceptual, multimodal).
+      - **Diversity Controls:** Ensuring coverage of different difficulty levels and reasoning requirements.
+      - **Specialized Multimodal Queries:** Generating queries specifically designed to test table and image understanding.
+  - **Practitioner-Focused Tools:**
+      - **Framework Integration Examples:** Providing code samples showing dataset integration with popular RAG frameworks (LangChain, LlamaIndex, etc.).
+      - **Evaluation Utilities:** Developing standardized tools to benchmark RAG system performance using this dataset.
+      - **Interactive Explorer:** Creating a simple visualization tool to browse and understand dataset contents.
+### Dataset Expansion:
+  - Implementing alternative solutions for PDF table & image extraction.
+  - Enhancing OCR capabilities for unstructured documents.
+  - Broadening scope beyond academic papers to include other document types.
+  - Potentially adding multilingual support.
+## Acknowledgments
+The Open RAG Benchmark project uses OpenAI's GPT models (specifically `gpt-4o-mini`) for query generation and evaluation. For post-filtering and retrieval filtering, the following embedding models, recognized for their outstanding performance on the [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard), were utilized:
+  - [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)
+  - [dunzhang/stella\_en\_1.5B\_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5)
+  - [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
+  - [infly/inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1)
+  - [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)
+  - [openai/text-embedding-3-large](https://platform.openai.com/docs/models/text-embedding-3-large)

eval/rag_eval.py ADDED Viewed

	@@ -0,0 +1,630 @@

+from __future__ import annotations
+import argparse
+import csv
+import json
+import math
+import shutil
+import zipfile
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Iterable
+import chromadb
+import requests
+from llama_index.core import StorageContext, VectorStoreIndex
+from llama_index.core.node_parser import SentenceSplitter
+from llama_index.core.schema import Document
+from llama_index.vector_stores.chroma import ChromaVectorStore
+from tools.query_knowledge import configure_model_cache, resolve_embed_model_name
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+EVAL_DIR = PROJECT_ROOT / "eval"
+DATA_DIR = EVAL_DIR / "data"
+INDEX_DIR = EVAL_DIR / "indexes"
+REPORT_DIR = EVAL_DIR / "reports"
+BEIR_URLS = {
+    "scifact": "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip",
+    "fiqa": "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip",
+}
+DATASET_ALIASES = {
+    "beir/scifact": "scifact",
+    "beir/fiqa": "fiqa",
+    "open-ragbench": "open_ragbench",
+    "open_ragbench": "open_ragbench",
+    "t2-ragbench": "t2_ragbench",
+    "t2_ragbench": "t2_ragbench",
+    "local-options": "local_options",
+    "local_options": "local_options",
+}
+@dataclass
+class EvalCorpus:
+    name: str
+    documents: list[dict[str, Any]]
+    queries: list[dict[str, Any]]
+    qrels: dict[str, set[str]]
+def ensure_dirs() -> None:
+    DATA_DIR.mkdir(parents=True, exist_ok=True)
+    INDEX_DIR.mkdir(parents=True, exist_ok=True)
+    REPORT_DIR.mkdir(parents=True, exist_ok=True)
+def download_file(url: str, destination: Path) -> None:
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    with requests.get(url, stream=True, timeout=60) as response:
+        response.raise_for_status()
+        with destination.open("wb") as file:
+            for chunk in response.iter_content(chunk_size=1024 * 1024):
+                if chunk:
+                    file.write(chunk)
+def read_jsonl(path: Path) -> Iterable[dict[str, Any]]:
+    with path.open("r", encoding="utf-8") as file:
+        for line in file:
+            line = line.strip()
+            if line:
+                yield json.loads(line)
+def prepare_beir_dataset(dataset_name: str) -> Path:
+    ensure_dirs()
+    if dataset_name not in BEIR_URLS:
+        raise ValueError(f"Unsupported BEIR dataset: {dataset_name}")
+    target_dir = DATA_DIR / "beir" / dataset_name
+    corpus_path = target_dir / "corpus.jsonl"
+    if corpus_path.exists():
+        return target_dir
+    zip_path = DATA_DIR / "downloads" / f"{dataset_name}.zip"
+    if not zip_path.exists():
+        download_file(BEIR_URLS[dataset_name], zip_path)
+    extract_root = DATA_DIR / "beir"
+    extract_root.mkdir(parents=True, exist_ok=True)
+    with zipfile.ZipFile(zip_path) as archive:
+        archive.extractall(extract_root)
+    if not corpus_path.exists():
+        raise FileNotFoundError(f"BEIR extraction did not create {corpus_path}")
+    return target_dir
+def load_beir_dataset(
+    dataset_name: str,
+    split: str,
+    max_corpus_docs: int | None,
+    max_queries: int | None,
+) -> EvalCorpus:
+    dataset_dir = prepare_beir_dataset(dataset_name)
+    all_queries = {
+        str(row["_id"]): row.get("text", "")
+        for row in read_jsonl(dataset_dir / "queries.jsonl")
+    }
+    qrels_path = dataset_dir / "qrels" / f"{split}.tsv"
+    if not qrels_path.exists():
+        candidates = sorted((dataset_dir / "qrels").glob("*.tsv"))
+        if not candidates:
+            raise FileNotFoundError(f"No qrels found under {dataset_dir / 'qrels'}")
+        qrels_path = candidates[0]
+    all_qrels: dict[str, set[str]] = {}
+    with qrels_path.open("r", encoding="utf-8") as file:
+        reader = csv.DictReader(file, delimiter="\t")
+        for row in reader:
+            query_id = str(row.get("query-id") or row.get("query_id"))
+            corpus_id = str(row.get("corpus-id") or row.get("corpus_id"))
+            score = int(row.get("score", 1))
+            if score <= 0:
+                continue
+            all_qrels.setdefault(query_id, set()).add(corpus_id)
+    queries = []
+    required_doc_ids = set()
+    for query_id, relevant_docs in all_qrels.items():
+        if query_id not in all_queries:
+            continue
+        if max_corpus_docs and len(required_doc_ids | relevant_docs) > max_corpus_docs:
+            continue
+        required_doc_ids.update(relevant_docs)
+        queries.append(
+            {
+                "query_id": query_id,
+                "question": all_queries[query_id],
+                "relevant_doc_ids": sorted(relevant_docs),
+            }
+        )
+        if max_queries and len(queries) >= max_queries:
+            break
+    documents = []
+    seen_doc_ids = set()
+    for row in read_jsonl(dataset_dir / "corpus.jsonl"):
+        doc_id = str(row["_id"])
+        if required_doc_ids and doc_id not in required_doc_ids:
+            if max_corpus_docs and len(documents) >= max_corpus_docs:
+                continue
+            if max_corpus_docs and len(documents) + len(required_doc_ids - seen_doc_ids) >= max_corpus_docs:
+                continue
+        title = row.get("title") or ""
+        text = row.get("text") or ""
+        documents.append(
+            {
+                "doc_id": doc_id,
+                "title": title,
+                "text": f"{title}\n{text}".strip(),
+                "metadata": {"source_dataset": f"beir/{dataset_name}"},
+            }
+        )
+        seen_doc_ids.add(doc_id)
+        if max_corpus_docs and len(documents) >= max_corpus_docs and required_doc_ids.issubset(seen_doc_ids):
+            break
+    if not documents or not queries:
+        raise ValueError(
+            f"Dataset beir/{dataset_name} has no evaluable documents/queries. "
+            "Increase --max-corpus-docs or use a larger sample."
+        )
+    return EvalCorpus(
+        name=f"beir_{dataset_name}",
+        documents=documents,
+        queries=queries,
+        qrels={query["query_id"]: set(query["relevant_doc_ids"]) for query in queries},
+    )
+def snapshot_hf_dataset(repo_id: str, local_name: str) -> Path:
+    from huggingface_hub import snapshot_download
+    ensure_dirs()
+    target_dir = DATA_DIR / "hf" / local_name
+    if target_dir.exists():
+        return target_dir
+    snapshot_download(
+        repo_id=repo_id,
+        repo_type="dataset",
+        local_dir=str(target_dir),
+        local_dir_use_symlinks=False,
+    )
+    return target_dir
+def flatten_open_ragbench_section(section: dict[str, Any]) -> str:
+    parts = [section.get("text") or ""]
+    tables = section.get("tables") or {}
+    if isinstance(tables, dict):
+        parts.extend(str(value) for value in tables.values())
+    return "\n".join(part for part in parts if part)
+def load_open_ragbench(
+    max_corpus_docs: int | None,
+    max_queries: int | None,
+) -> EvalCorpus:
+    dataset_dir = snapshot_hf_dataset("vectara/open_ragbench", "open_ragbench")
+    root = dataset_dir / "pdf" / "arxiv"
+    if not root.exists():
+        root = dataset_dir / "official" / "pdf" / "arxiv"
+    if not root.exists():
+        raise FileNotFoundError(f"Open RAGBench root not found: {root}")
+    queries_data = json.loads((root / "queries.json").read_text(encoding="utf-8"))
+    qrels_data = json.loads((root / "qrels.json").read_text(encoding="utf-8"))
+    documents = []
+    qrels: dict[str, set[str]] = {}
+    required_doc_ids = set()
+    selected_query_ids = []
+    for query_id, qrel in qrels_data.items():
+        doc_id = str(qrel.get("doc_id"))
+        if not doc_id or doc_id == "None":
+            continue
+        selected_query_ids.append(str(query_id))
+        required_doc_ids.add(doc_id)
+        if max_queries and len(selected_query_ids) >= max_queries:
+            break
+    allowed_doc_ids = set()
+    corpus_files = sorted((root / "corpus").glob("*.json"))
+    for corpus_file in corpus_files:
+        paper = json.loads(corpus_file.read_text(encoding="utf-8"))
+        paper_id = str(paper.get("id") or corpus_file.stem)
+        is_required = paper_id in required_doc_ids
+        if max_corpus_docs and not is_required:
+            missing_required_count = len(required_doc_ids - allowed_doc_ids)
+            if len(documents) + missing_required_count >= max_corpus_docs:
+                continue
+        allowed_doc_ids.add(paper_id)
+        section_texts = []
+        for section_index, section in enumerate(paper.get("sections") or []):
+            section_text = flatten_open_ragbench_section(section)
+            if section_text:
+                section_texts.append(f"[section {section_index}]\n{section_text}")
+        text = "\n\n".join(
+            part
+            for part in [paper.get("title") or "", paper.get("abstract") or "", *section_texts]
+            if part
+        )
+        documents.append(
+            {
+                "doc_id": paper_id,
+                "title": paper.get("title") or paper_id,
+                "text": text,
+                "metadata": {
+                    "source_dataset": "open_ragbench",
+                    "categories": ",".join(paper.get("categories") or []),
+                },
+            }
+        )
+        if max_corpus_docs and len(documents) >= max_corpus_docs:
+            break
+    queries = []
+    for query_id in selected_query_ids:
+        qrel = qrels_data[query_id]
+        doc_id = str(qrel.get("doc_id"))
+        if doc_id not in allowed_doc_ids:
+            continue
+        query_payload = queries_data.get(query_id) or {}
+        question = query_payload.get("query") if isinstance(query_payload, dict) else str(query_payload)
+        qrels[str(query_id)] = {doc_id}
+        queries.append(
+            {
+                "query_id": str(query_id),
+                "question": question,
+                "relevant_doc_ids": [doc_id],
+            }
+        )
+        if max_queries and len(queries) >= max_queries:
+            break
+    if not documents or not queries:
+        raise ValueError("Open RAGBench produced no evaluable sample.")
+    return EvalCorpus("open_ragbench", documents, queries, qrels)
+def load_t2_ragbench(
+    max_corpus_docs: int | None,
+    max_queries: int | None,
+) -> EvalCorpus:
+    dataset_dir = snapshot_hf_dataset("G4KMU/t2-ragbench", "t2_ragbench")
+    parquet_files = sorted(dataset_dir.rglob("*.parquet"))
+    jsonl_files = sorted(dataset_dir.rglob("*.jsonl"))
+    if not parquet_files and not jsonl_files:
+        raise FileNotFoundError(f"No parquet/jsonl files found in {dataset_dir}")
+    rows: list[dict[str, Any]] = []
+    if parquet_files:
+        import pandas as pd
+        for parquet_file in parquet_files:
+            frame = pd.read_parquet(parquet_file)
+            rows.extend(frame.to_dict(orient="records"))
+            if max_queries and len(rows) >= max_queries * 5:
+                break
+    else:
+        for jsonl_file in jsonl_files:
+            rows.extend(read_jsonl(jsonl_file))
+            if max_queries and len(rows) >= max_queries * 5:
+                break
+    documents_by_id: dict[str, dict[str, Any]] = {}
+    queries = []
+    qrels: dict[str, set[str]] = {}
+    for index, row in enumerate(rows):
+        question = first_present(row, ["question", "query", "Question"])
+        answer = first_present(row, ["answer", "Answer", "response"])
+        context = first_present(row, ["context", "evidence", "gold_context", "text", "document"])
+        table = first_present(row, ["table", "Table", "markdown_table"])
+        doc_id = str(first_present(row, ["doc_id", "document_id", "filename", "pdf_path", "source"]) or f"row-{index}")
+        if not question or not context:
+            continue
+        text = "\n".join(part for part in [str(context), str(table or "")] if part)
+        if doc_id not in documents_by_id:
+            documents_by_id[doc_id] = {
+                "doc_id": doc_id,
+                "title": str(first_present(row, ["company", "ticker", "title", "Title"]) or doc_id),
+                "text": text,
+                "metadata": {"source_dataset": "t2_ragbench", "answer": str(answer or "")},
+            }
+        queries.append(
+            {
+                "query_id": str(first_present(row, ["qid", "query_id", "id"]) or f"q-{index}"),
+                "question": str(question),
+                "relevant_doc_ids": [doc_id],
+            }
+        )
+        qrels[queries[-1]["query_id"]] = {doc_id}
+        if max_queries and len(queries) >= max_queries:
+            break
+    documents = list(documents_by_id.values())
+    if max_corpus_docs:
+        documents = documents[:max_corpus_docs]
+        allowed = {document["doc_id"] for document in documents}
+        queries = [query for query in queries if query["relevant_doc_ids"][0] in allowed]
+        qrels = {query["query_id"]: set(query["relevant_doc_ids"]) for query in queries}
+    if not documents or not queries:
+        raise ValueError("T2-RAGBench produced no evaluable sample.")
+    return EvalCorpus("t2_ragbench", documents, queries, qrels)
+def first_present(row: dict[str, Any], keys: list[str]) -> Any:
+    for key in keys:
+        value = row.get(key)
+        if value is not None and value != "":
+            return value
+    return None
+def load_local_options_eval(max_queries: int | None) -> EvalCorpus:
+    cases_path = EVAL_DIR / "local_options_eval.jsonl"
+    if not cases_path.exists():
+        raise FileNotFoundError(
+            f"Local options eval set not found: {cases_path}. "
+            "Create JSONL cases with question, expected_pages, expected_keywords."
+        )
+    from tools.query_knowledge import load_pdf_file
+    pdf_files = sorted((PROJECT_ROOT / "tools" / "knowledge_base" / "raw").rglob("*.pdf"))
+    documents = []
+    for pdf_file in pdf_files:
+        for doc_index, document in enumerate(load_pdf_file(pdf_file)):
+            documents.append(
+                {
+                    "doc_id": f"{pdf_file.name}:{document.metadata.get('page_number')}:{doc_index}",
+                    "title": document.metadata.get("section_path") or pdf_file.name,
+                    "text": document.text,
+                    "metadata": document.metadata,
+                }
+            )
+    queries = []
+    qrels: dict[str, set[str]] = {}
+    for case_index, case in enumerate(read_jsonl(cases_path)):
+        query_id = str(case.get("id") or f"local-{case_index}")
+        relevant_ids = []
+        expected_pages = set(case.get("expected_pages") or [])
+        expected_keywords = case.get("expected_keywords") or []
+        for document in documents:
+            metadata = document.get("metadata") or {}
+            page_hit = metadata.get("page_number") in expected_pages
+            keyword_hit = any(keyword in document["text"] for keyword in expected_keywords)
+            if page_hit or keyword_hit:
+                relevant_ids.append(document["doc_id"])
+        queries.append(
+            {
+                "query_id": query_id,
+                "question": case["question"],
+                "relevant_doc_ids": relevant_ids,
+            }
+        )
+        qrels[query_id] = set(relevant_ids)
+        if max_queries and len(queries) >= max_queries:
+            break
+    if not documents or not queries:
+        raise ValueError("Local options eval set produced no evaluable sample.")
+    return EvalCorpus("local_options", documents, queries, qrels)
+def load_eval_corpus(args: argparse.Namespace) -> EvalCorpus:
+    dataset = DATASET_ALIASES.get(args.dataset, args.dataset)
+    if dataset in {"scifact", "fiqa"}:
+        return load_beir_dataset(dataset, args.split, args.max_corpus_docs, args.max_queries)
+    if dataset == "open_ragbench":
+        return load_open_ragbench(args.max_corpus_docs, args.max_queries)
+    if dataset == "t2_ragbench":
+        return load_t2_ragbench(args.max_corpus_docs, args.max_queries)
+    if dataset == "local_options":
+        return load_local_options_eval(args.max_queries)
+    raise ValueError(f"Unknown dataset: {args.dataset}")
+def build_index(corpus: EvalCorpus, chunk_size: int, chunk_overlap: int, rebuild: bool) -> VectorStoreIndex:
+    configure_model_cache()
+    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+    index_path = INDEX_DIR / corpus.name
+    if rebuild and index_path.exists():
+        shutil.rmtree(index_path)
+    index_path.mkdir(parents=True, exist_ok=True)
+    db = chromadb.PersistentClient(path=str(index_path))
+    collection_name = f"{corpus.name}_eval"
+    if rebuild:
+        try:
+            db.delete_collection(collection_name)
+        except Exception:
+            pass
+    collection = db.get_or_create_collection(collection_name)
+    vector_store = ChromaVectorStore(chroma_collection=collection)
+    storage_context = StorageContext.from_defaults(vector_store=vector_store)
+    embed_model = HuggingFaceEmbedding(
+        model_name=resolve_embed_model_name(),
+        cache_folder=str(PROJECT_ROOT / "tools" / "hf_cache" / "sentence_transformers"),
+    )
+    if collection.count() == 0:
+        documents = [
+            Document(
+                text=document["text"],
+                metadata={
+                    "doc_id": document["doc_id"],
+                    "title": document.get("title", ""),
+                    **(document.get("metadata") or {}),
+                },
+            )
+            for document in corpus.documents
+        ]
+        splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
+        nodes = splitter.get_nodes_from_documents(documents)
+        VectorStoreIndex(
+            nodes,
+            storage_context=storage_context,
+            embed_model=embed_model,
+            show_progress=True,
+        )
+    return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
+def evaluate_retrieval(corpus: EvalCorpus, index: VectorStoreIndex, top_k: int) -> dict[str, Any]:
+    retriever = index.as_retriever(similarity_top_k=max(top_k * 5, top_k))
+    cases = []
+    hit_counts = {1: 0, 3: 0, 5: 0, top_k: 0}
+    reciprocal_ranks = []
+    ndcg_scores = []
+    for query in corpus.queries:
+        relevant_doc_ids = corpus.qrels.get(query["query_id"], set())
+        results = retriever.retrieve(query["question"])
+        retrieved = []
+        seen_doc_ids = set()
+        first_hit_rank = None
+        dcg = 0.0
+        for result in results:
+            metadata = result.node.metadata
+            doc_id = str(metadata.get("doc_id", ""))
+            if doc_id in seen_doc_ids:
+                continue
+            seen_doc_ids.add(doc_id)
+            rank = len(retrieved) + 1
+            hit = doc_id in relevant_doc_ids
+            if hit and first_hit_rank is None:
+                first_hit_rank = rank
+            if hit:
+                dcg += 1 / math.log2(rank + 1)
+            retrieved.append(
+                {
+                    "rank": rank,
+                    "doc_id": doc_id,
+                    "score": result.score,
+                    "hit": hit,
+                    "title": metadata.get("title", ""),
+                }
+            )
+            if len(retrieved) >= top_k:
+                break
+        ideal_hits = min(len(relevant_doc_ids), top_k)
+        idcg = sum(1 / math.log2(rank + 1) for rank in range(1, ideal_hits + 1))
+        ndcg = dcg / idcg if idcg else 0.0
+        ndcg_scores.append(ndcg)
+        reciprocal_ranks.append(1 / first_hit_rank if first_hit_rank else 0.0)
+        for k in hit_counts:
+            if any(item["hit"] for item in retrieved[:k]):
+                hit_counts[k] += 1
+        cases.append(
+            {
+                "query_id": query["query_id"],
+                "question": query["question"],
+                "relevant_doc_ids": sorted(relevant_doc_ids),
+                "first_hit_rank": first_hit_rank,
+                "retrieved": retrieved,
+            }
+        )
+    total = len(corpus.queries)
+    metrics = {
+        "queries": total,
+        "documents": len(corpus.documents),
+        "top_k": top_k,
+        "mrr": sum(reciprocal_ranks) / total if total else 0.0,
+        "ndcg_at_k": sum(ndcg_scores) / total if total else 0.0,
+    }
+    for k, count in sorted(hit_counts.items()):
+        metrics[f"hit_at_{k}"] = count / total if total else 0.0
+    return {"dataset": corpus.name, "metrics": metrics, "cases": cases}
+def write_reports(report: dict[str, Any]) -> tuple[Path, Path]:
+    ensure_dirs()
+    dataset_name = report["dataset"]
+    json_path = REPORT_DIR / f"{dataset_name}_retrieval_eval.json"
+    md_path = REPORT_DIR / f"{dataset_name}_retrieval_eval.md"
+    json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
+    metrics = report["metrics"]
+    lines = [
+        f"# Retrieval Eval: {dataset_name}",
+        "",
+        "## Metrics",
+        "",
+    ]
+    for key, value in metrics.items():
+        lines.append(f"- `{key}`: {value:.4f}" if isinstance(value, float) else f"- `{key}`: {value}")
+    lines.extend(["", "## Sample Cases", ""])
+    for case in report["cases"][:10]:
+        lines.append(f"### {case['query_id']}")
+        lines.append("")
+        lines.append(case["question"])
+        lines.append("")
+        lines.append(f"- first_hit_rank: `{case['first_hit_rank']}`")
+        for item in case["retrieved"][:5]:
+            lines.append(
+                f"- rank {item['rank']}: hit={item['hit']} doc_id=`{item['doc_id']}` score={item['score']}"
+            )
+        lines.append("")
+    md_path.write_text("\n".join(lines), encoding="utf-8")
+    return json_path, md_path
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run retrieval eval for RAG datasets.")
+    parser.add_argument(
+        "--dataset",
+        required=True,
+        help="beir/scifact, beir/fiqa, open-ragbench, t2-ragbench, or local-options",
+    )
+    parser.add_argument("--split", default="test")
+    parser.add_argument("--top-k", type=int, default=5)
+    parser.add_argument("--chunk-size", type=int, default=512)
+    parser.add_argument("--chunk-overlap", type=int, default=64)
+    parser.add_argument("--max-corpus-docs", type=int, default=None)
+    parser.add_argument("--max-queries", type=int, default=None)
+    parser.add_argument("--rebuild", action="store_true")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    corpus = load_eval_corpus(args)
+    index = build_index(corpus, args.chunk_size, args.chunk_overlap, args.rebuild)
+    report = evaluate_retrieval(corpus, index, args.top_k)
+    json_path, md_path = write_reports(report)
+    print(json.dumps(report["metrics"], ensure_ascii=False, indent=2))
+    print(f"JSON report: {json_path}")
+    print(f"Markdown report: {md_path}")
+if __name__ == "__main__":
+    main()

eval/reports/beir_fiqa_retrieval_eval.md ADDED Viewed

	@@ -0,0 +1,124 @@

+# Retrieval Eval: beir_fiqa
+## Metrics
+- `queries`: 10
+- `documents`: 500
+- `top_k`: 5
+- `mrr`: 0.8000
+- `ndcg_at_k`: 0.6582
+- `hit_at_1`: 0.8000
+- `hit_at_3`: 0.8000
+- `hit_at_5`: 0.8000
+## Sample Cases
+### 8
+How to deposit a cheque issued to an associate in my business into my business account?
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`65404` score=0.6844510955827177
+- rank 2: hit=False doc_id=`508754` score=0.6415634192002271
+- rank 3: hit=False doc_id=`1873` score=0.6244133153886419
+- rank 4: hit=False doc_id=`590102` score=0.6106401478322256
+- rank 5: hit=False doc_id=`1066` score=0.5854493569389293
+### 15
+Can I send a money order from USPS as a business?
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`325273` score=0.6860931820873509
+- rank 2: hit=False doc_id=`3714` score=0.5383410844537323
+- rank 3: hit=False doc_id=`508754` score=0.5295326644960427
+- rank 4: hit=False doc_id=`1873` score=0.5219679418951554
+- rank 5: hit=False doc_id=`4457` score=0.5122406473020094
+### 18
+1 EIN doing business under multiple business names
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`88124` score=0.5926237160250162
+- rank 2: hit=False doc_id=`1873` score=0.5421392202098603
+- rank 3: hit=False doc_id=`248624` score=0.5355707959162649
+- rank 4: hit=False doc_id=`590102` score=0.5349105669189491
+- rank 5: hit=False doc_id=`1173` score=0.5304232255229728
+### 26
+Applying for and receiving business credit
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`350819` score=0.6130084948278423
+- rank 2: hit=False doc_id=`2064` score=0.5484836878784439
+- rank 3: hit=False doc_id=`5019` score=0.545421752024407
+- rank 4: hit=False doc_id=`1873` score=0.5288677740902044
+- rank 5: hit=False doc_id=`1766` score=0.5277730439438229
+### 34
+401k Transfer After Business Closure
+- first_hit_rank: `None`
+- rank 1: hit=False doc_id=`19183` score=0.5697281829712297
+- rank 2: hit=False doc_id=`1506` score=0.5606544069043923
+- rank 3: hit=False doc_id=`1134` score=0.5594801072658324
+- rank 4: hit=False doc_id=`3481` score=0.5580692841866827
+- rank 5: hit=False doc_id=`3059` score=0.5470931591486823
+### 42
+What are the ins/outs of writing equipment purchases off as business expenses in a home based business?
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`272709` score=0.6108084707046366
+- rank 2: hit=False doc_id=`2528` score=0.5915589749452431
+- rank 3: hit=True doc_id=`331981` score=0.5819601957870557
+- rank 4: hit=False doc_id=`1873` score=0.5679211375564418
+- rank 5: hit=True doc_id=`327263` score=0.5609058973658579
+### 56
+Can a entrepreneur hire a self-employed business owner?
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`572690` score=0.5928112761756716
+- rank 2: hit=False doc_id=`1873` score=0.5329399371121925
+- rank 3: hit=False doc_id=`350819` score=0.49122764843847383
+- rank 4: hit=False doc_id=`288` score=0.48281883887294536
+- rank 5: hit=False doc_id=`599545` score=0.4825679577769018
+### 68
+Intentions of Deductible Amount for Small Business
+- first_hit_rank: `None`
+- rank 1: hit=False doc_id=`599545` score=0.5484593654392641
+- rank 2: hit=False doc_id=`350819` score=0.545089604374947
+- rank 3: hit=False doc_id=`327263` score=0.5425303284932907
+- rank 4: hit=False doc_id=`272709` score=0.5367760755311749
+- rank 5: hit=False doc_id=`1873` score=0.5341962558469263
+### 89
+How can I deposit a check made out to my business into my personal account?
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`508754` score=0.678210930846752
+- rank 2: hit=False doc_id=`3336` score=0.6219187366693569
+- rank 3: hit=False doc_id=`1066` score=0.6102283456272309
+- rank 4: hit=False doc_id=`65404` score=0.6070578770706204
+- rank 5: hit=True doc_id=`413229` score=0.5974145307840032
+### 90
+Filing personal with 1099s versus business s-corp?
+- first_hit_rank: `1`
+- rank 1: hit=True doc_id=`31793` score=0.6463855238248295
+- rank 2: hit=False doc_id=`4992` score=0.575164246858743
+- rank 3: hit=False doc_id=`1873` score=0.567805853646443
+- rank 4: hit=False doc_id=`2020` score=0.5629015874196683
+- rank 5: hit=False doc_id=`350819` score=0.5607360854843948

eval/run_eval_suite.py ADDED Viewed

	@@ -0,0 +1,173 @@

+from __future__ import annotations
+import argparse
+import json
+import traceback
+from dataclasses import dataclass
+from datetime import datetime
+from pathlib import Path
+from types import SimpleNamespace
+from typing import Any
+from eval.rag_eval import (
+    REPORT_DIR,
+    build_index,
+    ensure_dirs,
+    evaluate_retrieval,
+    load_eval_corpus,
+    write_reports,
+)
+DEFAULT_DATASETS = ["beir/scifact", "beir/fiqa", "open-ragbench", "local-options"]
+SMOKE_DEFAULTS = {
+    "beir/scifact": {"max_corpus_docs": 200, "max_queries": 10},
+    "beir/fiqa": {"max_corpus_docs": 500, "max_queries": 10},
+    "open-ragbench": {"max_corpus_docs": 20, "max_queries": 5},
+    "t2-ragbench": {"max_corpus_docs": 20, "max_queries": 5},
+    "local-options": {"max_corpus_docs": None, "max_queries": 3},
+}
+@dataclass
+class DatasetRun:
+    dataset: str
+    status: str
+    metrics: dict[str, Any] | None
+    json_report: str | None
+    markdown_report: str | None
+    error: str | None = None
+def parse_dataset_list(value: str) -> list[str]:
+    datasets = [item.strip() for item in value.split(",") if item.strip()]
+    return datasets or DEFAULT_DATASETS
+def build_dataset_args(args: argparse.Namespace, dataset: str) -> SimpleNamespace:
+    defaults = SMOKE_DEFAULTS.get(dataset, {"max_corpus_docs": None, "max_queries": None})
+    return SimpleNamespace(
+        dataset=dataset,
+        split=args.split,
+        top_k=args.top_k,
+        chunk_size=args.chunk_size,
+        chunk_overlap=args.chunk_overlap,
+        max_corpus_docs=args.max_corpus_docs
+        if args.max_corpus_docs is not None
+        else defaults["max_corpus_docs"],
+        max_queries=args.max_queries if args.max_queries is not None else defaults["max_queries"],
+        rebuild=args.rebuild,
+    )
+def run_one(dataset: str, args: argparse.Namespace) -> DatasetRun:
+    dataset_args = build_dataset_args(args, dataset)
+    print(
+        f"\n=== Running {dataset} "
+        f"(top_k={dataset_args.top_k}, max_corpus_docs={dataset_args.max_corpus_docs}, "
+        f"max_queries={dataset_args.max_queries}, rebuild={dataset_args.rebuild}) ==="
+    )
+    corpus = load_eval_corpus(dataset_args)
+    index = build_index(
+        corpus,
+        chunk_size=dataset_args.chunk_size,
+        chunk_overlap=dataset_args.chunk_overlap,
+        rebuild=dataset_args.rebuild,
+    )
+    report = evaluate_retrieval(corpus, index, dataset_args.top_k)
+    json_path, md_path = write_reports(report)
+    print(json.dumps(report["metrics"], ensure_ascii=False, indent=2))
+    return DatasetRun(
+        dataset=dataset,
+        status="passed",
+        metrics=report["metrics"],
+        json_report=str(json_path),
+        markdown_report=str(md_path),
+    )
+def write_suite_report(runs: list[DatasetRun], output_name: str | None) -> tuple[Path, Path]:
+    ensure_dirs()
+    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+    stem = output_name or f"rag_eval_suite_{timestamp}"
+    json_path = REPORT_DIR / f"{stem}.json"
+    md_path = REPORT_DIR / f"{stem}.md"
+    payload = {
+        "created_at": datetime.now().isoformat(timespec="seconds"),
+        "runs": [run.__dict__ for run in runs],
+    }
+    json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
+    lines = ["# RAG Eval Suite", ""]
+    for run in runs:
+        lines.append(f"## {run.dataset}")
+        lines.append("")
+        lines.append(f"- status: `{run.status}`")
+        if run.error:
+            lines.append(f"- error: `{run.error}`")
+        if run.metrics:
+            for key, value in run.metrics.items():
+                lines.append(f"- `{key}`: {value:.4f}" if isinstance(value, float) else f"- `{key}`: {value}")
+        if run.markdown_report:
+            lines.append(f"- report: `{run.markdown_report}`")
+        lines.append("")
+    md_path.write_text("\n".join(lines), encoding="utf-8")
+    return json_path, md_path
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run a RAG retrieval eval suite.")
+    parser.add_argument(
+        "--datasets",
+        default=",".join(DEFAULT_DATASETS),
+        help="Comma-separated datasets: beir/scifact, beir/fiqa, open-ragbench, t2-ragbench, local-options",
+    )
+    parser.add_argument("--split", default="test")
+    parser.add_argument("--top-k", type=int, default=5)
+    parser.add_argument("--chunk-size", type=int, default=512)
+    parser.add_argument("--chunk-overlap", type=int, default=64)
+    parser.add_argument("--max-corpus-docs", type=int, default=None)
+    parser.add_argument("--max-queries", type=int, default=None)
+    parser.add_argument("--rebuild", action="store_true")
+    parser.add_argument("--fail-fast", action="store_true")
+    parser.add_argument("--output-name", default=None, help="Suite report filename stem under eval/reports.")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    runs: list[DatasetRun] = []
+    for dataset in parse_dataset_list(args.datasets):
+        try:
+            runs.append(run_one(dataset, args))
+        except Exception as exc:
+            error = f"{type(exc).__name__}: {exc}"
+            print(f"\n*** {dataset} failed: {error}")
+            if args.fail_fast:
+                raise
+            traceback.print_exc()
+            runs.append(
+                DatasetRun(
+                    dataset=dataset,
+                    status="failed",
+                    metrics=None,
+                    json_report=None,
+                    markdown_report=None,
+                    error=error,
+                )
+            )
+    json_path, md_path = write_suite_report(runs, args.output_name)
+    print(f"\nSuite JSON report: {json_path}")
+    print(f"Suite Markdown report: {md_path}")
+    if any(run.status == "failed" for run in runs):
+        raise SystemExit(1)
+if __name__ == "__main__":
+    main()

hf_cache/sentence_transformers/models--BAAI--bge-small-en-v1.5/snapshots/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/README.md ADDED Viewed

	@@ -0,0 +1 @@


1	+ ../../blobs/8b8567d75ffa619486d9590cb0eb76d66ad46c49

load_docs.py DELETED Viewed

@@ -1,216 +0,0 @@
-import asyncio
-import hashlib
-import os
-from pathlib import Path
-from typing import Iterable, List
-from dotenv import load_dotenv
-import chromadb
-from chromadb.errors import NotFoundError
-from pypdf import PdfReader
-from llama_index.core import StorageContext, VectorStoreIndex
-from llama_index.core.schema import Document, BaseNode
-from llama_index.core.node_parser import SentenceSplitter
-from llama_index.vector_stores.chroma import ChromaVectorStore
-BASE_DIR = Path(__file__).resolve().parent
-KNOWLEDGE_BASE_DIR = BASE_DIR / "knowledge_base"
-RAW_DIR = KNOWLEDGE_BASE_DIR / "raw"
-CHROMA_DB_DIR = KNOWLEDGE_BASE_DIR / "chroma_db"
-HF_CACHE_DIR = BASE_DIR / "hf_cache"
-COLLECTION_NAME = "options_knowledge"
-EMBED_MODEL_NAME = "BAAI/bge-small-en-v1.5"
-CHUNK_SIZE = 1000
-CHUNK_OVERLAP = 150
-REQUIRED_METADATA = [
-    "source_file",
-    "file_name",
-    "file_type",
-    "document_title",
-    "file_hash",
-    "chunk_id",
-    "chunk_index",
-]
-def configure_model_cache() -> None:
-    HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
-    os.environ.setdefault("HF_HOME", str(HF_CACHE_DIR))
-    os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", str(HF_CACHE_DIR / "sentence_transformers"))
-    os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
-def file_sha256(path: Path) -> str:
-    digest = hashlib.sha256()
-    with path.open("rb") as file:
-        for block in iter(lambda: file.read(1024 * 1024), b""):
-            digest.update(block)
-    return digest.hexdigest()
-def load_md_file(path: Path) -> Document:
-    text = path.read_text(encoding="utf-8")
-    return Document(
-        text=text,
-        metadata={
-            "source_file": str(path.resolve()),
-            "file_name": path.name,
-            "file_type": "md",
-            "document_title": path.stem,
-            "file_hash": file_sha256(path),
-        },
-    )
-def load_pdf_file(path: Path) -> List[Document]:
-    reader = PdfReader(str(path))
-    documents = []
-    for page_index, page in enumerate(reader.pages, start=1):
-        text = page.extract_text() or ""
-        if not text.strip():
-            continue
-        documents.append(
-            Document(
-                text=text,
-                metadata={
-                    "source_file": str(path.resolve()),
-                    "file_name": path.name,
-                    "file_type": "pdf",
-                    "document_title": path.stem,
-                    "file_hash": file_sha256(path),
-                    "page_number": page_index,
-                },
-            )
-        )
-    return documents
-def iter_source_files(raw_dir: Path) -> Iterable[Path]:
-    supported_suffixes = {".md", ".markdown", ".pdf"}
-    for path in sorted(raw_dir.rglob("*")):
-        if path.is_file() and path.suffix.lower() in supported_suffixes:
-            yield path
-def load_docs(raw_dir: Path = RAW_DIR) -> List[Document]:
-    documents: List[Document] = []
-    for path in iter_source_files(raw_dir):
-        suffix = path.suffix.lower()
-        if suffix in {".md", ".markdown"}:
-            documents.append(load_md_file(path))
-        elif suffix == ".pdf":
-            documents.extend(load_pdf_file(path))
-    if not documents:
-        raise ValueError(f"No supported documents found under {raw_dir}")
-    return documents
-def add_chunk_metadata(nodes: List[BaseNode]) -> List[BaseNode]:
-    counters: dict[str, int] = {}
-    for node in nodes:
-        source_file = node.metadata["source_file"]
-        chunk_index = counters.get(source_file, 0)
-        counters[source_file] = chunk_index + 1
-        file_hash = node.metadata["file_hash"][:12]
-        page_number = node.metadata.get("page_number", "na")
-        chunk_id = f"{Path(source_file).stem}-{file_hash}-p{page_number}-c{chunk_index}"
-        node.metadata["chunk_id"] = chunk_id
-        node.metadata["chunk_index"] = chunk_index
-        node.id_ = chunk_id
-    return nodes
-def validate_nodes(nodes: List[BaseNode]) -> None:
-    if not nodes:
-        raise ValueError("No chunks were created from the source documents.")
-    for node in nodes:
-        missing = [key for key in REQUIRED_METADATA if key not in node.metadata]
-        if missing:
-            raise ValueError(f"Node {node.node_id} is missing metadata fields: {missing}")
-        if node.metadata["file_type"] == "pdf" and "page_number" not in node.metadata:
-            raise ValueError(f"PDF node {node.node_id} is missing page_number metadata.")
-def build_nodes(raw_dir: Path = RAW_DIR) -> List[BaseNode]:
-    documents = load_docs(raw_dir)
-    splitter = SentenceSplitter(
-        chunk_size=CHUNK_SIZE,
-        chunk_overlap=CHUNK_OVERLAP,
-    )
-    nodes = splitter.get_nodes_from_documents(documents)
-    add_chunk_metadata(nodes)
-    validate_nodes(nodes)
-    return nodes
-async def build_index(raw_dir: Path = RAW_DIR, rebuild: bool = False) -> VectorStoreIndex:
-    configure_model_cache()
-    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
-    load_dotenv()
-    CHROMA_DB_DIR.mkdir(parents=True, exist_ok=True)
-    db = chromadb.PersistentClient(path=str(CHROMA_DB_DIR))
-    if rebuild:
-        try:
-            db.delete_collection(COLLECTION_NAME)
-        except (NotFoundError, ValueError):
-            pass
-    chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
-    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
-    storage_context = StorageContext.from_defaults(vector_store=vector_store)
-    embed_model = HuggingFaceEmbedding(
-        model_name=EMBED_MODEL_NAME,
-        cache_folder=str(HF_CACHE_DIR / "sentence_transformers"),
-    )
-    if rebuild or chroma_collection.count() == 0:
-        nodes = build_nodes(raw_dir)
-        index = VectorStoreIndex(
-            nodes,
-            storage_context=storage_context,
-            embed_model=embed_model,
-            show_progress=True,
-        )
-        print(f"Indexed {len(nodes)} chunks into collection '{COLLECTION_NAME}'.")
-        return index
-    print(f"Loaded existing collection '{COLLECTION_NAME}' with {chroma_collection.count()} chunks.")
-    return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
-if __name__ == "__main__":
-    index = asyncio.run(build_index(rebuild=True))
-    retriever = index.as_retriever(similarity_top_k=5)
-    results = retriever.retrieve("What is volatility smile?")
-    print("\nTop retrieved chunks:")
-    for result in results:
-        metadata = result.node.metadata
-        source = metadata.get("file_name", "unknown")
-        page = metadata.get("page_number", "n/a")
-        score = result.score
-        print(f"- {source}, page {page}, score={score:.4f}")
-        print(result.node.get_content()[:500].replace("\n", " "))
-        print()

pyproject.toml CHANGED Viewed

@@ -17,6 +17,7 @@ dependencies = [
     "pypdf>=6.0.0",
     "tokenizers>=0.22.0,<=0.23.0",
     "transformers<5",
 ]
 [build-system]

     "pypdf>=6.0.0",
     "tokenizers>=0.22.0,<=0.23.0",
     "transformers<5",
+    "pymupdf>=1.27.2.3",
 ]
 [build-system]

rag_pdf_optimization_notes.md ADDED Viewed

	@@ -0,0 +1,282 @@

+# RAG PDF 提取与切分优化总结
+这次优化的目标是提升当前 RAG 系统对金融 PDF，尤其是包含大量数学公式、章节标题和图表内容的 PDF 的解析质量。原始实现能完成基础向量检索，但 PDF 提取、公式保留、chunk 切分和 metadata 管理都比较粗糙，导致检索结果不够稳定。
+## 一、整体背景
+项目使用 `LlamaIndex + Chroma + HuggingFaceEmbedding` 构建本地知识库，原始 PDF 文档是一本期权/波动率相关书籍。最开始的流程大致是：
+```text
+pypdf 提取每页文本
+  -> SentenceSplitter 固定长度切分
+  -> HuggingFace embedding
+  -> Chroma 向量库
+  -> QueryKnowledgeTool 检索返回片段
+```
+这个流程对普通纯文本还可以，但面对金融教材类 PDF 会遇到很多问题：公式被拆散、章节边界丢失、页眉页脚干扰、图表文字混入正文、数学符号顺序错乱等。
+## 二、遇到的主要问题
+### 1. PDF 基础文本提取能力弱
+最初只使用：
+```python
+page.extract_text()
+```
+问题是：
+- 页眉、页码、版权信息会混进正文。
+- 断行、断词严重，比如单词被 PDF 换行拆开。
+- 多栏、图表、公式附近的文本顺序容易错乱。
+- 数学公式经常被压成一行，或者符号顺序不对。
+解决方法：
+- 增加 `pypdf` 的 `layout` 模式作为候选。
+- 增加坐标级提取，利用 `visitor_text` 获取文字的 `x/y` 坐标，按视觉行重组。
+- 增加文本清洗逻辑：
+  - 去除空行、页码、重复页眉页脚。
+  - 修复连字符断词。
+  - 处理常见 ligature，例如 `ﬁ`、`ﬂ`。
+  - 保留公式行的换行，不把公式硬合并成普通段落。
+### 2. 数学公式提取不理想
+金融教材中大量公式包含：
+- 希腊字母，如 `𝜎`、`𝜇`、`𝜌`
+- 上标、下标
+- 分式结构
+- 积分、求和、根号
+- 公式编号，如 `(21.23)`
+普通 PDF 文本提取很难还原这些结构。例如：
+```text
+d𝜎 = a𝜎 dt + b𝜎 dZ
+```
+可能会被提取成符号粘连、顺序错乱，或者和前后正文混在一起。
+解决方法：
+- 先做 `pypdf` 数学感知优化：
+  - 识别公式行。
+  - 对短公式行、括号行、根号行保留换行。
+  - 尝试根据字号和垂直偏移标记上标/下标。
+后来发现 `pypdf` 仍然不够，所以进一步接入 `PyMuPDF`。
+### 3. PyMuPDF 初次接入后公式误判过多
+接入 `PyMuPDF` 后，可以通过：
+```python
+page.get_text("dict", sort=True)
+```
+拿到 block、line、span、bbox、font 等信息。这比 `pypdf` 更适合定位公式区域。
+但初版公式识别遇到一个问题：误判过多。
+例如：
+- 版权页中的电话号码。
+- 普通正文中的 `Black-Scholes-Merton`。
+- 普通段落里出现一个 `𝜎` 或 `F=ma`。
+- 图表坐标轴上的数字。
+都可能被误识别为公式。
+解决方法：
+- 从 block 级公式识别改为 line 级公式识别。
+- 不再把普通斜体字体当作数学字体。
+- 收紧公式触发条件：
+  - 单独的希腊字母不算公式。
+  - 普通 `-`、`/` 不作为强数学信号，避免把英文连字符误判为公式。
+  - 重点识别 `=`、`∫`、`∑`、`√`、`≤`、`≥`、`∕`、公式编号等强信号。
+- 增加 `is_useful_formula_text()`，过滤掉太短、太碎、无核心公式结构的片段。
+- 对公式续行做合并，避免根号、分母、括号被拆成多个孤立公式 chunk。
+最终实现了：
+```text
+正文 chunk
+公式 chunk: content_type=formula
+公式位置: formula_bbox
+公式编号: formula_id
+```
+### 4. 章节和标题切分缺失
+原始系统只用固定长度切分：
+```python
+SentenceSplitter(chunk_size=1000, chunk_overlap=150)
+```
+问题是：
+- chunk 可能跨章节。
+- 一个小节的标题和正文可能被分开。
+- 检索结果不知道来自哪一章、哪一节。
+- 回答时引用不够清楚。
+解决方法：
+在 `SentenceSplitter` 前增加一层章节/标题感知分段：
+- 识别 `CHAPTER ...`
+- 识别 `APPENDIX ...`
+- 识别全大写标题
+- 识别标题式大小写小节名
+- 过滤图表标题、坐标轴、公式短行、脚注、普通解释句
+并写入 metadata：
+```python
+chapter_title
+section_title
+section_path
+page_number
+content_type
+formula_id
+```
+这样检索结果可以返回：
+```text
+source: The_volatility_Smile_Wiley.pdf
+page: 379
+section: WITH ZERO CORRELATION
+content_type: formula
+formula_id: formula-378-3
+```
+### 5. metadata 过长导致 LlamaIndex 报错
+接入公式 bbox 后，最开始把每一行的 bbox 都放进 metadata，导致 metadata 太长。
+报错类似：
+```text
+Metadata length is longer than chunk size.
+Consider increasing the chunk size or decreasing metadata size.
+```
+原因是 `SentenceSplitter` 会把 metadata 长度也计入 chunk 长度。
+解决方法：
+- 不再存所有行的 bbox。
+- 将多个 bbox 合并成一个外接矩形：
+```text
+x0,y0,x1,y1
+```
+这样既保留了公式位置，又避免 metadata 过长。
+### 6. Hugging Face 模型加载反复联网
+本地已经有 embedding 模型缓存，但 `sentence-transformers` 仍尝试访问 Hugging Face 做 HEAD 检查。在网络受限环境下，会反复 retry，导致索引构建卡住。
+解决方法：
+- 检测本地 snapshot 是否存在。
+- 如果存在，直接把本地 snapshot 路径传给 embedding 模型。
+- 设置离线环境变量：
+```python
+HF_HUB_OFFLINE=1
+TRANSFORMERS_OFFLINE=1
+```
+这样索引构建可以稳定使用本地缓存。
+### 7. 旧索引不会自动更新
+PDF 提取逻辑升级后，如果 Chroma 里还是旧版本文本，RAG 实际不会变好。
+解决方法：
+- 增加 `PDF_EXTRACTION_METHOD` 版本号。
+- 当前版本为：
+```python
+pymupdf_formula_blocks_v5
+```
+- 启动时检查 Chroma 中 metadata 的 `extraction_method`。
+- 如果版本不一致，自动重建索引。
+## 三、最终方案
+最终 PDF RAG 流程变为：
+```text
+PyMuPDF 提取 block / line / span / bbox / font
+  -> 识别公式行
+  -> 合并公式续行
+  -> 生成独立公式文档 content_type=formula
+  -> 正文中保留 [FORMULA id=...] 引用
+  -> 清洗页眉页脚和噪声
+  -> 按章节/标题预分段
+  -> SentenceSplitter 二次切分
+  -> 写入 Chroma
+  -> 检索时返回 page / section / content_type / formula_id
+```
+核心收益：
+- 公式可以作为独立检索单元。
+- 正文仍保留公式上下文。
+- chunk 不再完全依赖固定长度。
+- 检索结果能说明来源页码、小节、内容类型。
+- 索引版本可控，避免旧数据污染。
+## 四、面试中可以怎么回答
+可以这样概括：
+> 我们一开始的 RAG 只是用 `pypdf` 按页提取文本，然后用固定长度切分。这个方案对普通文档可以，但对金融教材不够，因为里面有大量数学公式、图表和章节结构。主要问题是公式顺序错乱、上下标丢失、页眉页脚混入、chunk 跨章节。
+然后讲解决：
+> 我先做了基础清洗，包括页眉页脚去重、断词修复、公式行换行保留。后来发现 `pypdf` 对公式区域的定位能力有限，所以接入了 `PyMuPDF`，利用它返回的 block、line、span、bbox 和 font 信息，单独识别公式区域，并把公式作为 `content_type=formula` 的独立 chunk 入库，同时正文里保留 `[FORMULA id=...]`，这样检索公式和检索上下文都可以兼顾。
+再讲工程取舍：
+> 公式识别不能简单看到希腊字母就判定为公式，否则普通正文会大量误判。所以我把规则收紧到等号、积分、求和、根号、公式编号、比较符等强数学信号，并过滤掉太短的碎片。bbox 也不能直接把所有行都写入 metadata，因为 LlamaIndex 会把 metadata 计入 chunk 长度，所以我把多个 bbox 合并成一个外接矩形。
+最后讲效果：
+> 优化后索引从原来的纯文本 chunk，变成了正文 chunk 加公式 chunk 的混合结构。每条检索结果都带 page、section、content_type、formula_id 等 metadata，回答时更容易定位来源，也更适合处理“某个公式是什么意思”这类问题。
+## 五、后续可继续优化
+目前已经接入 PyMuPDF，但还不是完整 OCR/LaTeX 公式识别。后续可以继续做：
+1. 对 `formula_bbox` 区域裁图。
+2. 接入公式 OCR 模型，例如 LaTeX OCR。
+3. 把公式图片转成 LaTeX。
+4. metadata 中同时保存：
+```python
+formula_text_raw
+formula_latex
+formula_bbox
+page_number
+section_path
+```
+5. 检索时对公式 query 单独加权，或者做 hybrid search。
+6. 增加 reranker，提高公式相关问题的排序质量。
+## 六、一句话总结
+这次优化的核心不是简单换一个 PDF parser，而是把 PDF 解析从“按页提取纯文本”升级成“结构化解析正文、章节和公式区域”，让 RAG 的 chunk 更接近人阅读文档时的语义边界。

requirements.txt CHANGED Viewed

@@ -4,6 +4,7 @@ requests
 duckduckgo_search
 pandas
 pypdf
 chromadb
 llama-index-core
 llama-index-embeddings-huggingface

 duckduckgo_search
 pandas
 pypdf
+PyMuPDF
 chromadb
 llama-index-core
 llama-index-embeddings-huggingface

tools/query_knowledge.py ADDED Viewed

	@@ -0,0 +1,1196 @@

+from smolagents.tools import Tool
+import asyncio
+from collections import Counter
+import hashlib
+import logging
+import os
+from pathlib import Path
+from typing import Iterable, List, Optional
+import re
+from dotenv import load_dotenv
+import chromadb
+from chromadb.errors import NotFoundError
+from pypdf import PdfReader
+from llama_index.core import StorageContext, VectorStoreIndex
+from llama_index.core.schema import Document, BaseNode
+from llama_index.core.node_parser import SentenceSplitter
+from llama_index.vector_stores.chroma import ChromaVectorStore
+BASE_DIR = Path(__file__).resolve().parent
+KNOWLEDGE_BASE_DIR = BASE_DIR / "knowledge_base"
+RAW_DIR = KNOWLEDGE_BASE_DIR / "raw"
+CHROMA_DB_DIR = KNOWLEDGE_BASE_DIR / "chroma_db"
+HF_CACHE_DIR = BASE_DIR / "hf_cache"
+COLLECTION_NAME = "options_knowledge"
+EMBED_MODEL_NAME = "BAAI/bge-small-en-v1.5"
+CHUNK_SIZE = 1000
+CHUNK_OVERLAP = 150
+PDF_REPEATED_LINE_MIN_PAGES = 3
+PDF_BOUNDARY_LINE_COUNT = 4
+PDF_EXTRACTION_METHOD = "pymupdf_formula_blocks_v5"
+PDF_LINE_Y_TOLERANCE = 3.0
+PDF_MIN_SECTION_CHARS = 240
+PDF_STRONG_MATH_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_σΣΔδθΘλΛμρπΠφΦτν𝜎𝜇𝜌𝜃𝜕")
+PDF_WEAK_MATH_SYMBOLS = set("+-−*/∕<>")
+PDF_MATH_SYMBOLS = PDF_STRONG_MATH_SYMBOLS | PDF_WEAK_MATH_SYMBOLS
+PDF_OPERATOR_MATH_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_+-−*/∕<>")
+PDF_FORMULA_TRIGGER_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_∕<>")
+logging.getLogger("pypdf").setLevel(logging.ERROR)
+def load_pymupdf():
+    try:
+        import fitz
+    except ImportError:
+        return None
+    return fitz
+REQUIRED_METADATA = [
+    "source_file",
+    "file_name",
+    "file_type",
+    "document_title",
+    "file_hash",
+    "chunk_id",
+    "chunk_index",
+]
+def configure_model_cache() -> None:
+    HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    os.environ.setdefault("HF_HOME", str(HF_CACHE_DIR))
+    os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", str(
+        HF_CACHE_DIR / "sentence_transformers"))
+    os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
+    cached_model_dir = (
+        HF_CACHE_DIR
+        / "sentence_transformers"
+        / f"models--{EMBED_MODEL_NAME.replace('/', '--')}"
+    )
+    if cached_model_dir.exists():
+        os.environ.setdefault("HF_HUB_OFFLINE", "1")
+        os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
+def resolve_embed_model_name() -> str:
+    cached_model_dir = (
+        HF_CACHE_DIR
+        / "sentence_transformers"
+        / f"models--{EMBED_MODEL_NAME.replace('/', '--')}"
+    )
+    snapshots_dir = cached_model_dir / "snapshots"
+    if snapshots_dir.exists():
+        snapshots = sorted(path for path in snapshots_dir.iterdir() if path.is_dir())
+        if snapshots:
+            return str(snapshots[-1])
+    return EMBED_MODEL_NAME
+def file_sha256(path: Path) -> str:
+    digest = hashlib.sha256()
+    with path.open("rb") as file:
+        for block in iter(lambda: file.read(1024 * 1024), b""):
+            digest.update(block)
+    return digest.hexdigest()
+def load_md_file(path: Path) -> Document:
+    text = path.read_text(encoding="utf-8")
+    return Document(
+        text=text,
+        metadata={
+            "source_file": str(path.resolve()),
+            "file_name": path.name,
+            "file_type": "md",
+            "document_title": path.stem,
+            "file_hash": file_sha256(path),
+        },
+    )
+def append_visual_fragment(line_parts: List[str], text: str, baseline_y: float, item: dict) -> None:
+    if not text:
+        return
+    stripped = text.strip()
+    if not stripped:
+        return
+    font_size = item["font_size"]
+    y_offset = item["y"] - baseline_y
+    is_small = font_size < item["line_font_size"] * 0.82
+    if is_small and y_offset > max(1.5, item["line_font_size"] * 0.18):
+        line_parts.append(f"^{{{stripped}}}")
+    elif is_small and y_offset < -max(1.5, item["line_font_size"] * 0.18):
+        line_parts.append(f"_{{{stripped}}}")
+    else:
+        line_parts.append(stripped)
+def join_visual_line(items: List[dict]) -> str:
+    if not items:
+        return ""
+    items = sorted(items, key=lambda value: value["x"])
+    baseline_y = sorted(item["y"] for item in items)[len(items) // 2]
+    line_font_size = max(item["font_size"] for item in items)
+    previous_right = None
+    line_parts: List[str] = []
+    for item in items:
+        item["line_font_size"] = line_font_size
+        if previous_right is not None:
+            gap = item["x"] - previous_right
+            if gap > max(2.5, line_font_size * 0.28):
+                line_parts.append(" ")
+        append_visual_fragment(line_parts, item["text"], baseline_y, item)
+        previous_right = max(previous_right or item["x"], item["x"] + item["width"])
+    return normalize_pdf_line("".join(line_parts))
+def extract_pdf_text_by_position(page) -> str:
+    fragments: List[dict] = []
+    def visitor_text(text, cm, tm, font_dict, font_size):
+        if not text or not text.strip():
+            return
+        x = float(tm[4])
+        y = float(tm[5])
+        width = max(len(text.strip()) * float(font_size) * 0.45, float(font_size))
+        fragments.append(
+            {
+                "text": text,
+                "x": x,
+                "y": y,
+                "width": width,
+                "font_size": float(font_size or 1.0),
+            }
+        )
+    try:
+        page.extract_text(visitor_text=visitor_text)
+    except Exception:
+        return ""
+    if not fragments:
+        return ""
+    lines: List[List[dict]] = []
+    for fragment in sorted(fragments, key=lambda value: (-value["y"], value["x"])):
+        for line in lines:
+            if abs(line[0]["y"] - fragment["y"]) <= PDF_LINE_Y_TOLERANCE:
+                line.append(fragment)
+                break
+        else:
+            lines.append([fragment])
+    return "\n".join(join_visual_line(line) for line in lines)
+def math_text_score(text: str) -> float:
+    if not text.strip():
+        return 0.0
+    lines = [line for line in text.splitlines() if line.strip()]
+    compact_length = len(re.sub(r"\s+", "", text))
+    math_symbol_count = sum(1 for char in text if char in PDF_MATH_SYMBOLS)
+    superscript_markers = text.count("^{") + text.count("_{")
+    multiline_bonus = sum(1 for line in lines if is_formula_like(line)) * 8
+    equation_block_bonus = sum(
+        1
+        for index, line in enumerate(lines)
+        if is_formula_like(line)
+        and (
+            index > 0
+            and is_formula_like(lines[index - 1])
+            or index + 1 < len(lines)
+            and is_formula_like(lines[index + 1])
+        )
+    ) * 12
+    return (
+        compact_length
+        + math_symbol_count * 12
+        + superscript_markers * 20
+        + multiline_bonus
+        + equation_block_bonus
+    )
+def extract_pdf_text(page) -> str:
+    positioned_text = extract_pdf_text_by_position(page)
+    try:
+        layout_text = page.extract_text(extraction_mode="layout") or ""
+    except Exception:
+        layout_text = ""
+    try:
+        plain_text = page.extract_text() or ""
+    except Exception:
+        plain_text = ""
+    candidates = [positioned_text, layout_text, plain_text]
+    candidates = [candidate for candidate in candidates if candidate.strip()]
+    if not candidates:
+        return ""
+    return max(candidates, key=math_text_score)
+def pymupdf_span_text(span: dict) -> str:
+    return normalize_pdf_line(span.get("text", ""))
+def pymupdf_line_text(line: dict) -> str:
+    return normalize_pdf_line("".join(pymupdf_span_text(span) for span in line.get("spans", [])))
+def pymupdf_block_text(block: dict) -> str:
+    lines = [
+        pymupdf_line_text(line)
+        for line in block.get("lines", [])
+    ]
+    return "\n".join(line for line in lines if line)
+def pymupdf_span_has_math_font(span: dict) -> bool:
+    font_name = span.get("font", "").lower()
+    return any(
+        marker in font_name
+        for marker in ("math", "symbol", "cmmi", "cmsy", "cmex", "stix")
+    )
+def is_formula_block_line(line: str) -> bool:
+    stripped = line.strip()
+    if not stripped:
+        return False
+    trigger_math_count = sum(1 for char in stripped if char in PDF_FORMULA_TRIGGER_SYMBOLS)
+    digit_count = sum(1 for char in stripped if char.isdigit())
+    alpha_count = sum(1 for char in stripped if char.isalpha())
+    alpha_words = [
+        word
+        for word in re.findall(r"[A-Za-z]+", stripped)
+        if word.lower() not in {"and", "or", "the", "where", "then", "with", "for"}
+    ]
+    compact_length = len(re.sub(r"\s+", "", stripped))
+    if compact_length < 3:
+        return False
+    if re.fullmatch(r"\(?\d+(\.\d+)?\)?", stripped):
+        return False
+    if re.search(r"\(\d+(\.\d+)+[a-z]?\)$", stripped) and compact_length <= 240:
+        return True
+    if "=" in stripped and compact_length <= 260 and len(alpha_words) <= 12:
+        return True
+    if any(char in stripped for char in "∂∫∑∏√∞≈≠≤≥±×÷") and compact_length <= 220 and len(alpha_words) <= 10:
+        return True
+    if trigger_math_count >= 2 and compact_length <= 120 and len(alpha_words) <= 6:
+        return True
+    if trigger_math_count >= 1 and digit_count >= 1 and alpha_count <= 18 and compact_length <= 100:
+        return True
+    return False
+def is_formula_block(block: dict) -> bool:
+    text = pymupdf_block_text(block)
+    if not text:
+        return False
+    lines = [line for line in text.splitlines() if line.strip()]
+    if any(is_formula_block_line(line) for line in lines):
+        return True
+    spans = [
+        span
+        for line in block.get("lines", [])
+        for span in line.get("spans", [])
+        if pymupdf_span_text(span)
+    ]
+    if not spans:
+        return False
+    math_font_count = sum(1 for span in spans if pymupdf_span_has_math_font(span))
+    strong_math_count = sum(1 for char in text if char in PDF_STRONG_MATH_SYMBOLS)
+    alpha_count = sum(1 for char in text if char.isalpha())
+    digit_count = sum(1 for char in text if char.isdigit())
+    compact_length = len(re.sub(r"\s+", "", text))
+    if math_font_count >= 2 and compact_length <= 220:
+        return True
+    if strong_math_count >= 3 and compact_length <= 260:
+        return True
+    if strong_math_count >= 1 and digit_count >= 1 and alpha_count <= 20 and compact_length <= 160:
+        return True
+    return False
+def block_bbox_string(block: dict) -> str:
+    bbox = block.get("bbox") or []
+    if len(bbox) != 4:
+        return ""
+    return ",".join(f"{float(value):.2f}" for value in bbox)
+def line_bbox_string(line: dict) -> str:
+    bbox = line.get("bbox") or []
+    if len(bbox) != 4:
+        return ""
+    return ",".join(f"{float(value):.2f}" for value in bbox)
+def pymupdf_line_has_math_font(line: dict) -> bool:
+    return any(
+        pymupdf_span_has_math_font(span)
+        for span in line.get("spans", [])
+        if pymupdf_span_text(span)
+    )
+def should_extract_formula_line(line: dict) -> bool:
+    text = pymupdf_line_text(line)
+    if not text:
+        return False
+    if is_formula_block_line(text):
+        return True
+    compact_length = len(re.sub(r"\s+", "", text))
+    trigger_math_count = sum(1 for char in text if char in PDF_FORMULA_TRIGGER_SYMBOLS)
+    alpha_words = re.findall(r"[A-Za-z]+", text)
+    if (
+        pymupdf_line_has_math_font(line)
+        and trigger_math_count >= 1
+        and compact_length <= 180
+        and len(alpha_words) <= 6
+    ):
+        return True
+    return False
+def is_formula_continuation_line(text: str) -> bool:
+    stripped = text.strip()
+    if not stripped:
+        return False
+    compact = re.sub(r"\s+", "", stripped)
+    if len(compact) > 90:
+        return False
+    if compact in {"(", ")", "[", "]", "{", "}", "√"}:
+        return True
+    alpha_words = re.findall(r"[A-Za-z]+", stripped)
+    math_count = sum(1 for char in stripped if char in PDF_MATH_SYMBOLS)
+    digit_count = sum(1 for char in stripped if char.isdigit())
+    if len(alpha_words) <= 4 and (math_count >= 1 or digit_count >= 1):
+        return True
+    return False
+def append_formula_block(
+    formula_blocks: List[dict],
+    body_blocks: List[str],
+    page_number: int,
+    formula_index: int,
+    formula_lines: List[str],
+    formula_bboxes: List[str],
+) -> int:
+    formula_text = clean_formula_text("\n".join(formula_lines))
+    if not is_useful_formula_text(formula_text):
+        return formula_index
+    formula_id = f"formula-{page_number}-{formula_index}"
+    formula_bbox = merge_bbox_strings(formula_bboxes)
+    formula_blocks.append(
+        {
+            "id": formula_id,
+            "text": formula_text,
+            "bbox": formula_bbox,
+        }
+    )
+    body_blocks.append(f"[FORMULA id={formula_id}]\n{formula_text}\n[/FORMULA]")
+    return formula_index + 1
+def merge_bbox_strings(bbox_strings: List[str]) -> str:
+    boxes = []
+    for bbox_string in bbox_strings:
+        if not bbox_string:
+            continue
+        values = bbox_string.split(",")
+        if len(values) != 4:
+            continue
+        try:
+            boxes.append([float(value) for value in values])
+        except ValueError:
+            continue
+    if not boxes:
+        return ""
+    x0 = min(box[0] for box in boxes)
+    y0 = min(box[1] for box in boxes)
+    x1 = max(box[2] for box in boxes)
+    y1 = max(box[3] for box in boxes)
+    return f"{x0:.2f},{y0:.2f},{x1:.2f},{y1:.2f}"
+def is_useful_formula_text(text: str) -> bool:
+    stripped = text.strip()
+    if not stripped:
+        return False
+    compact_length = len(re.sub(r"\s+", "", stripped))
+    if compact_length < 6:
+        return False
+    lines = [line.strip() for line in stripped.splitlines() if line.strip()]
+    if re.search(r"\(\d+(\.\d+)+[a-z]?\)", stripped):
+        return True
+    if any(char in stripped for char in "∂∫∑∏∞≈≠≤≥±×÷"):
+        alpha_words = re.findall(r"[A-Za-z]+", stripped)
+        return len(alpha_words) <= 12 or "=" in stripped
+    for line in lines:
+        if "=" not in line:
+            continue
+        alpha_words = [
+            word
+            for word in re.findall(r"[A-Za-z]+", line)
+            if word.lower() not in {"and", "or", "the", "where", "then", "with", "for"}
+        ]
+        if len(alpha_words) <= 12 and len(line) <= 260:
+            return True
+    return False
+def extract_pymupdf_page(page) -> dict:
+    page_dict = page.get_text("dict", sort=True)
+    body_blocks: List[str] = []
+    formula_blocks: List[dict] = []
+    formula_lines: List[str] = []
+    formula_bboxes: List[str] = []
+    formula_index = 0
+    page_number = page.number + 1
+    for block in page_dict.get("blocks", []):
+        if block.get("type") != 0:
+            continue
+        normal_lines: List[str] = []
+        for line in block.get("lines", []):
+            line_text = pymupdf_line_text(line)
+            if not line_text:
+                continue
+            if should_extract_formula_line(line) or (
+                formula_lines and is_formula_continuation_line(line_text)
+            ):
+                if normal_lines:
+                    body_blocks.append("\n".join(normal_lines))
+                    normal_lines = []
+                formula_lines.append(line_text)
+                formula_bboxes.append(line_bbox_string(line))
+            else:
+                if formula_lines:
+                    formula_index = append_formula_block(
+                        formula_blocks=formula_blocks,
+                        body_blocks=body_blocks,
+                        page_number=page_number,
+                        formula_index=formula_index,
+                        formula_lines=formula_lines,
+                        formula_bboxes=formula_bboxes,
+                    )
+                    formula_lines = []
+                    formula_bboxes = []
+                normal_lines.append(line_text)
+        if normal_lines:
+            body_blocks.append("\n".join(normal_lines))
+    if formula_lines:
+        append_formula_block(
+            formula_blocks=formula_blocks,
+            body_blocks=body_blocks,
+            page_number=page_number,
+            formula_index=formula_index,
+            formula_lines=formula_lines,
+            formula_bboxes=formula_bboxes,
+        )
+    return {
+        "text": "\n".join(body_blocks),
+        "formula_blocks": formula_blocks,
+        "backend": "pymupdf",
+    }
+def extract_pdf_pages_with_pymupdf(path: Path) -> Optional[List[dict]]:
+    fitz = load_pymupdf()
+    if fitz is None:
+        return None
+    try:
+        document = fitz.open(str(path))
+    except Exception:
+        return None
+    try:
+        return [extract_pymupdf_page(page) for page in document]
+    finally:
+        document.close()
+def clean_formula_text(text: str) -> str:
+    lines = page_lines(text)
+    if not lines:
+        return ""
+    text = "\n".join(lines)
+    text = re.sub(r"[ \t]+", " ", text)
+    text = re.sub(r"\n{3,}", "\n\n", text)
+    return text.strip()
+def normalize_pdf_line(line: str) -> str:
+    line = line.replace("\x00", " ")
+    line = line.replace("\ufb00", "ff")
+    line = line.replace("\ufb01", "fi")
+    line = line.replace("\ufb02", "fl")
+    line = line.replace("\ufb03", "ffi")
+    line = line.replace("\ufb04", "ffl")
+    line = re.sub(r"[ \t]+", " ", line)
+    return line.strip()
+def is_noise_line(line: str) -> bool:
+    if not line:
+        return True
+    if re.fullmatch(r"\d+", line):
+        return True
+    if re.fullmatch(r"page\s+\d+(\s+of\s+\d+)?", line, flags=re.IGNORECASE):
+        return True
+    if re.fullmatch(r"[-_=\s]{3,}", line):
+        return True
+    return False
+def is_formula_like(line: str) -> bool:
+    stripped = line.strip()
+    if not stripped:
+        return False
+    strong_math_count = sum(1 for char in stripped if char in PDF_STRONG_MATH_SYMBOLS)
+    weak_math_count = sum(1 for char in stripped if char in PDF_WEAK_MATH_SYMBOLS)
+    alpha_count = sum(1 for char in stripped if char.isalpha())
+    digit_count = sum(1 for char in stripped if char.isdigit())
+    compact = stripped.replace(" ", "")
+    if "={" in compact or "^{" in compact or "_{" in compact:
+        return True
+    if compact in {"(", ")", "[", "]", "{", "}"}:
+        return True
+    if len(compact) <= 40 and any(char in compact for char in PDF_MATH_SYMBOLS):
+        return True
+    if strong_math_count >= 2 and len(stripped) <= 180:
+        return True
+    if strong_math_count >= 1 and weak_math_count >= 1 and len(stripped) <= 180:
+        return True
+    if "=" in stripped and (alpha_count + digit_count) >= 2 and len(stripped) <= 220:
+        return True
+    if re.search(r"\b(d|D|exp|ln|sqrt|max|min|var|cov)\s*[\(\[]", stripped):
+        return True
+    if alpha_count <= 4 and (strong_math_count + weak_math_count) >= 1 and digit_count >= 1:
+        return True
+    return False
+def normalized_line_key(line: str) -> str:
+    return re.sub(r"\d+", "#", line.lower()).strip()
+def page_lines(text: str) -> List[str]:
+    lines = []
+    for line in text.replace("\r\n", "\n").replace("\r", "\n").split("\n"):
+        normalized = normalize_pdf_line(line)
+        if not is_noise_line(normalized):
+            lines.append(normalized)
+    return lines
+def find_repeated_boundary_lines(raw_pages: List[str]) -> set[str]:
+    counter: Counter[str] = Counter()
+    for raw_text in raw_pages:
+        lines = page_lines(raw_text)
+        boundary_lines = lines[:PDF_BOUNDARY_LINE_COUNT] + lines[-PDF_BOUNDARY_LINE_COUNT:]
+        counter.update(
+            normalized_line_key(line)
+            for line in boundary_lines
+            if 3 <= len(line) <= 140
+        )
+    min_count = min(
+        PDF_REPEATED_LINE_MIN_PAGES,
+        max(2, len(raw_pages) // 3),
+    )
+    return {line for line, count in counter.items() if count >= min_count}
+def clean_pdf_text(text: str, repeated_boundary_lines: set[str]) -> str:
+    lines = page_lines(text)
+    cleaned_lines = []
+    for index, line in enumerate(lines):
+        is_boundary = (
+            index < PDF_BOUNDARY_LINE_COUNT
+            or index >= len(lines) - PDF_BOUNDARY_LINE_COUNT
+        )
+        if is_boundary and normalized_line_key(line) in repeated_boundary_lines:
+            continue
+        cleaned_lines.append(line)
+    merged_lines = []
+    for line in cleaned_lines:
+        if merged_lines and merged_lines[-1].endswith("-") and line[:1].islower():
+            merged_lines[-1] = merged_lines[-1][:-1] + line
+        else:
+            merged_lines.append(line)
+    text = "\n".join(merged_lines)
+    text = preserve_math_line_breaks(text)
+    text = re.sub(r"[ \t]+", " ", text)
+    text = re.sub(r"\n{3,}", "\n\n", text)
+    return text.strip()
+def preserve_math_line_breaks(text: str) -> str:
+    lines = text.split("\n")
+    if not lines:
+        return ""
+    output = [lines[0]]
+    in_formula_block = is_formula_like(lines[0])
+    for line in lines[1:]:
+        previous = output[-1]
+        line_is_formula = is_formula_like(line)
+        previous_is_formula = is_formula_like(previous)
+        if previous_is_formula or line_is_formula or in_formula_block:
+            output.append(line)
+            in_formula_block = line_is_formula or (
+                in_formula_block
+                and not line.endswith((".", ";", ":", "?", "!"))
+            )
+        elif previous.endswith((".", ":", ";", "?", "!", ")")):
+            output.append(line)
+            in_formula_block = False
+        else:
+            output[-1] = f"{previous} {line}"
+            in_formula_block = False
+    return "\n".join(output)
+def is_chapter_heading(line: str) -> bool:
+    return bool(re.fullmatch(
+        r"(chapter|appendix)\s+([0-9]+|[ivxlcdm]+|[a-z])",
+        line.strip(),
+        flags=re.IGNORECASE,
+    ))
+def titlecase_word_ratio(words: List[str]) -> float:
+    candidate_words = [
+        word.strip("()[]{}:;,.")
+        for word in words
+        if any(char.isalpha() for char in word)
+    ]
+    if not candidate_words:
+        return 0.0
+    titlecase_words = [
+        word
+        for word in candidate_words
+        if word[:1].isupper()
+        or word.lower() in {"a", "an", "and", "for", "in", "of", "on", "or", "the", "to", "with"}
+    ]
+    return len(titlecase_words) / len(candidate_words)
+def uppercase_letter_ratio(text: str) -> float:
+    letters = [char for char in text if char.isalpha()]
+    if not letters:
+        return 0.0
+    return sum(1 for char in letters if char.isupper()) / len(letters)
+def is_section_heading(line: str) -> bool:
+    stripped = line.strip()
+    if not 4 <= len(stripped) <= 150:
+        return False
+    letters = [char for char in stripped if char.isalpha()]
+    digit_count = sum(1 for char in stripped if char.isdigit())
+    alpha_words = [
+        word.strip("()[]{}:;,.")
+        for word in stripped.split()
+        if any(char.isalpha() for char in word)
+    ]
+    if len(letters) < 6 or len(alpha_words) < 2:
+        return False
+    if digit_count > max(4, len(letters)):
+        return False
+    if "%" in stripped and digit_count >= len(letters) / 2:
+        return False
+    numbered_heading = bool(re.match(r"^\d+(\.\d+)+\s+", stripped))
+    if stripped[:1].isdigit() and not numbered_heading:
+        return False
+    if re.match(
+        r"^(in|from|where|thus|then|now|let|because|while|figure|table|for)\b",
+        stripped,
+        flags=re.IGNORECASE,
+    ):
+        return False
+    if is_formula_like(stripped):
+        return False
+    if stripped.endswith((".", ",", ";")):
+        return False
+    if re.match(r"^(figure|table)\s+\d", stripped, flags=re.IGNORECASE):
+        return False
+    if numbered_heading:
+        return True
+    words = stripped.split()
+    if len(words) > 16:
+        return False
+    if uppercase_letter_ratio(stripped) >= 0.72 and len(words) >= 2:
+        return True
+    if len(words) >= 4 and titlecase_word_ratio(words) >= 0.68:
+        return True
+    return False
+def make_section_path(chapter_title: str, section_title: str) -> str:
+    if chapter_title and section_title and section_title != chapter_title:
+        return f"{chapter_title} > {section_title}"
+    return section_title or chapter_title
+def split_pdf_page_into_sections(
+    path: Path,
+    page_index: int,
+    text: str,
+    file_hash: str,
+    section_state: dict,
+    extraction_backend: str,
+    formula_count: int,
+) -> List[Document]:
+    documents = []
+    lines = text.splitlines()
+    pending_lines: List[str] = []
+    pending_metadata = {
+        "chapter_title": section_state.get("chapter_title", ""),
+        "section_title": section_state.get("section_title", ""),
+    }
+    def flush_pending() -> None:
+        nonlocal pending_lines, pending_metadata
+        section_text = "\n".join(line for line in pending_lines if line.strip()).strip()
+        if not section_text:
+            pending_lines = []
+            return
+        chapter_title = pending_metadata.get("chapter_title", "")
+        section_title = pending_metadata.get("section_title", "")
+        documents.append(
+            Document(
+                text=section_text,
+                metadata={
+                    "source_file": str(path.resolve()),
+                    "file_name": path.name,
+                    "file_type": "pdf",
+                    "document_title": path.stem,
+                    "file_hash": file_hash,
+                    "page_number": page_index,
+                    "extraction_method": PDF_EXTRACTION_METHOD,
+                    "extraction_backend": extraction_backend,
+                    "char_count": len(section_text),
+                    "formula_count": formula_count,
+                    "content_type": "text",
+                    "chapter_title": chapter_title,
+                    "section_title": section_title,
+                    "section_path": make_section_path(chapter_title, section_title),
+                },
+            )
+        )
+        pending_lines = []
+    for line in lines:
+        stripped = line.strip()
+        if not stripped:
+            continue
+        if is_chapter_heading(stripped):
+            if len("\n".join(pending_lines)) >= PDF_MIN_SECTION_CHARS:
+                flush_pending()
+            section_state["pending_chapter_label"] = stripped.title()
+            section_state["chapter_title"] = stripped.title()
+            section_state["section_title"] = stripped.title()
+            pending_metadata = {
+                "chapter_title": section_state["chapter_title"],
+                "section_title": section_state["section_title"],
+            }
+            pending_lines.append(stripped)
+            continue
+        if section_state.get("pending_chapter_label") and is_section_heading(stripped):
+            if pending_lines == [section_state["pending_chapter_label"]]:
+                pending_lines[0] = f"{section_state['pending_chapter_label']}: {stripped}"
+            else:
+                pending_lines.append(stripped)
+            section_state["chapter_title"] = pending_lines[-1]
+            section_state["section_title"] = pending_lines[-1]
+            section_state["pending_chapter_label"] = ""
+            pending_metadata = {
+                "chapter_title": section_state["chapter_title"],
+                "section_title": section_state["section_title"],
+            }
+            continue
+        if is_section_heading(stripped):
+            if len("\n".join(pending_lines)) >= PDF_MIN_SECTION_CHARS:
+                flush_pending()
+            section_state["section_title"] = stripped
+            section_state["pending_chapter_label"] = ""
+            pending_metadata = {
+                "chapter_title": section_state.get("chapter_title", ""),
+                "section_title": section_state["section_title"],
+            }
+        pending_lines.append(stripped)
+    flush_pending()
+    return documents
+def make_formula_documents(
+    path: Path,
+    page_index: int,
+    formula_blocks: List[dict],
+    file_hash: str,
+    extraction_backend: str,
+) -> List[Document]:
+    documents = []
+    for formula_index, formula in enumerate(formula_blocks):
+        formula_text = formula.get("text", "").strip()
+        if not formula_text:
+            continue
+        documents.append(
+            Document(
+                text=f"[FORMULA]\n{formula_text}\n[/FORMULA]",
+                metadata={
+                    "source_file": str(path.resolve()),
+                    "file_name": path.name,
+                    "file_type": "pdf",
+                    "document_title": path.stem,
+                    "file_hash": file_hash,
+                    "page_number": page_index,
+                    "extraction_method": PDF_EXTRACTION_METHOD,
+                    "extraction_backend": extraction_backend,
+                    "char_count": len(formula_text),
+                    "content_type": "formula",
+                    "formula_id": formula.get("id", f"formula-{page_index}-{formula_index}"),
+                    "formula_index": formula_index,
+                    "formula_bbox": formula.get("bbox", ""),
+                    "formula_count": 1,
+                    "chapter_title": "",
+                    "section_title": "",
+                    "section_path": "",
+                },
+            )
+        )
+    return documents
+def load_pdf_file(path: Path) -> List[Document]:
+    reader = PdfReader(str(path))
+    documents = []
+    pymupdf_pages = extract_pdf_pages_with_pymupdf(path)
+    if pymupdf_pages:
+        page_payloads = pymupdf_pages
+    else:
+        page_payloads = [
+            {
+                "text": extract_pdf_text(page),
+                "formula_blocks": [],
+                "backend": "pypdf",
+            }
+            for page in reader.pages
+        ]
+    raw_pages = [payload["text"] for payload in page_payloads]
+    repeated_boundary_lines = find_repeated_boundary_lines(raw_pages)
+    file_hash = file_sha256(path)
+    section_state: dict = {
+        "chapter_title": "",
+        "section_title": "",
+        "pending_chapter_label": "",
+    }
+    for page_index, payload in enumerate(page_payloads, start=1):
+        raw_text = payload["text"]
+        text = clean_pdf_text(raw_text, repeated_boundary_lines)
+        formula_blocks = payload.get("formula_blocks", [])
+        extraction_backend = payload.get("backend", "pypdf")
+        if not text.strip():
+            documents.extend(
+                make_formula_documents(
+                    path=path,
+                    page_index=page_index,
+                    formula_blocks=formula_blocks,
+                    file_hash=file_hash,
+                    extraction_backend=extraction_backend,
+                )
+            )
+            continue
+        documents.extend(
+            split_pdf_page_into_sections(
+                path=path,
+                page_index=page_index,
+                text=text,
+                file_hash=file_hash,
+                section_state=section_state,
+                extraction_backend=extraction_backend,
+                formula_count=len(formula_blocks),
+            )
+        )
+        documents.extend(
+            make_formula_documents(
+                path=path,
+                page_index=page_index,
+                formula_blocks=formula_blocks,
+                file_hash=file_hash,
+                extraction_backend=extraction_backend,
+            )
+        )
+    return documents
+def load_txt_file(path: Path) -> List[Document]:
+    # TODO: load text file
+    pass
+    return []
+def iter_source_files(raw_dir: Path) -> Iterable[Path]:
+    supported_suffixes = {".md", ".markdown", ".pdf"}
+    for path in sorted(raw_dir.rglob("*")):
+        if path.is_file() and path.suffix.lower() in supported_suffixes:
+            yield path
+def load_docs(raw_dir: Path = RAW_DIR) -> List[Document]:
+    documents: List[Document] = []
+    for path in iter_source_files(raw_dir):
+        suffix = path.suffix.lower()
+        if suffix in {".md", ".markdown"}:
+            documents.append(load_md_file(path))
+        elif suffix == ".pdf":
+            documents.extend(load_pdf_file(path))
+        elif suffix == ".txt":
+            documents.extend(load_txt_file(path))
+    if not documents:
+        raise ValueError(f"No supported documents found under {raw_dir}")
+    return documents
+def add_chunk_metadata(nodes: List[BaseNode]) -> List[BaseNode]:
+    counters: dict[str, int] = {}
+    for node in nodes:
+        source_file = node.metadata["source_file"]
+        chunk_index = counters.get(source_file, 0)
+        counters[source_file] = chunk_index + 1
+        file_hash = node.metadata["file_hash"][:12]
+        page_number = node.metadata.get("page_number", "na")
+        chunk_id = f"{Path(source_file).stem}-{file_hash}-p{page_number}-c{chunk_index}"
+        node.metadata["chunk_id"] = chunk_id
+        node.metadata["chunk_index"] = chunk_index
+        node.id_ = chunk_id
+    return nodes
+def validate_nodes(nodes: List[BaseNode]) -> None:
+    if not nodes:
+        raise ValueError("No chunks were created from the source documents.")
+    for node in nodes:
+        missing = [key for key in REQUIRED_METADATA if key not in node.metadata]
+        if missing:
+            raise ValueError(
+                f"Node {node.node_id} is missing metadata fields: {missing}")
+        if node.metadata["file_type"] == "pdf" and "page_number" not in node.metadata:
+            raise ValueError(
+                f"PDF node {node.node_id} is missing page_number metadata.")
+def build_nodes(raw_dir: Path = RAW_DIR) -> List[BaseNode]:
+    documents = load_docs(raw_dir)
+    splitter = SentenceSplitter(
+        chunk_size=CHUNK_SIZE,
+        chunk_overlap=CHUNK_OVERLAP,
+    )
+    nodes = splitter.get_nodes_from_documents(documents)
+    add_chunk_metadata(nodes)
+    validate_nodes(nodes)
+    return nodes
+def collection_needs_pdf_rebuild(chroma_collection) -> bool:
+    if chroma_collection.count() == 0:
+        return True
+    try:
+        sample = chroma_collection.peek(limit=min(chroma_collection.count(), 20))
+    except Exception:
+        return False
+    for metadata in sample.get("metadatas") or []:
+        if metadata.get("file_type") == "pdf":
+            return metadata.get("extraction_method") != PDF_EXTRACTION_METHOD
+    return False
+async def build_index(raw_dir: Path = RAW_DIR, rebuild: bool = False) -> VectorStoreIndex:
+    configure_model_cache()
+    from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+    load_dotenv()
+    CHROMA_DB_DIR.mkdir(parents=True, exist_ok=True)
+    db = chromadb.PersistentClient(path=str(CHROMA_DB_DIR))
+    if rebuild:
+        try:
+            db.delete_collection(COLLECTION_NAME)
+        except (NotFoundError, ValueError):
+            pass
+    chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
+    if not rebuild and collection_needs_pdf_rebuild(chroma_collection):
+        db.delete_collection(COLLECTION_NAME)
+        chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
+    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
+    storage_context = StorageContext.from_defaults(vector_store=vector_store)
+    embed_model = HuggingFaceEmbedding(
+        model_name=resolve_embed_model_name(),
+        cache_folder=str(HF_CACHE_DIR / "sentence_transformers"),
+    )
+    if rebuild or chroma_collection.count() == 0:
+        nodes = build_nodes(raw_dir)
+        index = VectorStoreIndex(
+            nodes,
+            storage_context=storage_context,
+            embed_model=embed_model,
+            show_progress=True,
+        )
+        print(
+            f"Indexed {len(nodes)} chunks into collection '{COLLECTION_NAME}'")
+        return index
+    print(
+        f"Loaded existing collection '{COLLECTION_NAME}' with {chroma_collection.count()} chunks.")
+    return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
+class QueryKnowledgeTool(Tool):
+    name = "query_knowledge"
+    description = "Performs a search of related information based on your query"
+    inputs = {'query': {'type': 'string',
+                        'description': 'The search query to perform.'}}
+    output_type = "string"
+    @staticmethod
+    def format_results(results):
+        output = []
+        for result in results:
+            metadata = result.node.metadata
+            source = metadata.get("file_name", "unknown")
+            page = metadata.get("page_number", "n/a")
+            section = metadata.get("section_path") or metadata.get("section_title") or "n/a"
+            content_type = metadata.get("content_type", "text")
+            formula_id = metadata.get("formula_id", "")
+            score = result.score
+            text = result.node.get_content()
+            output.append(
+                f"source：{source}\n"
+                f"page：{page}\n"
+                f"section：{section}\n"
+                f"content_type：{content_type}\n"
+                f"formula_id：{formula_id or 'n/a'}\n"
+                f"score：{score:.4f}\n"
+                f"content：{text}"
+            )
+        return "\n\n---\n\n".join(output)
+    def __init__(self, max_results=10, top_k=5, **kwargs):
+        super().__init__()
+        self.max_results = max_results
+        index = asyncio.run(build_index(rebuild=False))
+        self.retriever = index.as_retriever(similarity_top_k=top_k)
+    def forward(self, query: str) -> str:
+        results = self.retriever.retrieve(query)
+        return QueryKnowledgeTool.format_results(results)
+if __name__ == "__main__":
+    query_tool = QueryKnowledgeTool()
+    res: str = query_tool.forward("What is option?")
+    print(res)

tools/todo.md ADDED Viewed

	@@ -0,0 +1,5 @@

+1. 添加reranker
+2. 修改embedding模型
+3. chunk策略粗糙，建议按照章节、标题等进行划分
+4. 提升pdf提取能力
+5. 完成load_txt

uv.lock CHANGED Viewed

@@ -660,6 +660,7 @@ dependencies = [
     { name = "llama-index-core" },
     { name = "llama-index-embeddings-huggingface" },
     { name = "llama-index-vector-stores-chroma" },
     { name = "pypdf" },
     { name = "tokenizers" },
     { name = "transformers" },
@@ -673,6 +674,7 @@ requires-dist = [
     { name = "llama-index-core", specifier = ">=0.14.0" },
     { name = "llama-index-embeddings-huggingface", specifier = ">=0.6.0" },
     { name = "llama-index-vector-stores-chroma", specifier = ">=0.5.0" },
     { name = "pypdf", specifier = ">=6.0.0" },
     { name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
     { name = "transformers", specifier = "<5" },
@@ -2570,6 +2572,22 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
 ]
 [[package]]
 name = "pypdf"
 version = "6.12.0"

     { name = "llama-index-core" },
     { name = "llama-index-embeddings-huggingface" },
     { name = "llama-index-vector-stores-chroma" },
+    { name = "pymupdf" },
     { name = "pypdf" },
     { name = "tokenizers" },
     { name = "transformers" },
     { name = "llama-index-core", specifier = ">=0.14.0" },
     { name = "llama-index-embeddings-huggingface", specifier = ">=0.6.0" },
     { name = "llama-index-vector-stores-chroma", specifier = ">=0.5.0" },
+    { name = "pymupdf", specifier = ">=1.27.2.3" },
     { name = "pypdf", specifier = ">=6.0.0" },
     { name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
     { name = "transformers", specifier = "<5" },
     { url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
 ]
+[[package]]
+name = "pymupdf"
+version = "1.27.2.3"
+source = { registry = "https://pypi.org/simple" }
+sdist = { url = "https://files.pythonhosted.org/packages/22/32/708bedc9dde7b328d45abbc076091769d44f2f24ad151ad92d56a6ec142b/pymupdf-1.27.2.3.tar.gz", hash = "sha256:7a92faa25129e8bbec5e50eeb9214f187665428c31b05c4ef6e36c58c0b1c6d2", size = 85759618, upload-time = "2026-04-24T14:13:14.42Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/dc/09/ddbdfa7ee91fbabd6f63d7d744884cbdfe3e7ff9b8604749fb38bddf5c5d/pymupdf-1.27.2.3-cp310-abi3-macosx_10_9_x86_64.whl", hash = "sha256:fc1bc3cae6e9e150b0dbb0a9221bdfd411d65f0db2fe359eaa22467d7cc2a05f", size = 24002636, upload-time = "2026-04-24T14:09:17.459Z" },
+    { url = "https://files.pythonhosted.org/packages/01/89/3f8edd6c4f50ca370e2a2f2a3011face36f3760728ffe76dffec91c0fca0/pymupdf-1.27.2.3-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:660d93cb6da5bbddf11d3982ae27745dd3a9902d9f24cdb69adab83962294b5a", size = 23278238, upload-time = "2026-04-24T14:09:32.882Z" },
+    { url = "https://files.pythonhosted.org/packages/c3/26/b7e5a70eb83bd189f8b5df87ec442746b992f2f632662839b288170d357d/pymupdf-1.27.2.3-cp310-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:1dd460a3ae4597a755f00a3bd9771f5ebf1531dc111f6a36bf05dd00a6b84425", size = 24333923, upload-time = "2026-04-24T14:09:47.341Z" },
+    { url = "https://files.pythonhosted.org/packages/e4/a0/aa1ee2240f29481a04a827c313333b4ecd8a14d6ac3e15d3f41a30574781/pymupdf-1.27.2.3-cp310-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:857842b4888827bd6155a1131341b2822a7ebe9a8c15a975fd7d490d7a64a30c", size = 24963198, upload-time = "2026-04-24T14:10:07.408Z" },
+    { url = "https://files.pythonhosted.org/packages/69/49/4f742451f980840829fc00ba158bebb25d389c846d8f4f8c65936ee55de8/pymupdf-1.27.2.3-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:580983849c64a08d08344ca3d1580e87c01f046a8392421797bc850efd72a5b6", size = 25184609, upload-time = "2026-04-24T14:10:22.911Z" },
+    { url = "https://files.pythonhosted.org/packages/f6/3f/3853d6608f394faf6eec2bd4e8ea9f6a00beea329b071abdb29f4164cc3d/pymupdf-1.27.2.3-cp310-abi3-win32.whl", hash = "sha256:a5c1088a87189891a4946ab314a14b7934ac4c5b6077f7e74ebee956f8906d0e", size = 18019286, upload-time = "2026-04-24T14:10:34.239Z" },
+    { url = "https://files.pythonhosted.org/packages/44/47/5fb10fe73f96b31253a41647c362ea9e0380920bddf16028414a051247fc/pymupdf-1.27.2.3-cp310-abi3-win_amd64.whl", hash = "sha256:d20f68ef15195e073071dbc4ae7455257c7889af7584e39df490c0a92728526e", size = 19249102, upload-time = "2026-04-24T14:10:46.72Z" },
+    { url = "https://files.pythonhosted.org/packages/53/a4/b9e91aac82293f9c954654c85581ee8212b5b05efadc534b581141241e6f/pymupdf-1.27.2.3-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:77691604c5d1d0233827139bbcdea61fd57879c84712b8e49b1f45520f7ab9c2", size = 25000393, upload-time = "2026-04-24T14:11:01.669Z" },
+]
 [[package]]
 name = "pypdf"
 version = "6.12.0"