mathidot commited on
Commit
4a8fc49
·
1 Parent(s): 884eda5

加强pdf提取能力,增加rag评测模块

Browse files
eval/README.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAG Evaluation Module
2
+
3
+ This folder contains a lightweight retrieval-evaluation harness for the project.
4
+
5
+ ## Supported Steps
6
+
7
+ 1. `beir/scifact`
8
+ 2. `beir/fiqa`
9
+ 3. `open-ragbench`
10
+ 4. `t2-ragbench`
11
+ 5. `local-options`
12
+
13
+ Each run builds a temporary Chroma index under `eval/indexes/` and writes reports under `eval/reports/`.
14
+
15
+ ## Smoke Tests
16
+
17
+ ```bash
18
+ uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset beir/scifact --max-corpus-docs 200 --max-queries 10 --rebuild
19
+ uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset beir/fiqa --max-corpus-docs 500 --max-queries 10 --rebuild
20
+ uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset open-ragbench --max-corpus-docs 50 --max-queries 10 --rebuild
21
+ uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset t2-ragbench --max-corpus-docs 50 --max-queries 10 --rebuild
22
+ uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset local-options --max-queries 3 --rebuild
23
+ ```
24
+
25
+ ## Run The Whole Suite
26
+
27
+ ```bash
28
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite --rebuild
29
+ ```
30
+
31
+ By default, the suite runs:
32
+
33
+ - `beir/scifact`
34
+ - `beir/fiqa`
35
+ - `open-ragbench`
36
+ - `local-options`
37
+
38
+ Useful options:
39
+
40
+ ```bash
41
+ # Accurate run after changing PDF parsing, chunking, embedding, retrieval code, or sampling parameters.
42
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite --rebuild
43
+
44
+ # Faster run that reuses existing indexes.
45
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite
46
+
47
+ # Run only selected datasets.
48
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite --datasets local-options,beir/fiqa
49
+
50
+ # Override shared parameters for all selected datasets.
51
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite --top-k 10 --max-queries 20 --max-corpus-docs 1000
52
+
53
+ # Save a stable suite-level report name.
54
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite --output-name latest_rag_eval
55
+ ```
56
+
57
+ The suite writes per-dataset reports and one aggregate report under `eval/reports/`.
58
+
59
+ ## Common Commands
60
+
61
+ Run the fastest local check while developing PDF parsing or chunking:
62
+
63
+ ```bash
64
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
65
+ --datasets local-options \
66
+ --max-queries 3 \
67
+ --top-k 5 \
68
+ --rebuild
69
+ ```
70
+
71
+ Run only the standard public retrieval smoke tests:
72
+
73
+ ```bash
74
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
75
+ --datasets beir/scifact,beir/fiqa \
76
+ --rebuild
77
+ ```
78
+
79
+ Run the financial benchmark only:
80
+
81
+ ```bash
82
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
83
+ --datasets beir/fiqa \
84
+ --max-corpus-docs 1000 \
85
+ --max-queries 50 \
86
+ --top-k 5 \
87
+ --rebuild
88
+ ```
89
+
90
+ Run the PDF-like benchmark only:
91
+
92
+ ```bash
93
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
94
+ --datasets open-ragbench \
95
+ --max-corpus-docs 100 \
96
+ --max-queries 20 \
97
+ --top-k 5 \
98
+ --rebuild
99
+ ```
100
+
101
+ Compare different `top-k` values:
102
+
103
+ ```bash
104
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
105
+ --datasets local-options \
106
+ --top-k 3 \
107
+ --output-name local_options_top3 \
108
+ --rebuild
109
+
110
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
111
+ --datasets local-options \
112
+ --top-k 10 \
113
+ --output-name local_options_top10 \
114
+ --rebuild
115
+ ```
116
+
117
+ Compare different chunk settings:
118
+
119
+ ```bash
120
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
121
+ --datasets local-options \
122
+ --chunk-size 384 \
123
+ --chunk-overlap 64 \
124
+ --output-name local_options_chunk384 \
125
+ --rebuild
126
+
127
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
128
+ --datasets local-options \
129
+ --chunk-size 768 \
130
+ --chunk-overlap 128 \
131
+ --output-name local_options_chunk768 \
132
+ --rebuild
133
+ ```
134
+
135
+ Run a larger, slower evaluation before reporting results:
136
+
137
+ ```bash
138
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
139
+ --datasets beir/scifact,beir/fiqa,open-ragbench,local-options \
140
+ --max-corpus-docs 2000 \
141
+ --max-queries 100 \
142
+ --top-k 5 \
143
+ --output-name full_rag_eval \
144
+ --rebuild
145
+ ```
146
+
147
+ Stop immediately when one dataset fails:
148
+
149
+ ```bash
150
+ uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
151
+ --datasets beir/scifact,beir/fiqa,open-ragbench,local-options \
152
+ --fail-fast \
153
+ --rebuild
154
+ ```
155
+
156
+ Run a single dataset directly without the suite wrapper:
157
+
158
+ ```bash
159
+ uv --cache-dir .uv-cache run python -m eval.rag_eval \
160
+ --dataset local-options \
161
+ --max-queries 3 \
162
+ --top-k 5 \
163
+ --rebuild
164
+ ```
165
+
166
+ ## Suggested Workflow
167
+
168
+ 1. During development, run `local-options` with a small query count.
169
+ 2. After changing PDF extraction, chunking, embeddings, or retrieval code, add `--rebuild`.
170
+ 3. Before comparing two versions, use the same `--datasets`, `--max-queries`, `--max-corpus-docs`, `--top-k`, `--chunk-size`, and `--chunk-overlap`.
171
+ 4. Use `--output-name` to save stable report names for before/after comparison.
172
+
173
+ ## Metrics
174
+
175
+ - `hit_at_1`
176
+ - `hit_at_3`
177
+ - `hit_at_5`
178
+ - `hit_at_k`
179
+ - `mrr`
180
+ - `ndcg_at_k`
181
+
182
+ The public benchmarks test whether the eval pipeline works on standard datasets. The `local-options` benchmark is the project-specific check for PDF parsing, formula extraction, and section-aware chunking.
eval/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """RAG evaluation helpers."""
eval/data/hf/open_ragbench/README.md ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ ---
4
+ # Open RAG Benchmark
5
+
6
+ The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes **pure PDF content**, meticulously extracting and generating queries on diverse modalities including **text, tables, and images**, even when they are intricately interwoven within a document.
7
+
8
+ This dataset is purpose-built to power the company's [Open RAG Evaluation project](https://github.com/vectara/open-rag-eval), facilitating a holistic, end-to-end evaluation of RAG systems by offering:
9
+
10
+ - **Richer Multimodal Content:** A corpus derived exclusively from PDF documents, ensuring fidelity to real-world data and encompassing a wide spectrum of text, tabular, and visual information, often with intermodal crossovers.
11
+ - **Tailored for Open RAG Evaluation:** Designed to support the unique and comprehensive evaluation metrics adopted by the Open RAG Evaluation project, enabling a deeper understanding of RAG performance beyond traditional metrics.
12
+ - **High-Quality Retrieval Queries & Answers:** Each piece of extracted content is paired with expertly crafted retrieval queries and corresponding answers, optimized for robust RAG training and evaluation.
13
+ - **Diverse Knowledge Domains:** Content spanning various scientific and technical domains from arXiv, ensuring broad applicability and challenging RAG systems across different knowledge areas.
14
+
15
+ The current draft version of the Arxiv dataset, as the first step in this multimodal RAG dataset collection, includes:
16
+
17
+ - **Documents:** 1000 PDF papers evenly distributed across all Arxiv categories.
18
+ - 400 positive documents (each serving as the golden document for some queries).
19
+ - 600 hard negative documents (completely irrelevant to all queries).
20
+ - **Multimodal Content:** Extracted text, tables, and images from research papers.
21
+ - **QA Pairs:** 3045 valid question-answer pairs.
22
+ - **Based on query types:**
23
+ - 1793 abstractive queries (requiring generating a summary or rephrased response using understanding and synthesis).
24
+ - 1252 extractive queries (seeking concise, fact-based answers directly extracted from a given text).
25
+ - **Based on generation sources:**
26
+ - 1914 text-only queries
27
+ - 763 text-image queries
28
+ - 148 text-table queries
29
+ - 220 text-table-image queries
30
+
31
+ ## Dataset Structure
32
+
33
+ The dataset is organized similar to the [BEIR dataset](https://github.com/beir-cellar/beir) format within the `official/pdf/arxiv/` directory.
34
+
35
+ ```
36
+ official/
37
+ └── pdf
38
+     └── arxiv
39
+         ├── answers.json
40
+         ├── corpus
41
+         │   ├── {PAPER_ID_1}.json
42
+         │   ├── {PAPER_ID_2}.json
43
+         │   └── ...
44
+         ├── pdf_urls.json
45
+         ├── qrels.json
46
+         └── queries.json
47
+ ```
48
+
49
+ Each file's format is detailed below:
50
+
51
+ ### `pdf_urls.json`
52
+
53
+ This file provides the original PDF links to the papers in this dataset for downloading purposes.
54
+
55
+ ```json
56
+ {
57
+ "Paper ID": "Paper URL",
58
+ ...
59
+ }
60
+ ```
61
+
62
+ ### `corpus/`
63
+
64
+ This folder contains all processed papers in JSON format.
65
+
66
+ ```json
67
+ {
68
+ "title": "Paper Title",
69
+ "sections": [
70
+ {
71
+ "text": "Section text content with placeholders for tables/images",
72
+ "tables": {"table_id1": "markdown_table_string", ...},
73
+ "images": {"image_id1": "base64_encoded_string", ...},
74
+ },
75
+ ...
76
+ ],
77
+ "id": "Paper ID",
78
+ "authors": ["Author1", "Author2", ...],
79
+ "categories": ["Category1", "Category2", ...],
80
+ "abstract": "Abstract text",
81
+ "updated": "Updated date",
82
+ "published": "Published date"
83
+ }
84
+ ```
85
+
86
+ ### `queries.json`
87
+
88
+ This file contains all generated queries.
89
+
90
+ ```json
91
+ {
92
+ "Query UUID": {
93
+ "query": "Query text",
94
+ "type": "Query type (abstractive/extractive)",
95
+ "source": "Generation source (text/text-image/text-table/text-table-image)"
96
+ },
97
+ ...
98
+ }
99
+ ```
100
+
101
+ ### `qrels.json`
102
+
103
+ This file contains the query-document-section relevance labels.
104
+
105
+ ```json
106
+ {
107
+ "Query UUID": {
108
+ "doc_id": "Paper ID",
109
+ "section_id": Section Index
110
+ },
111
+ ...
112
+ }
113
+ ```
114
+
115
+ ### `answers.json`
116
+
117
+ This file contains the answers for the generated queries.
118
+
119
+ ```json
120
+ {
121
+ "Query UUID": "Answer text",
122
+ ...
123
+ }
124
+ ```
125
+
126
+ ## Dataset Creation
127
+
128
+ The Open RAG Benchmark dataset is created through a systematic process involving document collection, processing, content segmentation, query generation, and quality filtering.
129
+
130
+ 1. **Document Collection:** Gathering documents from sources like Arxiv.
131
+ 2. **Document Processing:** Parsing PDFs via OCR into text, Markdown tables, and base64 encoded images.
132
+ 3. **Content Segmentation:** Dividing documents into sections based on structural elements.
133
+ 4. **Query Generation:** Using LLMs (currently `gpt-4o-mini`) to generate retrieval queries for each section, handling multimodal content such as tables and images.
134
+ 5. **Quality Filtering:** Removing non-retrieval queries and ensuring quality through post-processing via a set of encoders for retrieval filtering and `gpt-4o-mini` for query quality filtering.
135
+ 6. **Hard-Negative Document Mining (Optional):** Mining hard negative documents that are entirely irrelevant to any existing query, relying on agreement across multiple embedding models for accuracy.
136
+
137
+ The code for reproducing and customizing the dataset generation process is available in the [Open RAG Benchmark GitHub repository](https://www.google.com/search?q=https://github.com/vectara/Open-RAG-Benchmark).
138
+
139
+ ## Limitations and Challenges
140
+
141
+ Several challenges are inherent in the current dataset development process:
142
+
143
+ - **OCR Performance:** Mistral OCR, while performing well for structured documents, struggles with unstructured PDFs, impacting the quality of extracted content.
144
+ - **Multimodal Integration:** Ensuring proper extraction and seamless integration of tables and images with corresponding text remains a complex challenge.
145
+
146
+ ## Future Enhancements
147
+
148
+ The project aims for continuous improvement and expansion of the dataset, with key next steps including:
149
+
150
+ ### Enhanced Dataset Structure and Usability:
151
+
152
+ - **Dataset Format and Content Enhancements:**
153
+ - **Rich Metadata:** Adding comprehensive document metadata (authors, publication date, categories, etc.) to enable better filtering and contextualization.
154
+ - **Flexible Chunking:** Providing multiple content granularity levels (sections, paragraphs, sentences) to accommodate different retrieval strategies.
155
+ - **Query Metadata:** Classifying queries by type (factual, conceptual, analytical), difficulty level, and whether they require multimodal understanding.
156
+ - **Advanced Multimodal Representation:**
157
+ - **Improved Image Integration:** Replacing basic placeholders with structured image objects including captions, alt text, and direct access URLs.
158
+ - **Structured Table Format:** Providing both markdown and programmatically accessible structured formats for tables (headers/rows).
159
+ - **Positional Context:** Maintaining clear positional relationships between text and visual elements.
160
+ - **Sophisticated Query Generation:**
161
+ - **Multi-stage Generation Pipeline:** Implementing targeted generation for different query types (factual, conceptual, multimodal).
162
+ - **Diversity Controls:** Ensuring coverage of different difficulty levels and reasoning requirements.
163
+ - **Specialized Multimodal Queries:** Generating queries specifically designed to test table and image understanding.
164
+ - **Practitioner-Focused Tools:**
165
+ - **Framework Integration Examples:** Providing code samples showing dataset integration with popular RAG frameworks (LangChain, LlamaIndex, etc.).
166
+ - **Evaluation Utilities:** Developing standardized tools to benchmark RAG system performance using this dataset.
167
+ - **Interactive Explorer:** Creating a simple visualization tool to browse and understand dataset contents.
168
+
169
+ ### Dataset Expansion:
170
+
171
+ - Implementing alternative solutions for PDF table & image extraction.
172
+ - Enhancing OCR capabilities for unstructured documents.
173
+ - Broadening scope beyond academic papers to include other document types.
174
+ - Potentially adding multilingual support.
175
+
176
+ ## Acknowledgments
177
+
178
+ The Open RAG Benchmark project uses OpenAI's GPT models (specifically `gpt-4o-mini`) for query generation and evaluation. For post-filtering and retrieval filtering, the following embedding models, recognized for their outstanding performance on the [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard), were utilized:
179
+
180
+ - [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)
181
+ - [dunzhang/stella\_en\_1.5B\_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5)
182
+ - [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
183
+ - [infly/inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1)
184
+ - [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)
185
+ - [openai/text-embedding-3-large](https://platform.openai.com/docs/models/text-embedding-3-large)
eval/rag_eval.py ADDED
@@ -0,0 +1,630 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import csv
5
+ import json
6
+ import math
7
+ import shutil
8
+ import zipfile
9
+ from dataclasses import dataclass
10
+ from pathlib import Path
11
+ from typing import Any, Iterable
12
+
13
+ import chromadb
14
+ import requests
15
+ from llama_index.core import StorageContext, VectorStoreIndex
16
+ from llama_index.core.node_parser import SentenceSplitter
17
+ from llama_index.core.schema import Document
18
+ from llama_index.vector_stores.chroma import ChromaVectorStore
19
+
20
+ from tools.query_knowledge import configure_model_cache, resolve_embed_model_name
21
+
22
+
23
+ PROJECT_ROOT = Path(__file__).resolve().parents[1]
24
+ EVAL_DIR = PROJECT_ROOT / "eval"
25
+ DATA_DIR = EVAL_DIR / "data"
26
+ INDEX_DIR = EVAL_DIR / "indexes"
27
+ REPORT_DIR = EVAL_DIR / "reports"
28
+
29
+ BEIR_URLS = {
30
+ "scifact": "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip",
31
+ "fiqa": "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip",
32
+ }
33
+
34
+ DATASET_ALIASES = {
35
+ "beir/scifact": "scifact",
36
+ "beir/fiqa": "fiqa",
37
+ "open-ragbench": "open_ragbench",
38
+ "open_ragbench": "open_ragbench",
39
+ "t2-ragbench": "t2_ragbench",
40
+ "t2_ragbench": "t2_ragbench",
41
+ "local-options": "local_options",
42
+ "local_options": "local_options",
43
+ }
44
+
45
+
46
+ @dataclass
47
+ class EvalCorpus:
48
+ name: str
49
+ documents: list[dict[str, Any]]
50
+ queries: list[dict[str, Any]]
51
+ qrels: dict[str, set[str]]
52
+
53
+
54
+ def ensure_dirs() -> None:
55
+ DATA_DIR.mkdir(parents=True, exist_ok=True)
56
+ INDEX_DIR.mkdir(parents=True, exist_ok=True)
57
+ REPORT_DIR.mkdir(parents=True, exist_ok=True)
58
+
59
+
60
+ def download_file(url: str, destination: Path) -> None:
61
+ destination.parent.mkdir(parents=True, exist_ok=True)
62
+ with requests.get(url, stream=True, timeout=60) as response:
63
+ response.raise_for_status()
64
+ with destination.open("wb") as file:
65
+ for chunk in response.iter_content(chunk_size=1024 * 1024):
66
+ if chunk:
67
+ file.write(chunk)
68
+
69
+
70
+ def read_jsonl(path: Path) -> Iterable[dict[str, Any]]:
71
+ with path.open("r", encoding="utf-8") as file:
72
+ for line in file:
73
+ line = line.strip()
74
+ if line:
75
+ yield json.loads(line)
76
+
77
+
78
+ def prepare_beir_dataset(dataset_name: str) -> Path:
79
+ ensure_dirs()
80
+ if dataset_name not in BEIR_URLS:
81
+ raise ValueError(f"Unsupported BEIR dataset: {dataset_name}")
82
+
83
+ target_dir = DATA_DIR / "beir" / dataset_name
84
+ corpus_path = target_dir / "corpus.jsonl"
85
+ if corpus_path.exists():
86
+ return target_dir
87
+
88
+ zip_path = DATA_DIR / "downloads" / f"{dataset_name}.zip"
89
+ if not zip_path.exists():
90
+ download_file(BEIR_URLS[dataset_name], zip_path)
91
+
92
+ extract_root = DATA_DIR / "beir"
93
+ extract_root.mkdir(parents=True, exist_ok=True)
94
+ with zipfile.ZipFile(zip_path) as archive:
95
+ archive.extractall(extract_root)
96
+
97
+ if not corpus_path.exists():
98
+ raise FileNotFoundError(f"BEIR extraction did not create {corpus_path}")
99
+
100
+ return target_dir
101
+
102
+
103
+ def load_beir_dataset(
104
+ dataset_name: str,
105
+ split: str,
106
+ max_corpus_docs: int | None,
107
+ max_queries: int | None,
108
+ ) -> EvalCorpus:
109
+ dataset_dir = prepare_beir_dataset(dataset_name)
110
+
111
+ all_queries = {
112
+ str(row["_id"]): row.get("text", "")
113
+ for row in read_jsonl(dataset_dir / "queries.jsonl")
114
+ }
115
+
116
+ qrels_path = dataset_dir / "qrels" / f"{split}.tsv"
117
+ if not qrels_path.exists():
118
+ candidates = sorted((dataset_dir / "qrels").glob("*.tsv"))
119
+ if not candidates:
120
+ raise FileNotFoundError(f"No qrels found under {dataset_dir / 'qrels'}")
121
+ qrels_path = candidates[0]
122
+
123
+ all_qrels: dict[str, set[str]] = {}
124
+ with qrels_path.open("r", encoding="utf-8") as file:
125
+ reader = csv.DictReader(file, delimiter="\t")
126
+ for row in reader:
127
+ query_id = str(row.get("query-id") or row.get("query_id"))
128
+ corpus_id = str(row.get("corpus-id") or row.get("corpus_id"))
129
+ score = int(row.get("score", 1))
130
+ if score <= 0:
131
+ continue
132
+ all_qrels.setdefault(query_id, set()).add(corpus_id)
133
+
134
+ queries = []
135
+ required_doc_ids = set()
136
+ for query_id, relevant_docs in all_qrels.items():
137
+ if query_id not in all_queries:
138
+ continue
139
+ if max_corpus_docs and len(required_doc_ids | relevant_docs) > max_corpus_docs:
140
+ continue
141
+ required_doc_ids.update(relevant_docs)
142
+ queries.append(
143
+ {
144
+ "query_id": query_id,
145
+ "question": all_queries[query_id],
146
+ "relevant_doc_ids": sorted(relevant_docs),
147
+ }
148
+ )
149
+ if max_queries and len(queries) >= max_queries:
150
+ break
151
+
152
+ documents = []
153
+ seen_doc_ids = set()
154
+ for row in read_jsonl(dataset_dir / "corpus.jsonl"):
155
+ doc_id = str(row["_id"])
156
+ if required_doc_ids and doc_id not in required_doc_ids:
157
+ if max_corpus_docs and len(documents) >= max_corpus_docs:
158
+ continue
159
+ if max_corpus_docs and len(documents) + len(required_doc_ids - seen_doc_ids) >= max_corpus_docs:
160
+ continue
161
+ title = row.get("title") or ""
162
+ text = row.get("text") or ""
163
+ documents.append(
164
+ {
165
+ "doc_id": doc_id,
166
+ "title": title,
167
+ "text": f"{title}\n{text}".strip(),
168
+ "metadata": {"source_dataset": f"beir/{dataset_name}"},
169
+ }
170
+ )
171
+ seen_doc_ids.add(doc_id)
172
+ if max_corpus_docs and len(documents) >= max_corpus_docs and required_doc_ids.issubset(seen_doc_ids):
173
+ break
174
+
175
+ if not documents or not queries:
176
+ raise ValueError(
177
+ f"Dataset beir/{dataset_name} has no evaluable documents/queries. "
178
+ "Increase --max-corpus-docs or use a larger sample."
179
+ )
180
+
181
+ return EvalCorpus(
182
+ name=f"beir_{dataset_name}",
183
+ documents=documents,
184
+ queries=queries,
185
+ qrels={query["query_id"]: set(query["relevant_doc_ids"]) for query in queries},
186
+ )
187
+
188
+
189
+ def snapshot_hf_dataset(repo_id: str, local_name: str) -> Path:
190
+ from huggingface_hub import snapshot_download
191
+
192
+ ensure_dirs()
193
+ target_dir = DATA_DIR / "hf" / local_name
194
+ if target_dir.exists():
195
+ return target_dir
196
+
197
+ snapshot_download(
198
+ repo_id=repo_id,
199
+ repo_type="dataset",
200
+ local_dir=str(target_dir),
201
+ local_dir_use_symlinks=False,
202
+ )
203
+ return target_dir
204
+
205
+
206
+ def flatten_open_ragbench_section(section: dict[str, Any]) -> str:
207
+ parts = [section.get("text") or ""]
208
+ tables = section.get("tables") or {}
209
+ if isinstance(tables, dict):
210
+ parts.extend(str(value) for value in tables.values())
211
+ return "\n".join(part for part in parts if part)
212
+
213
+
214
+ def load_open_ragbench(
215
+ max_corpus_docs: int | None,
216
+ max_queries: int | None,
217
+ ) -> EvalCorpus:
218
+ dataset_dir = snapshot_hf_dataset("vectara/open_ragbench", "open_ragbench")
219
+ root = dataset_dir / "pdf" / "arxiv"
220
+ if not root.exists():
221
+ root = dataset_dir / "official" / "pdf" / "arxiv"
222
+ if not root.exists():
223
+ raise FileNotFoundError(f"Open RAGBench root not found: {root}")
224
+
225
+ queries_data = json.loads((root / "queries.json").read_text(encoding="utf-8"))
226
+ qrels_data = json.loads((root / "qrels.json").read_text(encoding="utf-8"))
227
+
228
+ documents = []
229
+ qrels: dict[str, set[str]] = {}
230
+ required_doc_ids = set()
231
+ selected_query_ids = []
232
+ for query_id, qrel in qrels_data.items():
233
+ doc_id = str(qrel.get("doc_id"))
234
+ if not doc_id or doc_id == "None":
235
+ continue
236
+ selected_query_ids.append(str(query_id))
237
+ required_doc_ids.add(doc_id)
238
+ if max_queries and len(selected_query_ids) >= max_queries:
239
+ break
240
+
241
+ allowed_doc_ids = set()
242
+ corpus_files = sorted((root / "corpus").glob("*.json"))
243
+
244
+ for corpus_file in corpus_files:
245
+ paper = json.loads(corpus_file.read_text(encoding="utf-8"))
246
+ paper_id = str(paper.get("id") or corpus_file.stem)
247
+ is_required = paper_id in required_doc_ids
248
+ if max_corpus_docs and not is_required:
249
+ missing_required_count = len(required_doc_ids - allowed_doc_ids)
250
+ if len(documents) + missing_required_count >= max_corpus_docs:
251
+ continue
252
+ allowed_doc_ids.add(paper_id)
253
+ section_texts = []
254
+ for section_index, section in enumerate(paper.get("sections") or []):
255
+ section_text = flatten_open_ragbench_section(section)
256
+ if section_text:
257
+ section_texts.append(f"[section {section_index}]\n{section_text}")
258
+ text = "\n\n".join(
259
+ part
260
+ for part in [paper.get("title") or "", paper.get("abstract") or "", *section_texts]
261
+ if part
262
+ )
263
+ documents.append(
264
+ {
265
+ "doc_id": paper_id,
266
+ "title": paper.get("title") or paper_id,
267
+ "text": text,
268
+ "metadata": {
269
+ "source_dataset": "open_ragbench",
270
+ "categories": ",".join(paper.get("categories") or []),
271
+ },
272
+ }
273
+ )
274
+ if max_corpus_docs and len(documents) >= max_corpus_docs:
275
+ break
276
+
277
+ queries = []
278
+ for query_id in selected_query_ids:
279
+ qrel = qrels_data[query_id]
280
+ doc_id = str(qrel.get("doc_id"))
281
+ if doc_id not in allowed_doc_ids:
282
+ continue
283
+ query_payload = queries_data.get(query_id) or {}
284
+ question = query_payload.get("query") if isinstance(query_payload, dict) else str(query_payload)
285
+ qrels[str(query_id)] = {doc_id}
286
+ queries.append(
287
+ {
288
+ "query_id": str(query_id),
289
+ "question": question,
290
+ "relevant_doc_ids": [doc_id],
291
+ }
292
+ )
293
+ if max_queries and len(queries) >= max_queries:
294
+ break
295
+
296
+ if not documents or not queries:
297
+ raise ValueError("Open RAGBench produced no evaluable sample.")
298
+
299
+ return EvalCorpus("open_ragbench", documents, queries, qrels)
300
+
301
+
302
+ def load_t2_ragbench(
303
+ max_corpus_docs: int | None,
304
+ max_queries: int | None,
305
+ ) -> EvalCorpus:
306
+ dataset_dir = snapshot_hf_dataset("G4KMU/t2-ragbench", "t2_ragbench")
307
+ parquet_files = sorted(dataset_dir.rglob("*.parquet"))
308
+ jsonl_files = sorted(dataset_dir.rglob("*.jsonl"))
309
+ if not parquet_files and not jsonl_files:
310
+ raise FileNotFoundError(f"No parquet/jsonl files found in {dataset_dir}")
311
+
312
+ rows: list[dict[str, Any]] = []
313
+ if parquet_files:
314
+ import pandas as pd
315
+
316
+ for parquet_file in parquet_files:
317
+ frame = pd.read_parquet(parquet_file)
318
+ rows.extend(frame.to_dict(orient="records"))
319
+ if max_queries and len(rows) >= max_queries * 5:
320
+ break
321
+ else:
322
+ for jsonl_file in jsonl_files:
323
+ rows.extend(read_jsonl(jsonl_file))
324
+ if max_queries and len(rows) >= max_queries * 5:
325
+ break
326
+
327
+ documents_by_id: dict[str, dict[str, Any]] = {}
328
+ queries = []
329
+ qrels: dict[str, set[str]] = {}
330
+
331
+ for index, row in enumerate(rows):
332
+ question = first_present(row, ["question", "query", "Question"])
333
+ answer = first_present(row, ["answer", "Answer", "response"])
334
+ context = first_present(row, ["context", "evidence", "gold_context", "text", "document"])
335
+ table = first_present(row, ["table", "Table", "markdown_table"])
336
+ doc_id = str(first_present(row, ["doc_id", "document_id", "filename", "pdf_path", "source"]) or f"row-{index}")
337
+ if not question or not context:
338
+ continue
339
+
340
+ text = "\n".join(part for part in [str(context), str(table or "")] if part)
341
+ if doc_id not in documents_by_id:
342
+ documents_by_id[doc_id] = {
343
+ "doc_id": doc_id,
344
+ "title": str(first_present(row, ["company", "ticker", "title", "Title"]) or doc_id),
345
+ "text": text,
346
+ "metadata": {"source_dataset": "t2_ragbench", "answer": str(answer or "")},
347
+ }
348
+ queries.append(
349
+ {
350
+ "query_id": str(first_present(row, ["qid", "query_id", "id"]) or f"q-{index}"),
351
+ "question": str(question),
352
+ "relevant_doc_ids": [doc_id],
353
+ }
354
+ )
355
+ qrels[queries[-1]["query_id"]] = {doc_id}
356
+ if max_queries and len(queries) >= max_queries:
357
+ break
358
+
359
+ documents = list(documents_by_id.values())
360
+ if max_corpus_docs:
361
+ documents = documents[:max_corpus_docs]
362
+ allowed = {document["doc_id"] for document in documents}
363
+ queries = [query for query in queries if query["relevant_doc_ids"][0] in allowed]
364
+ qrels = {query["query_id"]: set(query["relevant_doc_ids"]) for query in queries}
365
+
366
+ if not documents or not queries:
367
+ raise ValueError("T2-RAGBench produced no evaluable sample.")
368
+
369
+ return EvalCorpus("t2_ragbench", documents, queries, qrels)
370
+
371
+
372
+ def first_present(row: dict[str, Any], keys: list[str]) -> Any:
373
+ for key in keys:
374
+ value = row.get(key)
375
+ if value is not None and value != "":
376
+ return value
377
+ return None
378
+
379
+
380
+ def load_local_options_eval(max_queries: int | None) -> EvalCorpus:
381
+ cases_path = EVAL_DIR / "local_options_eval.jsonl"
382
+ if not cases_path.exists():
383
+ raise FileNotFoundError(
384
+ f"Local options eval set not found: {cases_path}. "
385
+ "Create JSONL cases with question, expected_pages, expected_keywords."
386
+ )
387
+
388
+ from tools.query_knowledge import load_pdf_file
389
+
390
+ pdf_files = sorted((PROJECT_ROOT / "tools" / "knowledge_base" / "raw").rglob("*.pdf"))
391
+ documents = []
392
+ for pdf_file in pdf_files:
393
+ for doc_index, document in enumerate(load_pdf_file(pdf_file)):
394
+ documents.append(
395
+ {
396
+ "doc_id": f"{pdf_file.name}:{document.metadata.get('page_number')}:{doc_index}",
397
+ "title": document.metadata.get("section_path") or pdf_file.name,
398
+ "text": document.text,
399
+ "metadata": document.metadata,
400
+ }
401
+ )
402
+
403
+ queries = []
404
+ qrels: dict[str, set[str]] = {}
405
+ for case_index, case in enumerate(read_jsonl(cases_path)):
406
+ query_id = str(case.get("id") or f"local-{case_index}")
407
+ relevant_ids = []
408
+ expected_pages = set(case.get("expected_pages") or [])
409
+ expected_keywords = case.get("expected_keywords") or []
410
+ for document in documents:
411
+ metadata = document.get("metadata") or {}
412
+ page_hit = metadata.get("page_number") in expected_pages
413
+ keyword_hit = any(keyword in document["text"] for keyword in expected_keywords)
414
+ if page_hit or keyword_hit:
415
+ relevant_ids.append(document["doc_id"])
416
+ queries.append(
417
+ {
418
+ "query_id": query_id,
419
+ "question": case["question"],
420
+ "relevant_doc_ids": relevant_ids,
421
+ }
422
+ )
423
+ qrels[query_id] = set(relevant_ids)
424
+ if max_queries and len(queries) >= max_queries:
425
+ break
426
+
427
+ if not documents or not queries:
428
+ raise ValueError("Local options eval set produced no evaluable sample.")
429
+
430
+ return EvalCorpus("local_options", documents, queries, qrels)
431
+
432
+
433
+ def load_eval_corpus(args: argparse.Namespace) -> EvalCorpus:
434
+ dataset = DATASET_ALIASES.get(args.dataset, args.dataset)
435
+ if dataset in {"scifact", "fiqa"}:
436
+ return load_beir_dataset(dataset, args.split, args.max_corpus_docs, args.max_queries)
437
+ if dataset == "open_ragbench":
438
+ return load_open_ragbench(args.max_corpus_docs, args.max_queries)
439
+ if dataset == "t2_ragbench":
440
+ return load_t2_ragbench(args.max_corpus_docs, args.max_queries)
441
+ if dataset == "local_options":
442
+ return load_local_options_eval(args.max_queries)
443
+ raise ValueError(f"Unknown dataset: {args.dataset}")
444
+
445
+
446
+ def build_index(corpus: EvalCorpus, chunk_size: int, chunk_overlap: int, rebuild: bool) -> VectorStoreIndex:
447
+ configure_model_cache()
448
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
449
+
450
+ index_path = INDEX_DIR / corpus.name
451
+ if rebuild and index_path.exists():
452
+ shutil.rmtree(index_path)
453
+ index_path.mkdir(parents=True, exist_ok=True)
454
+
455
+ db = chromadb.PersistentClient(path=str(index_path))
456
+ collection_name = f"{corpus.name}_eval"
457
+ if rebuild:
458
+ try:
459
+ db.delete_collection(collection_name)
460
+ except Exception:
461
+ pass
462
+ collection = db.get_or_create_collection(collection_name)
463
+ vector_store = ChromaVectorStore(chroma_collection=collection)
464
+ storage_context = StorageContext.from_defaults(vector_store=vector_store)
465
+ embed_model = HuggingFaceEmbedding(
466
+ model_name=resolve_embed_model_name(),
467
+ cache_folder=str(PROJECT_ROOT / "tools" / "hf_cache" / "sentence_transformers"),
468
+ )
469
+
470
+ if collection.count() == 0:
471
+ documents = [
472
+ Document(
473
+ text=document["text"],
474
+ metadata={
475
+ "doc_id": document["doc_id"],
476
+ "title": document.get("title", ""),
477
+ **(document.get("metadata") or {}),
478
+ },
479
+ )
480
+ for document in corpus.documents
481
+ ]
482
+ splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
483
+ nodes = splitter.get_nodes_from_documents(documents)
484
+ VectorStoreIndex(
485
+ nodes,
486
+ storage_context=storage_context,
487
+ embed_model=embed_model,
488
+ show_progress=True,
489
+ )
490
+
491
+ return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
492
+
493
+
494
+ def evaluate_retrieval(corpus: EvalCorpus, index: VectorStoreIndex, top_k: int) -> dict[str, Any]:
495
+ retriever = index.as_retriever(similarity_top_k=max(top_k * 5, top_k))
496
+ cases = []
497
+ hit_counts = {1: 0, 3: 0, 5: 0, top_k: 0}
498
+ reciprocal_ranks = []
499
+ ndcg_scores = []
500
+
501
+ for query in corpus.queries:
502
+ relevant_doc_ids = corpus.qrels.get(query["query_id"], set())
503
+ results = retriever.retrieve(query["question"])
504
+ retrieved = []
505
+ seen_doc_ids = set()
506
+ first_hit_rank = None
507
+ dcg = 0.0
508
+
509
+ for result in results:
510
+ metadata = result.node.metadata
511
+ doc_id = str(metadata.get("doc_id", ""))
512
+ if doc_id in seen_doc_ids:
513
+ continue
514
+ seen_doc_ids.add(doc_id)
515
+ rank = len(retrieved) + 1
516
+ hit = doc_id in relevant_doc_ids
517
+ if hit and first_hit_rank is None:
518
+ first_hit_rank = rank
519
+ if hit:
520
+ dcg += 1 / math.log2(rank + 1)
521
+ retrieved.append(
522
+ {
523
+ "rank": rank,
524
+ "doc_id": doc_id,
525
+ "score": result.score,
526
+ "hit": hit,
527
+ "title": metadata.get("title", ""),
528
+ }
529
+ )
530
+ if len(retrieved) >= top_k:
531
+ break
532
+
533
+ ideal_hits = min(len(relevant_doc_ids), top_k)
534
+ idcg = sum(1 / math.log2(rank + 1) for rank in range(1, ideal_hits + 1))
535
+ ndcg = dcg / idcg if idcg else 0.0
536
+ ndcg_scores.append(ndcg)
537
+ reciprocal_ranks.append(1 / first_hit_rank if first_hit_rank else 0.0)
538
+
539
+ for k in hit_counts:
540
+ if any(item["hit"] for item in retrieved[:k]):
541
+ hit_counts[k] += 1
542
+
543
+ cases.append(
544
+ {
545
+ "query_id": query["query_id"],
546
+ "question": query["question"],
547
+ "relevant_doc_ids": sorted(relevant_doc_ids),
548
+ "first_hit_rank": first_hit_rank,
549
+ "retrieved": retrieved,
550
+ }
551
+ )
552
+
553
+ total = len(corpus.queries)
554
+ metrics = {
555
+ "queries": total,
556
+ "documents": len(corpus.documents),
557
+ "top_k": top_k,
558
+ "mrr": sum(reciprocal_ranks) / total if total else 0.0,
559
+ "ndcg_at_k": sum(ndcg_scores) / total if total else 0.0,
560
+ }
561
+ for k, count in sorted(hit_counts.items()):
562
+ metrics[f"hit_at_{k}"] = count / total if total else 0.0
563
+
564
+ return {"dataset": corpus.name, "metrics": metrics, "cases": cases}
565
+
566
+
567
+ def write_reports(report: dict[str, Any]) -> tuple[Path, Path]:
568
+ ensure_dirs()
569
+ dataset_name = report["dataset"]
570
+ json_path = REPORT_DIR / f"{dataset_name}_retrieval_eval.json"
571
+ md_path = REPORT_DIR / f"{dataset_name}_retrieval_eval.md"
572
+ json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
573
+
574
+ metrics = report["metrics"]
575
+ lines = [
576
+ f"# Retrieval Eval: {dataset_name}",
577
+ "",
578
+ "## Metrics",
579
+ "",
580
+ ]
581
+ for key, value in metrics.items():
582
+ lines.append(f"- `{key}`: {value:.4f}" if isinstance(value, float) else f"- `{key}`: {value}")
583
+
584
+ lines.extend(["", "## Sample Cases", ""])
585
+ for case in report["cases"][:10]:
586
+ lines.append(f"### {case['query_id']}")
587
+ lines.append("")
588
+ lines.append(case["question"])
589
+ lines.append("")
590
+ lines.append(f"- first_hit_rank: `{case['first_hit_rank']}`")
591
+ for item in case["retrieved"][:5]:
592
+ lines.append(
593
+ f"- rank {item['rank']}: hit={item['hit']} doc_id=`{item['doc_id']}` score={item['score']}"
594
+ )
595
+ lines.append("")
596
+
597
+ md_path.write_text("\n".join(lines), encoding="utf-8")
598
+ return json_path, md_path
599
+
600
+
601
+ def parse_args() -> argparse.Namespace:
602
+ parser = argparse.ArgumentParser(description="Run retrieval eval for RAG datasets.")
603
+ parser.add_argument(
604
+ "--dataset",
605
+ required=True,
606
+ help="beir/scifact, beir/fiqa, open-ragbench, t2-ragbench, or local-options",
607
+ )
608
+ parser.add_argument("--split", default="test")
609
+ parser.add_argument("--top-k", type=int, default=5)
610
+ parser.add_argument("--chunk-size", type=int, default=512)
611
+ parser.add_argument("--chunk-overlap", type=int, default=64)
612
+ parser.add_argument("--max-corpus-docs", type=int, default=None)
613
+ parser.add_argument("--max-queries", type=int, default=None)
614
+ parser.add_argument("--rebuild", action="store_true")
615
+ return parser.parse_args()
616
+
617
+
618
+ def main() -> None:
619
+ args = parse_args()
620
+ corpus = load_eval_corpus(args)
621
+ index = build_index(corpus, args.chunk_size, args.chunk_overlap, args.rebuild)
622
+ report = evaluate_retrieval(corpus, index, args.top_k)
623
+ json_path, md_path = write_reports(report)
624
+ print(json.dumps(report["metrics"], ensure_ascii=False, indent=2))
625
+ print(f"JSON report: {json_path}")
626
+ print(f"Markdown report: {md_path}")
627
+
628
+
629
+ if __name__ == "__main__":
630
+ main()
eval/reports/beir_fiqa_retrieval_eval.md ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Retrieval Eval: beir_fiqa
2
+
3
+ ## Metrics
4
+
5
+ - `queries`: 10
6
+ - `documents`: 500
7
+ - `top_k`: 5
8
+ - `mrr`: 0.8000
9
+ - `ndcg_at_k`: 0.6582
10
+ - `hit_at_1`: 0.8000
11
+ - `hit_at_3`: 0.8000
12
+ - `hit_at_5`: 0.8000
13
+
14
+ ## Sample Cases
15
+
16
+ ### 8
17
+
18
+ How to deposit a cheque issued to an associate in my business into my business account?
19
+
20
+ - first_hit_rank: `1`
21
+ - rank 1: hit=True doc_id=`65404` score=0.6844510955827177
22
+ - rank 2: hit=False doc_id=`508754` score=0.6415634192002271
23
+ - rank 3: hit=False doc_id=`1873` score=0.6244133153886419
24
+ - rank 4: hit=False doc_id=`590102` score=0.6106401478322256
25
+ - rank 5: hit=False doc_id=`1066` score=0.5854493569389293
26
+
27
+ ### 15
28
+
29
+ Can I send a money order from USPS as a business?
30
+
31
+ - first_hit_rank: `1`
32
+ - rank 1: hit=True doc_id=`325273` score=0.6860931820873509
33
+ - rank 2: hit=False doc_id=`3714` score=0.5383410844537323
34
+ - rank 3: hit=False doc_id=`508754` score=0.5295326644960427
35
+ - rank 4: hit=False doc_id=`1873` score=0.5219679418951554
36
+ - rank 5: hit=False doc_id=`4457` score=0.5122406473020094
37
+
38
+ ### 18
39
+
40
+ 1 EIN doing business under multiple business names
41
+
42
+ - first_hit_rank: `1`
43
+ - rank 1: hit=True doc_id=`88124` score=0.5926237160250162
44
+ - rank 2: hit=False doc_id=`1873` score=0.5421392202098603
45
+ - rank 3: hit=False doc_id=`248624` score=0.5355707959162649
46
+ - rank 4: hit=False doc_id=`590102` score=0.5349105669189491
47
+ - rank 5: hit=False doc_id=`1173` score=0.5304232255229728
48
+
49
+ ### 26
50
+
51
+ Applying for and receiving business credit
52
+
53
+ - first_hit_rank: `1`
54
+ - rank 1: hit=True doc_id=`350819` score=0.6130084948278423
55
+ - rank 2: hit=False doc_id=`2064` score=0.5484836878784439
56
+ - rank 3: hit=False doc_id=`5019` score=0.545421752024407
57
+ - rank 4: hit=False doc_id=`1873` score=0.5288677740902044
58
+ - rank 5: hit=False doc_id=`1766` score=0.5277730439438229
59
+
60
+ ### 34
61
+
62
+ 401k Transfer After Business Closure
63
+
64
+ - first_hit_rank: `None`
65
+ - rank 1: hit=False doc_id=`19183` score=0.5697281829712297
66
+ - rank 2: hit=False doc_id=`1506` score=0.5606544069043923
67
+ - rank 3: hit=False doc_id=`1134` score=0.5594801072658324
68
+ - rank 4: hit=False doc_id=`3481` score=0.5580692841866827
69
+ - rank 5: hit=False doc_id=`3059` score=0.5470931591486823
70
+
71
+ ### 42
72
+
73
+ What are the ins/outs of writing equipment purchases off as business expenses in a home based business?
74
+
75
+ - first_hit_rank: `1`
76
+ - rank 1: hit=True doc_id=`272709` score=0.6108084707046366
77
+ - rank 2: hit=False doc_id=`2528` score=0.5915589749452431
78
+ - rank 3: hit=True doc_id=`331981` score=0.5819601957870557
79
+ - rank 4: hit=False doc_id=`1873` score=0.5679211375564418
80
+ - rank 5: hit=True doc_id=`327263` score=0.5609058973658579
81
+
82
+ ### 56
83
+
84
+ Can a entrepreneur hire a self-employed business owner?
85
+
86
+ - first_hit_rank: `1`
87
+ - rank 1: hit=True doc_id=`572690` score=0.5928112761756716
88
+ - rank 2: hit=False doc_id=`1873` score=0.5329399371121925
89
+ - rank 3: hit=False doc_id=`350819` score=0.49122764843847383
90
+ - rank 4: hit=False doc_id=`288` score=0.48281883887294536
91
+ - rank 5: hit=False doc_id=`599545` score=0.4825679577769018
92
+
93
+ ### 68
94
+
95
+ Intentions of Deductible Amount for Small Business
96
+
97
+ - first_hit_rank: `None`
98
+ - rank 1: hit=False doc_id=`599545` score=0.5484593654392641
99
+ - rank 2: hit=False doc_id=`350819` score=0.545089604374947
100
+ - rank 3: hit=False doc_id=`327263` score=0.5425303284932907
101
+ - rank 4: hit=False doc_id=`272709` score=0.5367760755311749
102
+ - rank 5: hit=False doc_id=`1873` score=0.5341962558469263
103
+
104
+ ### 89
105
+
106
+ How can I deposit a check made out to my business into my personal account?
107
+
108
+ - first_hit_rank: `1`
109
+ - rank 1: hit=True doc_id=`508754` score=0.678210930846752
110
+ - rank 2: hit=False doc_id=`3336` score=0.6219187366693569
111
+ - rank 3: hit=False doc_id=`1066` score=0.6102283456272309
112
+ - rank 4: hit=False doc_id=`65404` score=0.6070578770706204
113
+ - rank 5: hit=True doc_id=`413229` score=0.5974145307840032
114
+
115
+ ### 90
116
+
117
+ Filing personal with 1099s versus business s-corp?
118
+
119
+ - first_hit_rank: `1`
120
+ - rank 1: hit=True doc_id=`31793` score=0.6463855238248295
121
+ - rank 2: hit=False doc_id=`4992` score=0.575164246858743
122
+ - rank 3: hit=False doc_id=`1873` score=0.567805853646443
123
+ - rank 4: hit=False doc_id=`2020` score=0.5629015874196683
124
+ - rank 5: hit=False doc_id=`350819` score=0.5607360854843948
eval/run_eval_suite.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import json
5
+ import traceback
6
+ from dataclasses import dataclass
7
+ from datetime import datetime
8
+ from pathlib import Path
9
+ from types import SimpleNamespace
10
+ from typing import Any
11
+
12
+ from eval.rag_eval import (
13
+ REPORT_DIR,
14
+ build_index,
15
+ ensure_dirs,
16
+ evaluate_retrieval,
17
+ load_eval_corpus,
18
+ write_reports,
19
+ )
20
+
21
+
22
+ DEFAULT_DATASETS = ["beir/scifact", "beir/fiqa", "open-ragbench", "local-options"]
23
+ SMOKE_DEFAULTS = {
24
+ "beir/scifact": {"max_corpus_docs": 200, "max_queries": 10},
25
+ "beir/fiqa": {"max_corpus_docs": 500, "max_queries": 10},
26
+ "open-ragbench": {"max_corpus_docs": 20, "max_queries": 5},
27
+ "t2-ragbench": {"max_corpus_docs": 20, "max_queries": 5},
28
+ "local-options": {"max_corpus_docs": None, "max_queries": 3},
29
+ }
30
+
31
+
32
+ @dataclass
33
+ class DatasetRun:
34
+ dataset: str
35
+ status: str
36
+ metrics: dict[str, Any] | None
37
+ json_report: str | None
38
+ markdown_report: str | None
39
+ error: str | None = None
40
+
41
+
42
+ def parse_dataset_list(value: str) -> list[str]:
43
+ datasets = [item.strip() for item in value.split(",") if item.strip()]
44
+ return datasets or DEFAULT_DATASETS
45
+
46
+
47
+ def build_dataset_args(args: argparse.Namespace, dataset: str) -> SimpleNamespace:
48
+ defaults = SMOKE_DEFAULTS.get(dataset, {"max_corpus_docs": None, "max_queries": None})
49
+ return SimpleNamespace(
50
+ dataset=dataset,
51
+ split=args.split,
52
+ top_k=args.top_k,
53
+ chunk_size=args.chunk_size,
54
+ chunk_overlap=args.chunk_overlap,
55
+ max_corpus_docs=args.max_corpus_docs
56
+ if args.max_corpus_docs is not None
57
+ else defaults["max_corpus_docs"],
58
+ max_queries=args.max_queries if args.max_queries is not None else defaults["max_queries"],
59
+ rebuild=args.rebuild,
60
+ )
61
+
62
+
63
+ def run_one(dataset: str, args: argparse.Namespace) -> DatasetRun:
64
+ dataset_args = build_dataset_args(args, dataset)
65
+ print(
66
+ f"\n=== Running {dataset} "
67
+ f"(top_k={dataset_args.top_k}, max_corpus_docs={dataset_args.max_corpus_docs}, "
68
+ f"max_queries={dataset_args.max_queries}, rebuild={dataset_args.rebuild}) ==="
69
+ )
70
+
71
+ corpus = load_eval_corpus(dataset_args)
72
+ index = build_index(
73
+ corpus,
74
+ chunk_size=dataset_args.chunk_size,
75
+ chunk_overlap=dataset_args.chunk_overlap,
76
+ rebuild=dataset_args.rebuild,
77
+ )
78
+ report = evaluate_retrieval(corpus, index, dataset_args.top_k)
79
+ json_path, md_path = write_reports(report)
80
+ print(json.dumps(report["metrics"], ensure_ascii=False, indent=2))
81
+
82
+ return DatasetRun(
83
+ dataset=dataset,
84
+ status="passed",
85
+ metrics=report["metrics"],
86
+ json_report=str(json_path),
87
+ markdown_report=str(md_path),
88
+ )
89
+
90
+
91
+ def write_suite_report(runs: list[DatasetRun], output_name: str | None) -> tuple[Path, Path]:
92
+ ensure_dirs()
93
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
94
+ stem = output_name or f"rag_eval_suite_{timestamp}"
95
+ json_path = REPORT_DIR / f"{stem}.json"
96
+ md_path = REPORT_DIR / f"{stem}.md"
97
+
98
+ payload = {
99
+ "created_at": datetime.now().isoformat(timespec="seconds"),
100
+ "runs": [run.__dict__ for run in runs],
101
+ }
102
+ json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
103
+
104
+ lines = ["# RAG Eval Suite", ""]
105
+ for run in runs:
106
+ lines.append(f"## {run.dataset}")
107
+ lines.append("")
108
+ lines.append(f"- status: `{run.status}`")
109
+ if run.error:
110
+ lines.append(f"- error: `{run.error}`")
111
+ if run.metrics:
112
+ for key, value in run.metrics.items():
113
+ lines.append(f"- `{key}`: {value:.4f}" if isinstance(value, float) else f"- `{key}`: {value}")
114
+ if run.markdown_report:
115
+ lines.append(f"- report: `{run.markdown_report}`")
116
+ lines.append("")
117
+ md_path.write_text("\n".join(lines), encoding="utf-8")
118
+ return json_path, md_path
119
+
120
+
121
+ def parse_args() -> argparse.Namespace:
122
+ parser = argparse.ArgumentParser(description="Run a RAG retrieval eval suite.")
123
+ parser.add_argument(
124
+ "--datasets",
125
+ default=",".join(DEFAULT_DATASETS),
126
+ help="Comma-separated datasets: beir/scifact, beir/fiqa, open-ragbench, t2-ragbench, local-options",
127
+ )
128
+ parser.add_argument("--split", default="test")
129
+ parser.add_argument("--top-k", type=int, default=5)
130
+ parser.add_argument("--chunk-size", type=int, default=512)
131
+ parser.add_argument("--chunk-overlap", type=int, default=64)
132
+ parser.add_argument("--max-corpus-docs", type=int, default=None)
133
+ parser.add_argument("--max-queries", type=int, default=None)
134
+ parser.add_argument("--rebuild", action="store_true")
135
+ parser.add_argument("--fail-fast", action="store_true")
136
+ parser.add_argument("--output-name", default=None, help="Suite report filename stem under eval/reports.")
137
+ return parser.parse_args()
138
+
139
+
140
+ def main() -> None:
141
+ args = parse_args()
142
+ runs: list[DatasetRun] = []
143
+
144
+ for dataset in parse_dataset_list(args.datasets):
145
+ try:
146
+ runs.append(run_one(dataset, args))
147
+ except Exception as exc:
148
+ error = f"{type(exc).__name__}: {exc}"
149
+ print(f"\n*** {dataset} failed: {error}")
150
+ if args.fail_fast:
151
+ raise
152
+ traceback.print_exc()
153
+ runs.append(
154
+ DatasetRun(
155
+ dataset=dataset,
156
+ status="failed",
157
+ metrics=None,
158
+ json_report=None,
159
+ markdown_report=None,
160
+ error=error,
161
+ )
162
+ )
163
+
164
+ json_path, md_path = write_suite_report(runs, args.output_name)
165
+ print(f"\nSuite JSON report: {json_path}")
166
+ print(f"Suite Markdown report: {md_path}")
167
+
168
+ if any(run.status == "failed" for run in runs):
169
+ raise SystemExit(1)
170
+
171
+
172
+ if __name__ == "__main__":
173
+ main()
hf_cache/sentence_transformers/models--BAAI--bge-small-en-v1.5/snapshots/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/README.md ADDED
@@ -0,0 +1 @@
 
 
1
+ ../../blobs/8b8567d75ffa619486d9590cb0eb76d66ad46c49
load_docs.py DELETED
@@ -1,216 +0,0 @@
1
- import asyncio
2
- import hashlib
3
- import os
4
- from pathlib import Path
5
- from typing import Iterable, List
6
- from dotenv import load_dotenv
7
- import chromadb
8
- from chromadb.errors import NotFoundError
9
- from pypdf import PdfReader
10
-
11
- from llama_index.core import StorageContext, VectorStoreIndex
12
- from llama_index.core.schema import Document, BaseNode
13
- from llama_index.core.node_parser import SentenceSplitter
14
- from llama_index.vector_stores.chroma import ChromaVectorStore
15
-
16
-
17
- BASE_DIR = Path(__file__).resolve().parent
18
- KNOWLEDGE_BASE_DIR = BASE_DIR / "knowledge_base"
19
- RAW_DIR = KNOWLEDGE_BASE_DIR / "raw"
20
- CHROMA_DB_DIR = KNOWLEDGE_BASE_DIR / "chroma_db"
21
- HF_CACHE_DIR = BASE_DIR / "hf_cache"
22
- COLLECTION_NAME = "options_knowledge"
23
-
24
- EMBED_MODEL_NAME = "BAAI/bge-small-en-v1.5"
25
- CHUNK_SIZE = 1000
26
- CHUNK_OVERLAP = 150
27
-
28
- REQUIRED_METADATA = [
29
- "source_file",
30
- "file_name",
31
- "file_type",
32
- "document_title",
33
- "file_hash",
34
- "chunk_id",
35
- "chunk_index",
36
- ]
37
-
38
-
39
- def configure_model_cache() -> None:
40
- HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
41
- os.environ.setdefault("HF_HOME", str(HF_CACHE_DIR))
42
- os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", str(HF_CACHE_DIR / "sentence_transformers"))
43
- os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
44
-
45
-
46
- def file_sha256(path: Path) -> str:
47
- digest = hashlib.sha256()
48
- with path.open("rb") as file:
49
- for block in iter(lambda: file.read(1024 * 1024), b""):
50
- digest.update(block)
51
- return digest.hexdigest()
52
-
53
-
54
- def load_md_file(path: Path) -> Document:
55
- text = path.read_text(encoding="utf-8")
56
-
57
- return Document(
58
- text=text,
59
- metadata={
60
- "source_file": str(path.resolve()),
61
- "file_name": path.name,
62
- "file_type": "md",
63
- "document_title": path.stem,
64
- "file_hash": file_sha256(path),
65
- },
66
- )
67
-
68
-
69
- def load_pdf_file(path: Path) -> List[Document]:
70
- reader = PdfReader(str(path))
71
- documents = []
72
-
73
- for page_index, page in enumerate(reader.pages, start=1):
74
- text = page.extract_text() or ""
75
-
76
- if not text.strip():
77
- continue
78
-
79
- documents.append(
80
- Document(
81
- text=text,
82
- metadata={
83
- "source_file": str(path.resolve()),
84
- "file_name": path.name,
85
- "file_type": "pdf",
86
- "document_title": path.stem,
87
- "file_hash": file_sha256(path),
88
- "page_number": page_index,
89
- },
90
- )
91
- )
92
-
93
- return documents
94
-
95
-
96
- def iter_source_files(raw_dir: Path) -> Iterable[Path]:
97
- supported_suffixes = {".md", ".markdown", ".pdf"}
98
- for path in sorted(raw_dir.rglob("*")):
99
- if path.is_file() and path.suffix.lower() in supported_suffixes:
100
- yield path
101
-
102
-
103
- def load_docs(raw_dir: Path = RAW_DIR) -> List[Document]:
104
- documents: List[Document] = []
105
-
106
- for path in iter_source_files(raw_dir):
107
- suffix = path.suffix.lower()
108
-
109
- if suffix in {".md", ".markdown"}:
110
- documents.append(load_md_file(path))
111
- elif suffix == ".pdf":
112
- documents.extend(load_pdf_file(path))
113
-
114
- if not documents:
115
- raise ValueError(f"No supported documents found under {raw_dir}")
116
-
117
- return documents
118
-
119
-
120
- def add_chunk_metadata(nodes: List[BaseNode]) -> List[BaseNode]:
121
- counters: dict[str, int] = {}
122
-
123
- for node in nodes:
124
- source_file = node.metadata["source_file"]
125
- chunk_index = counters.get(source_file, 0)
126
- counters[source_file] = chunk_index + 1
127
-
128
- file_hash = node.metadata["file_hash"][:12]
129
- page_number = node.metadata.get("page_number", "na")
130
- chunk_id = f"{Path(source_file).stem}-{file_hash}-p{page_number}-c{chunk_index}"
131
-
132
- node.metadata["chunk_id"] = chunk_id
133
- node.metadata["chunk_index"] = chunk_index
134
- node.id_ = chunk_id
135
-
136
- return nodes
137
-
138
-
139
- def validate_nodes(nodes: List[BaseNode]) -> None:
140
- if not nodes:
141
- raise ValueError("No chunks were created from the source documents.")
142
-
143
- for node in nodes:
144
- missing = [key for key in REQUIRED_METADATA if key not in node.metadata]
145
- if missing:
146
- raise ValueError(f"Node {node.node_id} is missing metadata fields: {missing}")
147
-
148
- if node.metadata["file_type"] == "pdf" and "page_number" not in node.metadata:
149
- raise ValueError(f"PDF node {node.node_id} is missing page_number metadata.")
150
-
151
-
152
- def build_nodes(raw_dir: Path = RAW_DIR) -> List[BaseNode]:
153
- documents = load_docs(raw_dir)
154
- splitter = SentenceSplitter(
155
- chunk_size=CHUNK_SIZE,
156
- chunk_overlap=CHUNK_OVERLAP,
157
- )
158
- nodes = splitter.get_nodes_from_documents(documents)
159
- add_chunk_metadata(nodes)
160
- validate_nodes(nodes)
161
- return nodes
162
-
163
-
164
- async def build_index(raw_dir: Path = RAW_DIR, rebuild: bool = False) -> VectorStoreIndex:
165
- configure_model_cache()
166
-
167
- from llama_index.embeddings.huggingface import HuggingFaceEmbedding
168
-
169
- load_dotenv()
170
- CHROMA_DB_DIR.mkdir(parents=True, exist_ok=True)
171
-
172
- db = chromadb.PersistentClient(path=str(CHROMA_DB_DIR))
173
-
174
- if rebuild:
175
- try:
176
- db.delete_collection(COLLECTION_NAME)
177
- except (NotFoundError, ValueError):
178
- pass
179
-
180
- chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
181
- vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
182
- storage_context = StorageContext.from_defaults(vector_store=vector_store)
183
- embed_model = HuggingFaceEmbedding(
184
- model_name=EMBED_MODEL_NAME,
185
- cache_folder=str(HF_CACHE_DIR / "sentence_transformers"),
186
- )
187
-
188
- if rebuild or chroma_collection.count() == 0:
189
- nodes = build_nodes(raw_dir)
190
- index = VectorStoreIndex(
191
- nodes,
192
- storage_context=storage_context,
193
- embed_model=embed_model,
194
- show_progress=True,
195
- )
196
- print(f"Indexed {len(nodes)} chunks into collection '{COLLECTION_NAME}'.")
197
- return index
198
-
199
- print(f"Loaded existing collection '{COLLECTION_NAME}' with {chroma_collection.count()} chunks.")
200
- return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
201
-
202
-
203
- if __name__ == "__main__":
204
- index = asyncio.run(build_index(rebuild=True))
205
- retriever = index.as_retriever(similarity_top_k=5)
206
- results = retriever.retrieve("What is volatility smile?")
207
-
208
- print("\nTop retrieved chunks:")
209
- for result in results:
210
- metadata = result.node.metadata
211
- source = metadata.get("file_name", "unknown")
212
- page = metadata.get("page_number", "n/a")
213
- score = result.score
214
- print(f"- {source}, page {page}, score={score:.4f}")
215
- print(result.node.get_content()[:500].replace("\n", " "))
216
- print()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pyproject.toml CHANGED
@@ -17,6 +17,7 @@ dependencies = [
17
  "pypdf>=6.0.0",
18
  "tokenizers>=0.22.0,<=0.23.0",
19
  "transformers<5",
 
20
  ]
21
 
22
  [build-system]
 
17
  "pypdf>=6.0.0",
18
  "tokenizers>=0.22.0,<=0.23.0",
19
  "transformers<5",
20
+ "pymupdf>=1.27.2.3",
21
  ]
22
 
23
  [build-system]
rag_pdf_optimization_notes.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAG PDF 提取与切分优化总结
2
+
3
+ 这次优化的目标是提升当前 RAG 系统对金融 PDF,尤其是包含大量数学公式、章节标题和图表内容的 PDF 的解析质量。原始实现能完成基础向量检索,但 PDF 提取、公式保留、chunk 切分和 metadata 管理都比较粗糙,导致检索结果不够稳定。
4
+
5
+ ## 一、整体背景
6
+
7
+ 项目使用 `LlamaIndex + Chroma + HuggingFaceEmbedding` 构建本地知识库,原始 PDF 文档是一本期权/波动率相关书籍。最开始的流程大致是:
8
+
9
+ ```text
10
+ pypdf 提取每页文本
11
+ -> SentenceSplitter 固定长度切分
12
+ -> HuggingFace embedding
13
+ -> Chroma 向量库
14
+ -> QueryKnowledgeTool 检索返回片段
15
+ ```
16
+
17
+ 这个流程对普通纯文本还可以,但面对金融教材类 PDF 会遇到很多问题:公式被拆散、章节边界丢失、页眉页脚干扰、图表文字混入正文、数学符号顺序错乱等。
18
+
19
+ ## 二、遇到的主要问题
20
+
21
+ ### 1. PDF 基础文本提取能力弱
22
+
23
+ 最初只使用:
24
+
25
+ ```python
26
+ page.extract_text()
27
+ ```
28
+
29
+ 问题是:
30
+
31
+ - 页眉、页码、版权信息会混进正文。
32
+ - 断行、断词严重,比如单词被 PDF 换行拆开。
33
+ - 多栏、图表、公式附近的文本顺序容易错乱。
34
+ - 数学公式经常被压成一行,或者符号顺序不对。
35
+
36
+ 解决方法:
37
+
38
+ - 增加 `pypdf` 的 `layout` 模式作为候选。
39
+ - 增加坐标级提取,利用 `visitor_text` 获取文字的 `x/y` 坐标,按视觉行重组。
40
+ - 增加文本清洗逻辑:
41
+ - 去除空行、页码、重复页眉页脚。
42
+ - 修复连字符断词。
43
+ - 处理常见 ligature,例如 `fi`、`fl`。
44
+ - 保留公式行的换行,不把公式硬合并成普通段落。
45
+
46
+ ### 2. 数学公式提取不理想
47
+
48
+ 金融教材中大量公式包含:
49
+
50
+ - 希腊字母,如 `𝜎`、`𝜇`、`𝜌`
51
+ - 上标、下标
52
+ - 分式结构
53
+ - 积分、求和、根号
54
+ - 公式编号,如 `(21.23)`
55
+
56
+ 普通 PDF 文本提取很难还原这些结构。例如:
57
+
58
+ ```text
59
+ d𝜎 = a𝜎 dt + b𝜎 dZ
60
+ ```
61
+
62
+ 可能会被提取成符号粘连、顺序错乱,或者和前后正文混在一起。
63
+
64
+ 解决方法:
65
+
66
+ - 先做 `pypdf` 数学感知优化:
67
+ - 识别公式行。
68
+ - 对短公式行、括号行、根号行保留换行。
69
+ - 尝试根据字号和垂直偏移标记上标/下标。
70
+
71
+ 后来发现 `pypdf` 仍然不够,所以进一步接入 `PyMuPDF`。
72
+
73
+ ### 3. PyMuPDF 初次接入后公式误判过多
74
+
75
+ 接入 `PyMuPDF` 后,可以通过:
76
+
77
+ ```python
78
+ page.get_text("dict", sort=True)
79
+ ```
80
+
81
+ 拿到 block、line、span、bbox、font 等信息。这比 `pypdf` 更适合定位公式区域。
82
+
83
+ 但初版公式识别遇到一个问题:误判过多。
84
+
85
+ 例如:
86
+
87
+ - 版权页中的电话号码。
88
+ - 普通正文中的 `Black-Scholes-Merton`。
89
+ - 普通段落里出现一个 `𝜎` 或 `F=ma`。
90
+ - 图表坐标轴上的数字。
91
+
92
+ 都可能被误识别为公式。
93
+
94
+ 解决方法:
95
+
96
+ - 从 block 级公式识别改为 line 级公式识别。
97
+ - 不再把普通斜体字体当作数学字体。
98
+ - 收紧公式触发条件:
99
+ - 单独的希腊字母不算公式。
100
+ - 普通 `-`、`/` 不作为强数学信号,避免把英文连字符误判为公式。
101
+ - 重点识别 `=`、`∫`、`∑`、`√`、`≤`、`≥`、`∕`、公式编号等强信号。
102
+ - 增加 `is_useful_formula_text()`,过滤掉太短、太碎、无核心公式结构的片段。
103
+ - 对公式续行做合并,避免根号、分母、括号被拆成多个孤立公式 chunk。
104
+
105
+ 最终实现了:
106
+
107
+ ```text
108
+ 正文 chunk
109
+ 公式 chunk: content_type=formula
110
+ 公式位置: formula_bbox
111
+ 公式编号: formula_id
112
+ ```
113
+
114
+ ### 4. 章节和标题切分缺失
115
+
116
+ 原始系统只用固定长度切分:
117
+
118
+ ```python
119
+ SentenceSplitter(chunk_size=1000, chunk_overlap=150)
120
+ ```
121
+
122
+ 问题是:
123
+
124
+ - chunk 可能跨章节。
125
+ - 一个小节的标题和正文可能被分开。
126
+ - 检索结果不知道来自哪一章、哪一节。
127
+ - 回答时引用不够清楚。
128
+
129
+ 解决方法:
130
+
131
+ 在 `SentenceSplitter` 前增加一层章节/标题感知分段:
132
+
133
+ - 识别 `CHAPTER ...`
134
+ - 识别 `APPENDIX ...`
135
+ - 识别全大写标题
136
+ - 识别标题式大小写小节名
137
+ - 过滤图表标题、坐标轴、公式短行、脚注、普通解释句
138
+
139
+ 并写入 metadata:
140
+
141
+ ```python
142
+ chapter_title
143
+ section_title
144
+ section_path
145
+ page_number
146
+ content_type
147
+ formula_id
148
+ ```
149
+
150
+ 这样检索结果可以返回:
151
+
152
+ ```text
153
+ source: The_volatility_Smile_Wiley.pdf
154
+ page: 379
155
+ section: WITH ZERO CORRELATION
156
+ content_type: formula
157
+ formula_id: formula-378-3
158
+ ```
159
+
160
+ ### 5. metadata 过长导致 LlamaIndex 报错
161
+
162
+ 接入公式 bbox 后,最开始把每一行的 bbox 都放进 metadata,导致 metadata 太长。
163
+
164
+ 报错类似:
165
+
166
+ ```text
167
+ Metadata length is longer than chunk size.
168
+ Consider increasing the chunk size or decreasing metadata size.
169
+ ```
170
+
171
+ 原因是 `SentenceSplitter` 会把 metadata 长度也计入 chunk 长度。
172
+
173
+ 解决方法:
174
+
175
+ - 不再存所有行的 bbox。
176
+ - 将多个 bbox 合并成一个外接矩形:
177
+
178
+ ```text
179
+ x0,y0,x1,y1
180
+ ```
181
+
182
+ 这样既保留了公式位置,又避免 metadata 过长。
183
+
184
+ ### 6. Hugging Face 模型加载反复联网
185
+
186
+ 本地已经有 embedding 模型缓存,但 `sentence-transformers` 仍尝试访问 Hugging Face 做 HEAD 检查。在网络受限环境下,会反复 retry,导致索引构建卡住。
187
+
188
+ 解决方法:
189
+
190
+ - 检测本地 snapshot 是否存在。
191
+ - 如果存在,直接把本地 snapshot 路径传给 embedding 模型。
192
+ - 设置离线环境变量:
193
+
194
+ ```python
195
+ HF_HUB_OFFLINE=1
196
+ TRANSFORMERS_OFFLINE=1
197
+ ```
198
+
199
+ 这样索引构建可以稳定使用本地缓存。
200
+
201
+ ### 7. 旧索引不会自动更新
202
+
203
+ PDF 提取逻辑升级后,如果 Chroma 里还是旧版本文本,RAG 实际不会变好。
204
+
205
+ 解决方法:
206
+
207
+ - 增加 `PDF_EXTRACTION_METHOD` 版本号。
208
+ - 当前版本为:
209
+
210
+ ```python
211
+ pymupdf_formula_blocks_v5
212
+ ```
213
+
214
+ - 启动时检查 Chroma 中 metadata 的 `extraction_method`。
215
+ - 如果版本不一致,自动重建索引。
216
+
217
+ ## 三、最终方案
218
+
219
+ 最终 PDF RAG 流程变为:
220
+
221
+ ```text
222
+ PyMuPDF 提取 block / line / span / bbox / font
223
+ -> 识别公式行
224
+ -> 合并公式续行
225
+ -> 生成独立公式文档 content_type=formula
226
+ -> 正文中保留 [FORMULA id=...] 引用
227
+ -> 清洗页眉页脚和噪声
228
+ -> 按章节/标题预分段
229
+ -> SentenceSplitter 二次切分
230
+ -> 写入 Chroma
231
+ -> 检索时返回 page / section / content_type / formula_id
232
+ ```
233
+
234
+ 核心收益:
235
+
236
+ - 公式可以作为独立检索单元。
237
+ - 正文仍保留公式上下文。
238
+ - chunk 不再完全依赖固定长度。
239
+ - 检索结果能说明来源页码、小节、内容类型。
240
+ - 索引版本可控,避免旧数据污染。
241
+
242
+ ## 四、面试中可以怎么回答
243
+
244
+ 可以这样概括:
245
+
246
+ > 我们一开始的 RAG 只是用 `pypdf` 按页提取文本,然后用固定长度切分。这个方案对普通文档可以,但对金融教材不够,因为里面有大量数学公式、图表和章节结构。主要问题是公式顺序错乱、上下标丢失、页眉页脚混入、chunk 跨章节。
247
+
248
+ 然后讲解决:
249
+
250
+ > 我先做了基础清洗,包括页眉页脚去重、断词修复、公式行换行保留。后来发现 `pypdf` 对公式区域的定位能力有限,所以接入了 `PyMuPDF`,利用它返回的 block、line、span、bbox 和 font 信息,单独识别公式区域,并把公式作为 `content_type=formula` 的独立 chunk 入库,同时正文里保留 `[FORMULA id=...]`,这样检索公式和检索上下文都可以兼顾。
251
+
252
+ 再讲工程取舍:
253
+
254
+ > 公式识别不能简单看到希腊字母就判定为公式,否则普通正文会大量误判。所以我把规则收紧到等号、积分、求和、根号、公式编号、比较符等强数学信号,并过滤掉太短的碎片。bbox 也不能直接把所有行都写入 metadata,因为 LlamaIndex 会把 metadata 计入 chunk 长度,所以我把多个 bbox 合并成一个外接矩形。
255
+
256
+ 最后讲效果:
257
+
258
+ > 优化后索引从原来的纯文本 chunk,变成了正文 chunk 加公式 chunk 的混合结构。每条检索结果都带 page、section、content_type、formula_id 等 metadata,回答时更容易定位来源,也更适合处理“某个公式是什么意思”这类问题。
259
+
260
+ ## 五、后续可继续优化
261
+
262
+ 目前已经接入 PyMuPDF,但还不是完整 OCR/LaTeX 公式识别。后续可以继续做:
263
+
264
+ 1. 对 `formula_bbox` 区域裁图。
265
+ 2. 接入公式 OCR 模型,例如 LaTeX OCR。
266
+ 3. 把公式图片转成 LaTeX。
267
+ 4. metadata 中同时保存:
268
+
269
+ ```python
270
+ formula_text_raw
271
+ formula_latex
272
+ formula_bbox
273
+ page_number
274
+ section_path
275
+ ```
276
+
277
+ 5. 检索时对公式 query 单独加权,或者做 hybrid search。
278
+ 6. 增加 reranker,提高公式相关问题的排序质量。
279
+
280
+ ## 六、一句话总结
281
+
282
+ 这次优化的核心不是简单换一个 PDF parser,而是把 PDF 解析从“按页提取纯文本”升级成“结构化解析正文、章节和公式区域”,让 RAG 的 chunk 更接近人阅读文档时的语义边界。
requirements.txt CHANGED
@@ -4,6 +4,7 @@ requests
4
  duckduckgo_search
5
  pandas
6
  pypdf
 
7
  chromadb
8
  llama-index-core
9
  llama-index-embeddings-huggingface
 
4
  duckduckgo_search
5
  pandas
6
  pypdf
7
+ PyMuPDF
8
  chromadb
9
  llama-index-core
10
  llama-index-embeddings-huggingface
tools/query_knowledge.py ADDED
@@ -0,0 +1,1196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from smolagents.tools import Tool
2
+ import asyncio
3
+ from collections import Counter
4
+ import hashlib
5
+ import logging
6
+ import os
7
+ from pathlib import Path
8
+ from typing import Iterable, List, Optional
9
+ import re
10
+ from dotenv import load_dotenv
11
+ import chromadb
12
+ from chromadb.errors import NotFoundError
13
+ from pypdf import PdfReader
14
+
15
+ from llama_index.core import StorageContext, VectorStoreIndex
16
+ from llama_index.core.schema import Document, BaseNode
17
+ from llama_index.core.node_parser import SentenceSplitter
18
+ from llama_index.vector_stores.chroma import ChromaVectorStore
19
+
20
+
21
+ BASE_DIR = Path(__file__).resolve().parent
22
+ KNOWLEDGE_BASE_DIR = BASE_DIR / "knowledge_base"
23
+ RAW_DIR = KNOWLEDGE_BASE_DIR / "raw"
24
+ CHROMA_DB_DIR = KNOWLEDGE_BASE_DIR / "chroma_db"
25
+ HF_CACHE_DIR = BASE_DIR / "hf_cache"
26
+ COLLECTION_NAME = "options_knowledge"
27
+
28
+ EMBED_MODEL_NAME = "BAAI/bge-small-en-v1.5"
29
+ CHUNK_SIZE = 1000
30
+ CHUNK_OVERLAP = 150
31
+ PDF_REPEATED_LINE_MIN_PAGES = 3
32
+ PDF_BOUNDARY_LINE_COUNT = 4
33
+ PDF_EXTRACTION_METHOD = "pymupdf_formula_blocks_v5"
34
+ PDF_LINE_Y_TOLERANCE = 3.0
35
+ PDF_MIN_SECTION_CHARS = 240
36
+ PDF_STRONG_MATH_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_σΣΔδθΘλΛμρπΠφΦτν𝜎𝜇𝜌𝜃𝜕")
37
+ PDF_WEAK_MATH_SYMBOLS = set("+-−*/∕<>")
38
+ PDF_MATH_SYMBOLS = PDF_STRONG_MATH_SYMBOLS | PDF_WEAK_MATH_SYMBOLS
39
+ PDF_OPERATOR_MATH_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_+-−*/∕<>")
40
+ PDF_FORMULA_TRIGGER_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_∕<>")
41
+
42
+ logging.getLogger("pypdf").setLevel(logging.ERROR)
43
+
44
+
45
+ def load_pymupdf():
46
+ try:
47
+ import fitz
48
+ except ImportError:
49
+ return None
50
+
51
+ return fitz
52
+
53
+
54
+ REQUIRED_METADATA = [
55
+ "source_file",
56
+ "file_name",
57
+ "file_type",
58
+ "document_title",
59
+ "file_hash",
60
+ "chunk_id",
61
+ "chunk_index",
62
+ ]
63
+
64
+
65
+ def configure_model_cache() -> None:
66
+ HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
67
+ os.environ.setdefault("HF_HOME", str(HF_CACHE_DIR))
68
+ os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", str(
69
+ HF_CACHE_DIR / "sentence_transformers"))
70
+ os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
71
+ cached_model_dir = (
72
+ HF_CACHE_DIR
73
+ / "sentence_transformers"
74
+ / f"models--{EMBED_MODEL_NAME.replace('/', '--')}"
75
+ )
76
+ if cached_model_dir.exists():
77
+ os.environ.setdefault("HF_HUB_OFFLINE", "1")
78
+ os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
79
+
80
+
81
+ def resolve_embed_model_name() -> str:
82
+ cached_model_dir = (
83
+ HF_CACHE_DIR
84
+ / "sentence_transformers"
85
+ / f"models--{EMBED_MODEL_NAME.replace('/', '--')}"
86
+ )
87
+ snapshots_dir = cached_model_dir / "snapshots"
88
+ if snapshots_dir.exists():
89
+ snapshots = sorted(path for path in snapshots_dir.iterdir() if path.is_dir())
90
+ if snapshots:
91
+ return str(snapshots[-1])
92
+
93
+ return EMBED_MODEL_NAME
94
+
95
+
96
+ def file_sha256(path: Path) -> str:
97
+ digest = hashlib.sha256()
98
+ with path.open("rb") as file:
99
+ for block in iter(lambda: file.read(1024 * 1024), b""):
100
+ digest.update(block)
101
+ return digest.hexdigest()
102
+
103
+
104
+ def load_md_file(path: Path) -> Document:
105
+ text = path.read_text(encoding="utf-8")
106
+
107
+ return Document(
108
+ text=text,
109
+ metadata={
110
+ "source_file": str(path.resolve()),
111
+ "file_name": path.name,
112
+ "file_type": "md",
113
+ "document_title": path.stem,
114
+ "file_hash": file_sha256(path),
115
+ },
116
+ )
117
+
118
+
119
+ def append_visual_fragment(line_parts: List[str], text: str, baseline_y: float, item: dict) -> None:
120
+ if not text:
121
+ return
122
+
123
+ stripped = text.strip()
124
+ if not stripped:
125
+ return
126
+
127
+ font_size = item["font_size"]
128
+ y_offset = item["y"] - baseline_y
129
+ is_small = font_size < item["line_font_size"] * 0.82
130
+
131
+ if is_small and y_offset > max(1.5, item["line_font_size"] * 0.18):
132
+ line_parts.append(f"^{{{stripped}}}")
133
+ elif is_small and y_offset < -max(1.5, item["line_font_size"] * 0.18):
134
+ line_parts.append(f"_{{{stripped}}}")
135
+ else:
136
+ line_parts.append(stripped)
137
+
138
+
139
+ def join_visual_line(items: List[dict]) -> str:
140
+ if not items:
141
+ return ""
142
+
143
+ items = sorted(items, key=lambda value: value["x"])
144
+ baseline_y = sorted(item["y"] for item in items)[len(items) // 2]
145
+ line_font_size = max(item["font_size"] for item in items)
146
+ previous_right = None
147
+ line_parts: List[str] = []
148
+
149
+ for item in items:
150
+ item["line_font_size"] = line_font_size
151
+ if previous_right is not None:
152
+ gap = item["x"] - previous_right
153
+ if gap > max(2.5, line_font_size * 0.28):
154
+ line_parts.append(" ")
155
+
156
+ append_visual_fragment(line_parts, item["text"], baseline_y, item)
157
+ previous_right = max(previous_right or item["x"], item["x"] + item["width"])
158
+
159
+ return normalize_pdf_line("".join(line_parts))
160
+
161
+
162
+ def extract_pdf_text_by_position(page) -> str:
163
+ fragments: List[dict] = []
164
+
165
+ def visitor_text(text, cm, tm, font_dict, font_size):
166
+ if not text or not text.strip():
167
+ return
168
+
169
+ x = float(tm[4])
170
+ y = float(tm[5])
171
+ width = max(len(text.strip()) * float(font_size) * 0.45, float(font_size))
172
+ fragments.append(
173
+ {
174
+ "text": text,
175
+ "x": x,
176
+ "y": y,
177
+ "width": width,
178
+ "font_size": float(font_size or 1.0),
179
+ }
180
+ )
181
+
182
+ try:
183
+ page.extract_text(visitor_text=visitor_text)
184
+ except Exception:
185
+ return ""
186
+
187
+ if not fragments:
188
+ return ""
189
+
190
+ lines: List[List[dict]] = []
191
+ for fragment in sorted(fragments, key=lambda value: (-value["y"], value["x"])):
192
+ for line in lines:
193
+ if abs(line[0]["y"] - fragment["y"]) <= PDF_LINE_Y_TOLERANCE:
194
+ line.append(fragment)
195
+ break
196
+ else:
197
+ lines.append([fragment])
198
+
199
+ return "\n".join(join_visual_line(line) for line in lines)
200
+
201
+
202
+ def math_text_score(text: str) -> float:
203
+ if not text.strip():
204
+ return 0.0
205
+
206
+ lines = [line for line in text.splitlines() if line.strip()]
207
+ compact_length = len(re.sub(r"\s+", "", text))
208
+ math_symbol_count = sum(1 for char in text if char in PDF_MATH_SYMBOLS)
209
+ superscript_markers = text.count("^{") + text.count("_{")
210
+ multiline_bonus = sum(1 for line in lines if is_formula_like(line)) * 8
211
+ equation_block_bonus = sum(
212
+ 1
213
+ for index, line in enumerate(lines)
214
+ if is_formula_like(line)
215
+ and (
216
+ index > 0
217
+ and is_formula_like(lines[index - 1])
218
+ or index + 1 < len(lines)
219
+ and is_formula_like(lines[index + 1])
220
+ )
221
+ ) * 12
222
+ return (
223
+ compact_length
224
+ + math_symbol_count * 12
225
+ + superscript_markers * 20
226
+ + multiline_bonus
227
+ + equation_block_bonus
228
+ )
229
+
230
+
231
+ def extract_pdf_text(page) -> str:
232
+ positioned_text = extract_pdf_text_by_position(page)
233
+
234
+ try:
235
+ layout_text = page.extract_text(extraction_mode="layout") or ""
236
+ except Exception:
237
+ layout_text = ""
238
+
239
+ try:
240
+ plain_text = page.extract_text() or ""
241
+ except Exception:
242
+ plain_text = ""
243
+
244
+ candidates = [positioned_text, layout_text, plain_text]
245
+ candidates = [candidate for candidate in candidates if candidate.strip()]
246
+ if not candidates:
247
+ return ""
248
+
249
+ return max(candidates, key=math_text_score)
250
+
251
+
252
+ def pymupdf_span_text(span: dict) -> str:
253
+ return normalize_pdf_line(span.get("text", ""))
254
+
255
+
256
+ def pymupdf_line_text(line: dict) -> str:
257
+ return normalize_pdf_line("".join(pymupdf_span_text(span) for span in line.get("spans", [])))
258
+
259
+
260
+ def pymupdf_block_text(block: dict) -> str:
261
+ lines = [
262
+ pymupdf_line_text(line)
263
+ for line in block.get("lines", [])
264
+ ]
265
+ return "\n".join(line for line in lines if line)
266
+
267
+
268
+ def pymupdf_span_has_math_font(span: dict) -> bool:
269
+ font_name = span.get("font", "").lower()
270
+ return any(
271
+ marker in font_name
272
+ for marker in ("math", "symbol", "cmmi", "cmsy", "cmex", "stix")
273
+ )
274
+
275
+
276
+ def is_formula_block_line(line: str) -> bool:
277
+ stripped = line.strip()
278
+ if not stripped:
279
+ return False
280
+
281
+ trigger_math_count = sum(1 for char in stripped if char in PDF_FORMULA_TRIGGER_SYMBOLS)
282
+ digit_count = sum(1 for char in stripped if char.isdigit())
283
+ alpha_count = sum(1 for char in stripped if char.isalpha())
284
+ alpha_words = [
285
+ word
286
+ for word in re.findall(r"[A-Za-z]+", stripped)
287
+ if word.lower() not in {"and", "or", "the", "where", "then", "with", "for"}
288
+ ]
289
+ compact_length = len(re.sub(r"\s+", "", stripped))
290
+
291
+ if compact_length < 3:
292
+ return False
293
+ if re.fullmatch(r"\(?\d+(\.\d+)?\)?", stripped):
294
+ return False
295
+ if re.search(r"\(\d+(\.\d+)+[a-z]?\)$", stripped) and compact_length <= 240:
296
+ return True
297
+ if "=" in stripped and compact_length <= 260 and len(alpha_words) <= 12:
298
+ return True
299
+ if any(char in stripped for char in "∂∫∑∏√∞≈≠≤≥±×÷") and compact_length <= 220 and len(alpha_words) <= 10:
300
+ return True
301
+ if trigger_math_count >= 2 and compact_length <= 120 and len(alpha_words) <= 6:
302
+ return True
303
+ if trigger_math_count >= 1 and digit_count >= 1 and alpha_count <= 18 and compact_length <= 100:
304
+ return True
305
+
306
+ return False
307
+
308
+
309
+ def is_formula_block(block: dict) -> bool:
310
+ text = pymupdf_block_text(block)
311
+ if not text:
312
+ return False
313
+
314
+ lines = [line for line in text.splitlines() if line.strip()]
315
+ if any(is_formula_block_line(line) for line in lines):
316
+ return True
317
+
318
+ spans = [
319
+ span
320
+ for line in block.get("lines", [])
321
+ for span in line.get("spans", [])
322
+ if pymupdf_span_text(span)
323
+ ]
324
+ if not spans:
325
+ return False
326
+
327
+ math_font_count = sum(1 for span in spans if pymupdf_span_has_math_font(span))
328
+ strong_math_count = sum(1 for char in text if char in PDF_STRONG_MATH_SYMBOLS)
329
+ alpha_count = sum(1 for char in text if char.isalpha())
330
+ digit_count = sum(1 for char in text if char.isdigit())
331
+ compact_length = len(re.sub(r"\s+", "", text))
332
+
333
+ if math_font_count >= 2 and compact_length <= 220:
334
+ return True
335
+ if strong_math_count >= 3 and compact_length <= 260:
336
+ return True
337
+ if strong_math_count >= 1 and digit_count >= 1 and alpha_count <= 20 and compact_length <= 160:
338
+ return True
339
+
340
+ return False
341
+
342
+
343
+ def block_bbox_string(block: dict) -> str:
344
+ bbox = block.get("bbox") or []
345
+ if len(bbox) != 4:
346
+ return ""
347
+ return ",".join(f"{float(value):.2f}" for value in bbox)
348
+
349
+
350
+ def line_bbox_string(line: dict) -> str:
351
+ bbox = line.get("bbox") or []
352
+ if len(bbox) != 4:
353
+ return ""
354
+ return ",".join(f"{float(value):.2f}" for value in bbox)
355
+
356
+
357
+ def pymupdf_line_has_math_font(line: dict) -> bool:
358
+ return any(
359
+ pymupdf_span_has_math_font(span)
360
+ for span in line.get("spans", [])
361
+ if pymupdf_span_text(span)
362
+ )
363
+
364
+
365
+ def should_extract_formula_line(line: dict) -> bool:
366
+ text = pymupdf_line_text(line)
367
+ if not text:
368
+ return False
369
+
370
+ if is_formula_block_line(text):
371
+ return True
372
+
373
+ compact_length = len(re.sub(r"\s+", "", text))
374
+ trigger_math_count = sum(1 for char in text if char in PDF_FORMULA_TRIGGER_SYMBOLS)
375
+ alpha_words = re.findall(r"[A-Za-z]+", text)
376
+ if (
377
+ pymupdf_line_has_math_font(line)
378
+ and trigger_math_count >= 1
379
+ and compact_length <= 180
380
+ and len(alpha_words) <= 6
381
+ ):
382
+ return True
383
+
384
+ return False
385
+
386
+
387
+ def is_formula_continuation_line(text: str) -> bool:
388
+ stripped = text.strip()
389
+ if not stripped:
390
+ return False
391
+
392
+ compact = re.sub(r"\s+", "", stripped)
393
+ if len(compact) > 90:
394
+ return False
395
+ if compact in {"(", ")", "[", "]", "{", "}", "√"}:
396
+ return True
397
+
398
+ alpha_words = re.findall(r"[A-Za-z]+", stripped)
399
+ math_count = sum(1 for char in stripped if char in PDF_MATH_SYMBOLS)
400
+ digit_count = sum(1 for char in stripped if char.isdigit())
401
+
402
+ if len(alpha_words) <= 4 and (math_count >= 1 or digit_count >= 1):
403
+ return True
404
+
405
+ return False
406
+
407
+
408
+ def append_formula_block(
409
+ formula_blocks: List[dict],
410
+ body_blocks: List[str],
411
+ page_number: int,
412
+ formula_index: int,
413
+ formula_lines: List[str],
414
+ formula_bboxes: List[str],
415
+ ) -> int:
416
+ formula_text = clean_formula_text("\n".join(formula_lines))
417
+ if not is_useful_formula_text(formula_text):
418
+ return formula_index
419
+
420
+ formula_id = f"formula-{page_number}-{formula_index}"
421
+ formula_bbox = merge_bbox_strings(formula_bboxes)
422
+ formula_blocks.append(
423
+ {
424
+ "id": formula_id,
425
+ "text": formula_text,
426
+ "bbox": formula_bbox,
427
+ }
428
+ )
429
+ body_blocks.append(f"[FORMULA id={formula_id}]\n{formula_text}\n[/FORMULA]")
430
+ return formula_index + 1
431
+
432
+
433
+ def merge_bbox_strings(bbox_strings: List[str]) -> str:
434
+ boxes = []
435
+ for bbox_string in bbox_strings:
436
+ if not bbox_string:
437
+ continue
438
+ values = bbox_string.split(",")
439
+ if len(values) != 4:
440
+ continue
441
+ try:
442
+ boxes.append([float(value) for value in values])
443
+ except ValueError:
444
+ continue
445
+
446
+ if not boxes:
447
+ return ""
448
+
449
+ x0 = min(box[0] for box in boxes)
450
+ y0 = min(box[1] for box in boxes)
451
+ x1 = max(box[2] for box in boxes)
452
+ y1 = max(box[3] for box in boxes)
453
+ return f"{x0:.2f},{y0:.2f},{x1:.2f},{y1:.2f}"
454
+
455
+
456
+ def is_useful_formula_text(text: str) -> bool:
457
+ stripped = text.strip()
458
+ if not stripped:
459
+ return False
460
+
461
+ compact_length = len(re.sub(r"\s+", "", stripped))
462
+ if compact_length < 6:
463
+ return False
464
+
465
+ lines = [line.strip() for line in stripped.splitlines() if line.strip()]
466
+ if re.search(r"\(\d+(\.\d+)+[a-z]?\)", stripped):
467
+ return True
468
+ if any(char in stripped for char in "∂∫∑∏∞≈≠≤≥±×÷"):
469
+ alpha_words = re.findall(r"[A-Za-z]+", stripped)
470
+ return len(alpha_words) <= 12 or "=" in stripped
471
+
472
+ for line in lines:
473
+ if "=" not in line:
474
+ continue
475
+
476
+ alpha_words = [
477
+ word
478
+ for word in re.findall(r"[A-Za-z]+", line)
479
+ if word.lower() not in {"and", "or", "the", "where", "then", "with", "for"}
480
+ ]
481
+ if len(alpha_words) <= 12 and len(line) <= 260:
482
+ return True
483
+
484
+ return False
485
+
486
+
487
+ def extract_pymupdf_page(page) -> dict:
488
+ page_dict = page.get_text("dict", sort=True)
489
+ body_blocks: List[str] = []
490
+ formula_blocks: List[dict] = []
491
+ formula_lines: List[str] = []
492
+ formula_bboxes: List[str] = []
493
+ formula_index = 0
494
+ page_number = page.number + 1
495
+
496
+ for block in page_dict.get("blocks", []):
497
+ if block.get("type") != 0:
498
+ continue
499
+
500
+ normal_lines: List[str] = []
501
+
502
+ for line in block.get("lines", []):
503
+ line_text = pymupdf_line_text(line)
504
+ if not line_text:
505
+ continue
506
+
507
+ if should_extract_formula_line(line) or (
508
+ formula_lines and is_formula_continuation_line(line_text)
509
+ ):
510
+ if normal_lines:
511
+ body_blocks.append("\n".join(normal_lines))
512
+ normal_lines = []
513
+ formula_lines.append(line_text)
514
+ formula_bboxes.append(line_bbox_string(line))
515
+ else:
516
+ if formula_lines:
517
+ formula_index = append_formula_block(
518
+ formula_blocks=formula_blocks,
519
+ body_blocks=body_blocks,
520
+ page_number=page_number,
521
+ formula_index=formula_index,
522
+ formula_lines=formula_lines,
523
+ formula_bboxes=formula_bboxes,
524
+ )
525
+ formula_lines = []
526
+ formula_bboxes = []
527
+ normal_lines.append(line_text)
528
+
529
+ if normal_lines:
530
+ body_blocks.append("\n".join(normal_lines))
531
+
532
+ if formula_lines:
533
+ append_formula_block(
534
+ formula_blocks=formula_blocks,
535
+ body_blocks=body_blocks,
536
+ page_number=page_number,
537
+ formula_index=formula_index,
538
+ formula_lines=formula_lines,
539
+ formula_bboxes=formula_bboxes,
540
+ )
541
+
542
+ return {
543
+ "text": "\n".join(body_blocks),
544
+ "formula_blocks": formula_blocks,
545
+ "backend": "pymupdf",
546
+ }
547
+
548
+
549
+ def extract_pdf_pages_with_pymupdf(path: Path) -> Optional[List[dict]]:
550
+ fitz = load_pymupdf()
551
+ if fitz is None:
552
+ return None
553
+
554
+ try:
555
+ document = fitz.open(str(path))
556
+ except Exception:
557
+ return None
558
+
559
+ try:
560
+ return [extract_pymupdf_page(page) for page in document]
561
+ finally:
562
+ document.close()
563
+
564
+
565
+ def clean_formula_text(text: str) -> str:
566
+ lines = page_lines(text)
567
+ if not lines:
568
+ return ""
569
+
570
+ text = "\n".join(lines)
571
+ text = re.sub(r"[ \t]+", " ", text)
572
+ text = re.sub(r"\n{3,}", "\n\n", text)
573
+ return text.strip()
574
+
575
+
576
+ def normalize_pdf_line(line: str) -> str:
577
+ line = line.replace("\x00", " ")
578
+ line = line.replace("\ufb00", "ff")
579
+ line = line.replace("\ufb01", "fi")
580
+ line = line.replace("\ufb02", "fl")
581
+ line = line.replace("\ufb03", "ffi")
582
+ line = line.replace("\ufb04", "ffl")
583
+ line = re.sub(r"[ \t]+", " ", line)
584
+ return line.strip()
585
+
586
+
587
+ def is_noise_line(line: str) -> bool:
588
+ if not line:
589
+ return True
590
+ if re.fullmatch(r"\d+", line):
591
+ return True
592
+ if re.fullmatch(r"page\s+\d+(\s+of\s+\d+)?", line, flags=re.IGNORECASE):
593
+ return True
594
+ if re.fullmatch(r"[-_=\s]{3,}", line):
595
+ return True
596
+ return False
597
+
598
+
599
+ def is_formula_like(line: str) -> bool:
600
+ stripped = line.strip()
601
+ if not stripped:
602
+ return False
603
+
604
+ strong_math_count = sum(1 for char in stripped if char in PDF_STRONG_MATH_SYMBOLS)
605
+ weak_math_count = sum(1 for char in stripped if char in PDF_WEAK_MATH_SYMBOLS)
606
+ alpha_count = sum(1 for char in stripped if char.isalpha())
607
+ digit_count = sum(1 for char in stripped if char.isdigit())
608
+ compact = stripped.replace(" ", "")
609
+
610
+ if "={" in compact or "^{" in compact or "_{" in compact:
611
+ return True
612
+ if compact in {"(", ")", "[", "]", "{", "}"}:
613
+ return True
614
+ if len(compact) <= 40 and any(char in compact for char in PDF_MATH_SYMBOLS):
615
+ return True
616
+ if strong_math_count >= 2 and len(stripped) <= 180:
617
+ return True
618
+ if strong_math_count >= 1 and weak_math_count >= 1 and len(stripped) <= 180:
619
+ return True
620
+ if "=" in stripped and (alpha_count + digit_count) >= 2 and len(stripped) <= 220:
621
+ return True
622
+ if re.search(r"\b(d|D|exp|ln|sqrt|max|min|var|cov)\s*[\(\[]", stripped):
623
+ return True
624
+ if alpha_count <= 4 and (strong_math_count + weak_math_count) >= 1 and digit_count >= 1:
625
+ return True
626
+
627
+ return False
628
+
629
+
630
+ def normalized_line_key(line: str) -> str:
631
+ return re.sub(r"\d+", "#", line.lower()).strip()
632
+
633
+
634
+ def page_lines(text: str) -> List[str]:
635
+ lines = []
636
+ for line in text.replace("\r\n", "\n").replace("\r", "\n").split("\n"):
637
+ normalized = normalize_pdf_line(line)
638
+ if not is_noise_line(normalized):
639
+ lines.append(normalized)
640
+ return lines
641
+
642
+
643
+ def find_repeated_boundary_lines(raw_pages: List[str]) -> set[str]:
644
+ counter: Counter[str] = Counter()
645
+
646
+ for raw_text in raw_pages:
647
+ lines = page_lines(raw_text)
648
+ boundary_lines = lines[:PDF_BOUNDARY_LINE_COUNT] + lines[-PDF_BOUNDARY_LINE_COUNT:]
649
+ counter.update(
650
+ normalized_line_key(line)
651
+ for line in boundary_lines
652
+ if 3 <= len(line) <= 140
653
+ )
654
+
655
+ min_count = min(
656
+ PDF_REPEATED_LINE_MIN_PAGES,
657
+ max(2, len(raw_pages) // 3),
658
+ )
659
+ return {line for line, count in counter.items() if count >= min_count}
660
+
661
+
662
+ def clean_pdf_text(text: str, repeated_boundary_lines: set[str]) -> str:
663
+ lines = page_lines(text)
664
+ cleaned_lines = []
665
+
666
+ for index, line in enumerate(lines):
667
+ is_boundary = (
668
+ index < PDF_BOUNDARY_LINE_COUNT
669
+ or index >= len(lines) - PDF_BOUNDARY_LINE_COUNT
670
+ )
671
+ if is_boundary and normalized_line_key(line) in repeated_boundary_lines:
672
+ continue
673
+ cleaned_lines.append(line)
674
+
675
+ merged_lines = []
676
+ for line in cleaned_lines:
677
+ if merged_lines and merged_lines[-1].endswith("-") and line[:1].islower():
678
+ merged_lines[-1] = merged_lines[-1][:-1] + line
679
+ else:
680
+ merged_lines.append(line)
681
+
682
+ text = "\n".join(merged_lines)
683
+ text = preserve_math_line_breaks(text)
684
+ text = re.sub(r"[ \t]+", " ", text)
685
+ text = re.sub(r"\n{3,}", "\n\n", text)
686
+ return text.strip()
687
+
688
+
689
+ def preserve_math_line_breaks(text: str) -> str:
690
+ lines = text.split("\n")
691
+ if not lines:
692
+ return ""
693
+
694
+ output = [lines[0]]
695
+ in_formula_block = is_formula_like(lines[0])
696
+ for line in lines[1:]:
697
+ previous = output[-1]
698
+ line_is_formula = is_formula_like(line)
699
+ previous_is_formula = is_formula_like(previous)
700
+
701
+ if previous_is_formula or line_is_formula or in_formula_block:
702
+ output.append(line)
703
+ in_formula_block = line_is_formula or (
704
+ in_formula_block
705
+ and not line.endswith((".", ";", ":", "?", "!"))
706
+ )
707
+ elif previous.endswith((".", ":", ";", "?", "!", ")")):
708
+ output.append(line)
709
+ in_formula_block = False
710
+ else:
711
+ output[-1] = f"{previous} {line}"
712
+ in_formula_block = False
713
+
714
+ return "\n".join(output)
715
+
716
+
717
+ def is_chapter_heading(line: str) -> bool:
718
+ return bool(re.fullmatch(
719
+ r"(chapter|appendix)\s+([0-9]+|[ivxlcdm]+|[a-z])",
720
+ line.strip(),
721
+ flags=re.IGNORECASE,
722
+ ))
723
+
724
+
725
+ def titlecase_word_ratio(words: List[str]) -> float:
726
+ candidate_words = [
727
+ word.strip("()[]{}:;,.")
728
+ for word in words
729
+ if any(char.isalpha() for char in word)
730
+ ]
731
+ if not candidate_words:
732
+ return 0.0
733
+
734
+ titlecase_words = [
735
+ word
736
+ for word in candidate_words
737
+ if word[:1].isupper()
738
+ or word.lower() in {"a", "an", "and", "for", "in", "of", "on", "or", "the", "to", "with"}
739
+ ]
740
+ return len(titlecase_words) / len(candidate_words)
741
+
742
+
743
+ def uppercase_letter_ratio(text: str) -> float:
744
+ letters = [char for char in text if char.isalpha()]
745
+ if not letters:
746
+ return 0.0
747
+ return sum(1 for char in letters if char.isupper()) / len(letters)
748
+
749
+
750
+ def is_section_heading(line: str) -> bool:
751
+ stripped = line.strip()
752
+ if not 4 <= len(stripped) <= 150:
753
+ return False
754
+ letters = [char for char in stripped if char.isalpha()]
755
+ digit_count = sum(1 for char in stripped if char.isdigit())
756
+ alpha_words = [
757
+ word.strip("()[]{}:;,.")
758
+ for word in stripped.split()
759
+ if any(char.isalpha() for char in word)
760
+ ]
761
+ if len(letters) < 6 or len(alpha_words) < 2:
762
+ return False
763
+ if digit_count > max(4, len(letters)):
764
+ return False
765
+ if "%" in stripped and digit_count >= len(letters) / 2:
766
+ return False
767
+ numbered_heading = bool(re.match(r"^\d+(\.\d+)+\s+", stripped))
768
+ if stripped[:1].isdigit() and not numbered_heading:
769
+ return False
770
+ if re.match(
771
+ r"^(in|from|where|thus|then|now|let|because|while|figure|table|for)\b",
772
+ stripped,
773
+ flags=re.IGNORECASE,
774
+ ):
775
+ return False
776
+ if is_formula_like(stripped):
777
+ return False
778
+ if stripped.endswith((".", ",", ";")):
779
+ return False
780
+ if re.match(r"^(figure|table)\s+\d", stripped, flags=re.IGNORECASE):
781
+ return False
782
+ if numbered_heading:
783
+ return True
784
+
785
+ words = stripped.split()
786
+ if len(words) > 16:
787
+ return False
788
+ if uppercase_letter_ratio(stripped) >= 0.72 and len(words) >= 2:
789
+ return True
790
+ if len(words) >= 4 and titlecase_word_ratio(words) >= 0.68:
791
+ return True
792
+
793
+ return False
794
+
795
+
796
+ def make_section_path(chapter_title: str, section_title: str) -> str:
797
+ if chapter_title and section_title and section_title != chapter_title:
798
+ return f"{chapter_title} > {section_title}"
799
+ return section_title or chapter_title
800
+
801
+
802
+ def split_pdf_page_into_sections(
803
+ path: Path,
804
+ page_index: int,
805
+ text: str,
806
+ file_hash: str,
807
+ section_state: dict,
808
+ extraction_backend: str,
809
+ formula_count: int,
810
+ ) -> List[Document]:
811
+ documents = []
812
+ lines = text.splitlines()
813
+ pending_lines: List[str] = []
814
+ pending_metadata = {
815
+ "chapter_title": section_state.get("chapter_title", ""),
816
+ "section_title": section_state.get("section_title", ""),
817
+ }
818
+
819
+ def flush_pending() -> None:
820
+ nonlocal pending_lines, pending_metadata
821
+ section_text = "\n".join(line for line in pending_lines if line.strip()).strip()
822
+ if not section_text:
823
+ pending_lines = []
824
+ return
825
+
826
+ chapter_title = pending_metadata.get("chapter_title", "")
827
+ section_title = pending_metadata.get("section_title", "")
828
+ documents.append(
829
+ Document(
830
+ text=section_text,
831
+ metadata={
832
+ "source_file": str(path.resolve()),
833
+ "file_name": path.name,
834
+ "file_type": "pdf",
835
+ "document_title": path.stem,
836
+ "file_hash": file_hash,
837
+ "page_number": page_index,
838
+ "extraction_method": PDF_EXTRACTION_METHOD,
839
+ "extraction_backend": extraction_backend,
840
+ "char_count": len(section_text),
841
+ "formula_count": formula_count,
842
+ "content_type": "text",
843
+ "chapter_title": chapter_title,
844
+ "section_title": section_title,
845
+ "section_path": make_section_path(chapter_title, section_title),
846
+ },
847
+ )
848
+ )
849
+ pending_lines = []
850
+
851
+ for line in lines:
852
+ stripped = line.strip()
853
+ if not stripped:
854
+ continue
855
+
856
+ if is_chapter_heading(stripped):
857
+ if len("\n".join(pending_lines)) >= PDF_MIN_SECTION_CHARS:
858
+ flush_pending()
859
+ section_state["pending_chapter_label"] = stripped.title()
860
+ section_state["chapter_title"] = stripped.title()
861
+ section_state["section_title"] = stripped.title()
862
+ pending_metadata = {
863
+ "chapter_title": section_state["chapter_title"],
864
+ "section_title": section_state["section_title"],
865
+ }
866
+ pending_lines.append(stripped)
867
+ continue
868
+
869
+ if section_state.get("pending_chapter_label") and is_section_heading(stripped):
870
+ if pending_lines == [section_state["pending_chapter_label"]]:
871
+ pending_lines[0] = f"{section_state['pending_chapter_label']}: {stripped}"
872
+ else:
873
+ pending_lines.append(stripped)
874
+
875
+ section_state["chapter_title"] = pending_lines[-1]
876
+ section_state["section_title"] = pending_lines[-1]
877
+ section_state["pending_chapter_label"] = ""
878
+ pending_metadata = {
879
+ "chapter_title": section_state["chapter_title"],
880
+ "section_title": section_state["section_title"],
881
+ }
882
+ continue
883
+
884
+ if is_section_heading(stripped):
885
+ if len("\n".join(pending_lines)) >= PDF_MIN_SECTION_CHARS:
886
+ flush_pending()
887
+ section_state["section_title"] = stripped
888
+ section_state["pending_chapter_label"] = ""
889
+ pending_metadata = {
890
+ "chapter_title": section_state.get("chapter_title", ""),
891
+ "section_title": section_state["section_title"],
892
+ }
893
+
894
+ pending_lines.append(stripped)
895
+
896
+ flush_pending()
897
+ return documents
898
+
899
+
900
+ def make_formula_documents(
901
+ path: Path,
902
+ page_index: int,
903
+ formula_blocks: List[dict],
904
+ file_hash: str,
905
+ extraction_backend: str,
906
+ ) -> List[Document]:
907
+ documents = []
908
+
909
+ for formula_index, formula in enumerate(formula_blocks):
910
+ formula_text = formula.get("text", "").strip()
911
+ if not formula_text:
912
+ continue
913
+
914
+ documents.append(
915
+ Document(
916
+ text=f"[FORMULA]\n{formula_text}\n[/FORMULA]",
917
+ metadata={
918
+ "source_file": str(path.resolve()),
919
+ "file_name": path.name,
920
+ "file_type": "pdf",
921
+ "document_title": path.stem,
922
+ "file_hash": file_hash,
923
+ "page_number": page_index,
924
+ "extraction_method": PDF_EXTRACTION_METHOD,
925
+ "extraction_backend": extraction_backend,
926
+ "char_count": len(formula_text),
927
+ "content_type": "formula",
928
+ "formula_id": formula.get("id", f"formula-{page_index}-{formula_index}"),
929
+ "formula_index": formula_index,
930
+ "formula_bbox": formula.get("bbox", ""),
931
+ "formula_count": 1,
932
+ "chapter_title": "",
933
+ "section_title": "",
934
+ "section_path": "",
935
+ },
936
+ )
937
+ )
938
+
939
+ return documents
940
+
941
+
942
+ def load_pdf_file(path: Path) -> List[Document]:
943
+ reader = PdfReader(str(path))
944
+ documents = []
945
+ pymupdf_pages = extract_pdf_pages_with_pymupdf(path)
946
+
947
+ if pymupdf_pages:
948
+ page_payloads = pymupdf_pages
949
+ else:
950
+ page_payloads = [
951
+ {
952
+ "text": extract_pdf_text(page),
953
+ "formula_blocks": [],
954
+ "backend": "pypdf",
955
+ }
956
+ for page in reader.pages
957
+ ]
958
+
959
+ raw_pages = [payload["text"] for payload in page_payloads]
960
+ repeated_boundary_lines = find_repeated_boundary_lines(raw_pages)
961
+ file_hash = file_sha256(path)
962
+ section_state: dict = {
963
+ "chapter_title": "",
964
+ "section_title": "",
965
+ "pending_chapter_label": "",
966
+ }
967
+
968
+ for page_index, payload in enumerate(page_payloads, start=1):
969
+ raw_text = payload["text"]
970
+ text = clean_pdf_text(raw_text, repeated_boundary_lines)
971
+ formula_blocks = payload.get("formula_blocks", [])
972
+ extraction_backend = payload.get("backend", "pypdf")
973
+
974
+ if not text.strip():
975
+ documents.extend(
976
+ make_formula_documents(
977
+ path=path,
978
+ page_index=page_index,
979
+ formula_blocks=formula_blocks,
980
+ file_hash=file_hash,
981
+ extraction_backend=extraction_backend,
982
+ )
983
+ )
984
+ continue
985
+
986
+ documents.extend(
987
+ split_pdf_page_into_sections(
988
+ path=path,
989
+ page_index=page_index,
990
+ text=text,
991
+ file_hash=file_hash,
992
+ section_state=section_state,
993
+ extraction_backend=extraction_backend,
994
+ formula_count=len(formula_blocks),
995
+ )
996
+ )
997
+ documents.extend(
998
+ make_formula_documents(
999
+ path=path,
1000
+ page_index=page_index,
1001
+ formula_blocks=formula_blocks,
1002
+ file_hash=file_hash,
1003
+ extraction_backend=extraction_backend,
1004
+ )
1005
+ )
1006
+
1007
+ return documents
1008
+
1009
+
1010
+ def load_txt_file(path: Path) -> List[Document]:
1011
+ # TODO: load text file
1012
+ pass
1013
+ return []
1014
+
1015
+
1016
+ def iter_source_files(raw_dir: Path) -> Iterable[Path]:
1017
+ supported_suffixes = {".md", ".markdown", ".pdf"}
1018
+ for path in sorted(raw_dir.rglob("*")):
1019
+ if path.is_file() and path.suffix.lower() in supported_suffixes:
1020
+ yield path
1021
+
1022
+
1023
+ def load_docs(raw_dir: Path = RAW_DIR) -> List[Document]:
1024
+ documents: List[Document] = []
1025
+
1026
+ for path in iter_source_files(raw_dir):
1027
+ suffix = path.suffix.lower()
1028
+
1029
+ if suffix in {".md", ".markdown"}:
1030
+ documents.append(load_md_file(path))
1031
+ elif suffix == ".pdf":
1032
+ documents.extend(load_pdf_file(path))
1033
+ elif suffix == ".txt":
1034
+ documents.extend(load_txt_file(path))
1035
+
1036
+ if not documents:
1037
+ raise ValueError(f"No supported documents found under {raw_dir}")
1038
+
1039
+ return documents
1040
+
1041
+
1042
+ def add_chunk_metadata(nodes: List[BaseNode]) -> List[BaseNode]:
1043
+ counters: dict[str, int] = {}
1044
+
1045
+ for node in nodes:
1046
+ source_file = node.metadata["source_file"]
1047
+ chunk_index = counters.get(source_file, 0)
1048
+ counters[source_file] = chunk_index + 1
1049
+
1050
+ file_hash = node.metadata["file_hash"][:12]
1051
+ page_number = node.metadata.get("page_number", "na")
1052
+ chunk_id = f"{Path(source_file).stem}-{file_hash}-p{page_number}-c{chunk_index}"
1053
+
1054
+ node.metadata["chunk_id"] = chunk_id
1055
+ node.metadata["chunk_index"] = chunk_index
1056
+ node.id_ = chunk_id
1057
+
1058
+ return nodes
1059
+
1060
+
1061
+ def validate_nodes(nodes: List[BaseNode]) -> None:
1062
+ if not nodes:
1063
+ raise ValueError("No chunks were created from the source documents.")
1064
+
1065
+ for node in nodes:
1066
+ missing = [key for key in REQUIRED_METADATA if key not in node.metadata]
1067
+ if missing:
1068
+ raise ValueError(
1069
+ f"Node {node.node_id} is missing metadata fields: {missing}")
1070
+
1071
+ if node.metadata["file_type"] == "pdf" and "page_number" not in node.metadata:
1072
+ raise ValueError(
1073
+ f"PDF node {node.node_id} is missing page_number metadata.")
1074
+
1075
+
1076
+ def build_nodes(raw_dir: Path = RAW_DIR) -> List[BaseNode]:
1077
+ documents = load_docs(raw_dir)
1078
+ splitter = SentenceSplitter(
1079
+ chunk_size=CHUNK_SIZE,
1080
+ chunk_overlap=CHUNK_OVERLAP,
1081
+ )
1082
+ nodes = splitter.get_nodes_from_documents(documents)
1083
+ add_chunk_metadata(nodes)
1084
+ validate_nodes(nodes)
1085
+ return nodes
1086
+
1087
+
1088
+ def collection_needs_pdf_rebuild(chroma_collection) -> bool:
1089
+ if chroma_collection.count() == 0:
1090
+ return True
1091
+
1092
+ try:
1093
+ sample = chroma_collection.peek(limit=min(chroma_collection.count(), 20))
1094
+ except Exception:
1095
+ return False
1096
+
1097
+ for metadata in sample.get("metadatas") or []:
1098
+ if metadata.get("file_type") == "pdf":
1099
+ return metadata.get("extraction_method") != PDF_EXTRACTION_METHOD
1100
+
1101
+ return False
1102
+
1103
+
1104
+ async def build_index(raw_dir: Path = RAW_DIR, rebuild: bool = False) -> VectorStoreIndex:
1105
+ configure_model_cache()
1106
+
1107
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
1108
+
1109
+ load_dotenv()
1110
+ CHROMA_DB_DIR.mkdir(parents=True, exist_ok=True)
1111
+
1112
+ db = chromadb.PersistentClient(path=str(CHROMA_DB_DIR))
1113
+
1114
+ if rebuild:
1115
+ try:
1116
+ db.delete_collection(COLLECTION_NAME)
1117
+ except (NotFoundError, ValueError):
1118
+ pass
1119
+
1120
+ chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
1121
+ if not rebuild and collection_needs_pdf_rebuild(chroma_collection):
1122
+ db.delete_collection(COLLECTION_NAME)
1123
+ chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
1124
+
1125
+ vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
1126
+ storage_context = StorageContext.from_defaults(vector_store=vector_store)
1127
+ embed_model = HuggingFaceEmbedding(
1128
+ model_name=resolve_embed_model_name(),
1129
+ cache_folder=str(HF_CACHE_DIR / "sentence_transformers"),
1130
+ )
1131
+
1132
+ if rebuild or chroma_collection.count() == 0:
1133
+ nodes = build_nodes(raw_dir)
1134
+ index = VectorStoreIndex(
1135
+ nodes,
1136
+ storage_context=storage_context,
1137
+ embed_model=embed_model,
1138
+ show_progress=True,
1139
+ )
1140
+ print(
1141
+ f"Indexed {len(nodes)} chunks into collection '{COLLECTION_NAME}'")
1142
+ return index
1143
+
1144
+ print(
1145
+ f"Loaded existing collection '{COLLECTION_NAME}' with {chroma_collection.count()} chunks.")
1146
+ return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
1147
+
1148
+
1149
+ class QueryKnowledgeTool(Tool):
1150
+ name = "query_knowledge"
1151
+ description = "Performs a search of related information based on your query"
1152
+ inputs = {'query': {'type': 'string',
1153
+ 'description': 'The search query to perform.'}}
1154
+ output_type = "string"
1155
+
1156
+ @staticmethod
1157
+ def format_results(results):
1158
+ output = []
1159
+
1160
+ for result in results:
1161
+ metadata = result.node.metadata
1162
+ source = metadata.get("file_name", "unknown")
1163
+ page = metadata.get("page_number", "n/a")
1164
+ section = metadata.get("section_path") or metadata.get("section_title") or "n/a"
1165
+ content_type = metadata.get("content_type", "text")
1166
+ formula_id = metadata.get("formula_id", "")
1167
+ score = result.score
1168
+ text = result.node.get_content()
1169
+
1170
+ output.append(
1171
+ f"source:{source}\n"
1172
+ f"page:{page}\n"
1173
+ f"section:{section}\n"
1174
+ f"content_type:{content_type}\n"
1175
+ f"formula_id:{formula_id or 'n/a'}\n"
1176
+ f"score:{score:.4f}\n"
1177
+ f"content:{text}"
1178
+ )
1179
+
1180
+ return "\n\n---\n\n".join(output)
1181
+
1182
+ def __init__(self, max_results=10, top_k=5, **kwargs):
1183
+ super().__init__()
1184
+ self.max_results = max_results
1185
+ index = asyncio.run(build_index(rebuild=False))
1186
+ self.retriever = index.as_retriever(similarity_top_k=top_k)
1187
+
1188
+ def forward(self, query: str) -> str:
1189
+ results = self.retriever.retrieve(query)
1190
+ return QueryKnowledgeTool.format_results(results)
1191
+
1192
+
1193
+ if __name__ == "__main__":
1194
+ query_tool = QueryKnowledgeTool()
1195
+ res: str = query_tool.forward("What is option?")
1196
+ print(res)
tools/todo.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ 1. 添加reranker
2
+ 2. 修改embedding模型
3
+ 3. chunk策略粗糙,建议按照章节、标题等进行划分
4
+ 4. 提升pdf提取能力
5
+ 5. 完成load_txt
uv.lock CHANGED
@@ -660,6 +660,7 @@ dependencies = [
660
  { name = "llama-index-core" },
661
  { name = "llama-index-embeddings-huggingface" },
662
  { name = "llama-index-vector-stores-chroma" },
 
663
  { name = "pypdf" },
664
  { name = "tokenizers" },
665
  { name = "transformers" },
@@ -673,6 +674,7 @@ requires-dist = [
673
  { name = "llama-index-core", specifier = ">=0.14.0" },
674
  { name = "llama-index-embeddings-huggingface", specifier = ">=0.6.0" },
675
  { name = "llama-index-vector-stores-chroma", specifier = ">=0.5.0" },
 
676
  { name = "pypdf", specifier = ">=6.0.0" },
677
  { name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
678
  { name = "transformers", specifier = "<5" },
@@ -2570,6 +2572,22 @@ wheels = [
2570
  { url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
2571
  ]
2572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2573
  [[package]]
2574
  name = "pypdf"
2575
  version = "6.12.0"
 
660
  { name = "llama-index-core" },
661
  { name = "llama-index-embeddings-huggingface" },
662
  { name = "llama-index-vector-stores-chroma" },
663
+ { name = "pymupdf" },
664
  { name = "pypdf" },
665
  { name = "tokenizers" },
666
  { name = "transformers" },
 
674
  { name = "llama-index-core", specifier = ">=0.14.0" },
675
  { name = "llama-index-embeddings-huggingface", specifier = ">=0.6.0" },
676
  { name = "llama-index-vector-stores-chroma", specifier = ">=0.5.0" },
677
+ { name = "pymupdf", specifier = ">=1.27.2.3" },
678
  { name = "pypdf", specifier = ">=6.0.0" },
679
  { name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
680
  { name = "transformers", specifier = "<5" },
 
2572
  { url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
2573
  ]
2574
 
2575
+ [[package]]
2576
+ name = "pymupdf"
2577
+ version = "1.27.2.3"
2578
+ source = { registry = "https://pypi.org/simple" }
2579
+ sdist = { url = "https://files.pythonhosted.org/packages/22/32/708bedc9dde7b328d45abbc076091769d44f2f24ad151ad92d56a6ec142b/pymupdf-1.27.2.3.tar.gz", hash = "sha256:7a92faa25129e8bbec5e50eeb9214f187665428c31b05c4ef6e36c58c0b1c6d2", size = 85759618, upload-time = "2026-04-24T14:13:14.42Z" }
2580
+ wheels = [
2581
+ { url = "https://files.pythonhosted.org/packages/dc/09/ddbdfa7ee91fbabd6f63d7d744884cbdfe3e7ff9b8604749fb38bddf5c5d/pymupdf-1.27.2.3-cp310-abi3-macosx_10_9_x86_64.whl", hash = "sha256:fc1bc3cae6e9e150b0dbb0a9221bdfd411d65f0db2fe359eaa22467d7cc2a05f", size = 24002636, upload-time = "2026-04-24T14:09:17.459Z" },
2582
+ { url = "https://files.pythonhosted.org/packages/01/89/3f8edd6c4f50ca370e2a2f2a3011face36f3760728ffe76dffec91c0fca0/pymupdf-1.27.2.3-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:660d93cb6da5bbddf11d3982ae27745dd3a9902d9f24cdb69adab83962294b5a", size = 23278238, upload-time = "2026-04-24T14:09:32.882Z" },
2583
+ { url = "https://files.pythonhosted.org/packages/c3/26/b7e5a70eb83bd189f8b5df87ec442746b992f2f632662839b288170d357d/pymupdf-1.27.2.3-cp310-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:1dd460a3ae4597a755f00a3bd9771f5ebf1531dc111f6a36bf05dd00a6b84425", size = 24333923, upload-time = "2026-04-24T14:09:47.341Z" },
2584
+ { url = "https://files.pythonhosted.org/packages/e4/a0/aa1ee2240f29481a04a827c313333b4ecd8a14d6ac3e15d3f41a30574781/pymupdf-1.27.2.3-cp310-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:857842b4888827bd6155a1131341b2822a7ebe9a8c15a975fd7d490d7a64a30c", size = 24963198, upload-time = "2026-04-24T14:10:07.408Z" },
2585
+ { url = "https://files.pythonhosted.org/packages/69/49/4f742451f980840829fc00ba158bebb25d389c846d8f4f8c65936ee55de8/pymupdf-1.27.2.3-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:580983849c64a08d08344ca3d1580e87c01f046a8392421797bc850efd72a5b6", size = 25184609, upload-time = "2026-04-24T14:10:22.911Z" },
2586
+ { url = "https://files.pythonhosted.org/packages/f6/3f/3853d6608f394faf6eec2bd4e8ea9f6a00beea329b071abdb29f4164cc3d/pymupdf-1.27.2.3-cp310-abi3-win32.whl", hash = "sha256:a5c1088a87189891a4946ab314a14b7934ac4c5b6077f7e74ebee956f8906d0e", size = 18019286, upload-time = "2026-04-24T14:10:34.239Z" },
2587
+ { url = "https://files.pythonhosted.org/packages/44/47/5fb10fe73f96b31253a41647c362ea9e0380920bddf16028414a051247fc/pymupdf-1.27.2.3-cp310-abi3-win_amd64.whl", hash = "sha256:d20f68ef15195e073071dbc4ae7455257c7889af7584e39df490c0a92728526e", size = 19249102, upload-time = "2026-04-24T14:10:46.72Z" },
2588
+ { url = "https://files.pythonhosted.org/packages/53/a4/b9e91aac82293f9c954654c85581ee8212b5b05efadc534b581141241e6f/pymupdf-1.27.2.3-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:77691604c5d1d0233827139bbcdea61fd57879c84712b8e49b1f45520f7ab9c2", size = 25000393, upload-time = "2026-04-24T14:11:01.669Z" },
2589
+ ]
2590
+
2591
  [[package]]
2592
  name = "pypdf"
2593
  version = "6.12.0"