Spaces:
Runtime error
Runtime error
加强pdf提取能力,增加rag评测模块
Browse files- eval/README.md +182 -0
- eval/__init__.py +1 -0
- eval/data/hf/open_ragbench/README.md +185 -0
- eval/rag_eval.py +630 -0
- eval/reports/beir_fiqa_retrieval_eval.md +124 -0
- eval/run_eval_suite.py +173 -0
- hf_cache/sentence_transformers/models--BAAI--bge-small-en-v1.5/snapshots/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/README.md +1 -0
- load_docs.py +0 -216
- pyproject.toml +1 -0
- rag_pdf_optimization_notes.md +282 -0
- requirements.txt +1 -0
- tools/query_knowledge.py +1196 -0
- tools/todo.md +5 -0
- uv.lock +18 -0
eval/README.md
ADDED
|
@@ -0,0 +1,182 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RAG Evaluation Module
|
| 2 |
+
|
| 3 |
+
This folder contains a lightweight retrieval-evaluation harness for the project.
|
| 4 |
+
|
| 5 |
+
## Supported Steps
|
| 6 |
+
|
| 7 |
+
1. `beir/scifact`
|
| 8 |
+
2. `beir/fiqa`
|
| 9 |
+
3. `open-ragbench`
|
| 10 |
+
4. `t2-ragbench`
|
| 11 |
+
5. `local-options`
|
| 12 |
+
|
| 13 |
+
Each run builds a temporary Chroma index under `eval/indexes/` and writes reports under `eval/reports/`.
|
| 14 |
+
|
| 15 |
+
## Smoke Tests
|
| 16 |
+
|
| 17 |
+
```bash
|
| 18 |
+
uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset beir/scifact --max-corpus-docs 200 --max-queries 10 --rebuild
|
| 19 |
+
uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset beir/fiqa --max-corpus-docs 500 --max-queries 10 --rebuild
|
| 20 |
+
uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset open-ragbench --max-corpus-docs 50 --max-queries 10 --rebuild
|
| 21 |
+
uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset t2-ragbench --max-corpus-docs 50 --max-queries 10 --rebuild
|
| 22 |
+
uv --cache-dir .uv-cache run python -m eval.rag_eval --dataset local-options --max-queries 3 --rebuild
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
## Run The Whole Suite
|
| 26 |
+
|
| 27 |
+
```bash
|
| 28 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite --rebuild
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
By default, the suite runs:
|
| 32 |
+
|
| 33 |
+
- `beir/scifact`
|
| 34 |
+
- `beir/fiqa`
|
| 35 |
+
- `open-ragbench`
|
| 36 |
+
- `local-options`
|
| 37 |
+
|
| 38 |
+
Useful options:
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
# Accurate run after changing PDF parsing, chunking, embedding, retrieval code, or sampling parameters.
|
| 42 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite --rebuild
|
| 43 |
+
|
| 44 |
+
# Faster run that reuses existing indexes.
|
| 45 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite
|
| 46 |
+
|
| 47 |
+
# Run only selected datasets.
|
| 48 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite --datasets local-options,beir/fiqa
|
| 49 |
+
|
| 50 |
+
# Override shared parameters for all selected datasets.
|
| 51 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite --top-k 10 --max-queries 20 --max-corpus-docs 1000
|
| 52 |
+
|
| 53 |
+
# Save a stable suite-level report name.
|
| 54 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite --output-name latest_rag_eval
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
The suite writes per-dataset reports and one aggregate report under `eval/reports/`.
|
| 58 |
+
|
| 59 |
+
## Common Commands
|
| 60 |
+
|
| 61 |
+
Run the fastest local check while developing PDF parsing or chunking:
|
| 62 |
+
|
| 63 |
+
```bash
|
| 64 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 65 |
+
--datasets local-options \
|
| 66 |
+
--max-queries 3 \
|
| 67 |
+
--top-k 5 \
|
| 68 |
+
--rebuild
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Run only the standard public retrieval smoke tests:
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 75 |
+
--datasets beir/scifact,beir/fiqa \
|
| 76 |
+
--rebuild
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
Run the financial benchmark only:
|
| 80 |
+
|
| 81 |
+
```bash
|
| 82 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 83 |
+
--datasets beir/fiqa \
|
| 84 |
+
--max-corpus-docs 1000 \
|
| 85 |
+
--max-queries 50 \
|
| 86 |
+
--top-k 5 \
|
| 87 |
+
--rebuild
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
Run the PDF-like benchmark only:
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 94 |
+
--datasets open-ragbench \
|
| 95 |
+
--max-corpus-docs 100 \
|
| 96 |
+
--max-queries 20 \
|
| 97 |
+
--top-k 5 \
|
| 98 |
+
--rebuild
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
Compare different `top-k` values:
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 105 |
+
--datasets local-options \
|
| 106 |
+
--top-k 3 \
|
| 107 |
+
--output-name local_options_top3 \
|
| 108 |
+
--rebuild
|
| 109 |
+
|
| 110 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 111 |
+
--datasets local-options \
|
| 112 |
+
--top-k 10 \
|
| 113 |
+
--output-name local_options_top10 \
|
| 114 |
+
--rebuild
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
Compare different chunk settings:
|
| 118 |
+
|
| 119 |
+
```bash
|
| 120 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 121 |
+
--datasets local-options \
|
| 122 |
+
--chunk-size 384 \
|
| 123 |
+
--chunk-overlap 64 \
|
| 124 |
+
--output-name local_options_chunk384 \
|
| 125 |
+
--rebuild
|
| 126 |
+
|
| 127 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 128 |
+
--datasets local-options \
|
| 129 |
+
--chunk-size 768 \
|
| 130 |
+
--chunk-overlap 128 \
|
| 131 |
+
--output-name local_options_chunk768 \
|
| 132 |
+
--rebuild
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
Run a larger, slower evaluation before reporting results:
|
| 136 |
+
|
| 137 |
+
```bash
|
| 138 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 139 |
+
--datasets beir/scifact,beir/fiqa,open-ragbench,local-options \
|
| 140 |
+
--max-corpus-docs 2000 \
|
| 141 |
+
--max-queries 100 \
|
| 142 |
+
--top-k 5 \
|
| 143 |
+
--output-name full_rag_eval \
|
| 144 |
+
--rebuild
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
Stop immediately when one dataset fails:
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
uv --cache-dir .uv-cache run python -m eval.run_eval_suite \
|
| 151 |
+
--datasets beir/scifact,beir/fiqa,open-ragbench,local-options \
|
| 152 |
+
--fail-fast \
|
| 153 |
+
--rebuild
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
Run a single dataset directly without the suite wrapper:
|
| 157 |
+
|
| 158 |
+
```bash
|
| 159 |
+
uv --cache-dir .uv-cache run python -m eval.rag_eval \
|
| 160 |
+
--dataset local-options \
|
| 161 |
+
--max-queries 3 \
|
| 162 |
+
--top-k 5 \
|
| 163 |
+
--rebuild
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
## Suggested Workflow
|
| 167 |
+
|
| 168 |
+
1. During development, run `local-options` with a small query count.
|
| 169 |
+
2. After changing PDF extraction, chunking, embeddings, or retrieval code, add `--rebuild`.
|
| 170 |
+
3. Before comparing two versions, use the same `--datasets`, `--max-queries`, `--max-corpus-docs`, `--top-k`, `--chunk-size`, and `--chunk-overlap`.
|
| 171 |
+
4. Use `--output-name` to save stable report names for before/after comparison.
|
| 172 |
+
|
| 173 |
+
## Metrics
|
| 174 |
+
|
| 175 |
+
- `hit_at_1`
|
| 176 |
+
- `hit_at_3`
|
| 177 |
+
- `hit_at_5`
|
| 178 |
+
- `hit_at_k`
|
| 179 |
+
- `mrr`
|
| 180 |
+
- `ndcg_at_k`
|
| 181 |
+
|
| 182 |
+
The public benchmarks test whether the eval pipeline works on standard datasets. The `local-options` benchmark is the project-specific check for PDF parsing, formula extraction, and section-aware chunking.
|
eval/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
"""RAG evaluation helpers."""
|
eval/data/hf/open_ragbench/README.md
ADDED
|
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
---
|
| 4 |
+
# Open RAG Benchmark
|
| 5 |
+
|
| 6 |
+
The Open RAG Benchmark is a unique, high-quality Retrieval-Augmented Generation (RAG) dataset constructed directly from arXiv PDF documents, specifically designed for evaluating RAG systems with a focus on multimodal PDF understanding. Unlike other datasets, Open RAG Benchmark emphasizes **pure PDF content**, meticulously extracting and generating queries on diverse modalities including **text, tables, and images**, even when they are intricately interwoven within a document.
|
| 7 |
+
|
| 8 |
+
This dataset is purpose-built to power the company's [Open RAG Evaluation project](https://github.com/vectara/open-rag-eval), facilitating a holistic, end-to-end evaluation of RAG systems by offering:
|
| 9 |
+
|
| 10 |
+
- **Richer Multimodal Content:** A corpus derived exclusively from PDF documents, ensuring fidelity to real-world data and encompassing a wide spectrum of text, tabular, and visual information, often with intermodal crossovers.
|
| 11 |
+
- **Tailored for Open RAG Evaluation:** Designed to support the unique and comprehensive evaluation metrics adopted by the Open RAG Evaluation project, enabling a deeper understanding of RAG performance beyond traditional metrics.
|
| 12 |
+
- **High-Quality Retrieval Queries & Answers:** Each piece of extracted content is paired with expertly crafted retrieval queries and corresponding answers, optimized for robust RAG training and evaluation.
|
| 13 |
+
- **Diverse Knowledge Domains:** Content spanning various scientific and technical domains from arXiv, ensuring broad applicability and challenging RAG systems across different knowledge areas.
|
| 14 |
+
|
| 15 |
+
The current draft version of the Arxiv dataset, as the first step in this multimodal RAG dataset collection, includes:
|
| 16 |
+
|
| 17 |
+
- **Documents:** 1000 PDF papers evenly distributed across all Arxiv categories.
|
| 18 |
+
- 400 positive documents (each serving as the golden document for some queries).
|
| 19 |
+
- 600 hard negative documents (completely irrelevant to all queries).
|
| 20 |
+
- **Multimodal Content:** Extracted text, tables, and images from research papers.
|
| 21 |
+
- **QA Pairs:** 3045 valid question-answer pairs.
|
| 22 |
+
- **Based on query types:**
|
| 23 |
+
- 1793 abstractive queries (requiring generating a summary or rephrased response using understanding and synthesis).
|
| 24 |
+
- 1252 extractive queries (seeking concise, fact-based answers directly extracted from a given text).
|
| 25 |
+
- **Based on generation sources:**
|
| 26 |
+
- 1914 text-only queries
|
| 27 |
+
- 763 text-image queries
|
| 28 |
+
- 148 text-table queries
|
| 29 |
+
- 220 text-table-image queries
|
| 30 |
+
|
| 31 |
+
## Dataset Structure
|
| 32 |
+
|
| 33 |
+
The dataset is organized similar to the [BEIR dataset](https://github.com/beir-cellar/beir) format within the `official/pdf/arxiv/` directory.
|
| 34 |
+
|
| 35 |
+
```
|
| 36 |
+
official/
|
| 37 |
+
└── pdf
|
| 38 |
+
└── arxiv
|
| 39 |
+
├── answers.json
|
| 40 |
+
├── corpus
|
| 41 |
+
│ ├── {PAPER_ID_1}.json
|
| 42 |
+
│ ├── {PAPER_ID_2}.json
|
| 43 |
+
│ └── ...
|
| 44 |
+
├── pdf_urls.json
|
| 45 |
+
├── qrels.json
|
| 46 |
+
└── queries.json
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
Each file's format is detailed below:
|
| 50 |
+
|
| 51 |
+
### `pdf_urls.json`
|
| 52 |
+
|
| 53 |
+
This file provides the original PDF links to the papers in this dataset for downloading purposes.
|
| 54 |
+
|
| 55 |
+
```json
|
| 56 |
+
{
|
| 57 |
+
"Paper ID": "Paper URL",
|
| 58 |
+
...
|
| 59 |
+
}
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### `corpus/`
|
| 63 |
+
|
| 64 |
+
This folder contains all processed papers in JSON format.
|
| 65 |
+
|
| 66 |
+
```json
|
| 67 |
+
{
|
| 68 |
+
"title": "Paper Title",
|
| 69 |
+
"sections": [
|
| 70 |
+
{
|
| 71 |
+
"text": "Section text content with placeholders for tables/images",
|
| 72 |
+
"tables": {"table_id1": "markdown_table_string", ...},
|
| 73 |
+
"images": {"image_id1": "base64_encoded_string", ...},
|
| 74 |
+
},
|
| 75 |
+
...
|
| 76 |
+
],
|
| 77 |
+
"id": "Paper ID",
|
| 78 |
+
"authors": ["Author1", "Author2", ...],
|
| 79 |
+
"categories": ["Category1", "Category2", ...],
|
| 80 |
+
"abstract": "Abstract text",
|
| 81 |
+
"updated": "Updated date",
|
| 82 |
+
"published": "Published date"
|
| 83 |
+
}
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### `queries.json`
|
| 87 |
+
|
| 88 |
+
This file contains all generated queries.
|
| 89 |
+
|
| 90 |
+
```json
|
| 91 |
+
{
|
| 92 |
+
"Query UUID": {
|
| 93 |
+
"query": "Query text",
|
| 94 |
+
"type": "Query type (abstractive/extractive)",
|
| 95 |
+
"source": "Generation source (text/text-image/text-table/text-table-image)"
|
| 96 |
+
},
|
| 97 |
+
...
|
| 98 |
+
}
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
### `qrels.json`
|
| 102 |
+
|
| 103 |
+
This file contains the query-document-section relevance labels.
|
| 104 |
+
|
| 105 |
+
```json
|
| 106 |
+
{
|
| 107 |
+
"Query UUID": {
|
| 108 |
+
"doc_id": "Paper ID",
|
| 109 |
+
"section_id": Section Index
|
| 110 |
+
},
|
| 111 |
+
...
|
| 112 |
+
}
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
### `answers.json`
|
| 116 |
+
|
| 117 |
+
This file contains the answers for the generated queries.
|
| 118 |
+
|
| 119 |
+
```json
|
| 120 |
+
{
|
| 121 |
+
"Query UUID": "Answer text",
|
| 122 |
+
...
|
| 123 |
+
}
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## Dataset Creation
|
| 127 |
+
|
| 128 |
+
The Open RAG Benchmark dataset is created through a systematic process involving document collection, processing, content segmentation, query generation, and quality filtering.
|
| 129 |
+
|
| 130 |
+
1. **Document Collection:** Gathering documents from sources like Arxiv.
|
| 131 |
+
2. **Document Processing:** Parsing PDFs via OCR into text, Markdown tables, and base64 encoded images.
|
| 132 |
+
3. **Content Segmentation:** Dividing documents into sections based on structural elements.
|
| 133 |
+
4. **Query Generation:** Using LLMs (currently `gpt-4o-mini`) to generate retrieval queries for each section, handling multimodal content such as tables and images.
|
| 134 |
+
5. **Quality Filtering:** Removing non-retrieval queries and ensuring quality through post-processing via a set of encoders for retrieval filtering and `gpt-4o-mini` for query quality filtering.
|
| 135 |
+
6. **Hard-Negative Document Mining (Optional):** Mining hard negative documents that are entirely irrelevant to any existing query, relying on agreement across multiple embedding models for accuracy.
|
| 136 |
+
|
| 137 |
+
The code for reproducing and customizing the dataset generation process is available in the [Open RAG Benchmark GitHub repository](https://www.google.com/search?q=https://github.com/vectara/Open-RAG-Benchmark).
|
| 138 |
+
|
| 139 |
+
## Limitations and Challenges
|
| 140 |
+
|
| 141 |
+
Several challenges are inherent in the current dataset development process:
|
| 142 |
+
|
| 143 |
+
- **OCR Performance:** Mistral OCR, while performing well for structured documents, struggles with unstructured PDFs, impacting the quality of extracted content.
|
| 144 |
+
- **Multimodal Integration:** Ensuring proper extraction and seamless integration of tables and images with corresponding text remains a complex challenge.
|
| 145 |
+
|
| 146 |
+
## Future Enhancements
|
| 147 |
+
|
| 148 |
+
The project aims for continuous improvement and expansion of the dataset, with key next steps including:
|
| 149 |
+
|
| 150 |
+
### Enhanced Dataset Structure and Usability:
|
| 151 |
+
|
| 152 |
+
- **Dataset Format and Content Enhancements:**
|
| 153 |
+
- **Rich Metadata:** Adding comprehensive document metadata (authors, publication date, categories, etc.) to enable better filtering and contextualization.
|
| 154 |
+
- **Flexible Chunking:** Providing multiple content granularity levels (sections, paragraphs, sentences) to accommodate different retrieval strategies.
|
| 155 |
+
- **Query Metadata:** Classifying queries by type (factual, conceptual, analytical), difficulty level, and whether they require multimodal understanding.
|
| 156 |
+
- **Advanced Multimodal Representation:**
|
| 157 |
+
- **Improved Image Integration:** Replacing basic placeholders with structured image objects including captions, alt text, and direct access URLs.
|
| 158 |
+
- **Structured Table Format:** Providing both markdown and programmatically accessible structured formats for tables (headers/rows).
|
| 159 |
+
- **Positional Context:** Maintaining clear positional relationships between text and visual elements.
|
| 160 |
+
- **Sophisticated Query Generation:**
|
| 161 |
+
- **Multi-stage Generation Pipeline:** Implementing targeted generation for different query types (factual, conceptual, multimodal).
|
| 162 |
+
- **Diversity Controls:** Ensuring coverage of different difficulty levels and reasoning requirements.
|
| 163 |
+
- **Specialized Multimodal Queries:** Generating queries specifically designed to test table and image understanding.
|
| 164 |
+
- **Practitioner-Focused Tools:**
|
| 165 |
+
- **Framework Integration Examples:** Providing code samples showing dataset integration with popular RAG frameworks (LangChain, LlamaIndex, etc.).
|
| 166 |
+
- **Evaluation Utilities:** Developing standardized tools to benchmark RAG system performance using this dataset.
|
| 167 |
+
- **Interactive Explorer:** Creating a simple visualization tool to browse and understand dataset contents.
|
| 168 |
+
|
| 169 |
+
### Dataset Expansion:
|
| 170 |
+
|
| 171 |
+
- Implementing alternative solutions for PDF table & image extraction.
|
| 172 |
+
- Enhancing OCR capabilities for unstructured documents.
|
| 173 |
+
- Broadening scope beyond academic papers to include other document types.
|
| 174 |
+
- Potentially adding multilingual support.
|
| 175 |
+
|
| 176 |
+
## Acknowledgments
|
| 177 |
+
|
| 178 |
+
The Open RAG Benchmark project uses OpenAI's GPT models (specifically `gpt-4o-mini`) for query generation and evaluation. For post-filtering and retrieval filtering, the following embedding models, recognized for their outstanding performance on the [MTEB Benchmark](https://huggingface.co/spaces/mteb/leaderboard), were utilized:
|
| 179 |
+
|
| 180 |
+
- [Linq-AI-Research/Linq-Embed-Mistral](https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral)
|
| 181 |
+
- [dunzhang/stella\_en\_1.5B\_v5](https://huggingface.co/NovaSearch/stella_en_1.5B_v5)
|
| 182 |
+
- [Alibaba-NLP/gte-Qwen2-7B-instruct](https://huggingface.co/Alibaba-NLP/gte-Qwen2-7B-instruct)
|
| 183 |
+
- [infly/inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1)
|
| 184 |
+
- [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)
|
| 185 |
+
- [openai/text-embedding-3-large](https://platform.openai.com/docs/models/text-embedding-3-large)
|
eval/rag_eval.py
ADDED
|
@@ -0,0 +1,630 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import argparse
|
| 4 |
+
import csv
|
| 5 |
+
import json
|
| 6 |
+
import math
|
| 7 |
+
import shutil
|
| 8 |
+
import zipfile
|
| 9 |
+
from dataclasses import dataclass
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from typing import Any, Iterable
|
| 12 |
+
|
| 13 |
+
import chromadb
|
| 14 |
+
import requests
|
| 15 |
+
from llama_index.core import StorageContext, VectorStoreIndex
|
| 16 |
+
from llama_index.core.node_parser import SentenceSplitter
|
| 17 |
+
from llama_index.core.schema import Document
|
| 18 |
+
from llama_index.vector_stores.chroma import ChromaVectorStore
|
| 19 |
+
|
| 20 |
+
from tools.query_knowledge import configure_model_cache, resolve_embed_model_name
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
PROJECT_ROOT = Path(__file__).resolve().parents[1]
|
| 24 |
+
EVAL_DIR = PROJECT_ROOT / "eval"
|
| 25 |
+
DATA_DIR = EVAL_DIR / "data"
|
| 26 |
+
INDEX_DIR = EVAL_DIR / "indexes"
|
| 27 |
+
REPORT_DIR = EVAL_DIR / "reports"
|
| 28 |
+
|
| 29 |
+
BEIR_URLS = {
|
| 30 |
+
"scifact": "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip",
|
| 31 |
+
"fiqa": "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/fiqa.zip",
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
DATASET_ALIASES = {
|
| 35 |
+
"beir/scifact": "scifact",
|
| 36 |
+
"beir/fiqa": "fiqa",
|
| 37 |
+
"open-ragbench": "open_ragbench",
|
| 38 |
+
"open_ragbench": "open_ragbench",
|
| 39 |
+
"t2-ragbench": "t2_ragbench",
|
| 40 |
+
"t2_ragbench": "t2_ragbench",
|
| 41 |
+
"local-options": "local_options",
|
| 42 |
+
"local_options": "local_options",
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
@dataclass
|
| 47 |
+
class EvalCorpus:
|
| 48 |
+
name: str
|
| 49 |
+
documents: list[dict[str, Any]]
|
| 50 |
+
queries: list[dict[str, Any]]
|
| 51 |
+
qrels: dict[str, set[str]]
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def ensure_dirs() -> None:
|
| 55 |
+
DATA_DIR.mkdir(parents=True, exist_ok=True)
|
| 56 |
+
INDEX_DIR.mkdir(parents=True, exist_ok=True)
|
| 57 |
+
REPORT_DIR.mkdir(parents=True, exist_ok=True)
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def download_file(url: str, destination: Path) -> None:
|
| 61 |
+
destination.parent.mkdir(parents=True, exist_ok=True)
|
| 62 |
+
with requests.get(url, stream=True, timeout=60) as response:
|
| 63 |
+
response.raise_for_status()
|
| 64 |
+
with destination.open("wb") as file:
|
| 65 |
+
for chunk in response.iter_content(chunk_size=1024 * 1024):
|
| 66 |
+
if chunk:
|
| 67 |
+
file.write(chunk)
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def read_jsonl(path: Path) -> Iterable[dict[str, Any]]:
|
| 71 |
+
with path.open("r", encoding="utf-8") as file:
|
| 72 |
+
for line in file:
|
| 73 |
+
line = line.strip()
|
| 74 |
+
if line:
|
| 75 |
+
yield json.loads(line)
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def prepare_beir_dataset(dataset_name: str) -> Path:
|
| 79 |
+
ensure_dirs()
|
| 80 |
+
if dataset_name not in BEIR_URLS:
|
| 81 |
+
raise ValueError(f"Unsupported BEIR dataset: {dataset_name}")
|
| 82 |
+
|
| 83 |
+
target_dir = DATA_DIR / "beir" / dataset_name
|
| 84 |
+
corpus_path = target_dir / "corpus.jsonl"
|
| 85 |
+
if corpus_path.exists():
|
| 86 |
+
return target_dir
|
| 87 |
+
|
| 88 |
+
zip_path = DATA_DIR / "downloads" / f"{dataset_name}.zip"
|
| 89 |
+
if not zip_path.exists():
|
| 90 |
+
download_file(BEIR_URLS[dataset_name], zip_path)
|
| 91 |
+
|
| 92 |
+
extract_root = DATA_DIR / "beir"
|
| 93 |
+
extract_root.mkdir(parents=True, exist_ok=True)
|
| 94 |
+
with zipfile.ZipFile(zip_path) as archive:
|
| 95 |
+
archive.extractall(extract_root)
|
| 96 |
+
|
| 97 |
+
if not corpus_path.exists():
|
| 98 |
+
raise FileNotFoundError(f"BEIR extraction did not create {corpus_path}")
|
| 99 |
+
|
| 100 |
+
return target_dir
|
| 101 |
+
|
| 102 |
+
|
| 103 |
+
def load_beir_dataset(
|
| 104 |
+
dataset_name: str,
|
| 105 |
+
split: str,
|
| 106 |
+
max_corpus_docs: int | None,
|
| 107 |
+
max_queries: int | None,
|
| 108 |
+
) -> EvalCorpus:
|
| 109 |
+
dataset_dir = prepare_beir_dataset(dataset_name)
|
| 110 |
+
|
| 111 |
+
all_queries = {
|
| 112 |
+
str(row["_id"]): row.get("text", "")
|
| 113 |
+
for row in read_jsonl(dataset_dir / "queries.jsonl")
|
| 114 |
+
}
|
| 115 |
+
|
| 116 |
+
qrels_path = dataset_dir / "qrels" / f"{split}.tsv"
|
| 117 |
+
if not qrels_path.exists():
|
| 118 |
+
candidates = sorted((dataset_dir / "qrels").glob("*.tsv"))
|
| 119 |
+
if not candidates:
|
| 120 |
+
raise FileNotFoundError(f"No qrels found under {dataset_dir / 'qrels'}")
|
| 121 |
+
qrels_path = candidates[0]
|
| 122 |
+
|
| 123 |
+
all_qrels: dict[str, set[str]] = {}
|
| 124 |
+
with qrels_path.open("r", encoding="utf-8") as file:
|
| 125 |
+
reader = csv.DictReader(file, delimiter="\t")
|
| 126 |
+
for row in reader:
|
| 127 |
+
query_id = str(row.get("query-id") or row.get("query_id"))
|
| 128 |
+
corpus_id = str(row.get("corpus-id") or row.get("corpus_id"))
|
| 129 |
+
score = int(row.get("score", 1))
|
| 130 |
+
if score <= 0:
|
| 131 |
+
continue
|
| 132 |
+
all_qrels.setdefault(query_id, set()).add(corpus_id)
|
| 133 |
+
|
| 134 |
+
queries = []
|
| 135 |
+
required_doc_ids = set()
|
| 136 |
+
for query_id, relevant_docs in all_qrels.items():
|
| 137 |
+
if query_id not in all_queries:
|
| 138 |
+
continue
|
| 139 |
+
if max_corpus_docs and len(required_doc_ids | relevant_docs) > max_corpus_docs:
|
| 140 |
+
continue
|
| 141 |
+
required_doc_ids.update(relevant_docs)
|
| 142 |
+
queries.append(
|
| 143 |
+
{
|
| 144 |
+
"query_id": query_id,
|
| 145 |
+
"question": all_queries[query_id],
|
| 146 |
+
"relevant_doc_ids": sorted(relevant_docs),
|
| 147 |
+
}
|
| 148 |
+
)
|
| 149 |
+
if max_queries and len(queries) >= max_queries:
|
| 150 |
+
break
|
| 151 |
+
|
| 152 |
+
documents = []
|
| 153 |
+
seen_doc_ids = set()
|
| 154 |
+
for row in read_jsonl(dataset_dir / "corpus.jsonl"):
|
| 155 |
+
doc_id = str(row["_id"])
|
| 156 |
+
if required_doc_ids and doc_id not in required_doc_ids:
|
| 157 |
+
if max_corpus_docs and len(documents) >= max_corpus_docs:
|
| 158 |
+
continue
|
| 159 |
+
if max_corpus_docs and len(documents) + len(required_doc_ids - seen_doc_ids) >= max_corpus_docs:
|
| 160 |
+
continue
|
| 161 |
+
title = row.get("title") or ""
|
| 162 |
+
text = row.get("text") or ""
|
| 163 |
+
documents.append(
|
| 164 |
+
{
|
| 165 |
+
"doc_id": doc_id,
|
| 166 |
+
"title": title,
|
| 167 |
+
"text": f"{title}\n{text}".strip(),
|
| 168 |
+
"metadata": {"source_dataset": f"beir/{dataset_name}"},
|
| 169 |
+
}
|
| 170 |
+
)
|
| 171 |
+
seen_doc_ids.add(doc_id)
|
| 172 |
+
if max_corpus_docs and len(documents) >= max_corpus_docs and required_doc_ids.issubset(seen_doc_ids):
|
| 173 |
+
break
|
| 174 |
+
|
| 175 |
+
if not documents or not queries:
|
| 176 |
+
raise ValueError(
|
| 177 |
+
f"Dataset beir/{dataset_name} has no evaluable documents/queries. "
|
| 178 |
+
"Increase --max-corpus-docs or use a larger sample."
|
| 179 |
+
)
|
| 180 |
+
|
| 181 |
+
return EvalCorpus(
|
| 182 |
+
name=f"beir_{dataset_name}",
|
| 183 |
+
documents=documents,
|
| 184 |
+
queries=queries,
|
| 185 |
+
qrels={query["query_id"]: set(query["relevant_doc_ids"]) for query in queries},
|
| 186 |
+
)
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
def snapshot_hf_dataset(repo_id: str, local_name: str) -> Path:
|
| 190 |
+
from huggingface_hub import snapshot_download
|
| 191 |
+
|
| 192 |
+
ensure_dirs()
|
| 193 |
+
target_dir = DATA_DIR / "hf" / local_name
|
| 194 |
+
if target_dir.exists():
|
| 195 |
+
return target_dir
|
| 196 |
+
|
| 197 |
+
snapshot_download(
|
| 198 |
+
repo_id=repo_id,
|
| 199 |
+
repo_type="dataset",
|
| 200 |
+
local_dir=str(target_dir),
|
| 201 |
+
local_dir_use_symlinks=False,
|
| 202 |
+
)
|
| 203 |
+
return target_dir
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
def flatten_open_ragbench_section(section: dict[str, Any]) -> str:
|
| 207 |
+
parts = [section.get("text") or ""]
|
| 208 |
+
tables = section.get("tables") or {}
|
| 209 |
+
if isinstance(tables, dict):
|
| 210 |
+
parts.extend(str(value) for value in tables.values())
|
| 211 |
+
return "\n".join(part for part in parts if part)
|
| 212 |
+
|
| 213 |
+
|
| 214 |
+
def load_open_ragbench(
|
| 215 |
+
max_corpus_docs: int | None,
|
| 216 |
+
max_queries: int | None,
|
| 217 |
+
) -> EvalCorpus:
|
| 218 |
+
dataset_dir = snapshot_hf_dataset("vectara/open_ragbench", "open_ragbench")
|
| 219 |
+
root = dataset_dir / "pdf" / "arxiv"
|
| 220 |
+
if not root.exists():
|
| 221 |
+
root = dataset_dir / "official" / "pdf" / "arxiv"
|
| 222 |
+
if not root.exists():
|
| 223 |
+
raise FileNotFoundError(f"Open RAGBench root not found: {root}")
|
| 224 |
+
|
| 225 |
+
queries_data = json.loads((root / "queries.json").read_text(encoding="utf-8"))
|
| 226 |
+
qrels_data = json.loads((root / "qrels.json").read_text(encoding="utf-8"))
|
| 227 |
+
|
| 228 |
+
documents = []
|
| 229 |
+
qrels: dict[str, set[str]] = {}
|
| 230 |
+
required_doc_ids = set()
|
| 231 |
+
selected_query_ids = []
|
| 232 |
+
for query_id, qrel in qrels_data.items():
|
| 233 |
+
doc_id = str(qrel.get("doc_id"))
|
| 234 |
+
if not doc_id or doc_id == "None":
|
| 235 |
+
continue
|
| 236 |
+
selected_query_ids.append(str(query_id))
|
| 237 |
+
required_doc_ids.add(doc_id)
|
| 238 |
+
if max_queries and len(selected_query_ids) >= max_queries:
|
| 239 |
+
break
|
| 240 |
+
|
| 241 |
+
allowed_doc_ids = set()
|
| 242 |
+
corpus_files = sorted((root / "corpus").glob("*.json"))
|
| 243 |
+
|
| 244 |
+
for corpus_file in corpus_files:
|
| 245 |
+
paper = json.loads(corpus_file.read_text(encoding="utf-8"))
|
| 246 |
+
paper_id = str(paper.get("id") or corpus_file.stem)
|
| 247 |
+
is_required = paper_id in required_doc_ids
|
| 248 |
+
if max_corpus_docs and not is_required:
|
| 249 |
+
missing_required_count = len(required_doc_ids - allowed_doc_ids)
|
| 250 |
+
if len(documents) + missing_required_count >= max_corpus_docs:
|
| 251 |
+
continue
|
| 252 |
+
allowed_doc_ids.add(paper_id)
|
| 253 |
+
section_texts = []
|
| 254 |
+
for section_index, section in enumerate(paper.get("sections") or []):
|
| 255 |
+
section_text = flatten_open_ragbench_section(section)
|
| 256 |
+
if section_text:
|
| 257 |
+
section_texts.append(f"[section {section_index}]\n{section_text}")
|
| 258 |
+
text = "\n\n".join(
|
| 259 |
+
part
|
| 260 |
+
for part in [paper.get("title") or "", paper.get("abstract") or "", *section_texts]
|
| 261 |
+
if part
|
| 262 |
+
)
|
| 263 |
+
documents.append(
|
| 264 |
+
{
|
| 265 |
+
"doc_id": paper_id,
|
| 266 |
+
"title": paper.get("title") or paper_id,
|
| 267 |
+
"text": text,
|
| 268 |
+
"metadata": {
|
| 269 |
+
"source_dataset": "open_ragbench",
|
| 270 |
+
"categories": ",".join(paper.get("categories") or []),
|
| 271 |
+
},
|
| 272 |
+
}
|
| 273 |
+
)
|
| 274 |
+
if max_corpus_docs and len(documents) >= max_corpus_docs:
|
| 275 |
+
break
|
| 276 |
+
|
| 277 |
+
queries = []
|
| 278 |
+
for query_id in selected_query_ids:
|
| 279 |
+
qrel = qrels_data[query_id]
|
| 280 |
+
doc_id = str(qrel.get("doc_id"))
|
| 281 |
+
if doc_id not in allowed_doc_ids:
|
| 282 |
+
continue
|
| 283 |
+
query_payload = queries_data.get(query_id) or {}
|
| 284 |
+
question = query_payload.get("query") if isinstance(query_payload, dict) else str(query_payload)
|
| 285 |
+
qrels[str(query_id)] = {doc_id}
|
| 286 |
+
queries.append(
|
| 287 |
+
{
|
| 288 |
+
"query_id": str(query_id),
|
| 289 |
+
"question": question,
|
| 290 |
+
"relevant_doc_ids": [doc_id],
|
| 291 |
+
}
|
| 292 |
+
)
|
| 293 |
+
if max_queries and len(queries) >= max_queries:
|
| 294 |
+
break
|
| 295 |
+
|
| 296 |
+
if not documents or not queries:
|
| 297 |
+
raise ValueError("Open RAGBench produced no evaluable sample.")
|
| 298 |
+
|
| 299 |
+
return EvalCorpus("open_ragbench", documents, queries, qrels)
|
| 300 |
+
|
| 301 |
+
|
| 302 |
+
def load_t2_ragbench(
|
| 303 |
+
max_corpus_docs: int | None,
|
| 304 |
+
max_queries: int | None,
|
| 305 |
+
) -> EvalCorpus:
|
| 306 |
+
dataset_dir = snapshot_hf_dataset("G4KMU/t2-ragbench", "t2_ragbench")
|
| 307 |
+
parquet_files = sorted(dataset_dir.rglob("*.parquet"))
|
| 308 |
+
jsonl_files = sorted(dataset_dir.rglob("*.jsonl"))
|
| 309 |
+
if not parquet_files and not jsonl_files:
|
| 310 |
+
raise FileNotFoundError(f"No parquet/jsonl files found in {dataset_dir}")
|
| 311 |
+
|
| 312 |
+
rows: list[dict[str, Any]] = []
|
| 313 |
+
if parquet_files:
|
| 314 |
+
import pandas as pd
|
| 315 |
+
|
| 316 |
+
for parquet_file in parquet_files:
|
| 317 |
+
frame = pd.read_parquet(parquet_file)
|
| 318 |
+
rows.extend(frame.to_dict(orient="records"))
|
| 319 |
+
if max_queries and len(rows) >= max_queries * 5:
|
| 320 |
+
break
|
| 321 |
+
else:
|
| 322 |
+
for jsonl_file in jsonl_files:
|
| 323 |
+
rows.extend(read_jsonl(jsonl_file))
|
| 324 |
+
if max_queries and len(rows) >= max_queries * 5:
|
| 325 |
+
break
|
| 326 |
+
|
| 327 |
+
documents_by_id: dict[str, dict[str, Any]] = {}
|
| 328 |
+
queries = []
|
| 329 |
+
qrels: dict[str, set[str]] = {}
|
| 330 |
+
|
| 331 |
+
for index, row in enumerate(rows):
|
| 332 |
+
question = first_present(row, ["question", "query", "Question"])
|
| 333 |
+
answer = first_present(row, ["answer", "Answer", "response"])
|
| 334 |
+
context = first_present(row, ["context", "evidence", "gold_context", "text", "document"])
|
| 335 |
+
table = first_present(row, ["table", "Table", "markdown_table"])
|
| 336 |
+
doc_id = str(first_present(row, ["doc_id", "document_id", "filename", "pdf_path", "source"]) or f"row-{index}")
|
| 337 |
+
if not question or not context:
|
| 338 |
+
continue
|
| 339 |
+
|
| 340 |
+
text = "\n".join(part for part in [str(context), str(table or "")] if part)
|
| 341 |
+
if doc_id not in documents_by_id:
|
| 342 |
+
documents_by_id[doc_id] = {
|
| 343 |
+
"doc_id": doc_id,
|
| 344 |
+
"title": str(first_present(row, ["company", "ticker", "title", "Title"]) or doc_id),
|
| 345 |
+
"text": text,
|
| 346 |
+
"metadata": {"source_dataset": "t2_ragbench", "answer": str(answer or "")},
|
| 347 |
+
}
|
| 348 |
+
queries.append(
|
| 349 |
+
{
|
| 350 |
+
"query_id": str(first_present(row, ["qid", "query_id", "id"]) or f"q-{index}"),
|
| 351 |
+
"question": str(question),
|
| 352 |
+
"relevant_doc_ids": [doc_id],
|
| 353 |
+
}
|
| 354 |
+
)
|
| 355 |
+
qrels[queries[-1]["query_id"]] = {doc_id}
|
| 356 |
+
if max_queries and len(queries) >= max_queries:
|
| 357 |
+
break
|
| 358 |
+
|
| 359 |
+
documents = list(documents_by_id.values())
|
| 360 |
+
if max_corpus_docs:
|
| 361 |
+
documents = documents[:max_corpus_docs]
|
| 362 |
+
allowed = {document["doc_id"] for document in documents}
|
| 363 |
+
queries = [query for query in queries if query["relevant_doc_ids"][0] in allowed]
|
| 364 |
+
qrels = {query["query_id"]: set(query["relevant_doc_ids"]) for query in queries}
|
| 365 |
+
|
| 366 |
+
if not documents or not queries:
|
| 367 |
+
raise ValueError("T2-RAGBench produced no evaluable sample.")
|
| 368 |
+
|
| 369 |
+
return EvalCorpus("t2_ragbench", documents, queries, qrels)
|
| 370 |
+
|
| 371 |
+
|
| 372 |
+
def first_present(row: dict[str, Any], keys: list[str]) -> Any:
|
| 373 |
+
for key in keys:
|
| 374 |
+
value = row.get(key)
|
| 375 |
+
if value is not None and value != "":
|
| 376 |
+
return value
|
| 377 |
+
return None
|
| 378 |
+
|
| 379 |
+
|
| 380 |
+
def load_local_options_eval(max_queries: int | None) -> EvalCorpus:
|
| 381 |
+
cases_path = EVAL_DIR / "local_options_eval.jsonl"
|
| 382 |
+
if not cases_path.exists():
|
| 383 |
+
raise FileNotFoundError(
|
| 384 |
+
f"Local options eval set not found: {cases_path}. "
|
| 385 |
+
"Create JSONL cases with question, expected_pages, expected_keywords."
|
| 386 |
+
)
|
| 387 |
+
|
| 388 |
+
from tools.query_knowledge import load_pdf_file
|
| 389 |
+
|
| 390 |
+
pdf_files = sorted((PROJECT_ROOT / "tools" / "knowledge_base" / "raw").rglob("*.pdf"))
|
| 391 |
+
documents = []
|
| 392 |
+
for pdf_file in pdf_files:
|
| 393 |
+
for doc_index, document in enumerate(load_pdf_file(pdf_file)):
|
| 394 |
+
documents.append(
|
| 395 |
+
{
|
| 396 |
+
"doc_id": f"{pdf_file.name}:{document.metadata.get('page_number')}:{doc_index}",
|
| 397 |
+
"title": document.metadata.get("section_path") or pdf_file.name,
|
| 398 |
+
"text": document.text,
|
| 399 |
+
"metadata": document.metadata,
|
| 400 |
+
}
|
| 401 |
+
)
|
| 402 |
+
|
| 403 |
+
queries = []
|
| 404 |
+
qrels: dict[str, set[str]] = {}
|
| 405 |
+
for case_index, case in enumerate(read_jsonl(cases_path)):
|
| 406 |
+
query_id = str(case.get("id") or f"local-{case_index}")
|
| 407 |
+
relevant_ids = []
|
| 408 |
+
expected_pages = set(case.get("expected_pages") or [])
|
| 409 |
+
expected_keywords = case.get("expected_keywords") or []
|
| 410 |
+
for document in documents:
|
| 411 |
+
metadata = document.get("metadata") or {}
|
| 412 |
+
page_hit = metadata.get("page_number") in expected_pages
|
| 413 |
+
keyword_hit = any(keyword in document["text"] for keyword in expected_keywords)
|
| 414 |
+
if page_hit or keyword_hit:
|
| 415 |
+
relevant_ids.append(document["doc_id"])
|
| 416 |
+
queries.append(
|
| 417 |
+
{
|
| 418 |
+
"query_id": query_id,
|
| 419 |
+
"question": case["question"],
|
| 420 |
+
"relevant_doc_ids": relevant_ids,
|
| 421 |
+
}
|
| 422 |
+
)
|
| 423 |
+
qrels[query_id] = set(relevant_ids)
|
| 424 |
+
if max_queries and len(queries) >= max_queries:
|
| 425 |
+
break
|
| 426 |
+
|
| 427 |
+
if not documents or not queries:
|
| 428 |
+
raise ValueError("Local options eval set produced no evaluable sample.")
|
| 429 |
+
|
| 430 |
+
return EvalCorpus("local_options", documents, queries, qrels)
|
| 431 |
+
|
| 432 |
+
|
| 433 |
+
def load_eval_corpus(args: argparse.Namespace) -> EvalCorpus:
|
| 434 |
+
dataset = DATASET_ALIASES.get(args.dataset, args.dataset)
|
| 435 |
+
if dataset in {"scifact", "fiqa"}:
|
| 436 |
+
return load_beir_dataset(dataset, args.split, args.max_corpus_docs, args.max_queries)
|
| 437 |
+
if dataset == "open_ragbench":
|
| 438 |
+
return load_open_ragbench(args.max_corpus_docs, args.max_queries)
|
| 439 |
+
if dataset == "t2_ragbench":
|
| 440 |
+
return load_t2_ragbench(args.max_corpus_docs, args.max_queries)
|
| 441 |
+
if dataset == "local_options":
|
| 442 |
+
return load_local_options_eval(args.max_queries)
|
| 443 |
+
raise ValueError(f"Unknown dataset: {args.dataset}")
|
| 444 |
+
|
| 445 |
+
|
| 446 |
+
def build_index(corpus: EvalCorpus, chunk_size: int, chunk_overlap: int, rebuild: bool) -> VectorStoreIndex:
|
| 447 |
+
configure_model_cache()
|
| 448 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
| 449 |
+
|
| 450 |
+
index_path = INDEX_DIR / corpus.name
|
| 451 |
+
if rebuild and index_path.exists():
|
| 452 |
+
shutil.rmtree(index_path)
|
| 453 |
+
index_path.mkdir(parents=True, exist_ok=True)
|
| 454 |
+
|
| 455 |
+
db = chromadb.PersistentClient(path=str(index_path))
|
| 456 |
+
collection_name = f"{corpus.name}_eval"
|
| 457 |
+
if rebuild:
|
| 458 |
+
try:
|
| 459 |
+
db.delete_collection(collection_name)
|
| 460 |
+
except Exception:
|
| 461 |
+
pass
|
| 462 |
+
collection = db.get_or_create_collection(collection_name)
|
| 463 |
+
vector_store = ChromaVectorStore(chroma_collection=collection)
|
| 464 |
+
storage_context = StorageContext.from_defaults(vector_store=vector_store)
|
| 465 |
+
embed_model = HuggingFaceEmbedding(
|
| 466 |
+
model_name=resolve_embed_model_name(),
|
| 467 |
+
cache_folder=str(PROJECT_ROOT / "tools" / "hf_cache" / "sentence_transformers"),
|
| 468 |
+
)
|
| 469 |
+
|
| 470 |
+
if collection.count() == 0:
|
| 471 |
+
documents = [
|
| 472 |
+
Document(
|
| 473 |
+
text=document["text"],
|
| 474 |
+
metadata={
|
| 475 |
+
"doc_id": document["doc_id"],
|
| 476 |
+
"title": document.get("title", ""),
|
| 477 |
+
**(document.get("metadata") or {}),
|
| 478 |
+
},
|
| 479 |
+
)
|
| 480 |
+
for document in corpus.documents
|
| 481 |
+
]
|
| 482 |
+
splitter = SentenceSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
|
| 483 |
+
nodes = splitter.get_nodes_from_documents(documents)
|
| 484 |
+
VectorStoreIndex(
|
| 485 |
+
nodes,
|
| 486 |
+
storage_context=storage_context,
|
| 487 |
+
embed_model=embed_model,
|
| 488 |
+
show_progress=True,
|
| 489 |
+
)
|
| 490 |
+
|
| 491 |
+
return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
|
| 492 |
+
|
| 493 |
+
|
| 494 |
+
def evaluate_retrieval(corpus: EvalCorpus, index: VectorStoreIndex, top_k: int) -> dict[str, Any]:
|
| 495 |
+
retriever = index.as_retriever(similarity_top_k=max(top_k * 5, top_k))
|
| 496 |
+
cases = []
|
| 497 |
+
hit_counts = {1: 0, 3: 0, 5: 0, top_k: 0}
|
| 498 |
+
reciprocal_ranks = []
|
| 499 |
+
ndcg_scores = []
|
| 500 |
+
|
| 501 |
+
for query in corpus.queries:
|
| 502 |
+
relevant_doc_ids = corpus.qrels.get(query["query_id"], set())
|
| 503 |
+
results = retriever.retrieve(query["question"])
|
| 504 |
+
retrieved = []
|
| 505 |
+
seen_doc_ids = set()
|
| 506 |
+
first_hit_rank = None
|
| 507 |
+
dcg = 0.0
|
| 508 |
+
|
| 509 |
+
for result in results:
|
| 510 |
+
metadata = result.node.metadata
|
| 511 |
+
doc_id = str(metadata.get("doc_id", ""))
|
| 512 |
+
if doc_id in seen_doc_ids:
|
| 513 |
+
continue
|
| 514 |
+
seen_doc_ids.add(doc_id)
|
| 515 |
+
rank = len(retrieved) + 1
|
| 516 |
+
hit = doc_id in relevant_doc_ids
|
| 517 |
+
if hit and first_hit_rank is None:
|
| 518 |
+
first_hit_rank = rank
|
| 519 |
+
if hit:
|
| 520 |
+
dcg += 1 / math.log2(rank + 1)
|
| 521 |
+
retrieved.append(
|
| 522 |
+
{
|
| 523 |
+
"rank": rank,
|
| 524 |
+
"doc_id": doc_id,
|
| 525 |
+
"score": result.score,
|
| 526 |
+
"hit": hit,
|
| 527 |
+
"title": metadata.get("title", ""),
|
| 528 |
+
}
|
| 529 |
+
)
|
| 530 |
+
if len(retrieved) >= top_k:
|
| 531 |
+
break
|
| 532 |
+
|
| 533 |
+
ideal_hits = min(len(relevant_doc_ids), top_k)
|
| 534 |
+
idcg = sum(1 / math.log2(rank + 1) for rank in range(1, ideal_hits + 1))
|
| 535 |
+
ndcg = dcg / idcg if idcg else 0.0
|
| 536 |
+
ndcg_scores.append(ndcg)
|
| 537 |
+
reciprocal_ranks.append(1 / first_hit_rank if first_hit_rank else 0.0)
|
| 538 |
+
|
| 539 |
+
for k in hit_counts:
|
| 540 |
+
if any(item["hit"] for item in retrieved[:k]):
|
| 541 |
+
hit_counts[k] += 1
|
| 542 |
+
|
| 543 |
+
cases.append(
|
| 544 |
+
{
|
| 545 |
+
"query_id": query["query_id"],
|
| 546 |
+
"question": query["question"],
|
| 547 |
+
"relevant_doc_ids": sorted(relevant_doc_ids),
|
| 548 |
+
"first_hit_rank": first_hit_rank,
|
| 549 |
+
"retrieved": retrieved,
|
| 550 |
+
}
|
| 551 |
+
)
|
| 552 |
+
|
| 553 |
+
total = len(corpus.queries)
|
| 554 |
+
metrics = {
|
| 555 |
+
"queries": total,
|
| 556 |
+
"documents": len(corpus.documents),
|
| 557 |
+
"top_k": top_k,
|
| 558 |
+
"mrr": sum(reciprocal_ranks) / total if total else 0.0,
|
| 559 |
+
"ndcg_at_k": sum(ndcg_scores) / total if total else 0.0,
|
| 560 |
+
}
|
| 561 |
+
for k, count in sorted(hit_counts.items()):
|
| 562 |
+
metrics[f"hit_at_{k}"] = count / total if total else 0.0
|
| 563 |
+
|
| 564 |
+
return {"dataset": corpus.name, "metrics": metrics, "cases": cases}
|
| 565 |
+
|
| 566 |
+
|
| 567 |
+
def write_reports(report: dict[str, Any]) -> tuple[Path, Path]:
|
| 568 |
+
ensure_dirs()
|
| 569 |
+
dataset_name = report["dataset"]
|
| 570 |
+
json_path = REPORT_DIR / f"{dataset_name}_retrieval_eval.json"
|
| 571 |
+
md_path = REPORT_DIR / f"{dataset_name}_retrieval_eval.md"
|
| 572 |
+
json_path.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 573 |
+
|
| 574 |
+
metrics = report["metrics"]
|
| 575 |
+
lines = [
|
| 576 |
+
f"# Retrieval Eval: {dataset_name}",
|
| 577 |
+
"",
|
| 578 |
+
"## Metrics",
|
| 579 |
+
"",
|
| 580 |
+
]
|
| 581 |
+
for key, value in metrics.items():
|
| 582 |
+
lines.append(f"- `{key}`: {value:.4f}" if isinstance(value, float) else f"- `{key}`: {value}")
|
| 583 |
+
|
| 584 |
+
lines.extend(["", "## Sample Cases", ""])
|
| 585 |
+
for case in report["cases"][:10]:
|
| 586 |
+
lines.append(f"### {case['query_id']}")
|
| 587 |
+
lines.append("")
|
| 588 |
+
lines.append(case["question"])
|
| 589 |
+
lines.append("")
|
| 590 |
+
lines.append(f"- first_hit_rank: `{case['first_hit_rank']}`")
|
| 591 |
+
for item in case["retrieved"][:5]:
|
| 592 |
+
lines.append(
|
| 593 |
+
f"- rank {item['rank']}: hit={item['hit']} doc_id=`{item['doc_id']}` score={item['score']}"
|
| 594 |
+
)
|
| 595 |
+
lines.append("")
|
| 596 |
+
|
| 597 |
+
md_path.write_text("\n".join(lines), encoding="utf-8")
|
| 598 |
+
return json_path, md_path
|
| 599 |
+
|
| 600 |
+
|
| 601 |
+
def parse_args() -> argparse.Namespace:
|
| 602 |
+
parser = argparse.ArgumentParser(description="Run retrieval eval for RAG datasets.")
|
| 603 |
+
parser.add_argument(
|
| 604 |
+
"--dataset",
|
| 605 |
+
required=True,
|
| 606 |
+
help="beir/scifact, beir/fiqa, open-ragbench, t2-ragbench, or local-options",
|
| 607 |
+
)
|
| 608 |
+
parser.add_argument("--split", default="test")
|
| 609 |
+
parser.add_argument("--top-k", type=int, default=5)
|
| 610 |
+
parser.add_argument("--chunk-size", type=int, default=512)
|
| 611 |
+
parser.add_argument("--chunk-overlap", type=int, default=64)
|
| 612 |
+
parser.add_argument("--max-corpus-docs", type=int, default=None)
|
| 613 |
+
parser.add_argument("--max-queries", type=int, default=None)
|
| 614 |
+
parser.add_argument("--rebuild", action="store_true")
|
| 615 |
+
return parser.parse_args()
|
| 616 |
+
|
| 617 |
+
|
| 618 |
+
def main() -> None:
|
| 619 |
+
args = parse_args()
|
| 620 |
+
corpus = load_eval_corpus(args)
|
| 621 |
+
index = build_index(corpus, args.chunk_size, args.chunk_overlap, args.rebuild)
|
| 622 |
+
report = evaluate_retrieval(corpus, index, args.top_k)
|
| 623 |
+
json_path, md_path = write_reports(report)
|
| 624 |
+
print(json.dumps(report["metrics"], ensure_ascii=False, indent=2))
|
| 625 |
+
print(f"JSON report: {json_path}")
|
| 626 |
+
print(f"Markdown report: {md_path}")
|
| 627 |
+
|
| 628 |
+
|
| 629 |
+
if __name__ == "__main__":
|
| 630 |
+
main()
|
eval/reports/beir_fiqa_retrieval_eval.md
ADDED
|
@@ -0,0 +1,124 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Retrieval Eval: beir_fiqa
|
| 2 |
+
|
| 3 |
+
## Metrics
|
| 4 |
+
|
| 5 |
+
- `queries`: 10
|
| 6 |
+
- `documents`: 500
|
| 7 |
+
- `top_k`: 5
|
| 8 |
+
- `mrr`: 0.8000
|
| 9 |
+
- `ndcg_at_k`: 0.6582
|
| 10 |
+
- `hit_at_1`: 0.8000
|
| 11 |
+
- `hit_at_3`: 0.8000
|
| 12 |
+
- `hit_at_5`: 0.8000
|
| 13 |
+
|
| 14 |
+
## Sample Cases
|
| 15 |
+
|
| 16 |
+
### 8
|
| 17 |
+
|
| 18 |
+
How to deposit a cheque issued to an associate in my business into my business account?
|
| 19 |
+
|
| 20 |
+
- first_hit_rank: `1`
|
| 21 |
+
- rank 1: hit=True doc_id=`65404` score=0.6844510955827177
|
| 22 |
+
- rank 2: hit=False doc_id=`508754` score=0.6415634192002271
|
| 23 |
+
- rank 3: hit=False doc_id=`1873` score=0.6244133153886419
|
| 24 |
+
- rank 4: hit=False doc_id=`590102` score=0.6106401478322256
|
| 25 |
+
- rank 5: hit=False doc_id=`1066` score=0.5854493569389293
|
| 26 |
+
|
| 27 |
+
### 15
|
| 28 |
+
|
| 29 |
+
Can I send a money order from USPS as a business?
|
| 30 |
+
|
| 31 |
+
- first_hit_rank: `1`
|
| 32 |
+
- rank 1: hit=True doc_id=`325273` score=0.6860931820873509
|
| 33 |
+
- rank 2: hit=False doc_id=`3714` score=0.5383410844537323
|
| 34 |
+
- rank 3: hit=False doc_id=`508754` score=0.5295326644960427
|
| 35 |
+
- rank 4: hit=False doc_id=`1873` score=0.5219679418951554
|
| 36 |
+
- rank 5: hit=False doc_id=`4457` score=0.5122406473020094
|
| 37 |
+
|
| 38 |
+
### 18
|
| 39 |
+
|
| 40 |
+
1 EIN doing business under multiple business names
|
| 41 |
+
|
| 42 |
+
- first_hit_rank: `1`
|
| 43 |
+
- rank 1: hit=True doc_id=`88124` score=0.5926237160250162
|
| 44 |
+
- rank 2: hit=False doc_id=`1873` score=0.5421392202098603
|
| 45 |
+
- rank 3: hit=False doc_id=`248624` score=0.5355707959162649
|
| 46 |
+
- rank 4: hit=False doc_id=`590102` score=0.5349105669189491
|
| 47 |
+
- rank 5: hit=False doc_id=`1173` score=0.5304232255229728
|
| 48 |
+
|
| 49 |
+
### 26
|
| 50 |
+
|
| 51 |
+
Applying for and receiving business credit
|
| 52 |
+
|
| 53 |
+
- first_hit_rank: `1`
|
| 54 |
+
- rank 1: hit=True doc_id=`350819` score=0.6130084948278423
|
| 55 |
+
- rank 2: hit=False doc_id=`2064` score=0.5484836878784439
|
| 56 |
+
- rank 3: hit=False doc_id=`5019` score=0.545421752024407
|
| 57 |
+
- rank 4: hit=False doc_id=`1873` score=0.5288677740902044
|
| 58 |
+
- rank 5: hit=False doc_id=`1766` score=0.5277730439438229
|
| 59 |
+
|
| 60 |
+
### 34
|
| 61 |
+
|
| 62 |
+
401k Transfer After Business Closure
|
| 63 |
+
|
| 64 |
+
- first_hit_rank: `None`
|
| 65 |
+
- rank 1: hit=False doc_id=`19183` score=0.5697281829712297
|
| 66 |
+
- rank 2: hit=False doc_id=`1506` score=0.5606544069043923
|
| 67 |
+
- rank 3: hit=False doc_id=`1134` score=0.5594801072658324
|
| 68 |
+
- rank 4: hit=False doc_id=`3481` score=0.5580692841866827
|
| 69 |
+
- rank 5: hit=False doc_id=`3059` score=0.5470931591486823
|
| 70 |
+
|
| 71 |
+
### 42
|
| 72 |
+
|
| 73 |
+
What are the ins/outs of writing equipment purchases off as business expenses in a home based business?
|
| 74 |
+
|
| 75 |
+
- first_hit_rank: `1`
|
| 76 |
+
- rank 1: hit=True doc_id=`272709` score=0.6108084707046366
|
| 77 |
+
- rank 2: hit=False doc_id=`2528` score=0.5915589749452431
|
| 78 |
+
- rank 3: hit=True doc_id=`331981` score=0.5819601957870557
|
| 79 |
+
- rank 4: hit=False doc_id=`1873` score=0.5679211375564418
|
| 80 |
+
- rank 5: hit=True doc_id=`327263` score=0.5609058973658579
|
| 81 |
+
|
| 82 |
+
### 56
|
| 83 |
+
|
| 84 |
+
Can a entrepreneur hire a self-employed business owner?
|
| 85 |
+
|
| 86 |
+
- first_hit_rank: `1`
|
| 87 |
+
- rank 1: hit=True doc_id=`572690` score=0.5928112761756716
|
| 88 |
+
- rank 2: hit=False doc_id=`1873` score=0.5329399371121925
|
| 89 |
+
- rank 3: hit=False doc_id=`350819` score=0.49122764843847383
|
| 90 |
+
- rank 4: hit=False doc_id=`288` score=0.48281883887294536
|
| 91 |
+
- rank 5: hit=False doc_id=`599545` score=0.4825679577769018
|
| 92 |
+
|
| 93 |
+
### 68
|
| 94 |
+
|
| 95 |
+
Intentions of Deductible Amount for Small Business
|
| 96 |
+
|
| 97 |
+
- first_hit_rank: `None`
|
| 98 |
+
- rank 1: hit=False doc_id=`599545` score=0.5484593654392641
|
| 99 |
+
- rank 2: hit=False doc_id=`350819` score=0.545089604374947
|
| 100 |
+
- rank 3: hit=False doc_id=`327263` score=0.5425303284932907
|
| 101 |
+
- rank 4: hit=False doc_id=`272709` score=0.5367760755311749
|
| 102 |
+
- rank 5: hit=False doc_id=`1873` score=0.5341962558469263
|
| 103 |
+
|
| 104 |
+
### 89
|
| 105 |
+
|
| 106 |
+
How can I deposit a check made out to my business into my personal account?
|
| 107 |
+
|
| 108 |
+
- first_hit_rank: `1`
|
| 109 |
+
- rank 1: hit=True doc_id=`508754` score=0.678210930846752
|
| 110 |
+
- rank 2: hit=False doc_id=`3336` score=0.6219187366693569
|
| 111 |
+
- rank 3: hit=False doc_id=`1066` score=0.6102283456272309
|
| 112 |
+
- rank 4: hit=False doc_id=`65404` score=0.6070578770706204
|
| 113 |
+
- rank 5: hit=True doc_id=`413229` score=0.5974145307840032
|
| 114 |
+
|
| 115 |
+
### 90
|
| 116 |
+
|
| 117 |
+
Filing personal with 1099s versus business s-corp?
|
| 118 |
+
|
| 119 |
+
- first_hit_rank: `1`
|
| 120 |
+
- rank 1: hit=True doc_id=`31793` score=0.6463855238248295
|
| 121 |
+
- rank 2: hit=False doc_id=`4992` score=0.575164246858743
|
| 122 |
+
- rank 3: hit=False doc_id=`1873` score=0.567805853646443
|
| 123 |
+
- rank 4: hit=False doc_id=`2020` score=0.5629015874196683
|
| 124 |
+
- rank 5: hit=False doc_id=`350819` score=0.5607360854843948
|
eval/run_eval_suite.py
ADDED
|
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import argparse
|
| 4 |
+
import json
|
| 5 |
+
import traceback
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
from datetime import datetime
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from types import SimpleNamespace
|
| 10 |
+
from typing import Any
|
| 11 |
+
|
| 12 |
+
from eval.rag_eval import (
|
| 13 |
+
REPORT_DIR,
|
| 14 |
+
build_index,
|
| 15 |
+
ensure_dirs,
|
| 16 |
+
evaluate_retrieval,
|
| 17 |
+
load_eval_corpus,
|
| 18 |
+
write_reports,
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
DEFAULT_DATASETS = ["beir/scifact", "beir/fiqa", "open-ragbench", "local-options"]
|
| 23 |
+
SMOKE_DEFAULTS = {
|
| 24 |
+
"beir/scifact": {"max_corpus_docs": 200, "max_queries": 10},
|
| 25 |
+
"beir/fiqa": {"max_corpus_docs": 500, "max_queries": 10},
|
| 26 |
+
"open-ragbench": {"max_corpus_docs": 20, "max_queries": 5},
|
| 27 |
+
"t2-ragbench": {"max_corpus_docs": 20, "max_queries": 5},
|
| 28 |
+
"local-options": {"max_corpus_docs": None, "max_queries": 3},
|
| 29 |
+
}
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
@dataclass
|
| 33 |
+
class DatasetRun:
|
| 34 |
+
dataset: str
|
| 35 |
+
status: str
|
| 36 |
+
metrics: dict[str, Any] | None
|
| 37 |
+
json_report: str | None
|
| 38 |
+
markdown_report: str | None
|
| 39 |
+
error: str | None = None
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def parse_dataset_list(value: str) -> list[str]:
|
| 43 |
+
datasets = [item.strip() for item in value.split(",") if item.strip()]
|
| 44 |
+
return datasets or DEFAULT_DATASETS
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def build_dataset_args(args: argparse.Namespace, dataset: str) -> SimpleNamespace:
|
| 48 |
+
defaults = SMOKE_DEFAULTS.get(dataset, {"max_corpus_docs": None, "max_queries": None})
|
| 49 |
+
return SimpleNamespace(
|
| 50 |
+
dataset=dataset,
|
| 51 |
+
split=args.split,
|
| 52 |
+
top_k=args.top_k,
|
| 53 |
+
chunk_size=args.chunk_size,
|
| 54 |
+
chunk_overlap=args.chunk_overlap,
|
| 55 |
+
max_corpus_docs=args.max_corpus_docs
|
| 56 |
+
if args.max_corpus_docs is not None
|
| 57 |
+
else defaults["max_corpus_docs"],
|
| 58 |
+
max_queries=args.max_queries if args.max_queries is not None else defaults["max_queries"],
|
| 59 |
+
rebuild=args.rebuild,
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def run_one(dataset: str, args: argparse.Namespace) -> DatasetRun:
|
| 64 |
+
dataset_args = build_dataset_args(args, dataset)
|
| 65 |
+
print(
|
| 66 |
+
f"\n=== Running {dataset} "
|
| 67 |
+
f"(top_k={dataset_args.top_k}, max_corpus_docs={dataset_args.max_corpus_docs}, "
|
| 68 |
+
f"max_queries={dataset_args.max_queries}, rebuild={dataset_args.rebuild}) ==="
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
corpus = load_eval_corpus(dataset_args)
|
| 72 |
+
index = build_index(
|
| 73 |
+
corpus,
|
| 74 |
+
chunk_size=dataset_args.chunk_size,
|
| 75 |
+
chunk_overlap=dataset_args.chunk_overlap,
|
| 76 |
+
rebuild=dataset_args.rebuild,
|
| 77 |
+
)
|
| 78 |
+
report = evaluate_retrieval(corpus, index, dataset_args.top_k)
|
| 79 |
+
json_path, md_path = write_reports(report)
|
| 80 |
+
print(json.dumps(report["metrics"], ensure_ascii=False, indent=2))
|
| 81 |
+
|
| 82 |
+
return DatasetRun(
|
| 83 |
+
dataset=dataset,
|
| 84 |
+
status="passed",
|
| 85 |
+
metrics=report["metrics"],
|
| 86 |
+
json_report=str(json_path),
|
| 87 |
+
markdown_report=str(md_path),
|
| 88 |
+
)
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
def write_suite_report(runs: list[DatasetRun], output_name: str | None) -> tuple[Path, Path]:
|
| 92 |
+
ensure_dirs()
|
| 93 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
| 94 |
+
stem = output_name or f"rag_eval_suite_{timestamp}"
|
| 95 |
+
json_path = REPORT_DIR / f"{stem}.json"
|
| 96 |
+
md_path = REPORT_DIR / f"{stem}.md"
|
| 97 |
+
|
| 98 |
+
payload = {
|
| 99 |
+
"created_at": datetime.now().isoformat(timespec="seconds"),
|
| 100 |
+
"runs": [run.__dict__ for run in runs],
|
| 101 |
+
}
|
| 102 |
+
json_path.write_text(json.dumps(payload, ensure_ascii=False, indent=2), encoding="utf-8")
|
| 103 |
+
|
| 104 |
+
lines = ["# RAG Eval Suite", ""]
|
| 105 |
+
for run in runs:
|
| 106 |
+
lines.append(f"## {run.dataset}")
|
| 107 |
+
lines.append("")
|
| 108 |
+
lines.append(f"- status: `{run.status}`")
|
| 109 |
+
if run.error:
|
| 110 |
+
lines.append(f"- error: `{run.error}`")
|
| 111 |
+
if run.metrics:
|
| 112 |
+
for key, value in run.metrics.items():
|
| 113 |
+
lines.append(f"- `{key}`: {value:.4f}" if isinstance(value, float) else f"- `{key}`: {value}")
|
| 114 |
+
if run.markdown_report:
|
| 115 |
+
lines.append(f"- report: `{run.markdown_report}`")
|
| 116 |
+
lines.append("")
|
| 117 |
+
md_path.write_text("\n".join(lines), encoding="utf-8")
|
| 118 |
+
return json_path, md_path
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
def parse_args() -> argparse.Namespace:
|
| 122 |
+
parser = argparse.ArgumentParser(description="Run a RAG retrieval eval suite.")
|
| 123 |
+
parser.add_argument(
|
| 124 |
+
"--datasets",
|
| 125 |
+
default=",".join(DEFAULT_DATASETS),
|
| 126 |
+
help="Comma-separated datasets: beir/scifact, beir/fiqa, open-ragbench, t2-ragbench, local-options",
|
| 127 |
+
)
|
| 128 |
+
parser.add_argument("--split", default="test")
|
| 129 |
+
parser.add_argument("--top-k", type=int, default=5)
|
| 130 |
+
parser.add_argument("--chunk-size", type=int, default=512)
|
| 131 |
+
parser.add_argument("--chunk-overlap", type=int, default=64)
|
| 132 |
+
parser.add_argument("--max-corpus-docs", type=int, default=None)
|
| 133 |
+
parser.add_argument("--max-queries", type=int, default=None)
|
| 134 |
+
parser.add_argument("--rebuild", action="store_true")
|
| 135 |
+
parser.add_argument("--fail-fast", action="store_true")
|
| 136 |
+
parser.add_argument("--output-name", default=None, help="Suite report filename stem under eval/reports.")
|
| 137 |
+
return parser.parse_args()
|
| 138 |
+
|
| 139 |
+
|
| 140 |
+
def main() -> None:
|
| 141 |
+
args = parse_args()
|
| 142 |
+
runs: list[DatasetRun] = []
|
| 143 |
+
|
| 144 |
+
for dataset in parse_dataset_list(args.datasets):
|
| 145 |
+
try:
|
| 146 |
+
runs.append(run_one(dataset, args))
|
| 147 |
+
except Exception as exc:
|
| 148 |
+
error = f"{type(exc).__name__}: {exc}"
|
| 149 |
+
print(f"\n*** {dataset} failed: {error}")
|
| 150 |
+
if args.fail_fast:
|
| 151 |
+
raise
|
| 152 |
+
traceback.print_exc()
|
| 153 |
+
runs.append(
|
| 154 |
+
DatasetRun(
|
| 155 |
+
dataset=dataset,
|
| 156 |
+
status="failed",
|
| 157 |
+
metrics=None,
|
| 158 |
+
json_report=None,
|
| 159 |
+
markdown_report=None,
|
| 160 |
+
error=error,
|
| 161 |
+
)
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
json_path, md_path = write_suite_report(runs, args.output_name)
|
| 165 |
+
print(f"\nSuite JSON report: {json_path}")
|
| 166 |
+
print(f"Suite Markdown report: {md_path}")
|
| 167 |
+
|
| 168 |
+
if any(run.status == "failed" for run in runs):
|
| 169 |
+
raise SystemExit(1)
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
if __name__ == "__main__":
|
| 173 |
+
main()
|
hf_cache/sentence_transformers/models--BAAI--bge-small-en-v1.5/snapshots/5c38ec7c405ec4b44b94cc5a9bb96e735b38267a/README.md
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
../../blobs/8b8567d75ffa619486d9590cb0eb76d66ad46c49
|
load_docs.py
DELETED
|
@@ -1,216 +0,0 @@
|
|
| 1 |
-
import asyncio
|
| 2 |
-
import hashlib
|
| 3 |
-
import os
|
| 4 |
-
from pathlib import Path
|
| 5 |
-
from typing import Iterable, List
|
| 6 |
-
from dotenv import load_dotenv
|
| 7 |
-
import chromadb
|
| 8 |
-
from chromadb.errors import NotFoundError
|
| 9 |
-
from pypdf import PdfReader
|
| 10 |
-
|
| 11 |
-
from llama_index.core import StorageContext, VectorStoreIndex
|
| 12 |
-
from llama_index.core.schema import Document, BaseNode
|
| 13 |
-
from llama_index.core.node_parser import SentenceSplitter
|
| 14 |
-
from llama_index.vector_stores.chroma import ChromaVectorStore
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
BASE_DIR = Path(__file__).resolve().parent
|
| 18 |
-
KNOWLEDGE_BASE_DIR = BASE_DIR / "knowledge_base"
|
| 19 |
-
RAW_DIR = KNOWLEDGE_BASE_DIR / "raw"
|
| 20 |
-
CHROMA_DB_DIR = KNOWLEDGE_BASE_DIR / "chroma_db"
|
| 21 |
-
HF_CACHE_DIR = BASE_DIR / "hf_cache"
|
| 22 |
-
COLLECTION_NAME = "options_knowledge"
|
| 23 |
-
|
| 24 |
-
EMBED_MODEL_NAME = "BAAI/bge-small-en-v1.5"
|
| 25 |
-
CHUNK_SIZE = 1000
|
| 26 |
-
CHUNK_OVERLAP = 150
|
| 27 |
-
|
| 28 |
-
REQUIRED_METADATA = [
|
| 29 |
-
"source_file",
|
| 30 |
-
"file_name",
|
| 31 |
-
"file_type",
|
| 32 |
-
"document_title",
|
| 33 |
-
"file_hash",
|
| 34 |
-
"chunk_id",
|
| 35 |
-
"chunk_index",
|
| 36 |
-
]
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
def configure_model_cache() -> None:
|
| 40 |
-
HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
| 41 |
-
os.environ.setdefault("HF_HOME", str(HF_CACHE_DIR))
|
| 42 |
-
os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", str(HF_CACHE_DIR / "sentence_transformers"))
|
| 43 |
-
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
def file_sha256(path: Path) -> str:
|
| 47 |
-
digest = hashlib.sha256()
|
| 48 |
-
with path.open("rb") as file:
|
| 49 |
-
for block in iter(lambda: file.read(1024 * 1024), b""):
|
| 50 |
-
digest.update(block)
|
| 51 |
-
return digest.hexdigest()
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
def load_md_file(path: Path) -> Document:
|
| 55 |
-
text = path.read_text(encoding="utf-8")
|
| 56 |
-
|
| 57 |
-
return Document(
|
| 58 |
-
text=text,
|
| 59 |
-
metadata={
|
| 60 |
-
"source_file": str(path.resolve()),
|
| 61 |
-
"file_name": path.name,
|
| 62 |
-
"file_type": "md",
|
| 63 |
-
"document_title": path.stem,
|
| 64 |
-
"file_hash": file_sha256(path),
|
| 65 |
-
},
|
| 66 |
-
)
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
def load_pdf_file(path: Path) -> List[Document]:
|
| 70 |
-
reader = PdfReader(str(path))
|
| 71 |
-
documents = []
|
| 72 |
-
|
| 73 |
-
for page_index, page in enumerate(reader.pages, start=1):
|
| 74 |
-
text = page.extract_text() or ""
|
| 75 |
-
|
| 76 |
-
if not text.strip():
|
| 77 |
-
continue
|
| 78 |
-
|
| 79 |
-
documents.append(
|
| 80 |
-
Document(
|
| 81 |
-
text=text,
|
| 82 |
-
metadata={
|
| 83 |
-
"source_file": str(path.resolve()),
|
| 84 |
-
"file_name": path.name,
|
| 85 |
-
"file_type": "pdf",
|
| 86 |
-
"document_title": path.stem,
|
| 87 |
-
"file_hash": file_sha256(path),
|
| 88 |
-
"page_number": page_index,
|
| 89 |
-
},
|
| 90 |
-
)
|
| 91 |
-
)
|
| 92 |
-
|
| 93 |
-
return documents
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
def iter_source_files(raw_dir: Path) -> Iterable[Path]:
|
| 97 |
-
supported_suffixes = {".md", ".markdown", ".pdf"}
|
| 98 |
-
for path in sorted(raw_dir.rglob("*")):
|
| 99 |
-
if path.is_file() and path.suffix.lower() in supported_suffixes:
|
| 100 |
-
yield path
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
def load_docs(raw_dir: Path = RAW_DIR) -> List[Document]:
|
| 104 |
-
documents: List[Document] = []
|
| 105 |
-
|
| 106 |
-
for path in iter_source_files(raw_dir):
|
| 107 |
-
suffix = path.suffix.lower()
|
| 108 |
-
|
| 109 |
-
if suffix in {".md", ".markdown"}:
|
| 110 |
-
documents.append(load_md_file(path))
|
| 111 |
-
elif suffix == ".pdf":
|
| 112 |
-
documents.extend(load_pdf_file(path))
|
| 113 |
-
|
| 114 |
-
if not documents:
|
| 115 |
-
raise ValueError(f"No supported documents found under {raw_dir}")
|
| 116 |
-
|
| 117 |
-
return documents
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
def add_chunk_metadata(nodes: List[BaseNode]) -> List[BaseNode]:
|
| 121 |
-
counters: dict[str, int] = {}
|
| 122 |
-
|
| 123 |
-
for node in nodes:
|
| 124 |
-
source_file = node.metadata["source_file"]
|
| 125 |
-
chunk_index = counters.get(source_file, 0)
|
| 126 |
-
counters[source_file] = chunk_index + 1
|
| 127 |
-
|
| 128 |
-
file_hash = node.metadata["file_hash"][:12]
|
| 129 |
-
page_number = node.metadata.get("page_number", "na")
|
| 130 |
-
chunk_id = f"{Path(source_file).stem}-{file_hash}-p{page_number}-c{chunk_index}"
|
| 131 |
-
|
| 132 |
-
node.metadata["chunk_id"] = chunk_id
|
| 133 |
-
node.metadata["chunk_index"] = chunk_index
|
| 134 |
-
node.id_ = chunk_id
|
| 135 |
-
|
| 136 |
-
return nodes
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
def validate_nodes(nodes: List[BaseNode]) -> None:
|
| 140 |
-
if not nodes:
|
| 141 |
-
raise ValueError("No chunks were created from the source documents.")
|
| 142 |
-
|
| 143 |
-
for node in nodes:
|
| 144 |
-
missing = [key for key in REQUIRED_METADATA if key not in node.metadata]
|
| 145 |
-
if missing:
|
| 146 |
-
raise ValueError(f"Node {node.node_id} is missing metadata fields: {missing}")
|
| 147 |
-
|
| 148 |
-
if node.metadata["file_type"] == "pdf" and "page_number" not in node.metadata:
|
| 149 |
-
raise ValueError(f"PDF node {node.node_id} is missing page_number metadata.")
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
def build_nodes(raw_dir: Path = RAW_DIR) -> List[BaseNode]:
|
| 153 |
-
documents = load_docs(raw_dir)
|
| 154 |
-
splitter = SentenceSplitter(
|
| 155 |
-
chunk_size=CHUNK_SIZE,
|
| 156 |
-
chunk_overlap=CHUNK_OVERLAP,
|
| 157 |
-
)
|
| 158 |
-
nodes = splitter.get_nodes_from_documents(documents)
|
| 159 |
-
add_chunk_metadata(nodes)
|
| 160 |
-
validate_nodes(nodes)
|
| 161 |
-
return nodes
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
async def build_index(raw_dir: Path = RAW_DIR, rebuild: bool = False) -> VectorStoreIndex:
|
| 165 |
-
configure_model_cache()
|
| 166 |
-
|
| 167 |
-
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
| 168 |
-
|
| 169 |
-
load_dotenv()
|
| 170 |
-
CHROMA_DB_DIR.mkdir(parents=True, exist_ok=True)
|
| 171 |
-
|
| 172 |
-
db = chromadb.PersistentClient(path=str(CHROMA_DB_DIR))
|
| 173 |
-
|
| 174 |
-
if rebuild:
|
| 175 |
-
try:
|
| 176 |
-
db.delete_collection(COLLECTION_NAME)
|
| 177 |
-
except (NotFoundError, ValueError):
|
| 178 |
-
pass
|
| 179 |
-
|
| 180 |
-
chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
|
| 181 |
-
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
|
| 182 |
-
storage_context = StorageContext.from_defaults(vector_store=vector_store)
|
| 183 |
-
embed_model = HuggingFaceEmbedding(
|
| 184 |
-
model_name=EMBED_MODEL_NAME,
|
| 185 |
-
cache_folder=str(HF_CACHE_DIR / "sentence_transformers"),
|
| 186 |
-
)
|
| 187 |
-
|
| 188 |
-
if rebuild or chroma_collection.count() == 0:
|
| 189 |
-
nodes = build_nodes(raw_dir)
|
| 190 |
-
index = VectorStoreIndex(
|
| 191 |
-
nodes,
|
| 192 |
-
storage_context=storage_context,
|
| 193 |
-
embed_model=embed_model,
|
| 194 |
-
show_progress=True,
|
| 195 |
-
)
|
| 196 |
-
print(f"Indexed {len(nodes)} chunks into collection '{COLLECTION_NAME}'.")
|
| 197 |
-
return index
|
| 198 |
-
|
| 199 |
-
print(f"Loaded existing collection '{COLLECTION_NAME}' with {chroma_collection.count()} chunks.")
|
| 200 |
-
return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
|
| 201 |
-
|
| 202 |
-
|
| 203 |
-
if __name__ == "__main__":
|
| 204 |
-
index = asyncio.run(build_index(rebuild=True))
|
| 205 |
-
retriever = index.as_retriever(similarity_top_k=5)
|
| 206 |
-
results = retriever.retrieve("What is volatility smile?")
|
| 207 |
-
|
| 208 |
-
print("\nTop retrieved chunks:")
|
| 209 |
-
for result in results:
|
| 210 |
-
metadata = result.node.metadata
|
| 211 |
-
source = metadata.get("file_name", "unknown")
|
| 212 |
-
page = metadata.get("page_number", "n/a")
|
| 213 |
-
score = result.score
|
| 214 |
-
print(f"- {source}, page {page}, score={score:.4f}")
|
| 215 |
-
print(result.node.get_content()[:500].replace("\n", " "))
|
| 216 |
-
print()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pyproject.toml
CHANGED
|
@@ -17,6 +17,7 @@ dependencies = [
|
|
| 17 |
"pypdf>=6.0.0",
|
| 18 |
"tokenizers>=0.22.0,<=0.23.0",
|
| 19 |
"transformers<5",
|
|
|
|
| 20 |
]
|
| 21 |
|
| 22 |
[build-system]
|
|
|
|
| 17 |
"pypdf>=6.0.0",
|
| 18 |
"tokenizers>=0.22.0,<=0.23.0",
|
| 19 |
"transformers<5",
|
| 20 |
+
"pymupdf>=1.27.2.3",
|
| 21 |
]
|
| 22 |
|
| 23 |
[build-system]
|
rag_pdf_optimization_notes.md
ADDED
|
@@ -0,0 +1,282 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RAG PDF 提取与切分优化总结
|
| 2 |
+
|
| 3 |
+
这次优化的目标是提升当前 RAG 系统对金融 PDF,尤其是包含大量数学公式、章节标题和图表内容的 PDF 的解析质量。原始实现能完成基础向量检索,但 PDF 提取、公式保留、chunk 切分和 metadata 管理都比较粗糙,导致检索结果不够稳定。
|
| 4 |
+
|
| 5 |
+
## 一、整体背景
|
| 6 |
+
|
| 7 |
+
项目使用 `LlamaIndex + Chroma + HuggingFaceEmbedding` 构建本地知识库,原始 PDF 文档是一本期权/波动率相关书籍。最开始的流程大致是:
|
| 8 |
+
|
| 9 |
+
```text
|
| 10 |
+
pypdf 提取每页文本
|
| 11 |
+
-> SentenceSplitter 固定长度切分
|
| 12 |
+
-> HuggingFace embedding
|
| 13 |
+
-> Chroma 向量库
|
| 14 |
+
-> QueryKnowledgeTool 检索返回片段
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
这个流程对普通纯文本还可以,但面对金融教材类 PDF 会遇到很多问题:公式被拆散、章节边界丢失、页眉页脚干扰、图表文字混入正文、数学符号顺序错乱等。
|
| 18 |
+
|
| 19 |
+
## 二、遇到的主要问题
|
| 20 |
+
|
| 21 |
+
### 1. PDF 基础文本提取能力弱
|
| 22 |
+
|
| 23 |
+
最初只使用:
|
| 24 |
+
|
| 25 |
+
```python
|
| 26 |
+
page.extract_text()
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
问题是:
|
| 30 |
+
|
| 31 |
+
- 页眉、页码、版权信息会混进正文。
|
| 32 |
+
- 断行、断词严重,比如单词被 PDF 换行拆开。
|
| 33 |
+
- 多栏、图表、公式附近的文本顺序容易错乱。
|
| 34 |
+
- 数学公式经常被压成一行,或者符号顺序不对。
|
| 35 |
+
|
| 36 |
+
解决方法:
|
| 37 |
+
|
| 38 |
+
- 增加 `pypdf` 的 `layout` 模式作为候选。
|
| 39 |
+
- 增加坐标级提取,利用 `visitor_text` 获取文字的 `x/y` 坐标,按视觉行重组。
|
| 40 |
+
- 增加文本清洗逻辑:
|
| 41 |
+
- 去除空行、页码、重复页眉页脚。
|
| 42 |
+
- 修复连字符断词。
|
| 43 |
+
- 处理常见 ligature,例如 `fi`、`fl`。
|
| 44 |
+
- 保留公式行的换行,不把公式硬合并成普通段落。
|
| 45 |
+
|
| 46 |
+
### 2. 数学公式提取不理想
|
| 47 |
+
|
| 48 |
+
金融教材中大量公式包含:
|
| 49 |
+
|
| 50 |
+
- 希腊字母,如 `𝜎`、`𝜇`、`𝜌`
|
| 51 |
+
- 上标、下标
|
| 52 |
+
- 分式结构
|
| 53 |
+
- 积分、求和、根号
|
| 54 |
+
- 公式编号,如 `(21.23)`
|
| 55 |
+
|
| 56 |
+
普通 PDF 文本提取很难还原这些结构。例如:
|
| 57 |
+
|
| 58 |
+
```text
|
| 59 |
+
d𝜎 = a𝜎 dt + b𝜎 dZ
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
可能会被提取成符号粘连、顺序错乱,或者和前后正文混在一起。
|
| 63 |
+
|
| 64 |
+
解决方法:
|
| 65 |
+
|
| 66 |
+
- 先做 `pypdf` 数学感知优化:
|
| 67 |
+
- 识别公式行。
|
| 68 |
+
- 对短公式行、括号行、根号行保留换行。
|
| 69 |
+
- 尝试根据字号和垂直偏移标记上标/下标。
|
| 70 |
+
|
| 71 |
+
后来发现 `pypdf` 仍然不够,所以进一步接入 `PyMuPDF`。
|
| 72 |
+
|
| 73 |
+
### 3. PyMuPDF 初次接入后公式误判过多
|
| 74 |
+
|
| 75 |
+
接入 `PyMuPDF` 后,可以通过:
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
page.get_text("dict", sort=True)
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
拿到 block、line、span、bbox、font 等信息。这比 `pypdf` 更适合定位公式区域。
|
| 82 |
+
|
| 83 |
+
但初版公式识别遇到一个问题:误判过多。
|
| 84 |
+
|
| 85 |
+
例如:
|
| 86 |
+
|
| 87 |
+
- 版权页中的电话号码。
|
| 88 |
+
- 普通正文中的 `Black-Scholes-Merton`。
|
| 89 |
+
- 普通段落里出现一个 `𝜎` 或 `F=ma`。
|
| 90 |
+
- 图表坐标轴上的数字。
|
| 91 |
+
|
| 92 |
+
都可能被误识别为公式。
|
| 93 |
+
|
| 94 |
+
解决方法:
|
| 95 |
+
|
| 96 |
+
- 从 block 级公式识别改为 line 级公式识别。
|
| 97 |
+
- 不再把普通斜体字体当作数学字体。
|
| 98 |
+
- 收紧公式触发条件:
|
| 99 |
+
- 单独的希腊字母不算公式。
|
| 100 |
+
- 普通 `-`、`/` 不作为强数学信号,避免把英文连字符误判为公式。
|
| 101 |
+
- 重点识别 `=`、`∫`、`∑`、`√`、`≤`、`≥`、`∕`、公式编号等强信号。
|
| 102 |
+
- 增加 `is_useful_formula_text()`,过滤掉太短、太碎、无核心公式结构的片段。
|
| 103 |
+
- 对公式续行做合并,避免根号、分母、括号被拆成多个孤立公式 chunk。
|
| 104 |
+
|
| 105 |
+
最终实现了:
|
| 106 |
+
|
| 107 |
+
```text
|
| 108 |
+
正文 chunk
|
| 109 |
+
公式 chunk: content_type=formula
|
| 110 |
+
公式位置: formula_bbox
|
| 111 |
+
公式编号: formula_id
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### 4. 章节和标题切分缺失
|
| 115 |
+
|
| 116 |
+
原始系统只用固定长度切分:
|
| 117 |
+
|
| 118 |
+
```python
|
| 119 |
+
SentenceSplitter(chunk_size=1000, chunk_overlap=150)
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
问题是:
|
| 123 |
+
|
| 124 |
+
- chunk 可能跨章节。
|
| 125 |
+
- 一个小节的标题和正文可能被分开。
|
| 126 |
+
- 检索结果不知道来自哪一章、哪一节。
|
| 127 |
+
- 回答时引用不够清楚。
|
| 128 |
+
|
| 129 |
+
解决方法:
|
| 130 |
+
|
| 131 |
+
在 `SentenceSplitter` 前增加一层章节/标题感知分段:
|
| 132 |
+
|
| 133 |
+
- 识别 `CHAPTER ...`
|
| 134 |
+
- 识别 `APPENDIX ...`
|
| 135 |
+
- 识别全大写标题
|
| 136 |
+
- 识别标题式大小写小节名
|
| 137 |
+
- 过滤图表标题、坐标轴、公式短行、脚注、普通解释句
|
| 138 |
+
|
| 139 |
+
并写入 metadata:
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
chapter_title
|
| 143 |
+
section_title
|
| 144 |
+
section_path
|
| 145 |
+
page_number
|
| 146 |
+
content_type
|
| 147 |
+
formula_id
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
这样检索结果可以返回:
|
| 151 |
+
|
| 152 |
+
```text
|
| 153 |
+
source: The_volatility_Smile_Wiley.pdf
|
| 154 |
+
page: 379
|
| 155 |
+
section: WITH ZERO CORRELATION
|
| 156 |
+
content_type: formula
|
| 157 |
+
formula_id: formula-378-3
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
### 5. metadata 过长导致 LlamaIndex 报错
|
| 161 |
+
|
| 162 |
+
接入公式 bbox 后,最开始把每一行的 bbox 都放进 metadata,导致 metadata 太长。
|
| 163 |
+
|
| 164 |
+
报错类似:
|
| 165 |
+
|
| 166 |
+
```text
|
| 167 |
+
Metadata length is longer than chunk size.
|
| 168 |
+
Consider increasing the chunk size or decreasing metadata size.
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
原因是 `SentenceSplitter` 会把 metadata 长度也计入 chunk 长度。
|
| 172 |
+
|
| 173 |
+
解决方法:
|
| 174 |
+
|
| 175 |
+
- 不再存所有行的 bbox。
|
| 176 |
+
- 将多个 bbox 合并成一个外接矩形:
|
| 177 |
+
|
| 178 |
+
```text
|
| 179 |
+
x0,y0,x1,y1
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
这样既保留了公式位置,又避免 metadata 过长。
|
| 183 |
+
|
| 184 |
+
### 6. Hugging Face 模型加载反复联网
|
| 185 |
+
|
| 186 |
+
本地已经有 embedding 模型缓存,但 `sentence-transformers` 仍尝试访问 Hugging Face 做 HEAD 检查。在网络受限环境下,会反复 retry,导致索引构建卡住。
|
| 187 |
+
|
| 188 |
+
解决方法:
|
| 189 |
+
|
| 190 |
+
- 检测本地 snapshot 是否存在。
|
| 191 |
+
- 如果存在,直接把本地 snapshot 路径传给 embedding 模型。
|
| 192 |
+
- 设置离线环境变量:
|
| 193 |
+
|
| 194 |
+
```python
|
| 195 |
+
HF_HUB_OFFLINE=1
|
| 196 |
+
TRANSFORMERS_OFFLINE=1
|
| 197 |
+
```
|
| 198 |
+
|
| 199 |
+
这样索引构建可以稳定使用本地缓存。
|
| 200 |
+
|
| 201 |
+
### 7. 旧索引不会自动更新
|
| 202 |
+
|
| 203 |
+
PDF 提取逻辑升级后,如果 Chroma 里还是旧版本文本,RAG 实际不会变好。
|
| 204 |
+
|
| 205 |
+
解决方法:
|
| 206 |
+
|
| 207 |
+
- 增加 `PDF_EXTRACTION_METHOD` 版本号。
|
| 208 |
+
- 当前版本为:
|
| 209 |
+
|
| 210 |
+
```python
|
| 211 |
+
pymupdf_formula_blocks_v5
|
| 212 |
+
```
|
| 213 |
+
|
| 214 |
+
- 启动时检查 Chroma 中 metadata 的 `extraction_method`。
|
| 215 |
+
- 如果版本不一致,自动重建索引。
|
| 216 |
+
|
| 217 |
+
## 三、最终方案
|
| 218 |
+
|
| 219 |
+
最终 PDF RAG 流程变为:
|
| 220 |
+
|
| 221 |
+
```text
|
| 222 |
+
PyMuPDF 提取 block / line / span / bbox / font
|
| 223 |
+
-> 识别公式行
|
| 224 |
+
-> 合并公式续行
|
| 225 |
+
-> 生成独立公式文档 content_type=formula
|
| 226 |
+
-> 正文中保留 [FORMULA id=...] 引用
|
| 227 |
+
-> 清洗页眉页脚和噪声
|
| 228 |
+
-> 按章节/标题预分段
|
| 229 |
+
-> SentenceSplitter 二次切分
|
| 230 |
+
-> 写入 Chroma
|
| 231 |
+
-> 检索时返回 page / section / content_type / formula_id
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
核心收益:
|
| 235 |
+
|
| 236 |
+
- 公式可以作为独立检索单元。
|
| 237 |
+
- 正文仍保留公式上下文。
|
| 238 |
+
- chunk 不再完全依赖固定长度。
|
| 239 |
+
- 检索结果能说明来源页码、小节、内容类型。
|
| 240 |
+
- 索引版本可控,避免旧数据污染。
|
| 241 |
+
|
| 242 |
+
## 四、面试中可以怎么回答
|
| 243 |
+
|
| 244 |
+
可以这样概括:
|
| 245 |
+
|
| 246 |
+
> 我们一开始的 RAG 只是用 `pypdf` 按页提取文本,然后用固定长度切分。这个方案对普通文档可以,但对金融教材不够,因为里面有大量数学公式、图表和章节结构。主要问题是公式顺序错乱、上下标丢失、页眉页脚混入、chunk 跨章节。
|
| 247 |
+
|
| 248 |
+
然后讲解决:
|
| 249 |
+
|
| 250 |
+
> 我先做了基础清洗,包括页眉页脚去重、断词修复、公式行换行保留。后来发现 `pypdf` 对公式区域的定位能力有限,所以接入了 `PyMuPDF`,利用它返回的 block、line、span、bbox 和 font 信息,单独识别公式区域,并把公式作为 `content_type=formula` 的独立 chunk 入库,同时正文里保留 `[FORMULA id=...]`,这样检索公式和检索上下文都可以兼顾。
|
| 251 |
+
|
| 252 |
+
再讲工程取舍:
|
| 253 |
+
|
| 254 |
+
> 公式识别不能简单看到希腊字母就判定为公式,否则普通正文会大量误判。所以我把规则收紧到等号、积分、求和、根号、公式编号、比较符等强数学信号,并过滤掉太短的碎片。bbox 也不能直接把所有行都写入 metadata,因为 LlamaIndex 会把 metadata 计入 chunk 长度,所以我把多个 bbox 合并成一个外接矩形。
|
| 255 |
+
|
| 256 |
+
最后讲效果:
|
| 257 |
+
|
| 258 |
+
> 优化后索引从原来的纯文本 chunk,变成了正文 chunk 加公式 chunk 的混合结构。每条检索结果都带 page、section、content_type、formula_id 等 metadata,回答时更容易定位来源,也更适合处理“某个公式是什么意思”这类问题。
|
| 259 |
+
|
| 260 |
+
## 五、后续可继续优化
|
| 261 |
+
|
| 262 |
+
目前已经接入 PyMuPDF,但还不是完整 OCR/LaTeX 公式识别。后续可以继续做:
|
| 263 |
+
|
| 264 |
+
1. 对 `formula_bbox` 区域裁图。
|
| 265 |
+
2. 接入公式 OCR 模型,例如 LaTeX OCR。
|
| 266 |
+
3. 把公式图片转成 LaTeX。
|
| 267 |
+
4. metadata 中同时保存:
|
| 268 |
+
|
| 269 |
+
```python
|
| 270 |
+
formula_text_raw
|
| 271 |
+
formula_latex
|
| 272 |
+
formula_bbox
|
| 273 |
+
page_number
|
| 274 |
+
section_path
|
| 275 |
+
```
|
| 276 |
+
|
| 277 |
+
5. 检索时对公式 query 单独加权,或者做 hybrid search。
|
| 278 |
+
6. 增加 reranker,提高公式相关问题的排序质量。
|
| 279 |
+
|
| 280 |
+
## 六、一句话总结
|
| 281 |
+
|
| 282 |
+
这次优化的核心不是简单换一个 PDF parser,而是把 PDF 解析从“按页提取纯文本”升级成“结构化解析正文、章节和公式区域”,让 RAG 的 chunk 更接近人阅读文档时的语义边界。
|
requirements.txt
CHANGED
|
@@ -4,6 +4,7 @@ requests
|
|
| 4 |
duckduckgo_search
|
| 5 |
pandas
|
| 6 |
pypdf
|
|
|
|
| 7 |
chromadb
|
| 8 |
llama-index-core
|
| 9 |
llama-index-embeddings-huggingface
|
|
|
|
| 4 |
duckduckgo_search
|
| 5 |
pandas
|
| 6 |
pypdf
|
| 7 |
+
PyMuPDF
|
| 8 |
chromadb
|
| 9 |
llama-index-core
|
| 10 |
llama-index-embeddings-huggingface
|
tools/query_knowledge.py
ADDED
|
@@ -0,0 +1,1196 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from smolagents.tools import Tool
|
| 2 |
+
import asyncio
|
| 3 |
+
from collections import Counter
|
| 4 |
+
import hashlib
|
| 5 |
+
import logging
|
| 6 |
+
import os
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
from typing import Iterable, List, Optional
|
| 9 |
+
import re
|
| 10 |
+
from dotenv import load_dotenv
|
| 11 |
+
import chromadb
|
| 12 |
+
from chromadb.errors import NotFoundError
|
| 13 |
+
from pypdf import PdfReader
|
| 14 |
+
|
| 15 |
+
from llama_index.core import StorageContext, VectorStoreIndex
|
| 16 |
+
from llama_index.core.schema import Document, BaseNode
|
| 17 |
+
from llama_index.core.node_parser import SentenceSplitter
|
| 18 |
+
from llama_index.vector_stores.chroma import ChromaVectorStore
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
BASE_DIR = Path(__file__).resolve().parent
|
| 22 |
+
KNOWLEDGE_BASE_DIR = BASE_DIR / "knowledge_base"
|
| 23 |
+
RAW_DIR = KNOWLEDGE_BASE_DIR / "raw"
|
| 24 |
+
CHROMA_DB_DIR = KNOWLEDGE_BASE_DIR / "chroma_db"
|
| 25 |
+
HF_CACHE_DIR = BASE_DIR / "hf_cache"
|
| 26 |
+
COLLECTION_NAME = "options_knowledge"
|
| 27 |
+
|
| 28 |
+
EMBED_MODEL_NAME = "BAAI/bge-small-en-v1.5"
|
| 29 |
+
CHUNK_SIZE = 1000
|
| 30 |
+
CHUNK_OVERLAP = 150
|
| 31 |
+
PDF_REPEATED_LINE_MIN_PAGES = 3
|
| 32 |
+
PDF_BOUNDARY_LINE_COUNT = 4
|
| 33 |
+
PDF_EXTRACTION_METHOD = "pymupdf_formula_blocks_v5"
|
| 34 |
+
PDF_LINE_Y_TOLERANCE = 3.0
|
| 35 |
+
PDF_MIN_SECTION_CHARS = 240
|
| 36 |
+
PDF_STRONG_MATH_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_σΣΔδθΘλΛμρπΠφΦτν𝜎𝜇𝜌𝜃𝜕")
|
| 37 |
+
PDF_WEAK_MATH_SYMBOLS = set("+-−*/∕<>")
|
| 38 |
+
PDF_MATH_SYMBOLS = PDF_STRONG_MATH_SYMBOLS | PDF_WEAK_MATH_SYMBOLS
|
| 39 |
+
PDF_OPERATOR_MATH_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_+-−*/∕<>")
|
| 40 |
+
PDF_FORMULA_TRIGGER_SYMBOLS = set("=∂∫∑∏√∞≈≠≤≥±×÷^_∕<>")
|
| 41 |
+
|
| 42 |
+
logging.getLogger("pypdf").setLevel(logging.ERROR)
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def load_pymupdf():
|
| 46 |
+
try:
|
| 47 |
+
import fitz
|
| 48 |
+
except ImportError:
|
| 49 |
+
return None
|
| 50 |
+
|
| 51 |
+
return fitz
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
REQUIRED_METADATA = [
|
| 55 |
+
"source_file",
|
| 56 |
+
"file_name",
|
| 57 |
+
"file_type",
|
| 58 |
+
"document_title",
|
| 59 |
+
"file_hash",
|
| 60 |
+
"chunk_id",
|
| 61 |
+
"chunk_index",
|
| 62 |
+
]
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def configure_model_cache() -> None:
|
| 66 |
+
HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
|
| 67 |
+
os.environ.setdefault("HF_HOME", str(HF_CACHE_DIR))
|
| 68 |
+
os.environ.setdefault("SENTENCE_TRANSFORMERS_HOME", str(
|
| 69 |
+
HF_CACHE_DIR / "sentence_transformers"))
|
| 70 |
+
os.environ.setdefault("TOKENIZERS_PARALLELISM", "false")
|
| 71 |
+
cached_model_dir = (
|
| 72 |
+
HF_CACHE_DIR
|
| 73 |
+
/ "sentence_transformers"
|
| 74 |
+
/ f"models--{EMBED_MODEL_NAME.replace('/', '--')}"
|
| 75 |
+
)
|
| 76 |
+
if cached_model_dir.exists():
|
| 77 |
+
os.environ.setdefault("HF_HUB_OFFLINE", "1")
|
| 78 |
+
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def resolve_embed_model_name() -> str:
|
| 82 |
+
cached_model_dir = (
|
| 83 |
+
HF_CACHE_DIR
|
| 84 |
+
/ "sentence_transformers"
|
| 85 |
+
/ f"models--{EMBED_MODEL_NAME.replace('/', '--')}"
|
| 86 |
+
)
|
| 87 |
+
snapshots_dir = cached_model_dir / "snapshots"
|
| 88 |
+
if snapshots_dir.exists():
|
| 89 |
+
snapshots = sorted(path for path in snapshots_dir.iterdir() if path.is_dir())
|
| 90 |
+
if snapshots:
|
| 91 |
+
return str(snapshots[-1])
|
| 92 |
+
|
| 93 |
+
return EMBED_MODEL_NAME
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def file_sha256(path: Path) -> str:
|
| 97 |
+
digest = hashlib.sha256()
|
| 98 |
+
with path.open("rb") as file:
|
| 99 |
+
for block in iter(lambda: file.read(1024 * 1024), b""):
|
| 100 |
+
digest.update(block)
|
| 101 |
+
return digest.hexdigest()
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def load_md_file(path: Path) -> Document:
|
| 105 |
+
text = path.read_text(encoding="utf-8")
|
| 106 |
+
|
| 107 |
+
return Document(
|
| 108 |
+
text=text,
|
| 109 |
+
metadata={
|
| 110 |
+
"source_file": str(path.resolve()),
|
| 111 |
+
"file_name": path.name,
|
| 112 |
+
"file_type": "md",
|
| 113 |
+
"document_title": path.stem,
|
| 114 |
+
"file_hash": file_sha256(path),
|
| 115 |
+
},
|
| 116 |
+
)
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
def append_visual_fragment(line_parts: List[str], text: str, baseline_y: float, item: dict) -> None:
|
| 120 |
+
if not text:
|
| 121 |
+
return
|
| 122 |
+
|
| 123 |
+
stripped = text.strip()
|
| 124 |
+
if not stripped:
|
| 125 |
+
return
|
| 126 |
+
|
| 127 |
+
font_size = item["font_size"]
|
| 128 |
+
y_offset = item["y"] - baseline_y
|
| 129 |
+
is_small = font_size < item["line_font_size"] * 0.82
|
| 130 |
+
|
| 131 |
+
if is_small and y_offset > max(1.5, item["line_font_size"] * 0.18):
|
| 132 |
+
line_parts.append(f"^{{{stripped}}}")
|
| 133 |
+
elif is_small and y_offset < -max(1.5, item["line_font_size"] * 0.18):
|
| 134 |
+
line_parts.append(f"_{{{stripped}}}")
|
| 135 |
+
else:
|
| 136 |
+
line_parts.append(stripped)
|
| 137 |
+
|
| 138 |
+
|
| 139 |
+
def join_visual_line(items: List[dict]) -> str:
|
| 140 |
+
if not items:
|
| 141 |
+
return ""
|
| 142 |
+
|
| 143 |
+
items = sorted(items, key=lambda value: value["x"])
|
| 144 |
+
baseline_y = sorted(item["y"] for item in items)[len(items) // 2]
|
| 145 |
+
line_font_size = max(item["font_size"] for item in items)
|
| 146 |
+
previous_right = None
|
| 147 |
+
line_parts: List[str] = []
|
| 148 |
+
|
| 149 |
+
for item in items:
|
| 150 |
+
item["line_font_size"] = line_font_size
|
| 151 |
+
if previous_right is not None:
|
| 152 |
+
gap = item["x"] - previous_right
|
| 153 |
+
if gap > max(2.5, line_font_size * 0.28):
|
| 154 |
+
line_parts.append(" ")
|
| 155 |
+
|
| 156 |
+
append_visual_fragment(line_parts, item["text"], baseline_y, item)
|
| 157 |
+
previous_right = max(previous_right or item["x"], item["x"] + item["width"])
|
| 158 |
+
|
| 159 |
+
return normalize_pdf_line("".join(line_parts))
|
| 160 |
+
|
| 161 |
+
|
| 162 |
+
def extract_pdf_text_by_position(page) -> str:
|
| 163 |
+
fragments: List[dict] = []
|
| 164 |
+
|
| 165 |
+
def visitor_text(text, cm, tm, font_dict, font_size):
|
| 166 |
+
if not text or not text.strip():
|
| 167 |
+
return
|
| 168 |
+
|
| 169 |
+
x = float(tm[4])
|
| 170 |
+
y = float(tm[5])
|
| 171 |
+
width = max(len(text.strip()) * float(font_size) * 0.45, float(font_size))
|
| 172 |
+
fragments.append(
|
| 173 |
+
{
|
| 174 |
+
"text": text,
|
| 175 |
+
"x": x,
|
| 176 |
+
"y": y,
|
| 177 |
+
"width": width,
|
| 178 |
+
"font_size": float(font_size or 1.0),
|
| 179 |
+
}
|
| 180 |
+
)
|
| 181 |
+
|
| 182 |
+
try:
|
| 183 |
+
page.extract_text(visitor_text=visitor_text)
|
| 184 |
+
except Exception:
|
| 185 |
+
return ""
|
| 186 |
+
|
| 187 |
+
if not fragments:
|
| 188 |
+
return ""
|
| 189 |
+
|
| 190 |
+
lines: List[List[dict]] = []
|
| 191 |
+
for fragment in sorted(fragments, key=lambda value: (-value["y"], value["x"])):
|
| 192 |
+
for line in lines:
|
| 193 |
+
if abs(line[0]["y"] - fragment["y"]) <= PDF_LINE_Y_TOLERANCE:
|
| 194 |
+
line.append(fragment)
|
| 195 |
+
break
|
| 196 |
+
else:
|
| 197 |
+
lines.append([fragment])
|
| 198 |
+
|
| 199 |
+
return "\n".join(join_visual_line(line) for line in lines)
|
| 200 |
+
|
| 201 |
+
|
| 202 |
+
def math_text_score(text: str) -> float:
|
| 203 |
+
if not text.strip():
|
| 204 |
+
return 0.0
|
| 205 |
+
|
| 206 |
+
lines = [line for line in text.splitlines() if line.strip()]
|
| 207 |
+
compact_length = len(re.sub(r"\s+", "", text))
|
| 208 |
+
math_symbol_count = sum(1 for char in text if char in PDF_MATH_SYMBOLS)
|
| 209 |
+
superscript_markers = text.count("^{") + text.count("_{")
|
| 210 |
+
multiline_bonus = sum(1 for line in lines if is_formula_like(line)) * 8
|
| 211 |
+
equation_block_bonus = sum(
|
| 212 |
+
1
|
| 213 |
+
for index, line in enumerate(lines)
|
| 214 |
+
if is_formula_like(line)
|
| 215 |
+
and (
|
| 216 |
+
index > 0
|
| 217 |
+
and is_formula_like(lines[index - 1])
|
| 218 |
+
or index + 1 < len(lines)
|
| 219 |
+
and is_formula_like(lines[index + 1])
|
| 220 |
+
)
|
| 221 |
+
) * 12
|
| 222 |
+
return (
|
| 223 |
+
compact_length
|
| 224 |
+
+ math_symbol_count * 12
|
| 225 |
+
+ superscript_markers * 20
|
| 226 |
+
+ multiline_bonus
|
| 227 |
+
+ equation_block_bonus
|
| 228 |
+
)
|
| 229 |
+
|
| 230 |
+
|
| 231 |
+
def extract_pdf_text(page) -> str:
|
| 232 |
+
positioned_text = extract_pdf_text_by_position(page)
|
| 233 |
+
|
| 234 |
+
try:
|
| 235 |
+
layout_text = page.extract_text(extraction_mode="layout") or ""
|
| 236 |
+
except Exception:
|
| 237 |
+
layout_text = ""
|
| 238 |
+
|
| 239 |
+
try:
|
| 240 |
+
plain_text = page.extract_text() or ""
|
| 241 |
+
except Exception:
|
| 242 |
+
plain_text = ""
|
| 243 |
+
|
| 244 |
+
candidates = [positioned_text, layout_text, plain_text]
|
| 245 |
+
candidates = [candidate for candidate in candidates if candidate.strip()]
|
| 246 |
+
if not candidates:
|
| 247 |
+
return ""
|
| 248 |
+
|
| 249 |
+
return max(candidates, key=math_text_score)
|
| 250 |
+
|
| 251 |
+
|
| 252 |
+
def pymupdf_span_text(span: dict) -> str:
|
| 253 |
+
return normalize_pdf_line(span.get("text", ""))
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
def pymupdf_line_text(line: dict) -> str:
|
| 257 |
+
return normalize_pdf_line("".join(pymupdf_span_text(span) for span in line.get("spans", [])))
|
| 258 |
+
|
| 259 |
+
|
| 260 |
+
def pymupdf_block_text(block: dict) -> str:
|
| 261 |
+
lines = [
|
| 262 |
+
pymupdf_line_text(line)
|
| 263 |
+
for line in block.get("lines", [])
|
| 264 |
+
]
|
| 265 |
+
return "\n".join(line for line in lines if line)
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
def pymupdf_span_has_math_font(span: dict) -> bool:
|
| 269 |
+
font_name = span.get("font", "").lower()
|
| 270 |
+
return any(
|
| 271 |
+
marker in font_name
|
| 272 |
+
for marker in ("math", "symbol", "cmmi", "cmsy", "cmex", "stix")
|
| 273 |
+
)
|
| 274 |
+
|
| 275 |
+
|
| 276 |
+
def is_formula_block_line(line: str) -> bool:
|
| 277 |
+
stripped = line.strip()
|
| 278 |
+
if not stripped:
|
| 279 |
+
return False
|
| 280 |
+
|
| 281 |
+
trigger_math_count = sum(1 for char in stripped if char in PDF_FORMULA_TRIGGER_SYMBOLS)
|
| 282 |
+
digit_count = sum(1 for char in stripped if char.isdigit())
|
| 283 |
+
alpha_count = sum(1 for char in stripped if char.isalpha())
|
| 284 |
+
alpha_words = [
|
| 285 |
+
word
|
| 286 |
+
for word in re.findall(r"[A-Za-z]+", stripped)
|
| 287 |
+
if word.lower() not in {"and", "or", "the", "where", "then", "with", "for"}
|
| 288 |
+
]
|
| 289 |
+
compact_length = len(re.sub(r"\s+", "", stripped))
|
| 290 |
+
|
| 291 |
+
if compact_length < 3:
|
| 292 |
+
return False
|
| 293 |
+
if re.fullmatch(r"\(?\d+(\.\d+)?\)?", stripped):
|
| 294 |
+
return False
|
| 295 |
+
if re.search(r"\(\d+(\.\d+)+[a-z]?\)$", stripped) and compact_length <= 240:
|
| 296 |
+
return True
|
| 297 |
+
if "=" in stripped and compact_length <= 260 and len(alpha_words) <= 12:
|
| 298 |
+
return True
|
| 299 |
+
if any(char in stripped for char in "∂∫∑∏√∞≈≠≤≥±×÷") and compact_length <= 220 and len(alpha_words) <= 10:
|
| 300 |
+
return True
|
| 301 |
+
if trigger_math_count >= 2 and compact_length <= 120 and len(alpha_words) <= 6:
|
| 302 |
+
return True
|
| 303 |
+
if trigger_math_count >= 1 and digit_count >= 1 and alpha_count <= 18 and compact_length <= 100:
|
| 304 |
+
return True
|
| 305 |
+
|
| 306 |
+
return False
|
| 307 |
+
|
| 308 |
+
|
| 309 |
+
def is_formula_block(block: dict) -> bool:
|
| 310 |
+
text = pymupdf_block_text(block)
|
| 311 |
+
if not text:
|
| 312 |
+
return False
|
| 313 |
+
|
| 314 |
+
lines = [line for line in text.splitlines() if line.strip()]
|
| 315 |
+
if any(is_formula_block_line(line) for line in lines):
|
| 316 |
+
return True
|
| 317 |
+
|
| 318 |
+
spans = [
|
| 319 |
+
span
|
| 320 |
+
for line in block.get("lines", [])
|
| 321 |
+
for span in line.get("spans", [])
|
| 322 |
+
if pymupdf_span_text(span)
|
| 323 |
+
]
|
| 324 |
+
if not spans:
|
| 325 |
+
return False
|
| 326 |
+
|
| 327 |
+
math_font_count = sum(1 for span in spans if pymupdf_span_has_math_font(span))
|
| 328 |
+
strong_math_count = sum(1 for char in text if char in PDF_STRONG_MATH_SYMBOLS)
|
| 329 |
+
alpha_count = sum(1 for char in text if char.isalpha())
|
| 330 |
+
digit_count = sum(1 for char in text if char.isdigit())
|
| 331 |
+
compact_length = len(re.sub(r"\s+", "", text))
|
| 332 |
+
|
| 333 |
+
if math_font_count >= 2 and compact_length <= 220:
|
| 334 |
+
return True
|
| 335 |
+
if strong_math_count >= 3 and compact_length <= 260:
|
| 336 |
+
return True
|
| 337 |
+
if strong_math_count >= 1 and digit_count >= 1 and alpha_count <= 20 and compact_length <= 160:
|
| 338 |
+
return True
|
| 339 |
+
|
| 340 |
+
return False
|
| 341 |
+
|
| 342 |
+
|
| 343 |
+
def block_bbox_string(block: dict) -> str:
|
| 344 |
+
bbox = block.get("bbox") or []
|
| 345 |
+
if len(bbox) != 4:
|
| 346 |
+
return ""
|
| 347 |
+
return ",".join(f"{float(value):.2f}" for value in bbox)
|
| 348 |
+
|
| 349 |
+
|
| 350 |
+
def line_bbox_string(line: dict) -> str:
|
| 351 |
+
bbox = line.get("bbox") or []
|
| 352 |
+
if len(bbox) != 4:
|
| 353 |
+
return ""
|
| 354 |
+
return ",".join(f"{float(value):.2f}" for value in bbox)
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
def pymupdf_line_has_math_font(line: dict) -> bool:
|
| 358 |
+
return any(
|
| 359 |
+
pymupdf_span_has_math_font(span)
|
| 360 |
+
for span in line.get("spans", [])
|
| 361 |
+
if pymupdf_span_text(span)
|
| 362 |
+
)
|
| 363 |
+
|
| 364 |
+
|
| 365 |
+
def should_extract_formula_line(line: dict) -> bool:
|
| 366 |
+
text = pymupdf_line_text(line)
|
| 367 |
+
if not text:
|
| 368 |
+
return False
|
| 369 |
+
|
| 370 |
+
if is_formula_block_line(text):
|
| 371 |
+
return True
|
| 372 |
+
|
| 373 |
+
compact_length = len(re.sub(r"\s+", "", text))
|
| 374 |
+
trigger_math_count = sum(1 for char in text if char in PDF_FORMULA_TRIGGER_SYMBOLS)
|
| 375 |
+
alpha_words = re.findall(r"[A-Za-z]+", text)
|
| 376 |
+
if (
|
| 377 |
+
pymupdf_line_has_math_font(line)
|
| 378 |
+
and trigger_math_count >= 1
|
| 379 |
+
and compact_length <= 180
|
| 380 |
+
and len(alpha_words) <= 6
|
| 381 |
+
):
|
| 382 |
+
return True
|
| 383 |
+
|
| 384 |
+
return False
|
| 385 |
+
|
| 386 |
+
|
| 387 |
+
def is_formula_continuation_line(text: str) -> bool:
|
| 388 |
+
stripped = text.strip()
|
| 389 |
+
if not stripped:
|
| 390 |
+
return False
|
| 391 |
+
|
| 392 |
+
compact = re.sub(r"\s+", "", stripped)
|
| 393 |
+
if len(compact) > 90:
|
| 394 |
+
return False
|
| 395 |
+
if compact in {"(", ")", "[", "]", "{", "}", "√"}:
|
| 396 |
+
return True
|
| 397 |
+
|
| 398 |
+
alpha_words = re.findall(r"[A-Za-z]+", stripped)
|
| 399 |
+
math_count = sum(1 for char in stripped if char in PDF_MATH_SYMBOLS)
|
| 400 |
+
digit_count = sum(1 for char in stripped if char.isdigit())
|
| 401 |
+
|
| 402 |
+
if len(alpha_words) <= 4 and (math_count >= 1 or digit_count >= 1):
|
| 403 |
+
return True
|
| 404 |
+
|
| 405 |
+
return False
|
| 406 |
+
|
| 407 |
+
|
| 408 |
+
def append_formula_block(
|
| 409 |
+
formula_blocks: List[dict],
|
| 410 |
+
body_blocks: List[str],
|
| 411 |
+
page_number: int,
|
| 412 |
+
formula_index: int,
|
| 413 |
+
formula_lines: List[str],
|
| 414 |
+
formula_bboxes: List[str],
|
| 415 |
+
) -> int:
|
| 416 |
+
formula_text = clean_formula_text("\n".join(formula_lines))
|
| 417 |
+
if not is_useful_formula_text(formula_text):
|
| 418 |
+
return formula_index
|
| 419 |
+
|
| 420 |
+
formula_id = f"formula-{page_number}-{formula_index}"
|
| 421 |
+
formula_bbox = merge_bbox_strings(formula_bboxes)
|
| 422 |
+
formula_blocks.append(
|
| 423 |
+
{
|
| 424 |
+
"id": formula_id,
|
| 425 |
+
"text": formula_text,
|
| 426 |
+
"bbox": formula_bbox,
|
| 427 |
+
}
|
| 428 |
+
)
|
| 429 |
+
body_blocks.append(f"[FORMULA id={formula_id}]\n{formula_text}\n[/FORMULA]")
|
| 430 |
+
return formula_index + 1
|
| 431 |
+
|
| 432 |
+
|
| 433 |
+
def merge_bbox_strings(bbox_strings: List[str]) -> str:
|
| 434 |
+
boxes = []
|
| 435 |
+
for bbox_string in bbox_strings:
|
| 436 |
+
if not bbox_string:
|
| 437 |
+
continue
|
| 438 |
+
values = bbox_string.split(",")
|
| 439 |
+
if len(values) != 4:
|
| 440 |
+
continue
|
| 441 |
+
try:
|
| 442 |
+
boxes.append([float(value) for value in values])
|
| 443 |
+
except ValueError:
|
| 444 |
+
continue
|
| 445 |
+
|
| 446 |
+
if not boxes:
|
| 447 |
+
return ""
|
| 448 |
+
|
| 449 |
+
x0 = min(box[0] for box in boxes)
|
| 450 |
+
y0 = min(box[1] for box in boxes)
|
| 451 |
+
x1 = max(box[2] for box in boxes)
|
| 452 |
+
y1 = max(box[3] for box in boxes)
|
| 453 |
+
return f"{x0:.2f},{y0:.2f},{x1:.2f},{y1:.2f}"
|
| 454 |
+
|
| 455 |
+
|
| 456 |
+
def is_useful_formula_text(text: str) -> bool:
|
| 457 |
+
stripped = text.strip()
|
| 458 |
+
if not stripped:
|
| 459 |
+
return False
|
| 460 |
+
|
| 461 |
+
compact_length = len(re.sub(r"\s+", "", stripped))
|
| 462 |
+
if compact_length < 6:
|
| 463 |
+
return False
|
| 464 |
+
|
| 465 |
+
lines = [line.strip() for line in stripped.splitlines() if line.strip()]
|
| 466 |
+
if re.search(r"\(\d+(\.\d+)+[a-z]?\)", stripped):
|
| 467 |
+
return True
|
| 468 |
+
if any(char in stripped for char in "∂∫∑∏∞≈≠≤≥±×÷"):
|
| 469 |
+
alpha_words = re.findall(r"[A-Za-z]+", stripped)
|
| 470 |
+
return len(alpha_words) <= 12 or "=" in stripped
|
| 471 |
+
|
| 472 |
+
for line in lines:
|
| 473 |
+
if "=" not in line:
|
| 474 |
+
continue
|
| 475 |
+
|
| 476 |
+
alpha_words = [
|
| 477 |
+
word
|
| 478 |
+
for word in re.findall(r"[A-Za-z]+", line)
|
| 479 |
+
if word.lower() not in {"and", "or", "the", "where", "then", "with", "for"}
|
| 480 |
+
]
|
| 481 |
+
if len(alpha_words) <= 12 and len(line) <= 260:
|
| 482 |
+
return True
|
| 483 |
+
|
| 484 |
+
return False
|
| 485 |
+
|
| 486 |
+
|
| 487 |
+
def extract_pymupdf_page(page) -> dict:
|
| 488 |
+
page_dict = page.get_text("dict", sort=True)
|
| 489 |
+
body_blocks: List[str] = []
|
| 490 |
+
formula_blocks: List[dict] = []
|
| 491 |
+
formula_lines: List[str] = []
|
| 492 |
+
formula_bboxes: List[str] = []
|
| 493 |
+
formula_index = 0
|
| 494 |
+
page_number = page.number + 1
|
| 495 |
+
|
| 496 |
+
for block in page_dict.get("blocks", []):
|
| 497 |
+
if block.get("type") != 0:
|
| 498 |
+
continue
|
| 499 |
+
|
| 500 |
+
normal_lines: List[str] = []
|
| 501 |
+
|
| 502 |
+
for line in block.get("lines", []):
|
| 503 |
+
line_text = pymupdf_line_text(line)
|
| 504 |
+
if not line_text:
|
| 505 |
+
continue
|
| 506 |
+
|
| 507 |
+
if should_extract_formula_line(line) or (
|
| 508 |
+
formula_lines and is_formula_continuation_line(line_text)
|
| 509 |
+
):
|
| 510 |
+
if normal_lines:
|
| 511 |
+
body_blocks.append("\n".join(normal_lines))
|
| 512 |
+
normal_lines = []
|
| 513 |
+
formula_lines.append(line_text)
|
| 514 |
+
formula_bboxes.append(line_bbox_string(line))
|
| 515 |
+
else:
|
| 516 |
+
if formula_lines:
|
| 517 |
+
formula_index = append_formula_block(
|
| 518 |
+
formula_blocks=formula_blocks,
|
| 519 |
+
body_blocks=body_blocks,
|
| 520 |
+
page_number=page_number,
|
| 521 |
+
formula_index=formula_index,
|
| 522 |
+
formula_lines=formula_lines,
|
| 523 |
+
formula_bboxes=formula_bboxes,
|
| 524 |
+
)
|
| 525 |
+
formula_lines = []
|
| 526 |
+
formula_bboxes = []
|
| 527 |
+
normal_lines.append(line_text)
|
| 528 |
+
|
| 529 |
+
if normal_lines:
|
| 530 |
+
body_blocks.append("\n".join(normal_lines))
|
| 531 |
+
|
| 532 |
+
if formula_lines:
|
| 533 |
+
append_formula_block(
|
| 534 |
+
formula_blocks=formula_blocks,
|
| 535 |
+
body_blocks=body_blocks,
|
| 536 |
+
page_number=page_number,
|
| 537 |
+
formula_index=formula_index,
|
| 538 |
+
formula_lines=formula_lines,
|
| 539 |
+
formula_bboxes=formula_bboxes,
|
| 540 |
+
)
|
| 541 |
+
|
| 542 |
+
return {
|
| 543 |
+
"text": "\n".join(body_blocks),
|
| 544 |
+
"formula_blocks": formula_blocks,
|
| 545 |
+
"backend": "pymupdf",
|
| 546 |
+
}
|
| 547 |
+
|
| 548 |
+
|
| 549 |
+
def extract_pdf_pages_with_pymupdf(path: Path) -> Optional[List[dict]]:
|
| 550 |
+
fitz = load_pymupdf()
|
| 551 |
+
if fitz is None:
|
| 552 |
+
return None
|
| 553 |
+
|
| 554 |
+
try:
|
| 555 |
+
document = fitz.open(str(path))
|
| 556 |
+
except Exception:
|
| 557 |
+
return None
|
| 558 |
+
|
| 559 |
+
try:
|
| 560 |
+
return [extract_pymupdf_page(page) for page in document]
|
| 561 |
+
finally:
|
| 562 |
+
document.close()
|
| 563 |
+
|
| 564 |
+
|
| 565 |
+
def clean_formula_text(text: str) -> str:
|
| 566 |
+
lines = page_lines(text)
|
| 567 |
+
if not lines:
|
| 568 |
+
return ""
|
| 569 |
+
|
| 570 |
+
text = "\n".join(lines)
|
| 571 |
+
text = re.sub(r"[ \t]+", " ", text)
|
| 572 |
+
text = re.sub(r"\n{3,}", "\n\n", text)
|
| 573 |
+
return text.strip()
|
| 574 |
+
|
| 575 |
+
|
| 576 |
+
def normalize_pdf_line(line: str) -> str:
|
| 577 |
+
line = line.replace("\x00", " ")
|
| 578 |
+
line = line.replace("\ufb00", "ff")
|
| 579 |
+
line = line.replace("\ufb01", "fi")
|
| 580 |
+
line = line.replace("\ufb02", "fl")
|
| 581 |
+
line = line.replace("\ufb03", "ffi")
|
| 582 |
+
line = line.replace("\ufb04", "ffl")
|
| 583 |
+
line = re.sub(r"[ \t]+", " ", line)
|
| 584 |
+
return line.strip()
|
| 585 |
+
|
| 586 |
+
|
| 587 |
+
def is_noise_line(line: str) -> bool:
|
| 588 |
+
if not line:
|
| 589 |
+
return True
|
| 590 |
+
if re.fullmatch(r"\d+", line):
|
| 591 |
+
return True
|
| 592 |
+
if re.fullmatch(r"page\s+\d+(\s+of\s+\d+)?", line, flags=re.IGNORECASE):
|
| 593 |
+
return True
|
| 594 |
+
if re.fullmatch(r"[-_=\s]{3,}", line):
|
| 595 |
+
return True
|
| 596 |
+
return False
|
| 597 |
+
|
| 598 |
+
|
| 599 |
+
def is_formula_like(line: str) -> bool:
|
| 600 |
+
stripped = line.strip()
|
| 601 |
+
if not stripped:
|
| 602 |
+
return False
|
| 603 |
+
|
| 604 |
+
strong_math_count = sum(1 for char in stripped if char in PDF_STRONG_MATH_SYMBOLS)
|
| 605 |
+
weak_math_count = sum(1 for char in stripped if char in PDF_WEAK_MATH_SYMBOLS)
|
| 606 |
+
alpha_count = sum(1 for char in stripped if char.isalpha())
|
| 607 |
+
digit_count = sum(1 for char in stripped if char.isdigit())
|
| 608 |
+
compact = stripped.replace(" ", "")
|
| 609 |
+
|
| 610 |
+
if "={" in compact or "^{" in compact or "_{" in compact:
|
| 611 |
+
return True
|
| 612 |
+
if compact in {"(", ")", "[", "]", "{", "}"}:
|
| 613 |
+
return True
|
| 614 |
+
if len(compact) <= 40 and any(char in compact for char in PDF_MATH_SYMBOLS):
|
| 615 |
+
return True
|
| 616 |
+
if strong_math_count >= 2 and len(stripped) <= 180:
|
| 617 |
+
return True
|
| 618 |
+
if strong_math_count >= 1 and weak_math_count >= 1 and len(stripped) <= 180:
|
| 619 |
+
return True
|
| 620 |
+
if "=" in stripped and (alpha_count + digit_count) >= 2 and len(stripped) <= 220:
|
| 621 |
+
return True
|
| 622 |
+
if re.search(r"\b(d|D|exp|ln|sqrt|max|min|var|cov)\s*[\(\[]", stripped):
|
| 623 |
+
return True
|
| 624 |
+
if alpha_count <= 4 and (strong_math_count + weak_math_count) >= 1 and digit_count >= 1:
|
| 625 |
+
return True
|
| 626 |
+
|
| 627 |
+
return False
|
| 628 |
+
|
| 629 |
+
|
| 630 |
+
def normalized_line_key(line: str) -> str:
|
| 631 |
+
return re.sub(r"\d+", "#", line.lower()).strip()
|
| 632 |
+
|
| 633 |
+
|
| 634 |
+
def page_lines(text: str) -> List[str]:
|
| 635 |
+
lines = []
|
| 636 |
+
for line in text.replace("\r\n", "\n").replace("\r", "\n").split("\n"):
|
| 637 |
+
normalized = normalize_pdf_line(line)
|
| 638 |
+
if not is_noise_line(normalized):
|
| 639 |
+
lines.append(normalized)
|
| 640 |
+
return lines
|
| 641 |
+
|
| 642 |
+
|
| 643 |
+
def find_repeated_boundary_lines(raw_pages: List[str]) -> set[str]:
|
| 644 |
+
counter: Counter[str] = Counter()
|
| 645 |
+
|
| 646 |
+
for raw_text in raw_pages:
|
| 647 |
+
lines = page_lines(raw_text)
|
| 648 |
+
boundary_lines = lines[:PDF_BOUNDARY_LINE_COUNT] + lines[-PDF_BOUNDARY_LINE_COUNT:]
|
| 649 |
+
counter.update(
|
| 650 |
+
normalized_line_key(line)
|
| 651 |
+
for line in boundary_lines
|
| 652 |
+
if 3 <= len(line) <= 140
|
| 653 |
+
)
|
| 654 |
+
|
| 655 |
+
min_count = min(
|
| 656 |
+
PDF_REPEATED_LINE_MIN_PAGES,
|
| 657 |
+
max(2, len(raw_pages) // 3),
|
| 658 |
+
)
|
| 659 |
+
return {line for line, count in counter.items() if count >= min_count}
|
| 660 |
+
|
| 661 |
+
|
| 662 |
+
def clean_pdf_text(text: str, repeated_boundary_lines: set[str]) -> str:
|
| 663 |
+
lines = page_lines(text)
|
| 664 |
+
cleaned_lines = []
|
| 665 |
+
|
| 666 |
+
for index, line in enumerate(lines):
|
| 667 |
+
is_boundary = (
|
| 668 |
+
index < PDF_BOUNDARY_LINE_COUNT
|
| 669 |
+
or index >= len(lines) - PDF_BOUNDARY_LINE_COUNT
|
| 670 |
+
)
|
| 671 |
+
if is_boundary and normalized_line_key(line) in repeated_boundary_lines:
|
| 672 |
+
continue
|
| 673 |
+
cleaned_lines.append(line)
|
| 674 |
+
|
| 675 |
+
merged_lines = []
|
| 676 |
+
for line in cleaned_lines:
|
| 677 |
+
if merged_lines and merged_lines[-1].endswith("-") and line[:1].islower():
|
| 678 |
+
merged_lines[-1] = merged_lines[-1][:-1] + line
|
| 679 |
+
else:
|
| 680 |
+
merged_lines.append(line)
|
| 681 |
+
|
| 682 |
+
text = "\n".join(merged_lines)
|
| 683 |
+
text = preserve_math_line_breaks(text)
|
| 684 |
+
text = re.sub(r"[ \t]+", " ", text)
|
| 685 |
+
text = re.sub(r"\n{3,}", "\n\n", text)
|
| 686 |
+
return text.strip()
|
| 687 |
+
|
| 688 |
+
|
| 689 |
+
def preserve_math_line_breaks(text: str) -> str:
|
| 690 |
+
lines = text.split("\n")
|
| 691 |
+
if not lines:
|
| 692 |
+
return ""
|
| 693 |
+
|
| 694 |
+
output = [lines[0]]
|
| 695 |
+
in_formula_block = is_formula_like(lines[0])
|
| 696 |
+
for line in lines[1:]:
|
| 697 |
+
previous = output[-1]
|
| 698 |
+
line_is_formula = is_formula_like(line)
|
| 699 |
+
previous_is_formula = is_formula_like(previous)
|
| 700 |
+
|
| 701 |
+
if previous_is_formula or line_is_formula or in_formula_block:
|
| 702 |
+
output.append(line)
|
| 703 |
+
in_formula_block = line_is_formula or (
|
| 704 |
+
in_formula_block
|
| 705 |
+
and not line.endswith((".", ";", ":", "?", "!"))
|
| 706 |
+
)
|
| 707 |
+
elif previous.endswith((".", ":", ";", "?", "!", ")")):
|
| 708 |
+
output.append(line)
|
| 709 |
+
in_formula_block = False
|
| 710 |
+
else:
|
| 711 |
+
output[-1] = f"{previous} {line}"
|
| 712 |
+
in_formula_block = False
|
| 713 |
+
|
| 714 |
+
return "\n".join(output)
|
| 715 |
+
|
| 716 |
+
|
| 717 |
+
def is_chapter_heading(line: str) -> bool:
|
| 718 |
+
return bool(re.fullmatch(
|
| 719 |
+
r"(chapter|appendix)\s+([0-9]+|[ivxlcdm]+|[a-z])",
|
| 720 |
+
line.strip(),
|
| 721 |
+
flags=re.IGNORECASE,
|
| 722 |
+
))
|
| 723 |
+
|
| 724 |
+
|
| 725 |
+
def titlecase_word_ratio(words: List[str]) -> float:
|
| 726 |
+
candidate_words = [
|
| 727 |
+
word.strip("()[]{}:;,.")
|
| 728 |
+
for word in words
|
| 729 |
+
if any(char.isalpha() for char in word)
|
| 730 |
+
]
|
| 731 |
+
if not candidate_words:
|
| 732 |
+
return 0.0
|
| 733 |
+
|
| 734 |
+
titlecase_words = [
|
| 735 |
+
word
|
| 736 |
+
for word in candidate_words
|
| 737 |
+
if word[:1].isupper()
|
| 738 |
+
or word.lower() in {"a", "an", "and", "for", "in", "of", "on", "or", "the", "to", "with"}
|
| 739 |
+
]
|
| 740 |
+
return len(titlecase_words) / len(candidate_words)
|
| 741 |
+
|
| 742 |
+
|
| 743 |
+
def uppercase_letter_ratio(text: str) -> float:
|
| 744 |
+
letters = [char for char in text if char.isalpha()]
|
| 745 |
+
if not letters:
|
| 746 |
+
return 0.0
|
| 747 |
+
return sum(1 for char in letters if char.isupper()) / len(letters)
|
| 748 |
+
|
| 749 |
+
|
| 750 |
+
def is_section_heading(line: str) -> bool:
|
| 751 |
+
stripped = line.strip()
|
| 752 |
+
if not 4 <= len(stripped) <= 150:
|
| 753 |
+
return False
|
| 754 |
+
letters = [char for char in stripped if char.isalpha()]
|
| 755 |
+
digit_count = sum(1 for char in stripped if char.isdigit())
|
| 756 |
+
alpha_words = [
|
| 757 |
+
word.strip("()[]{}:;,.")
|
| 758 |
+
for word in stripped.split()
|
| 759 |
+
if any(char.isalpha() for char in word)
|
| 760 |
+
]
|
| 761 |
+
if len(letters) < 6 or len(alpha_words) < 2:
|
| 762 |
+
return False
|
| 763 |
+
if digit_count > max(4, len(letters)):
|
| 764 |
+
return False
|
| 765 |
+
if "%" in stripped and digit_count >= len(letters) / 2:
|
| 766 |
+
return False
|
| 767 |
+
numbered_heading = bool(re.match(r"^\d+(\.\d+)+\s+", stripped))
|
| 768 |
+
if stripped[:1].isdigit() and not numbered_heading:
|
| 769 |
+
return False
|
| 770 |
+
if re.match(
|
| 771 |
+
r"^(in|from|where|thus|then|now|let|because|while|figure|table|for)\b",
|
| 772 |
+
stripped,
|
| 773 |
+
flags=re.IGNORECASE,
|
| 774 |
+
):
|
| 775 |
+
return False
|
| 776 |
+
if is_formula_like(stripped):
|
| 777 |
+
return False
|
| 778 |
+
if stripped.endswith((".", ",", ";")):
|
| 779 |
+
return False
|
| 780 |
+
if re.match(r"^(figure|table)\s+\d", stripped, flags=re.IGNORECASE):
|
| 781 |
+
return False
|
| 782 |
+
if numbered_heading:
|
| 783 |
+
return True
|
| 784 |
+
|
| 785 |
+
words = stripped.split()
|
| 786 |
+
if len(words) > 16:
|
| 787 |
+
return False
|
| 788 |
+
if uppercase_letter_ratio(stripped) >= 0.72 and len(words) >= 2:
|
| 789 |
+
return True
|
| 790 |
+
if len(words) >= 4 and titlecase_word_ratio(words) >= 0.68:
|
| 791 |
+
return True
|
| 792 |
+
|
| 793 |
+
return False
|
| 794 |
+
|
| 795 |
+
|
| 796 |
+
def make_section_path(chapter_title: str, section_title: str) -> str:
|
| 797 |
+
if chapter_title and section_title and section_title != chapter_title:
|
| 798 |
+
return f"{chapter_title} > {section_title}"
|
| 799 |
+
return section_title or chapter_title
|
| 800 |
+
|
| 801 |
+
|
| 802 |
+
def split_pdf_page_into_sections(
|
| 803 |
+
path: Path,
|
| 804 |
+
page_index: int,
|
| 805 |
+
text: str,
|
| 806 |
+
file_hash: str,
|
| 807 |
+
section_state: dict,
|
| 808 |
+
extraction_backend: str,
|
| 809 |
+
formula_count: int,
|
| 810 |
+
) -> List[Document]:
|
| 811 |
+
documents = []
|
| 812 |
+
lines = text.splitlines()
|
| 813 |
+
pending_lines: List[str] = []
|
| 814 |
+
pending_metadata = {
|
| 815 |
+
"chapter_title": section_state.get("chapter_title", ""),
|
| 816 |
+
"section_title": section_state.get("section_title", ""),
|
| 817 |
+
}
|
| 818 |
+
|
| 819 |
+
def flush_pending() -> None:
|
| 820 |
+
nonlocal pending_lines, pending_metadata
|
| 821 |
+
section_text = "\n".join(line for line in pending_lines if line.strip()).strip()
|
| 822 |
+
if not section_text:
|
| 823 |
+
pending_lines = []
|
| 824 |
+
return
|
| 825 |
+
|
| 826 |
+
chapter_title = pending_metadata.get("chapter_title", "")
|
| 827 |
+
section_title = pending_metadata.get("section_title", "")
|
| 828 |
+
documents.append(
|
| 829 |
+
Document(
|
| 830 |
+
text=section_text,
|
| 831 |
+
metadata={
|
| 832 |
+
"source_file": str(path.resolve()),
|
| 833 |
+
"file_name": path.name,
|
| 834 |
+
"file_type": "pdf",
|
| 835 |
+
"document_title": path.stem,
|
| 836 |
+
"file_hash": file_hash,
|
| 837 |
+
"page_number": page_index,
|
| 838 |
+
"extraction_method": PDF_EXTRACTION_METHOD,
|
| 839 |
+
"extraction_backend": extraction_backend,
|
| 840 |
+
"char_count": len(section_text),
|
| 841 |
+
"formula_count": formula_count,
|
| 842 |
+
"content_type": "text",
|
| 843 |
+
"chapter_title": chapter_title,
|
| 844 |
+
"section_title": section_title,
|
| 845 |
+
"section_path": make_section_path(chapter_title, section_title),
|
| 846 |
+
},
|
| 847 |
+
)
|
| 848 |
+
)
|
| 849 |
+
pending_lines = []
|
| 850 |
+
|
| 851 |
+
for line in lines:
|
| 852 |
+
stripped = line.strip()
|
| 853 |
+
if not stripped:
|
| 854 |
+
continue
|
| 855 |
+
|
| 856 |
+
if is_chapter_heading(stripped):
|
| 857 |
+
if len("\n".join(pending_lines)) >= PDF_MIN_SECTION_CHARS:
|
| 858 |
+
flush_pending()
|
| 859 |
+
section_state["pending_chapter_label"] = stripped.title()
|
| 860 |
+
section_state["chapter_title"] = stripped.title()
|
| 861 |
+
section_state["section_title"] = stripped.title()
|
| 862 |
+
pending_metadata = {
|
| 863 |
+
"chapter_title": section_state["chapter_title"],
|
| 864 |
+
"section_title": section_state["section_title"],
|
| 865 |
+
}
|
| 866 |
+
pending_lines.append(stripped)
|
| 867 |
+
continue
|
| 868 |
+
|
| 869 |
+
if section_state.get("pending_chapter_label") and is_section_heading(stripped):
|
| 870 |
+
if pending_lines == [section_state["pending_chapter_label"]]:
|
| 871 |
+
pending_lines[0] = f"{section_state['pending_chapter_label']}: {stripped}"
|
| 872 |
+
else:
|
| 873 |
+
pending_lines.append(stripped)
|
| 874 |
+
|
| 875 |
+
section_state["chapter_title"] = pending_lines[-1]
|
| 876 |
+
section_state["section_title"] = pending_lines[-1]
|
| 877 |
+
section_state["pending_chapter_label"] = ""
|
| 878 |
+
pending_metadata = {
|
| 879 |
+
"chapter_title": section_state["chapter_title"],
|
| 880 |
+
"section_title": section_state["section_title"],
|
| 881 |
+
}
|
| 882 |
+
continue
|
| 883 |
+
|
| 884 |
+
if is_section_heading(stripped):
|
| 885 |
+
if len("\n".join(pending_lines)) >= PDF_MIN_SECTION_CHARS:
|
| 886 |
+
flush_pending()
|
| 887 |
+
section_state["section_title"] = stripped
|
| 888 |
+
section_state["pending_chapter_label"] = ""
|
| 889 |
+
pending_metadata = {
|
| 890 |
+
"chapter_title": section_state.get("chapter_title", ""),
|
| 891 |
+
"section_title": section_state["section_title"],
|
| 892 |
+
}
|
| 893 |
+
|
| 894 |
+
pending_lines.append(stripped)
|
| 895 |
+
|
| 896 |
+
flush_pending()
|
| 897 |
+
return documents
|
| 898 |
+
|
| 899 |
+
|
| 900 |
+
def make_formula_documents(
|
| 901 |
+
path: Path,
|
| 902 |
+
page_index: int,
|
| 903 |
+
formula_blocks: List[dict],
|
| 904 |
+
file_hash: str,
|
| 905 |
+
extraction_backend: str,
|
| 906 |
+
) -> List[Document]:
|
| 907 |
+
documents = []
|
| 908 |
+
|
| 909 |
+
for formula_index, formula in enumerate(formula_blocks):
|
| 910 |
+
formula_text = formula.get("text", "").strip()
|
| 911 |
+
if not formula_text:
|
| 912 |
+
continue
|
| 913 |
+
|
| 914 |
+
documents.append(
|
| 915 |
+
Document(
|
| 916 |
+
text=f"[FORMULA]\n{formula_text}\n[/FORMULA]",
|
| 917 |
+
metadata={
|
| 918 |
+
"source_file": str(path.resolve()),
|
| 919 |
+
"file_name": path.name,
|
| 920 |
+
"file_type": "pdf",
|
| 921 |
+
"document_title": path.stem,
|
| 922 |
+
"file_hash": file_hash,
|
| 923 |
+
"page_number": page_index,
|
| 924 |
+
"extraction_method": PDF_EXTRACTION_METHOD,
|
| 925 |
+
"extraction_backend": extraction_backend,
|
| 926 |
+
"char_count": len(formula_text),
|
| 927 |
+
"content_type": "formula",
|
| 928 |
+
"formula_id": formula.get("id", f"formula-{page_index}-{formula_index}"),
|
| 929 |
+
"formula_index": formula_index,
|
| 930 |
+
"formula_bbox": formula.get("bbox", ""),
|
| 931 |
+
"formula_count": 1,
|
| 932 |
+
"chapter_title": "",
|
| 933 |
+
"section_title": "",
|
| 934 |
+
"section_path": "",
|
| 935 |
+
},
|
| 936 |
+
)
|
| 937 |
+
)
|
| 938 |
+
|
| 939 |
+
return documents
|
| 940 |
+
|
| 941 |
+
|
| 942 |
+
def load_pdf_file(path: Path) -> List[Document]:
|
| 943 |
+
reader = PdfReader(str(path))
|
| 944 |
+
documents = []
|
| 945 |
+
pymupdf_pages = extract_pdf_pages_with_pymupdf(path)
|
| 946 |
+
|
| 947 |
+
if pymupdf_pages:
|
| 948 |
+
page_payloads = pymupdf_pages
|
| 949 |
+
else:
|
| 950 |
+
page_payloads = [
|
| 951 |
+
{
|
| 952 |
+
"text": extract_pdf_text(page),
|
| 953 |
+
"formula_blocks": [],
|
| 954 |
+
"backend": "pypdf",
|
| 955 |
+
}
|
| 956 |
+
for page in reader.pages
|
| 957 |
+
]
|
| 958 |
+
|
| 959 |
+
raw_pages = [payload["text"] for payload in page_payloads]
|
| 960 |
+
repeated_boundary_lines = find_repeated_boundary_lines(raw_pages)
|
| 961 |
+
file_hash = file_sha256(path)
|
| 962 |
+
section_state: dict = {
|
| 963 |
+
"chapter_title": "",
|
| 964 |
+
"section_title": "",
|
| 965 |
+
"pending_chapter_label": "",
|
| 966 |
+
}
|
| 967 |
+
|
| 968 |
+
for page_index, payload in enumerate(page_payloads, start=1):
|
| 969 |
+
raw_text = payload["text"]
|
| 970 |
+
text = clean_pdf_text(raw_text, repeated_boundary_lines)
|
| 971 |
+
formula_blocks = payload.get("formula_blocks", [])
|
| 972 |
+
extraction_backend = payload.get("backend", "pypdf")
|
| 973 |
+
|
| 974 |
+
if not text.strip():
|
| 975 |
+
documents.extend(
|
| 976 |
+
make_formula_documents(
|
| 977 |
+
path=path,
|
| 978 |
+
page_index=page_index,
|
| 979 |
+
formula_blocks=formula_blocks,
|
| 980 |
+
file_hash=file_hash,
|
| 981 |
+
extraction_backend=extraction_backend,
|
| 982 |
+
)
|
| 983 |
+
)
|
| 984 |
+
continue
|
| 985 |
+
|
| 986 |
+
documents.extend(
|
| 987 |
+
split_pdf_page_into_sections(
|
| 988 |
+
path=path,
|
| 989 |
+
page_index=page_index,
|
| 990 |
+
text=text,
|
| 991 |
+
file_hash=file_hash,
|
| 992 |
+
section_state=section_state,
|
| 993 |
+
extraction_backend=extraction_backend,
|
| 994 |
+
formula_count=len(formula_blocks),
|
| 995 |
+
)
|
| 996 |
+
)
|
| 997 |
+
documents.extend(
|
| 998 |
+
make_formula_documents(
|
| 999 |
+
path=path,
|
| 1000 |
+
page_index=page_index,
|
| 1001 |
+
formula_blocks=formula_blocks,
|
| 1002 |
+
file_hash=file_hash,
|
| 1003 |
+
extraction_backend=extraction_backend,
|
| 1004 |
+
)
|
| 1005 |
+
)
|
| 1006 |
+
|
| 1007 |
+
return documents
|
| 1008 |
+
|
| 1009 |
+
|
| 1010 |
+
def load_txt_file(path: Path) -> List[Document]:
|
| 1011 |
+
# TODO: load text file
|
| 1012 |
+
pass
|
| 1013 |
+
return []
|
| 1014 |
+
|
| 1015 |
+
|
| 1016 |
+
def iter_source_files(raw_dir: Path) -> Iterable[Path]:
|
| 1017 |
+
supported_suffixes = {".md", ".markdown", ".pdf"}
|
| 1018 |
+
for path in sorted(raw_dir.rglob("*")):
|
| 1019 |
+
if path.is_file() and path.suffix.lower() in supported_suffixes:
|
| 1020 |
+
yield path
|
| 1021 |
+
|
| 1022 |
+
|
| 1023 |
+
def load_docs(raw_dir: Path = RAW_DIR) -> List[Document]:
|
| 1024 |
+
documents: List[Document] = []
|
| 1025 |
+
|
| 1026 |
+
for path in iter_source_files(raw_dir):
|
| 1027 |
+
suffix = path.suffix.lower()
|
| 1028 |
+
|
| 1029 |
+
if suffix in {".md", ".markdown"}:
|
| 1030 |
+
documents.append(load_md_file(path))
|
| 1031 |
+
elif suffix == ".pdf":
|
| 1032 |
+
documents.extend(load_pdf_file(path))
|
| 1033 |
+
elif suffix == ".txt":
|
| 1034 |
+
documents.extend(load_txt_file(path))
|
| 1035 |
+
|
| 1036 |
+
if not documents:
|
| 1037 |
+
raise ValueError(f"No supported documents found under {raw_dir}")
|
| 1038 |
+
|
| 1039 |
+
return documents
|
| 1040 |
+
|
| 1041 |
+
|
| 1042 |
+
def add_chunk_metadata(nodes: List[BaseNode]) -> List[BaseNode]:
|
| 1043 |
+
counters: dict[str, int] = {}
|
| 1044 |
+
|
| 1045 |
+
for node in nodes:
|
| 1046 |
+
source_file = node.metadata["source_file"]
|
| 1047 |
+
chunk_index = counters.get(source_file, 0)
|
| 1048 |
+
counters[source_file] = chunk_index + 1
|
| 1049 |
+
|
| 1050 |
+
file_hash = node.metadata["file_hash"][:12]
|
| 1051 |
+
page_number = node.metadata.get("page_number", "na")
|
| 1052 |
+
chunk_id = f"{Path(source_file).stem}-{file_hash}-p{page_number}-c{chunk_index}"
|
| 1053 |
+
|
| 1054 |
+
node.metadata["chunk_id"] = chunk_id
|
| 1055 |
+
node.metadata["chunk_index"] = chunk_index
|
| 1056 |
+
node.id_ = chunk_id
|
| 1057 |
+
|
| 1058 |
+
return nodes
|
| 1059 |
+
|
| 1060 |
+
|
| 1061 |
+
def validate_nodes(nodes: List[BaseNode]) -> None:
|
| 1062 |
+
if not nodes:
|
| 1063 |
+
raise ValueError("No chunks were created from the source documents.")
|
| 1064 |
+
|
| 1065 |
+
for node in nodes:
|
| 1066 |
+
missing = [key for key in REQUIRED_METADATA if key not in node.metadata]
|
| 1067 |
+
if missing:
|
| 1068 |
+
raise ValueError(
|
| 1069 |
+
f"Node {node.node_id} is missing metadata fields: {missing}")
|
| 1070 |
+
|
| 1071 |
+
if node.metadata["file_type"] == "pdf" and "page_number" not in node.metadata:
|
| 1072 |
+
raise ValueError(
|
| 1073 |
+
f"PDF node {node.node_id} is missing page_number metadata.")
|
| 1074 |
+
|
| 1075 |
+
|
| 1076 |
+
def build_nodes(raw_dir: Path = RAW_DIR) -> List[BaseNode]:
|
| 1077 |
+
documents = load_docs(raw_dir)
|
| 1078 |
+
splitter = SentenceSplitter(
|
| 1079 |
+
chunk_size=CHUNK_SIZE,
|
| 1080 |
+
chunk_overlap=CHUNK_OVERLAP,
|
| 1081 |
+
)
|
| 1082 |
+
nodes = splitter.get_nodes_from_documents(documents)
|
| 1083 |
+
add_chunk_metadata(nodes)
|
| 1084 |
+
validate_nodes(nodes)
|
| 1085 |
+
return nodes
|
| 1086 |
+
|
| 1087 |
+
|
| 1088 |
+
def collection_needs_pdf_rebuild(chroma_collection) -> bool:
|
| 1089 |
+
if chroma_collection.count() == 0:
|
| 1090 |
+
return True
|
| 1091 |
+
|
| 1092 |
+
try:
|
| 1093 |
+
sample = chroma_collection.peek(limit=min(chroma_collection.count(), 20))
|
| 1094 |
+
except Exception:
|
| 1095 |
+
return False
|
| 1096 |
+
|
| 1097 |
+
for metadata in sample.get("metadatas") or []:
|
| 1098 |
+
if metadata.get("file_type") == "pdf":
|
| 1099 |
+
return metadata.get("extraction_method") != PDF_EXTRACTION_METHOD
|
| 1100 |
+
|
| 1101 |
+
return False
|
| 1102 |
+
|
| 1103 |
+
|
| 1104 |
+
async def build_index(raw_dir: Path = RAW_DIR, rebuild: bool = False) -> VectorStoreIndex:
|
| 1105 |
+
configure_model_cache()
|
| 1106 |
+
|
| 1107 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
| 1108 |
+
|
| 1109 |
+
load_dotenv()
|
| 1110 |
+
CHROMA_DB_DIR.mkdir(parents=True, exist_ok=True)
|
| 1111 |
+
|
| 1112 |
+
db = chromadb.PersistentClient(path=str(CHROMA_DB_DIR))
|
| 1113 |
+
|
| 1114 |
+
if rebuild:
|
| 1115 |
+
try:
|
| 1116 |
+
db.delete_collection(COLLECTION_NAME)
|
| 1117 |
+
except (NotFoundError, ValueError):
|
| 1118 |
+
pass
|
| 1119 |
+
|
| 1120 |
+
chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
|
| 1121 |
+
if not rebuild and collection_needs_pdf_rebuild(chroma_collection):
|
| 1122 |
+
db.delete_collection(COLLECTION_NAME)
|
| 1123 |
+
chroma_collection = db.get_or_create_collection(COLLECTION_NAME)
|
| 1124 |
+
|
| 1125 |
+
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
|
| 1126 |
+
storage_context = StorageContext.from_defaults(vector_store=vector_store)
|
| 1127 |
+
embed_model = HuggingFaceEmbedding(
|
| 1128 |
+
model_name=resolve_embed_model_name(),
|
| 1129 |
+
cache_folder=str(HF_CACHE_DIR / "sentence_transformers"),
|
| 1130 |
+
)
|
| 1131 |
+
|
| 1132 |
+
if rebuild or chroma_collection.count() == 0:
|
| 1133 |
+
nodes = build_nodes(raw_dir)
|
| 1134 |
+
index = VectorStoreIndex(
|
| 1135 |
+
nodes,
|
| 1136 |
+
storage_context=storage_context,
|
| 1137 |
+
embed_model=embed_model,
|
| 1138 |
+
show_progress=True,
|
| 1139 |
+
)
|
| 1140 |
+
print(
|
| 1141 |
+
f"Indexed {len(nodes)} chunks into collection '{COLLECTION_NAME}'")
|
| 1142 |
+
return index
|
| 1143 |
+
|
| 1144 |
+
print(
|
| 1145 |
+
f"Loaded existing collection '{COLLECTION_NAME}' with {chroma_collection.count()} chunks.")
|
| 1146 |
+
return VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
|
| 1147 |
+
|
| 1148 |
+
|
| 1149 |
+
class QueryKnowledgeTool(Tool):
|
| 1150 |
+
name = "query_knowledge"
|
| 1151 |
+
description = "Performs a search of related information based on your query"
|
| 1152 |
+
inputs = {'query': {'type': 'string',
|
| 1153 |
+
'description': 'The search query to perform.'}}
|
| 1154 |
+
output_type = "string"
|
| 1155 |
+
|
| 1156 |
+
@staticmethod
|
| 1157 |
+
def format_results(results):
|
| 1158 |
+
output = []
|
| 1159 |
+
|
| 1160 |
+
for result in results:
|
| 1161 |
+
metadata = result.node.metadata
|
| 1162 |
+
source = metadata.get("file_name", "unknown")
|
| 1163 |
+
page = metadata.get("page_number", "n/a")
|
| 1164 |
+
section = metadata.get("section_path") or metadata.get("section_title") or "n/a"
|
| 1165 |
+
content_type = metadata.get("content_type", "text")
|
| 1166 |
+
formula_id = metadata.get("formula_id", "")
|
| 1167 |
+
score = result.score
|
| 1168 |
+
text = result.node.get_content()
|
| 1169 |
+
|
| 1170 |
+
output.append(
|
| 1171 |
+
f"source:{source}\n"
|
| 1172 |
+
f"page:{page}\n"
|
| 1173 |
+
f"section:{section}\n"
|
| 1174 |
+
f"content_type:{content_type}\n"
|
| 1175 |
+
f"formula_id:{formula_id or 'n/a'}\n"
|
| 1176 |
+
f"score:{score:.4f}\n"
|
| 1177 |
+
f"content:{text}"
|
| 1178 |
+
)
|
| 1179 |
+
|
| 1180 |
+
return "\n\n---\n\n".join(output)
|
| 1181 |
+
|
| 1182 |
+
def __init__(self, max_results=10, top_k=5, **kwargs):
|
| 1183 |
+
super().__init__()
|
| 1184 |
+
self.max_results = max_results
|
| 1185 |
+
index = asyncio.run(build_index(rebuild=False))
|
| 1186 |
+
self.retriever = index.as_retriever(similarity_top_k=top_k)
|
| 1187 |
+
|
| 1188 |
+
def forward(self, query: str) -> str:
|
| 1189 |
+
results = self.retriever.retrieve(query)
|
| 1190 |
+
return QueryKnowledgeTool.format_results(results)
|
| 1191 |
+
|
| 1192 |
+
|
| 1193 |
+
if __name__ == "__main__":
|
| 1194 |
+
query_tool = QueryKnowledgeTool()
|
| 1195 |
+
res: str = query_tool.forward("What is option?")
|
| 1196 |
+
print(res)
|
tools/todo.md
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
1. 添加reranker
|
| 2 |
+
2. 修改embedding模型
|
| 3 |
+
3. chunk策略粗糙,建议按照章节、标题等进行划分
|
| 4 |
+
4. 提升pdf提取能力
|
| 5 |
+
5. 完成load_txt
|
uv.lock
CHANGED
|
@@ -660,6 +660,7 @@ dependencies = [
|
|
| 660 |
{ name = "llama-index-core" },
|
| 661 |
{ name = "llama-index-embeddings-huggingface" },
|
| 662 |
{ name = "llama-index-vector-stores-chroma" },
|
|
|
|
| 663 |
{ name = "pypdf" },
|
| 664 |
{ name = "tokenizers" },
|
| 665 |
{ name = "transformers" },
|
|
@@ -673,6 +674,7 @@ requires-dist = [
|
|
| 673 |
{ name = "llama-index-core", specifier = ">=0.14.0" },
|
| 674 |
{ name = "llama-index-embeddings-huggingface", specifier = ">=0.6.0" },
|
| 675 |
{ name = "llama-index-vector-stores-chroma", specifier = ">=0.5.0" },
|
|
|
|
| 676 |
{ name = "pypdf", specifier = ">=6.0.0" },
|
| 677 |
{ name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
|
| 678 |
{ name = "transformers", specifier = "<5" },
|
|
@@ -2570,6 +2572,22 @@ wheels = [
|
|
| 2570 |
{ url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
|
| 2571 |
]
|
| 2572 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2573 |
[[package]]
|
| 2574 |
name = "pypdf"
|
| 2575 |
version = "6.12.0"
|
|
|
|
| 660 |
{ name = "llama-index-core" },
|
| 661 |
{ name = "llama-index-embeddings-huggingface" },
|
| 662 |
{ name = "llama-index-vector-stores-chroma" },
|
| 663 |
+
{ name = "pymupdf" },
|
| 664 |
{ name = "pypdf" },
|
| 665 |
{ name = "tokenizers" },
|
| 666 |
{ name = "transformers" },
|
|
|
|
| 674 |
{ name = "llama-index-core", specifier = ">=0.14.0" },
|
| 675 |
{ name = "llama-index-embeddings-huggingface", specifier = ">=0.6.0" },
|
| 676 |
{ name = "llama-index-vector-stores-chroma", specifier = ">=0.5.0" },
|
| 677 |
+
{ name = "pymupdf", specifier = ">=1.27.2.3" },
|
| 678 |
{ name = "pypdf", specifier = ">=6.0.0" },
|
| 679 |
{ name = "tokenizers", specifier = ">=0.22.0,<=0.23.0" },
|
| 680 |
{ name = "transformers", specifier = "<5" },
|
|
|
|
| 2572 |
{ url = "https://files.pythonhosted.org/packages/f4/7e/a72dd26f3b0f4f2bf1dd8923c85f7ceb43172af56d63c7383eb62b332364/pygments-2.20.0-py3-none-any.whl", hash = "sha256:81a9e26dd42fd28a23a2d169d86d7ac03b46e2f8b59ed4698fb4785f946d0176", size = 1231151, upload-time = "2026-03-29T13:29:30.038Z" },
|
| 2573 |
]
|
| 2574 |
|
| 2575 |
+
[[package]]
|
| 2576 |
+
name = "pymupdf"
|
| 2577 |
+
version = "1.27.2.3"
|
| 2578 |
+
source = { registry = "https://pypi.org/simple" }
|
| 2579 |
+
sdist = { url = "https://files.pythonhosted.org/packages/22/32/708bedc9dde7b328d45abbc076091769d44f2f24ad151ad92d56a6ec142b/pymupdf-1.27.2.3.tar.gz", hash = "sha256:7a92faa25129e8bbec5e50eeb9214f187665428c31b05c4ef6e36c58c0b1c6d2", size = 85759618, upload-time = "2026-04-24T14:13:14.42Z" }
|
| 2580 |
+
wheels = [
|
| 2581 |
+
{ url = "https://files.pythonhosted.org/packages/dc/09/ddbdfa7ee91fbabd6f63d7d744884cbdfe3e7ff9b8604749fb38bddf5c5d/pymupdf-1.27.2.3-cp310-abi3-macosx_10_9_x86_64.whl", hash = "sha256:fc1bc3cae6e9e150b0dbb0a9221bdfd411d65f0db2fe359eaa22467d7cc2a05f", size = 24002636, upload-time = "2026-04-24T14:09:17.459Z" },
|
| 2582 |
+
{ url = "https://files.pythonhosted.org/packages/01/89/3f8edd6c4f50ca370e2a2f2a3011face36f3760728ffe76dffec91c0fca0/pymupdf-1.27.2.3-cp310-abi3-macosx_11_0_arm64.whl", hash = "sha256:660d93cb6da5bbddf11d3982ae27745dd3a9902d9f24cdb69adab83962294b5a", size = 23278238, upload-time = "2026-04-24T14:09:32.882Z" },
|
| 2583 |
+
{ url = "https://files.pythonhosted.org/packages/c3/26/b7e5a70eb83bd189f8b5df87ec442746b992f2f632662839b288170d357d/pymupdf-1.27.2.3-cp310-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:1dd460a3ae4597a755f00a3bd9771f5ebf1531dc111f6a36bf05dd00a6b84425", size = 24333923, upload-time = "2026-04-24T14:09:47.341Z" },
|
| 2584 |
+
{ url = "https://files.pythonhosted.org/packages/e4/a0/aa1ee2240f29481a04a827c313333b4ecd8a14d6ac3e15d3f41a30574781/pymupdf-1.27.2.3-cp310-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:857842b4888827bd6155a1131341b2822a7ebe9a8c15a975fd7d490d7a64a30c", size = 24963198, upload-time = "2026-04-24T14:10:07.408Z" },
|
| 2585 |
+
{ url = "https://files.pythonhosted.org/packages/69/49/4f742451f980840829fc00ba158bebb25d389c846d8f4f8c65936ee55de8/pymupdf-1.27.2.3-cp310-abi3-musllinux_1_2_x86_64.whl", hash = "sha256:580983849c64a08d08344ca3d1580e87c01f046a8392421797bc850efd72a5b6", size = 25184609, upload-time = "2026-04-24T14:10:22.911Z" },
|
| 2586 |
+
{ url = "https://files.pythonhosted.org/packages/f6/3f/3853d6608f394faf6eec2bd4e8ea9f6a00beea329b071abdb29f4164cc3d/pymupdf-1.27.2.3-cp310-abi3-win32.whl", hash = "sha256:a5c1088a87189891a4946ab314a14b7934ac4c5b6077f7e74ebee956f8906d0e", size = 18019286, upload-time = "2026-04-24T14:10:34.239Z" },
|
| 2587 |
+
{ url = "https://files.pythonhosted.org/packages/44/47/5fb10fe73f96b31253a41647c362ea9e0380920bddf16028414a051247fc/pymupdf-1.27.2.3-cp310-abi3-win_amd64.whl", hash = "sha256:d20f68ef15195e073071dbc4ae7455257c7889af7584e39df490c0a92728526e", size = 19249102, upload-time = "2026-04-24T14:10:46.72Z" },
|
| 2588 |
+
{ url = "https://files.pythonhosted.org/packages/53/a4/b9e91aac82293f9c954654c85581ee8212b5b05efadc534b581141241e6f/pymupdf-1.27.2.3-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:77691604c5d1d0233827139bbcdea61fd57879c84712b8e49b1f45520f7ab9c2", size = 25000393, upload-time = "2026-04-24T14:11:01.669Z" },
|
| 2589 |
+
]
|
| 2590 |
+
|
| 2591 |
[[package]]
|
| 2592 |
name = "pypdf"
|
| 2593 |
version = "6.12.0"
|