Spaces:
Running
Running
File size: 546 Bytes
a67789e c2b7f46 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | # Raw Data Extraction
Most users start with messy HTML/TXT/PDF dumps.
This module converts raw files → structured JSON sessions compatible with the distill_rag indexer.
## Usage
1. Put raw files into a directory, e.g.:
raw_corpus/file1.html
raw_corpus/file2.txt
2. Run extraction:
node data_extraction/walk_and_extract.js raw_corpus extracted_json
3. The output will be JSON files with:
- title
- session_date
- turns[]
You can now index these with:
JSON_DIR=extracted_json node indexing/index_distill_chunks.js
|