Spaces:
Running
Running
Raw Data Extraction
Most users start with messy HTML/TXT/PDF dumps.
This module converts raw files → structured JSON sessions compatible with the distill_rag indexer.
Usage
Put raw files into a directory, e.g.:
raw_corpus/file1.html raw_corpus/file2.txt
Run extraction:
node data_extraction/walk_and_extract.js raw_corpus extracted_json
The output will be JSON files with:
- title
- session_date
- turns[]
You can now index these with:
JSON_DIR=extracted_json node indexing/index_distill_chunks.js