htaf's picture
Rename QUO_JSON_DIR → JSON_DIR (generic input dir)
c2b7f46

Raw Data Extraction

Most users start with messy HTML/TXT/PDF dumps.

This module converts raw files → structured JSON sessions compatible with the distill_rag indexer.

Usage

  1. Put raw files into a directory, e.g.:

    raw_corpus/file1.html raw_corpus/file2.txt

  2. Run extraction:

    node data_extraction/walk_and_extract.js raw_corpus extracted_json

  3. The output will be JSON files with:

    • title
    • session_date
    • turns[]

You can now index these with:

JSON_DIR=extracted_json node indexing/index_distill_chunks.js