File size: 546 Bytes
a67789e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2b7f46
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Raw Data Extraction

Most users start with messy HTML/TXT/PDF dumps.

This module converts raw files → structured JSON sessions compatible with the distill_rag indexer.

## Usage

1. Put raw files into a directory, e.g.:

   raw_corpus/file1.html
   raw_corpus/file2.txt

2. Run extraction:

   node data_extraction/walk_and_extract.js raw_corpus extracted_json

3. The output will be JSON files with:

   - title
   - session_date
   - turns[]

You can now index these with:

   JSON_DIR=extracted_json node indexing/index_distill_chunks.js