leideng/QCFuse / data /README.md
leideng's picture
|
download
raw
1.49 kB
# Dataset Preprocessing
This directory builds the six JSONL files used by QCFuse:
- LongBench: `musique.jsonl`, `2wikimqa.jsonl`, `hotpotqa.jsonl`
- RULER: `ruler_mv.jsonl`, `ruler_mq.jsonl`, `ruler_vt.jsonl`
## Environment
```bash
pip install transformers langchain-text-splitters sentence-transformers numpy
```
For RULER preprocessing, follow the official
[NVIDIA/RULER](https://github.com/NVIDIA/RULER) installation instructions.
## LongBench
Put the official LongBench files here:
```text
data/raw_longbench/
musique.jsonl
2wikimqa.jsonl
hotpotqa.jsonl
```
Build the QCFuse-format files:
```bash
python3 data/build_longbench_data.py \
--input_dir data/raw_longbench \
--output_dir data/final_data \
--tokenizer_path models/qwen3-8b \
--embedding_model models/bge-m3 \
--chunk_size 512 \
--chunk_overlap 50 \
--context_topk 20 \
--max_samples 200
```
## RULER
Clone the official RULER repository:
```bash
git clone https://github.com/NVIDIA/RULER.git third_party/RULER
```
Build the full files:
```bash
python3 data/build_ruler_data.py \
--ruler_dir third_party/RULER \
--raw_dir data/ruler_raw \
--output_dir data/final_data \
--tokenizer_path models/qwen3-8b \
--num_samples 200 \
--chunk_size 512 \
--target_num_chunks 20 \
--ruler_max_seq_length 11264
```
RULER outputs are trimmed to 20 chunks per sample. The script also writes
metadata files under `data/final_data`.
Use `data/final_data` as `--data_dir` in the QCFuse runner.

Xet Storage Details

Size:
1.49 kB
·
Xet hash:
02bc742d4f451f4c616ccd6c5806939af9d3f75b2a95f1d5d1440c54ce1d9178

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.