leideng/QCFuse / data /README.md
leideng's picture
|
download
raw
1.49 kB

Dataset Preprocessing

This directory builds the six JSONL files used by QCFuse:

  • LongBench: musique.jsonl, 2wikimqa.jsonl, hotpotqa.jsonl
  • RULER: ruler_mv.jsonl, ruler_mq.jsonl, ruler_vt.jsonl

Environment

pip install transformers langchain-text-splitters sentence-transformers numpy

For RULER preprocessing, follow the official NVIDIA/RULER installation instructions.

LongBench

Put the official LongBench files here:

data/raw_longbench/
  musique.jsonl
  2wikimqa.jsonl
  hotpotqa.jsonl

Build the QCFuse-format files:

python3 data/build_longbench_data.py \
  --input_dir data/raw_longbench \
  --output_dir data/final_data \
  --tokenizer_path models/qwen3-8b \
  --embedding_model models/bge-m3 \
  --chunk_size 512 \
  --chunk_overlap 50 \
  --context_topk 20 \
  --max_samples 200

RULER

Clone the official RULER repository:

git clone https://github.com/NVIDIA/RULER.git third_party/RULER

Build the full files:

python3 data/build_ruler_data.py \
  --ruler_dir third_party/RULER \
  --raw_dir data/ruler_raw \
  --output_dir data/final_data \
  --tokenizer_path models/qwen3-8b \
  --num_samples 200 \
  --chunk_size 512 \
  --target_num_chunks 20 \
  --ruler_max_seq_length 11264

RULER outputs are trimmed to 20 chunks per sample. The script also writes metadata files under data/final_data.

Use data/final_data as --data_dir in the QCFuse runner.

Xet Storage Details

Size:
1.49 kB
·
Xet hash:
02bc742d4f451f4c616ccd6c5806939af9d3f75b2a95f1d5d1440c54ce1d9178

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.