466 GB
2,808 files
Updated 12 days ago
Ctrl+K
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| final_data | 8 items | ||
| raw_longbench | 3 items | ||
| ruler_raw | 4 items | ||
| README.md | 1.49 kB xet | 02bc742d | |
| build_longbench_data.py | 8.24 kB xet | 427bc933 | |
| build_ruler_data.py | 22.1 kB xet | 93adeae2 |
Dataset Preprocessing
This directory builds the six JSONL files used by QCFuse:
- LongBench:
musique.jsonl,2wikimqa.jsonl,hotpotqa.jsonl - RULER:
ruler_mv.jsonl,ruler_mq.jsonl,ruler_vt.jsonl
Environment
pip install transformers langchain-text-splitters sentence-transformers numpy
For RULER preprocessing, follow the official NVIDIA/RULER installation instructions.
LongBench
Put the official LongBench files here:
data/raw_longbench/
musique.jsonl
2wikimqa.jsonl
hotpotqa.jsonl
Build the QCFuse-format files:
python3 data/build_longbench_data.py \
--input_dir data/raw_longbench \
--output_dir data/final_data \
--tokenizer_path models/qwen3-8b \
--embedding_model models/bge-m3 \
--chunk_size 512 \
--chunk_overlap 50 \
--context_topk 20 \
--max_samples 200
RULER
Clone the official RULER repository:
git clone https://github.com/NVIDIA/RULER.git third_party/RULER
Build the full files:
python3 data/build_ruler_data.py \
--ruler_dir third_party/RULER \
--raw_dir data/ruler_raw \
--output_dir data/final_data \
--tokenizer_path models/qwen3-8b \
--num_samples 200 \
--chunk_size 512 \
--target_num_chunks 20 \
--ruler_max_seq_length 11264
RULER outputs are trimmed to 20 chunks per sample. The script also writes
metadata files under data/final_data.
Use data/final_data as --data_dir in the QCFuse runner.
- Total size
- 466 GB
- Files
- 2,808
- Last updated
- Jun 16
- Pre-warmed CDN
- US EU US EU