ASTERIZER
/

LUNA-Training

ASTERIZER commited on Apr 2

Commit

94d6754

verified ·

1 Parent(s): 6462a62

Upload Base/Datasets/rag_mcp_sft/README.md with huggingface_hub

Files changed (1) hide show

Base/Datasets/rag_mcp_sft/README.md ADDED Viewed

+---
+language:
+- en
+tags:
+- instruction-finetuning
+- rag
+- mcp
+- lora
+task_categories:
+- text-generation
+size_categories:
+- 10K<n<100K
+---
+# LUNA RAG + MCP SFT Dataset
+This folder holds a clean English supervised fine-tuning dataset focused on two current topics:
+- RAG: retrieval-augmented generation
+- MCP: Model Context Protocol
+The build pipeline generates only Alpaca-style SFT records:
+- instruction
+- input
+- output
+Design constraints:
+- target size: about 10 million formatted tokens
+- max formatted context per sample: 1024 tokens
+- pure English only
+- exact triple deduplication
+- grounded in curated topic cards derived from current public documentation notes
+- balanced across direct QnA, descriptions, comparisons, checklists, scenarios, and misconception corrections
+Build the dataset:
+```powershell
+python Base\Datasets\rag_mcp_sft\build_rag_mcp_sft_dataset.py --target-tokens 10000000
+```
+Push the dataset to Hugging Face:
+```powershell
+$env:HF_TOKEN = "<your_token>"
+python Base\Datasets\rag_mcp_sft\push_to_hf.py --repo-id ASTERIZER/LUNA-RAG-MCP-SFT-10M
+```
+Primary outputs:
+- train.json
+- val.json
+- all.json
+- BUILD_REPORT.md
+- source_manifest.json