ASTERIZER commited on
Commit
94d6754
·
verified ·
1 Parent(s): 6462a62

Upload Base/Datasets/rag_mcp_sft/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. Base/Datasets/rag_mcp_sft/README.md +56 -0
Base/Datasets/rag_mcp_sft/README.md ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - instruction-finetuning
6
+ - rag
7
+ - mcp
8
+ - lora
9
+ task_categories:
10
+ - text-generation
11
+ size_categories:
12
+ - 10K<n<100K
13
+ ---
14
+
15
+ # LUNA RAG + MCP SFT Dataset
16
+
17
+ This folder holds a clean English supervised fine-tuning dataset focused on two current topics:
18
+
19
+ - RAG: retrieval-augmented generation
20
+ - MCP: Model Context Protocol
21
+
22
+ The build pipeline generates only Alpaca-style SFT records:
23
+
24
+ - instruction
25
+ - input
26
+ - output
27
+
28
+ Design constraints:
29
+
30
+ - target size: about 10 million formatted tokens
31
+ - max formatted context per sample: 1024 tokens
32
+ - pure English only
33
+ - exact triple deduplication
34
+ - grounded in curated topic cards derived from current public documentation notes
35
+ - balanced across direct QnA, descriptions, comparisons, checklists, scenarios, and misconception corrections
36
+
37
+ Build the dataset:
38
+
39
+ ```powershell
40
+ python Base\Datasets\rag_mcp_sft\build_rag_mcp_sft_dataset.py --target-tokens 10000000
41
+ ```
42
+
43
+ Push the dataset to Hugging Face:
44
+
45
+ ```powershell
46
+ $env:HF_TOKEN = "<your_token>"
47
+ python Base\Datasets\rag_mcp_sft\push_to_hf.py --repo-id ASTERIZER/LUNA-RAG-MCP-SFT-10M
48
+ ```
49
+
50
+ Primary outputs:
51
+
52
+ - train.json
53
+ - val.json
54
+ - all.json
55
+ - BUILD_REPORT.md
56
+ - source_manifest.json