Upload Base/Datasets/rag_mcp_sft/BUILD_REPORT.md with huggingface_hub
Browse files
Base/Datasets/rag_mcp_sft/BUILD_REPORT.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# RAG + MCP SFT Build Report
|
| 2 |
+
|
| 3 |
+
- Retrieved on: 2026-04-03
|
| 4 |
+
- Target tokens: 10,000,000
|
| 5 |
+
- Realized tokens: 10,000,168
|
| 6 |
+
- Train samples: 60,647
|
| 7 |
+
- Val samples: 1,237
|
| 8 |
+
- Total samples: 61,884
|
| 9 |
+
- Average formatted tokens per sample: 161.6
|
| 10 |
+
- Max window enforced: 1024 tokens
|
| 11 |
+
|
| 12 |
+
## Breakdown by kind
|
| 13 |
+
|
| 14 |
+
- checklist: 9,050
|
| 15 |
+
- clarification: 6,162
|
| 16 |
+
- comparison: 9,368
|
| 17 |
+
- description: 12,424
|
| 18 |
+
- qna: 15,639
|
| 19 |
+
- scenario: 9,241
|
| 20 |
+
|
| 21 |
+
## Breakdown by topic
|
| 22 |
+
|
| 23 |
+
- Bridge: 6,252
|
| 24 |
+
- Bridge+Bridge: 96
|
| 25 |
+
- Bridge+MCP: 592
|
| 26 |
+
- Bridge+RAG: 447
|
| 27 |
+
- MCP: 25,242
|
| 28 |
+
- MCP+Bridge: 561
|
| 29 |
+
- MCP+MCP: 2,087
|
| 30 |
+
- MCP+RAG: 1,824
|
| 31 |
+
- RAG: 21,022
|
| 32 |
+
- RAG+Bridge: 467
|
| 33 |
+
- RAG+MCP: 1,887
|
| 34 |
+
- RAG+RAG: 1,407
|
| 35 |
+
|
| 36 |
+
## Files
|
| 37 |
+
|
| 38 |
+
- train.json
|
| 39 |
+
- val.json
|
| 40 |
+
- all.json
|
| 41 |
+
- sample_preview.json
|
| 42 |
+
- source_manifest.json
|