ianshank commited on
Commit
a82d1c1
·
verified ·
1 Parent(s): c635641

Upload training_data/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. training_data/README.md +94 -0
training_data/README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Training Data for Phi-3.5-MoE-Instruct
2
+
3
+ This repository contains comprehensive training data used for fine-tuning the Phi-3.5-MoE model.
4
+
5
+ ## Data Structure
6
+
7
+ ### 📁 processed/
8
+ Contains processed training data in JSONL format:
9
+ - **Agent-specific datasets**: Individual training files for different AI agents
10
+ - **Enhanced datasets**: Improved versions with better quality data
11
+ - **Realistic datasets**: Real-world scenario training data
12
+ - **Gradient descent datasets**: Specialized training for optimization tasks
13
+
14
+ ### 📁 raw/
15
+ Contains raw training data:
16
+ - **AWS infrastructure data**: Real-world infrastructure configurations
17
+ - **Test results**: Comprehensive testing data
18
+ - **Requirements**: System requirements and specifications
19
+ - **External datasets**: Third-party training data
20
+
21
+ ### 📁 arxiv/
22
+ Contains arXiv research paper data:
23
+ - **Processed papers**: Cleaned and formatted research papers
24
+ - **Raw papers**: Original arXiv data
25
+ - **Scientific content**: High-quality academic training data
26
+
27
+ ### 📁 vector_db/
28
+ Contains ChromaDB vector database:
29
+ - **ChromaDB files**: Complete vector database with embeddings
30
+ - **100,678 chunks**: Processed document chunks
31
+ - **123 documents**: Source documents
32
+ - **2.1 GB database**: Full vector search capability
33
+
34
+ ## Usage
35
+
36
+ ### Loading Training Data
37
+ ```python
38
+ import json
39
+ from pathlib import Path
40
+
41
+ # Load processed training data
42
+ with open("training_data/processed/agent_name_train.jsonl", "r") as f:
43
+ for line in f:
44
+ data = json.loads(line)
45
+ # Process training example
46
+ ```
47
+
48
+ ### Using Vector Database
49
+ ```python
50
+ import chromadb
51
+ from chromadb.config import Settings
52
+
53
+ # Load vector database
54
+ client = chromadb.PersistentClient(
55
+ path="training_data/vector_db/chroma",
56
+ settings=Settings(anonymized_telemetry=False)
57
+ )
58
+
59
+ collection = client.get_collection("rag_docs")
60
+ results = collection.query(query_texts=["your query"], n_results=5)
61
+ ```
62
+
63
+ ## Statistics
64
+
65
+ - **Total Training Files**: 120+ JSONL files
66
+ - **Total Raw Files**: 100+ source files
67
+ - **Vector Database Size**: 2.1 GB
68
+ - **Total Chunks**: 100,678
69
+ - **Total Documents**: 123
70
+ - **Average Query Time**: 0.343 seconds
71
+
72
+ ## Model Performance
73
+
74
+ The training data has been used to achieve:
75
+ - **50.7% loss reduction** during training
76
+ - **Improved reasoning capabilities** across multiple domains
77
+ - **Enhanced code generation** and problem-solving
78
+ - **Better multilingual support**
79
+
80
+ ## License
81
+
82
+ This training data is provided under the same license as the Phi-3.5-MoE model (MIT License).
83
+
84
+ ## Citation
85
+
86
+ If you use this training data, please cite:
87
+ ```
88
+ @misc{phi35-moe-training-data,
89
+ title={Comprehensive Training Data for Phi-3.5-MoE},
90
+ author={Ian Cruickshank},
91
+ year={2024},
92
+ url={https://huggingface.co/ianshank/phi-35-moe-instruct}
93
+ }
94
+ ```