| # Training Data for Phi-3.5-MoE-Instruct | |
| This repository contains comprehensive training data used for fine-tuning the Phi-3.5-MoE model. | |
| ## Data Structure | |
| ### π processed/ | |
| Contains processed training data in JSONL format: | |
| - **Agent-specific datasets**: Individual training files for different AI agents | |
| - **Enhanced datasets**: Improved versions with better quality data | |
| - **Realistic datasets**: Real-world scenario training data | |
| - **Gradient descent datasets**: Specialized training for optimization tasks | |
| ### π raw/ | |
| Contains raw training data: | |
| - **AWS infrastructure data**: Real-world infrastructure configurations | |
| - **Test results**: Comprehensive testing data | |
| - **Requirements**: System requirements and specifications | |
| - **External datasets**: Third-party training data | |
| ### π arxiv/ | |
| Contains arXiv research paper data: | |
| - **Processed papers**: Cleaned and formatted research papers | |
| - **Raw papers**: Original arXiv data | |
| - **Scientific content**: High-quality academic training data | |
| ### π vector_db/ | |
| Contains ChromaDB vector database: | |
| - **ChromaDB files**: Complete vector database with embeddings | |
| - **100,678 chunks**: Processed document chunks | |
| - **123 documents**: Source documents | |
| - **2.1 GB database**: Full vector search capability | |
| ## Usage | |
| ### Loading Training Data | |
| ```python | |
| import json | |
| from pathlib import Path | |
| # Load processed training data | |
| with open("training_data/processed/agent_name_train.jsonl", "r") as f: | |
| for line in f: | |
| data = json.loads(line) | |
| # Process training example | |
| ``` | |
| ### Using Vector Database | |
| ```python | |
| import chromadb | |
| from chromadb.config import Settings | |
| # Load vector database | |
| client = chromadb.PersistentClient( | |
| path="training_data/vector_db/chroma", | |
| settings=Settings(anonymized_telemetry=False) | |
| ) | |
| collection = client.get_collection("rag_docs") | |
| results = collection.query(query_texts=["your query"], n_results=5) | |
| ``` | |
| ## Statistics | |
| - **Total Training Files**: 120+ JSONL files | |
| - **Total Raw Files**: 100+ source files | |
| - **Vector Database Size**: 2.1 GB | |
| - **Total Chunks**: 100,678 | |
| - **Total Documents**: 123 | |
| - **Average Query Time**: 0.343 seconds | |
| ## Model Performance | |
| The training data has been used to achieve: | |
| - **50.7% loss reduction** during training | |
| - **Improved reasoning capabilities** across multiple domains | |
| - **Enhanced code generation** and problem-solving | |
| - **Better multilingual support** | |
| ## License | |
| This training data is provided under the same license as the Phi-3.5-MoE model (MIT License). | |
| ## Citation | |
| If you use this training data, please cite: | |
| ``` | |
| @misc{phi35-moe-training-data, | |
| title={Comprehensive Training Data for Phi-3.5-MoE}, | |
| author={Ian Cruickshank}, | |
| year={2024}, | |
| url={https://huggingface.co/ianshank/phi-35-moe-instruct} | |
| } | |
| ``` | |