# Training Data for Phi-3.5-MoE-Instruct This repository contains comprehensive training data used for fine-tuning the Phi-3.5-MoE model. ## Data Structure ### 📁 processed/ Contains processed training data in JSONL format: - **Agent-specific datasets**: Individual training files for different AI agents - **Enhanced datasets**: Improved versions with better quality data - **Realistic datasets**: Real-world scenario training data - **Gradient descent datasets**: Specialized training for optimization tasks ### 📁 raw/ Contains raw training data: - **AWS infrastructure data**: Real-world infrastructure configurations - **Test results**: Comprehensive testing data - **Requirements**: System requirements and specifications - **External datasets**: Third-party training data ### 📁 arxiv/ Contains arXiv research paper data: - **Processed papers**: Cleaned and formatted research papers - **Raw papers**: Original arXiv data - **Scientific content**: High-quality academic training data ### 📁 vector_db/ Contains ChromaDB vector database: - **ChromaDB files**: Complete vector database with embeddings - **100,678 chunks**: Processed document chunks - **123 documents**: Source documents - **2.1 GB database**: Full vector search capability ## Usage ### Loading Training Data ```python import json from pathlib import Path # Load processed training data with open("training_data/processed/agent_name_train.jsonl", "r") as f: for line in f: data = json.loads(line) # Process training example ``` ### Using Vector Database ```python import chromadb from chromadb.config import Settings # Load vector database client = chromadb.PersistentClient( path="training_data/vector_db/chroma", settings=Settings(anonymized_telemetry=False) ) collection = client.get_collection("rag_docs") results = collection.query(query_texts=["your query"], n_results=5) ``` ## Statistics - **Total Training Files**: 120+ JSONL files - **Total Raw Files**: 100+ source files - **Vector Database Size**: 2.1 GB - **Total Chunks**: 100,678 - **Total Documents**: 123 - **Average Query Time**: 0.343 seconds ## Model Performance The training data has been used to achieve: - **50.7% loss reduction** during training - **Improved reasoning capabilities** across multiple domains - **Enhanced code generation** and problem-solving - **Better multilingual support** ## License This training data is provided under the same license as the Phi-3.5-MoE model (MIT License). ## Citation If you use this training data, please cite: ``` @misc{phi35-moe-training-data, title={Comprehensive Training Data for Phi-3.5-MoE}, author={Ian Cruickshank}, year={2024}, url={https://huggingface.co/ianshank/phi-35-moe-instruct} } ```