Training Data for Phi-3.5-MoE-Instruct
This repository contains comprehensive training data used for fine-tuning the Phi-3.5-MoE model.
Data Structure
π processed/
Contains processed training data in JSONL format:
- Agent-specific datasets: Individual training files for different AI agents
- Enhanced datasets: Improved versions with better quality data
- Realistic datasets: Real-world scenario training data
- Gradient descent datasets: Specialized training for optimization tasks
π raw/
Contains raw training data:
- AWS infrastructure data: Real-world infrastructure configurations
- Test results: Comprehensive testing data
- Requirements: System requirements and specifications
- External datasets: Third-party training data
π arxiv/
Contains arXiv research paper data:
- Processed papers: Cleaned and formatted research papers
- Raw papers: Original arXiv data
- Scientific content: High-quality academic training data
π vector_db/
Contains ChromaDB vector database:
- ChromaDB files: Complete vector database with embeddings
- 100,678 chunks: Processed document chunks
- 123 documents: Source documents
- 2.1 GB database: Full vector search capability
Usage
Loading Training Data
import json
from pathlib import Path
# Load processed training data
with open("training_data/processed/agent_name_train.jsonl", "r") as f:
for line in f:
data = json.loads(line)
# Process training example
Using Vector Database
import chromadb
from chromadb.config import Settings
# Load vector database
client = chromadb.PersistentClient(
path="training_data/vector_db/chroma",
settings=Settings(anonymized_telemetry=False)
)
collection = client.get_collection("rag_docs")
results = collection.query(query_texts=["your query"], n_results=5)
Statistics
- Total Training Files: 120+ JSONL files
- Total Raw Files: 100+ source files
- Vector Database Size: 2.1 GB
- Total Chunks: 100,678
- Total Documents: 123
- Average Query Time: 0.343 seconds
Model Performance
The training data has been used to achieve:
- 50.7% loss reduction during training
- Improved reasoning capabilities across multiple domains
- Enhanced code generation and problem-solving
- Better multilingual support
License
This training data is provided under the same license as the Phi-3.5-MoE model (MIT License).
Citation
If you use this training data, please cite:
@misc{phi35-moe-training-data,
title={Comprehensive Training Data for Phi-3.5-MoE},
author={Ian Cruickshank},
year={2024},
url={https://huggingface.co/ianshank/phi-35-moe-instruct}
}