ianshank's picture
Upload training_data/README.md with huggingface_hub
a82d1c1 verified
# Training Data for Phi-3.5-MoE-Instruct
This repository contains comprehensive training data used for fine-tuning the Phi-3.5-MoE model.
## Data Structure
### πŸ“ processed/
Contains processed training data in JSONL format:
- **Agent-specific datasets**: Individual training files for different AI agents
- **Enhanced datasets**: Improved versions with better quality data
- **Realistic datasets**: Real-world scenario training data
- **Gradient descent datasets**: Specialized training for optimization tasks
### πŸ“ raw/
Contains raw training data:
- **AWS infrastructure data**: Real-world infrastructure configurations
- **Test results**: Comprehensive testing data
- **Requirements**: System requirements and specifications
- **External datasets**: Third-party training data
### πŸ“ arxiv/
Contains arXiv research paper data:
- **Processed papers**: Cleaned and formatted research papers
- **Raw papers**: Original arXiv data
- **Scientific content**: High-quality academic training data
### πŸ“ vector_db/
Contains ChromaDB vector database:
- **ChromaDB files**: Complete vector database with embeddings
- **100,678 chunks**: Processed document chunks
- **123 documents**: Source documents
- **2.1 GB database**: Full vector search capability
## Usage
### Loading Training Data
```python
import json
from pathlib import Path
# Load processed training data
with open("training_data/processed/agent_name_train.jsonl", "r") as f:
for line in f:
data = json.loads(line)
# Process training example
```
### Using Vector Database
```python
import chromadb
from chromadb.config import Settings
# Load vector database
client = chromadb.PersistentClient(
path="training_data/vector_db/chroma",
settings=Settings(anonymized_telemetry=False)
)
collection = client.get_collection("rag_docs")
results = collection.query(query_texts=["your query"], n_results=5)
```
## Statistics
- **Total Training Files**: 120+ JSONL files
- **Total Raw Files**: 100+ source files
- **Vector Database Size**: 2.1 GB
- **Total Chunks**: 100,678
- **Total Documents**: 123
- **Average Query Time**: 0.343 seconds
## Model Performance
The training data has been used to achieve:
- **50.7% loss reduction** during training
- **Improved reasoning capabilities** across multiple domains
- **Enhanced code generation** and problem-solving
- **Better multilingual support**
## License
This training data is provided under the same license as the Phi-3.5-MoE model (MIT License).
## Citation
If you use this training data, please cite:
```
@misc{phi35-moe-training-data,
title={Comprehensive Training Data for Phi-3.5-MoE},
author={Ian Cruickshank},
year={2024},
url={https://huggingface.co/ianshank/phi-35-moe-instruct}
}
```