ianshank's picture
Upload training_data/README.md with huggingface_hub
a82d1c1 verified

Training Data for Phi-3.5-MoE-Instruct

This repository contains comprehensive training data used for fine-tuning the Phi-3.5-MoE model.

Data Structure

πŸ“ processed/

Contains processed training data in JSONL format:

  • Agent-specific datasets: Individual training files for different AI agents
  • Enhanced datasets: Improved versions with better quality data
  • Realistic datasets: Real-world scenario training data
  • Gradient descent datasets: Specialized training for optimization tasks

πŸ“ raw/

Contains raw training data:

  • AWS infrastructure data: Real-world infrastructure configurations
  • Test results: Comprehensive testing data
  • Requirements: System requirements and specifications
  • External datasets: Third-party training data

πŸ“ arxiv/

Contains arXiv research paper data:

  • Processed papers: Cleaned and formatted research papers
  • Raw papers: Original arXiv data
  • Scientific content: High-quality academic training data

πŸ“ vector_db/

Contains ChromaDB vector database:

  • ChromaDB files: Complete vector database with embeddings
  • 100,678 chunks: Processed document chunks
  • 123 documents: Source documents
  • 2.1 GB database: Full vector search capability

Usage

Loading Training Data

import json
from pathlib import Path

# Load processed training data
with open("training_data/processed/agent_name_train.jsonl", "r") as f:
    for line in f:
        data = json.loads(line)
        # Process training example

Using Vector Database

import chromadb
from chromadb.config import Settings

# Load vector database
client = chromadb.PersistentClient(
    path="training_data/vector_db/chroma",
    settings=Settings(anonymized_telemetry=False)
)

collection = client.get_collection("rag_docs")
results = collection.query(query_texts=["your query"], n_results=5)

Statistics

  • Total Training Files: 120+ JSONL files
  • Total Raw Files: 100+ source files
  • Vector Database Size: 2.1 GB
  • Total Chunks: 100,678
  • Total Documents: 123
  • Average Query Time: 0.343 seconds

Model Performance

The training data has been used to achieve:

  • 50.7% loss reduction during training
  • Improved reasoning capabilities across multiple domains
  • Enhanced code generation and problem-solving
  • Better multilingual support

License

This training data is provided under the same license as the Phi-3.5-MoE model (MIT License).

Citation

If you use this training data, please cite:

@misc{phi35-moe-training-data,
  title={Comprehensive Training Data for Phi-3.5-MoE},
  author={Ian Cruickshank},
  year={2024},
  url={https://huggingface.co/ianshank/phi-35-moe-instruct}
}