SmolFactory

Sleeping

App Files Files Community

SmolFactory / docs /MONITORING_INTEGRATION_GUIDE.md

Tonic

adds formatting fix

ebe598e verified 5 months ago

preview code

raw

history blame

7.04 kB

	# 🔧 Improved Monitoring Integration Guide

	## Overview

	The monitoring system has been enhanced to support Hugging Face Datasets for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments.

	## 🚀 Key Improvements

	### 1. HF Datasets Integration
	- ✅ Persistent Storage: Experiments are saved to HF Datasets repositories
	- ✅ Environment Variables: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO`
	- ✅ Fallback Support: Graceful degradation if HF Datasets unavailable
	- ✅ Automatic Backup: Local files as backup

	### 2. Enhanced Monitoring Features
	- 📊 Real-time Metrics: Training metrics logged to both Trackio and HF Datasets
	- 🔧 System Metrics: GPU memory, CPU usage, and system performance
	- 📈 Training Summaries: Comprehensive experiment summaries
	- 🛡️ Error Handling: Robust error logging and recovery

	### 3. Easy Integration
	- 🔌 Automatic Setup: Environment variables automatically detected
	- 📝 Configuration: Simple setup with environment variables
	- 🔄 Backward Compatible: Works with existing Trackio setup

	## 📋 Environment Variables

	\| Variable \| Required \| Default \| Description \|
	\|----------\|----------\|---------\|-------------\|
	\| `HF_TOKEN` \| ✅ Yes \| None \| Your Hugging Face token \|
	\| `TRACKIO_DATASET_REPO` \| ❌ No \| `tonic/trackio-experiments` \| Dataset repository \|
	\| `TRACKIO_URL` \| ❌ No \| None \| Trackio server URL \|
	\| `TRACKIO_TOKEN` \| ❌ No \| None \| Trackio authentication token \|

	## 🛠️ Setup Instructions

	### 1. Get Your HF Token
	```bash
	# Go to https://huggingface.co/settings/tokens
	# Create a new token with "Write" permissions
	# Copy the token
	```

	### 2. Set Environment Variables
	```bash
	# For HF Spaces, add these to your Space settings:
	HF_TOKEN=your_hf_token_here
	TRACKIO_DATASET_REPO=your-username/your-dataset-name

	# For local development:
	export HF_TOKEN=your_hf_token_here
	export TRACKIO_DATASET_REPO=your-username/your-dataset-name
	```

	### 3. Create Dataset Repository
	```bash
	# Run the setup script
	python setup_hf_dataset.py

	# Or manually create a dataset on HF Hub
	# Go to https://huggingface.co/datasets
	# Create a new dataset repository
	```

	### 4. Test Configuration
	```bash
	# Test your setup
	python configure_trackio.py

	# Test dataset access
	python test_hf_datasets.py
	```

	## 🚀 Usage Examples

	### Basic Training with Monitoring
	```bash
	# Train with default monitoring
	python train.py config/train_smollm3_openhermes_fr.py

	# Train with custom dataset repository
	TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py
	```

	### Advanced Training Configuration
	```bash
	# Train with custom experiment name
	python train.py config/train_smollm3_openhermes_fr.py \
	--experiment_name "smollm3_french_tuning_v2" \
	--hf_token your_token_here \
	--dataset_repo your-username/french-experiments
	```

	### Training Scripts with Monitoring
	```bash
	# All training scripts now support monitoring:
	python train.py config/train_smollm3_openhermes_fr_a100_balanced.py
	python train.py config/train_smollm3_openhermes_fr_a100_large.py
	python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py
	python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py
	```

	## 📊 What Gets Monitored

	### Training Metrics
	- Loss values (training and validation)
	- Learning rate
	- Gradient norms
	- Training steps and epochs

	### System Metrics
	- GPU memory usage
	- GPU utilization
	- CPU usage
	- Memory usage

	### Experiment Data
	- Configuration parameters
	- Model checkpoints
	- Evaluation results
	- Training summaries

	### Artifacts
	- Configuration files
	- Training logs
	- Evaluation results
	- Model checkpoints

	## 🔍 Viewing Results

	### 1. Trackio Interface
	- Visit your Trackio Space
	- Navigate to "Experiments" tab
	- View real-time metrics and plots

	### 2. HF Dataset Repository
	- Go to your dataset repository on HF Hub
	- Browse experiment data
	- Download experiment files

	### 3. Local Files
	- Check local backup files
	- Review training logs
	- Examine configuration files

	## 🛠️ Configuration Examples

	### Default Setup
	```python
	# Uses default dataset: tonic/trackio-experiments
	# Requires only HF_TOKEN
	```

	### Personal Dataset
	```bash
	export HF_TOKEN=your_token_here
	export TRACKIO_DATASET_REPO=your-username/trackio-experiments
	```

	### Team Dataset
	```bash
	export HF_TOKEN=your_token_here
	export TRACKIO_DATASET_REPO=your-org/team-experiments
	```

	### Project-Specific Dataset
	```bash
	export HF_TOKEN=your_token_here
	export TRACKIO_DATASET_REPO=your-username/smollm3-experiments
	```

	## 🔧 Troubleshooting

	### Issue: "HF_TOKEN not found"
	```bash
	# Solution: Set your HF token
	export HF_TOKEN=your_token_here
	# Or add to HF Space environment variables
	```

	### Issue: "Failed to load dataset"
	```bash
	# Solutions:
	# 1. Check token has read access
	# 2. Verify dataset repository exists
	# 3. Run setup script: python setup_hf_dataset.py
	```

	### Issue: "Failed to save experiments"
	```bash
	# Solutions:
	# 1. Check token has write permissions
	# 2. Verify dataset repository exists
	# 3. Check network connectivity
	```

	### Issue: "Monitoring not working"
	```bash
	# Solutions:
	# 1. Check environment variables
	# 2. Run configuration test: python configure_trackio.py
	# 3. Check logs for specific errors
	```

	## 📈 Benefits

	### For HF Spaces Deployment
	- ✅ Persistent Storage: Data survives Space restarts
	- ✅ No Local Storage: No dependency on ephemeral storage
	- ✅ Scalable: Works with any dataset size
	- ✅ Secure: Private dataset storage

	### For Experiment Management
	- ✅ Centralized: All experiments in one place
	- ✅ Searchable: Easy to find specific experiments
	- ✅ Versioned: Dataset versioning for experiments
	- ✅ Collaborative: Share experiments with team

	### For Development
	- ✅ Flexible: Easy to switch between datasets
	- ✅ Configurable: Environment-based configuration
	- ✅ Robust: Fallback mechanisms
	- ✅ Debuggable: Comprehensive logging

	## 🎯 Next Steps

	1. Set up your HF token and dataset repository
	2. Test the configuration with `python configure_trackio.py`
	3. Run a training experiment to verify monitoring
	4. Check your HF Dataset repository for experiment data
	5. View results in your Trackio interface

	## 📚 Related Files

	- `monitoring.py` - Enhanced monitoring with HF Datasets support
	- `train.py` - Updated training script with monitoring integration
	- `configure_trackio.py` - Configuration and testing script
	- `setup_hf_dataset.py` - Dataset repository setup
	- `test_hf_datasets.py` - Dataset access testing
	- `ENVIRONMENT_VARIABLES.md` - Environment variable reference
	- `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide

	---

	🎉 Your experiments are now persistently stored and easily accessible!