Spaces:
Sleeping
Sleeping
| # π§ Improved Monitoring Integration Guide | |
| ## Overview | |
| The monitoring system has been enhanced to support **Hugging Face Datasets** for persistent experiment storage, making it ideal for deployment on Hugging Face Spaces and other cloud environments. | |
| ## π Key Improvements | |
| ### 1. **HF Datasets Integration** | |
| - β **Persistent Storage**: Experiments are saved to HF Datasets repositories | |
| - β **Environment Variables**: Configurable via `HF_TOKEN` and `TRACKIO_DATASET_REPO` | |
| - β **Fallback Support**: Graceful degradation if HF Datasets unavailable | |
| - β **Automatic Backup**: Local files as backup | |
| ### 2. **Enhanced Monitoring Features** | |
| - π **Real-time Metrics**: Training metrics logged to both Trackio and HF Datasets | |
| - π§ **System Metrics**: GPU memory, CPU usage, and system performance | |
| - π **Training Summaries**: Comprehensive experiment summaries | |
| - π‘οΈ **Error Handling**: Robust error logging and recovery | |
| ### 3. **Easy Integration** | |
| - π **Automatic Setup**: Environment variables automatically detected | |
| - π **Configuration**: Simple setup with environment variables | |
| - π **Backward Compatible**: Works with existing Trackio setup | |
| ## π Environment Variables | |
| | Variable | Required | Default | Description | | |
| |----------|----------|---------|-------------| | |
| | `HF_TOKEN` | β Yes | None | Your Hugging Face token | | |
| | `TRACKIO_DATASET_REPO` | β No | `tonic/trackio-experiments` | Dataset repository | | |
| | `TRACKIO_URL` | β No | None | Trackio server URL | | |
| | `TRACKIO_TOKEN` | β No | None | Trackio authentication token | | |
| ## π οΈ Setup Instructions | |
| ### 1. **Get Your HF Token** | |
| ```bash | |
| # Go to https://huggingface.co/settings/tokens | |
| # Create a new token with "Write" permissions | |
| # Copy the token | |
| ``` | |
| ### 2. **Set Environment Variables** | |
| ```bash | |
| # For HF Spaces, add these to your Space settings: | |
| HF_TOKEN=your_hf_token_here | |
| TRACKIO_DATASET_REPO=your-username/your-dataset-name | |
| # For local development: | |
| export HF_TOKEN=your_hf_token_here | |
| export TRACKIO_DATASET_REPO=your-username/your-dataset-name | |
| ``` | |
| ### 3. **Create Dataset Repository** | |
| ```bash | |
| # Run the setup script | |
| python setup_hf_dataset.py | |
| # Or manually create a dataset on HF Hub | |
| # Go to https://huggingface.co/datasets | |
| # Create a new dataset repository | |
| ``` | |
| ### 4. **Test Configuration** | |
| ```bash | |
| # Test your setup | |
| python configure_trackio.py | |
| # Test dataset access | |
| python test_hf_datasets.py | |
| ``` | |
| ## π Usage Examples | |
| ### **Basic Training with Monitoring** | |
| ```bash | |
| # Train with default monitoring | |
| python train.py config/train_smollm3_openhermes_fr.py | |
| # Train with custom dataset repository | |
| TRACKIO_DATASET_REPO=your-username/smollm3-experiments python train.py config/train_smollm3_openhermes_fr.py | |
| ``` | |
| ### **Advanced Training Configuration** | |
| ```bash | |
| # Train with custom experiment name | |
| python train.py config/train_smollm3_openhermes_fr.py \ | |
| --experiment_name "smollm3_french_tuning_v2" \ | |
| --hf_token your_token_here \ | |
| --dataset_repo your-username/french-experiments | |
| ``` | |
| ### **Training Scripts with Monitoring** | |
| ```bash | |
| # All training scripts now support monitoring: | |
| python train.py config/train_smollm3_openhermes_fr_a100_balanced.py | |
| python train.py config/train_smollm3_openhermes_fr_a100_large.py | |
| python train.py config/train_smollm3_openhermes_fr_a100_max_performance.py | |
| python train.py config/train_smollm3_openhermes_fr_a100_multiple_passes.py | |
| ``` | |
| ## π What Gets Monitored | |
| ### **Training Metrics** | |
| - Loss values (training and validation) | |
| - Learning rate | |
| - Gradient norms | |
| - Training steps and epochs | |
| ### **System Metrics** | |
| - GPU memory usage | |
| - GPU utilization | |
| - CPU usage | |
| - Memory usage | |
| ### **Experiment Data** | |
| - Configuration parameters | |
| - Model checkpoints | |
| - Evaluation results | |
| - Training summaries | |
| ### **Artifacts** | |
| - Configuration files | |
| - Training logs | |
| - Evaluation results | |
| - Model checkpoints | |
| ## π Viewing Results | |
| ### **1. Trackio Interface** | |
| - Visit your Trackio Space | |
| - Navigate to "Experiments" tab | |
| - View real-time metrics and plots | |
| ### **2. HF Dataset Repository** | |
| - Go to your dataset repository on HF Hub | |
| - Browse experiment data | |
| - Download experiment files | |
| ### **3. Local Files** | |
| - Check local backup files | |
| - Review training logs | |
| - Examine configuration files | |
| ## π οΈ Configuration Examples | |
| ### **Default Setup** | |
| ```python | |
| # Uses default dataset: tonic/trackio-experiments | |
| # Requires only HF_TOKEN | |
| ``` | |
| ### **Personal Dataset** | |
| ```bash | |
| export HF_TOKEN=your_token_here | |
| export TRACKIO_DATASET_REPO=your-username/trackio-experiments | |
| ``` | |
| ### **Team Dataset** | |
| ```bash | |
| export HF_TOKEN=your_token_here | |
| export TRACKIO_DATASET_REPO=your-org/team-experiments | |
| ``` | |
| ### **Project-Specific Dataset** | |
| ```bash | |
| export HF_TOKEN=your_token_here | |
| export TRACKIO_DATASET_REPO=your-username/smollm3-experiments | |
| ``` | |
| ## π§ Troubleshooting | |
| ### **Issue: "HF_TOKEN not found"** | |
| ```bash | |
| # Solution: Set your HF token | |
| export HF_TOKEN=your_token_here | |
| # Or add to HF Space environment variables | |
| ``` | |
| ### **Issue: "Failed to load dataset"** | |
| ```bash | |
| # Solutions: | |
| # 1. Check token has read access | |
| # 2. Verify dataset repository exists | |
| # 3. Run setup script: python setup_hf_dataset.py | |
| ``` | |
| ### **Issue: "Failed to save experiments"** | |
| ```bash | |
| # Solutions: | |
| # 1. Check token has write permissions | |
| # 2. Verify dataset repository exists | |
| # 3. Check network connectivity | |
| ``` | |
| ### **Issue: "Monitoring not working"** | |
| ```bash | |
| # Solutions: | |
| # 1. Check environment variables | |
| # 2. Run configuration test: python configure_trackio.py | |
| # 3. Check logs for specific errors | |
| ``` | |
| ## π Benefits | |
| ### **For HF Spaces Deployment** | |
| - β **Persistent Storage**: Data survives Space restarts | |
| - β **No Local Storage**: No dependency on ephemeral storage | |
| - β **Scalable**: Works with any dataset size | |
| - β **Secure**: Private dataset storage | |
| ### **For Experiment Management** | |
| - β **Centralized**: All experiments in one place | |
| - β **Searchable**: Easy to find specific experiments | |
| - β **Versioned**: Dataset versioning for experiments | |
| - β **Collaborative**: Share experiments with team | |
| ### **For Development** | |
| - β **Flexible**: Easy to switch between datasets | |
| - β **Configurable**: Environment-based configuration | |
| - β **Robust**: Fallback mechanisms | |
| - β **Debuggable**: Comprehensive logging | |
| ## π― Next Steps | |
| 1. **Set up your HF token and dataset repository** | |
| 2. **Test the configuration with `python configure_trackio.py`** | |
| 3. **Run a training experiment to verify monitoring** | |
| 4. **Check your HF Dataset repository for experiment data** | |
| 5. **View results in your Trackio interface** | |
| ## π Related Files | |
| - `monitoring.py` - Enhanced monitoring with HF Datasets support | |
| - `train.py` - Updated training script with monitoring integration | |
| - `configure_trackio.py` - Configuration and testing script | |
| - `setup_hf_dataset.py` - Dataset repository setup | |
| - `test_hf_datasets.py` - Dataset access testing | |
| - `ENVIRONMENT_VARIABLES.md` - Environment variable reference | |
| - `HF_DATASETS_GUIDE.md` - Detailed HF Datasets guide | |
| --- | |
| **π Your experiments are now persistently stored and easily accessible!** |