Spaces:
Sleeping
Sleeping
| # SmolLM3 Training Pipeline Fixes Summary | |
| ## Issues Identified and Fixed | |
| ### 1. Format String Error | |
| **Issue**: `Unknown format code 'f' for object of type 'str'` | |
| **Root Cause**: The console callback was trying to format non-numeric values with f-string format specifiers | |
| **Fix**: Updated `src/trainer.py` to properly handle type conversion before formatting | |
| ```python | |
| # Before (causing error): | |
| print("Step {}: loss={:.4f}, lr={}".format(step, loss, lr)) | |
| # After (fixed): | |
| if isinstance(loss, (int, float)): | |
| loss_str = f"{loss:.4f}" | |
| else: | |
| loss_str = str(loss) | |
| if isinstance(lr, (int, float)): | |
| lr_str = f"{lr:.2e}" | |
| else: | |
| lr_str = str(lr) | |
| print(f"Step {step}: loss={loss_str}, lr={lr_str}") | |
| ``` | |
| ### 2. Callback Addition Error | |
| **Issue**: `'SmolLM3Trainer' object has no attribute 'add_callback'` | |
| **Root Cause**: The trainer was trying to add callbacks after creation, but callbacks should be passed during trainer creation | |
| **Fix**: Removed the incorrect `add_callback` call from `src/train.py` since callbacks are already handled in `SmolLM3Trainer._setup_trainer()` | |
| ### 3. Trackio Space Deployment Issues | |
| **Issue**: 404 errors when trying to create experiments via Trackio API | |
| **Root Cause**: The Trackio Space deployment was failing or the API endpoints weren't accessible | |
| **Fix**: Updated `src/monitoring.py` to gracefully handle Trackio Space failures and continue with HF Datasets integration | |
| ```python | |
| # Added graceful fallback: | |
| try: | |
| result = self.trackio_client.log_metrics(...) | |
| if "success" in result: | |
| logger.debug("Metrics logged to Trackio") | |
| else: | |
| logger.warning("Failed to log metrics to Trackio: %s", result) | |
| except Exception as e: | |
| logger.warning("Trackio logging failed: %s", e) | |
| ``` | |
| ### 4. Monitoring Integration Improvements | |
| **Enhancement**: Made monitoring more robust by: | |
| - Testing Trackio Space connectivity before attempting operations | |
| - Continuing with HF Datasets even if Trackio fails | |
| - Adding better error handling and logging | |
| - Ensuring experiments are saved to HF Datasets regardless of Trackio status | |
| ## Files Modified | |
| ### Core Training Files | |
| 1. **`src/trainer.py`** | |
| - Fixed format string error in SimpleConsoleCallback | |
| - Improved callback handling and error reporting | |
| 2. **`src/train.py`** | |
| - Removed incorrect `add_callback` call | |
| - Simplified trainer initialization | |
| 3. **`src/monitoring.py`** | |
| - Added graceful Trackio Space failure handling | |
| - Improved error logging and fallback mechanisms | |
| - Enhanced HF Datasets integration | |
| ### Test Files | |
| 4. **`tests/test_training_fix.py`** | |
| - Created comprehensive test suite | |
| - Tests imports, config loading, monitoring setup, trainer creation | |
| - Validates format string fixes | |
| ## Testing the Fixes | |
| Run the test suite to verify all fixes work: | |
| ```bash | |
| python tests/test_training_fix.py | |
| ``` | |
| Expected output: | |
| ``` | |
| π Testing SmolLM3 Training Pipeline Fixes | |
| ================================================== | |
| π Testing imports... | |
| β config.py imported successfully | |
| β model.py imported successfully | |
| β data.py imported successfully | |
| β trainer.py imported successfully | |
| β monitoring.py imported successfully | |
| π Testing configuration loading... | |
| β Configuration loaded successfully | |
| Model: HuggingFaceTB/SmolLM3-3B | |
| Dataset: legmlai/openhermes-fr | |
| Batch size: 16 | |
| Learning rate: 8e-06 | |
| π Testing monitoring setup... | |
| β Monitoring setup successful | |
| Experiment: test_experiment | |
| Tracking enabled: False | |
| HF Dataset: tonic/trackio-experiments | |
| π Testing trainer creation... | |
| β Model created successfully | |
| β Dataset created successfully | |
| β Trainer created successfully | |
| π Testing format string fix... | |
| β Format string fix works correctly | |
| π Test Results: 5/5 tests passed | |
| β All tests passed! The training pipeline should work correctly. | |
| ``` | |
| ## Running the Training Pipeline | |
| The training pipeline should now work correctly with the H100 lightweight configuration: | |
| ```bash | |
| # Run the interactive pipeline | |
| ./launch.sh | |
| # Or run training directly | |
| python src/train.py config/train_smollm3_h100_lightweight.py \ | |
| --experiment-name "smollm3_test" \ | |
| --trackio-url "https://your-space.hf.space" \ | |
| --output-dir /output-checkpoint | |
| ``` | |
| ## Key Improvements | |
| 1. **Robust Error Handling**: Training continues even if monitoring components fail | |
| 2. **Better Logging**: More informative error messages and status updates | |
| 3. **Graceful Degradation**: HF Datasets integration works even without Trackio Space | |
| 4. **Type Safety**: Proper type checking prevents format string errors | |
| 5. **Comprehensive Testing**: Test suite validates all components work correctly | |
| ## Next Steps | |
| 1. **Deploy Trackio Space**: If you want full monitoring, deploy the Trackio Space manually | |
| 2. **Test Training**: Run a short training session to verify everything works | |
| 3. **Monitor Progress**: Check HF Datasets for experiment data even if Trackio Space is unavailable | |
| The training pipeline should now work reliably for your end-to-end fine-tuning experiments! |