| # Docker Usage Guide | |
| Complete guide for using ULTRATHINK with Docker. | |
| ## π Quick Start | |
| ### Run Web Interface (Default) | |
| ```bash | |
| docker compose up app | |
| # Visit http://localhost:7860 | |
| ``` | |
| ### Run Training (CPU) | |
| ```bash | |
| docker compose --profile train up | |
| ``` | |
| ### Run Training (GPU) | |
| ```bash | |
| docker compose --profile train-gpu up | |
| ``` | |
| ## π¦ Available Services | |
| ### 1. Web Interface (`app`) | |
| **Purpose**: Interactive Gradio UI for model inference | |
| ```bash | |
| # Start the web interface | |
| docker compose up app | |
| # Run in background | |
| docker compose up -d app | |
| # View logs | |
| docker compose logs -f app | |
| ``` | |
| **Ports**: | |
| - 7860: Gradio web interface | |
| - 8000: FastAPI (if needed) | |
| **Volumes**: | |
| - `./outputs` - Model outputs | |
| - `./checkpoints` - Model checkpoints | |
| --- | |
| ### 2. CPU Training (`train`) | |
| **Purpose**: Train models on CPU (for testing/small models) | |
| ```bash | |
| # Start training with default config | |
| docker compose --profile train up | |
| # Custom training command | |
| docker compose run --rm train python train_ultrathink.py \ | |
| --dataset wikitext \ | |
| --hidden_size 512 \ | |
| --num_layers 6 \ | |
| --batch_size 2 \ | |
| --num_epochs 3 | |
| ``` | |
| **Example - WikiText Training**: | |
| ```bash | |
| docker compose run --rm train python train_advanced.py \ | |
| --config /app/configs/train_small.yaml | |
| ``` | |
| --- | |
| ### 3. GPU Training (`train-gpu`) | |
| **Purpose**: Train models with GPU acceleration | |
| **Prerequisites**: | |
| - NVIDIA GPU | |
| - NVIDIA Docker runtime | |
| - nvidia-container-toolkit | |
| ```bash | |
| # Start GPU training | |
| docker compose --profile train-gpu up | |
| # Custom GPU training | |
| docker compose run --rm train-gpu python train_ultrathink.py \ | |
| --dataset c4 --streaming \ | |
| --hidden_size 768 --num_layers 12 \ | |
| --use_amp --gradient_checkpointing \ | |
| --output_dir /app/outputs/c4_model | |
| ``` | |
| --- | |
| ### 4. MLflow Tracking (`mlflow`) | |
| **Purpose**: Experiment tracking and model registry | |
| ```bash | |
| # Start MLflow server | |
| docker compose --profile mlflow up -d | |
| # Access UI | |
| # http://localhost:5000 | |
| ``` | |
| **Train with MLflow tracking**: | |
| ```bash | |
| docker compose run --rm \ | |
| --env MLFLOW_TRACKING_URI=http://mlflow:5000 \ | |
| train python train_ultrathink.py \ | |
| --use_mlflow \ | |
| --dataset wikitext \ | |
| --num_epochs 3 | |
| ``` | |
| --- | |
| ### 5. Development Environment (`dev`) | |
| **Purpose**: Interactive development with all tools | |
| ```bash | |
| # Start dev container | |
| docker compose --profile dev run --rm dev | |
| # Inside container: | |
| pytest # Run tests | |
| python train_ultrathink.py --help | |
| jupyter notebook --ip 0.0.0.0 --port 8888 | |
| ``` | |
| **Access Jupyter**: | |
| - http://localhost:8888 | |
| --- | |
| ## π― Common Use Cases | |
| ### Use Case 1: Quick Demo | |
| ```bash | |
| # Start web interface only | |
| docker compose up app | |
| # Visit http://localhost:7860 | |
| ``` | |
| ### Use Case 2: Training + Monitoring | |
| ```bash | |
| # Start MLflow and training | |
| docker compose --profile mlflow up -d | |
| docker compose --profile train up | |
| ``` | |
| ### Use Case 3: Full Development Stack | |
| ```bash | |
| # Start all services | |
| docker compose --profile dev --profile mlflow up | |
| ``` | |
| ### Use Case 4: Production Training | |
| ```bash | |
| # GPU training with checkpointing | |
| docker compose run --rm train-gpu \ | |
| python train_advanced.py \ | |
| --config /app/configs/train_medium.yaml \ | |
| --checkpoint_frequency 1000 \ | |
| --output_dir /app/outputs/production_model | |
| ``` | |
| --- | |
| ## π§ Advanced Usage | |
| ### Building Specific Stages | |
| **Production image (minimal)**: | |
| ```bash | |
| docker build --target production -t ultrathink:prod . | |
| ``` | |
| **Development image (with tools)**: | |
| ```bash | |
| docker build --target development -t ultrathink:dev . | |
| ``` | |
| **Training image**: | |
| ```bash | |
| docker build --target training -t ultrathink:train . | |
| ``` | |
| ### Custom Environment Variables | |
| Create `.env` file: | |
| ```env | |
| WANDB_API_KEY=your_api_key_here | |
| HF_TOKEN=your_hf_token_here | |
| MLFLOW_TRACKING_URI=http://mlflow:5000 | |
| PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | |
| ``` | |
| Run with env file: | |
| ```bash | |
| docker compose --env-file .env up | |
| ``` | |
| ### Volume Mounting for Development | |
| ```bash | |
| # Mount entire project (live editing) | |
| docker run -it --rm \ | |
| -v $(pwd):/app \ | |
| ultrathink:dev bash | |
| ``` | |
| ### Multi-GPU Training | |
| ```bash | |
| # Use specific GPUs | |
| docker compose run --rm \ | |
| --env CUDA_VISIBLE_DEVICES=0,1,2,3 \ | |
| train-gpu python train_ultrathink.py \ | |
| --distributed \ | |
| --num_gpus 4 | |
| ``` | |
| --- | |
| ## π Monitoring | |
| ### View Logs | |
| ```bash | |
| # View all logs | |
| docker compose logs | |
| # Follow specific service | |
| docker compose logs -f app | |
| # Last 100 lines | |
| docker compose logs --tail 100 train | |
| ``` | |
| ### Check Resource Usage | |
| ```bash | |
| # Container stats | |
| docker stats | |
| # Specific container | |
| docker stats ultrathink_train_gpu | |
| ``` | |
| ### Access Running Container | |
| ```bash | |
| # Execute command in running container | |
| docker compose exec app bash | |
| # Run pytest in running container | |
| docker compose exec app pytest | |
| ``` | |
| --- | |
| ## π§Ή Cleanup | |
| ### Stop Services | |
| ```bash | |
| # Stop all services | |
| docker compose down | |
| # Stop and remove volumes | |
| docker compose down -v | |
| ``` | |
| ### Remove Images | |
| ```bash | |
| # Remove project images | |
| docker rmi ultrathink:latest ultrathink:training ultrathink:dev | |
| # Prune unused images | |
| docker image prune -a | |
| ``` | |
| ### Clean Build Cache | |
| ```bash | |
| docker builder prune -a | |
| ``` | |
| --- | |
| ## π Troubleshooting | |
| ### Issue: GPU not detected | |
| **Solution**: | |
| ```bash | |
| # Check NVIDIA runtime | |
| docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi | |
| # Install nvidia-container-toolkit if needed | |
| # https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html | |
| ``` | |
| ### Issue: Out of memory | |
| **Solution**: | |
| ```bash | |
| # Reduce batch size | |
| docker compose run --rm train python train_ultrathink.py \ | |
| --batch_size 1 \ | |
| --gradient_accumulation_steps 32 | |
| # Use gradient checkpointing | |
| docker compose run --rm train python train_ultrathink.py \ | |
| --gradient_checkpointing | |
| ``` | |
| ### Issue: Slow training | |
| **Solution**: | |
| ```bash | |
| # Enable AMP and optimizations | |
| docker compose run --rm train-gpu python train_ultrathink.py \ | |
| --use_amp \ | |
| --gradient_checkpointing \ | |
| --use_flash_attention | |
| ``` | |
| ### Issue: Container fails to start | |
| **Solution**: | |
| ```bash | |
| # Check logs | |
| docker compose logs app | |
| # Rebuild image | |
| docker compose build --no-cache app | |
| # Check disk space | |
| docker system df | |
| ``` | |
| --- | |
| ## π Security Best Practices | |
| 1. **Don't hardcode secrets** - Use environment variables | |
| 2. **Use .env file** - Keep secrets out of docker-compose.yml | |
| 3. **Limit port exposure** - Only expose necessary ports | |
| 4. **Use specific tags** - Avoid `latest` in production | |
| 5. **Scan images** - Use `docker scan ultrathink:latest` | |
| 6. **Non-root user** - Run containers as non-root (future enhancement) | |
| --- | |
| ## π Additional Resources | |
| - [Docker Documentation](https://docs.docker.com/) | |
| - [Docker Compose Reference](https://docs.docker.com/compose/compose-file/) | |
| - [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker) | |
| - [Main README](README.md) | |
| - [Training Guide](TRAINING_QUICKSTART.md) | |
| --- | |
| ## π Examples | |
| ### Example 1: Complete Training Pipeline | |
| ```bash | |
| # 1. Start MLflow | |
| docker compose --profile mlflow up -d | |
| # 2. Train model | |
| docker compose run --rm train python train_ultrathink.py \ | |
| --dataset wikitext \ | |
| --use_mlflow \ | |
| --num_epochs 5 \ | |
| --output_dir /app/outputs/wikitext_model | |
| # 3. Check MLflow UI | |
| # Visit http://localhost:5000 | |
| # 4. Start inference UI | |
| docker compose up app | |
| # 5. Cleanup | |
| docker compose down | |
| ``` | |
| ### Example 2: Distributed Training | |
| ```bash | |
| # Multi-GPU training with DeepSpeed | |
| docker compose run --rm \ | |
| --env CUDA_VISIBLE_DEVICES=0,1,2,3 \ | |
| train-gpu python train_ultrathink.py \ | |
| --distributed \ | |
| --deepspeed \ | |
| --deepspeed_config /app/configs/deepspeed_z3.json \ | |
| --dataset c4 --streaming | |
| ``` | |
| ### Example 3: Development Workflow | |
| ```bash | |
| # Start dev container | |
| docker compose --profile dev run --rm dev bash | |
| # Inside container: | |
| # 1. Run tests | |
| pytest | |
| # 2. Train small model | |
| python train_ultrathink.py --dataset wikitext --num_epochs 1 | |
| # 3. Profile performance | |
| python scripts/profile_model.py --size tiny | |
| # 4. Exit | |
| exit | |
| ``` | |
| --- | |
| For more information, see the [main README](README.md) or [Advanced Training Guide](ADVANCED_TRAINING_GUIDE.md). | |