UltraThinking-LLM-Training / DOCKER_GUIDE.md
Vedisasi's picture
Upload folder using huggingface_hub
54c5666 verified
# Docker Usage Guide
Complete guide for using ULTRATHINK with Docker.
## πŸš€ Quick Start
### Run Web Interface (Default)
```bash
docker compose up app
# Visit http://localhost:7860
```
### Run Training (CPU)
```bash
docker compose --profile train up
```
### Run Training (GPU)
```bash
docker compose --profile train-gpu up
```
## πŸ“¦ Available Services
### 1. Web Interface (`app`)
**Purpose**: Interactive Gradio UI for model inference
```bash
# Start the web interface
docker compose up app
# Run in background
docker compose up -d app
# View logs
docker compose logs -f app
```
**Ports**:
- 7860: Gradio web interface
- 8000: FastAPI (if needed)
**Volumes**:
- `./outputs` - Model outputs
- `./checkpoints` - Model checkpoints
---
### 2. CPU Training (`train`)
**Purpose**: Train models on CPU (for testing/small models)
```bash
# Start training with default config
docker compose --profile train up
# Custom training command
docker compose run --rm train python train_ultrathink.py \
--dataset wikitext \
--hidden_size 512 \
--num_layers 6 \
--batch_size 2 \
--num_epochs 3
```
**Example - WikiText Training**:
```bash
docker compose run --rm train python train_advanced.py \
--config /app/configs/train_small.yaml
```
---
### 3. GPU Training (`train-gpu`)
**Purpose**: Train models with GPU acceleration
**Prerequisites**:
- NVIDIA GPU
- NVIDIA Docker runtime
- nvidia-container-toolkit
```bash
# Start GPU training
docker compose --profile train-gpu up
# Custom GPU training
docker compose run --rm train-gpu python train_ultrathink.py \
--dataset c4 --streaming \
--hidden_size 768 --num_layers 12 \
--use_amp --gradient_checkpointing \
--output_dir /app/outputs/c4_model
```
---
### 4. MLflow Tracking (`mlflow`)
**Purpose**: Experiment tracking and model registry
```bash
# Start MLflow server
docker compose --profile mlflow up -d
# Access UI
# http://localhost:5000
```
**Train with MLflow tracking**:
```bash
docker compose run --rm \
--env MLFLOW_TRACKING_URI=http://mlflow:5000 \
train python train_ultrathink.py \
--use_mlflow \
--dataset wikitext \
--num_epochs 3
```
---
### 5. Development Environment (`dev`)
**Purpose**: Interactive development with all tools
```bash
# Start dev container
docker compose --profile dev run --rm dev
# Inside container:
pytest # Run tests
python train_ultrathink.py --help
jupyter notebook --ip 0.0.0.0 --port 8888
```
**Access Jupyter**:
- http://localhost:8888
---
## 🎯 Common Use Cases
### Use Case 1: Quick Demo
```bash
# Start web interface only
docker compose up app
# Visit http://localhost:7860
```
### Use Case 2: Training + Monitoring
```bash
# Start MLflow and training
docker compose --profile mlflow up -d
docker compose --profile train up
```
### Use Case 3: Full Development Stack
```bash
# Start all services
docker compose --profile dev --profile mlflow up
```
### Use Case 4: Production Training
```bash
# GPU training with checkpointing
docker compose run --rm train-gpu \
python train_advanced.py \
--config /app/configs/train_medium.yaml \
--checkpoint_frequency 1000 \
--output_dir /app/outputs/production_model
```
---
## πŸ”§ Advanced Usage
### Building Specific Stages
**Production image (minimal)**:
```bash
docker build --target production -t ultrathink:prod .
```
**Development image (with tools)**:
```bash
docker build --target development -t ultrathink:dev .
```
**Training image**:
```bash
docker build --target training -t ultrathink:train .
```
### Custom Environment Variables
Create `.env` file:
```env
WANDB_API_KEY=your_api_key_here
HF_TOKEN=your_hf_token_here
MLFLOW_TRACKING_URI=http://mlflow:5000
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
```
Run with env file:
```bash
docker compose --env-file .env up
```
### Volume Mounting for Development
```bash
# Mount entire project (live editing)
docker run -it --rm \
-v $(pwd):/app \
ultrathink:dev bash
```
### Multi-GPU Training
```bash
# Use specific GPUs
docker compose run --rm \
--env CUDA_VISIBLE_DEVICES=0,1,2,3 \
train-gpu python train_ultrathink.py \
--distributed \
--num_gpus 4
```
---
## πŸ“Š Monitoring
### View Logs
```bash
# View all logs
docker compose logs
# Follow specific service
docker compose logs -f app
# Last 100 lines
docker compose logs --tail 100 train
```
### Check Resource Usage
```bash
# Container stats
docker stats
# Specific container
docker stats ultrathink_train_gpu
```
### Access Running Container
```bash
# Execute command in running container
docker compose exec app bash
# Run pytest in running container
docker compose exec app pytest
```
---
## 🧹 Cleanup
### Stop Services
```bash
# Stop all services
docker compose down
# Stop and remove volumes
docker compose down -v
```
### Remove Images
```bash
# Remove project images
docker rmi ultrathink:latest ultrathink:training ultrathink:dev
# Prune unused images
docker image prune -a
```
### Clean Build Cache
```bash
docker builder prune -a
```
---
## πŸ› Troubleshooting
### Issue: GPU not detected
**Solution**:
```bash
# Check NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# Install nvidia-container-toolkit if needed
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
```
### Issue: Out of memory
**Solution**:
```bash
# Reduce batch size
docker compose run --rm train python train_ultrathink.py \
--batch_size 1 \
--gradient_accumulation_steps 32
# Use gradient checkpointing
docker compose run --rm train python train_ultrathink.py \
--gradient_checkpointing
```
### Issue: Slow training
**Solution**:
```bash
# Enable AMP and optimizations
docker compose run --rm train-gpu python train_ultrathink.py \
--use_amp \
--gradient_checkpointing \
--use_flash_attention
```
### Issue: Container fails to start
**Solution**:
```bash
# Check logs
docker compose logs app
# Rebuild image
docker compose build --no-cache app
# Check disk space
docker system df
```
---
## πŸ” Security Best Practices
1. **Don't hardcode secrets** - Use environment variables
2. **Use .env file** - Keep secrets out of docker-compose.yml
3. **Limit port exposure** - Only expose necessary ports
4. **Use specific tags** - Avoid `latest` in production
5. **Scan images** - Use `docker scan ultrathink:latest`
6. **Non-root user** - Run containers as non-root (future enhancement)
---
## πŸ“š Additional Resources
- [Docker Documentation](https://docs.docker.com/)
- [Docker Compose Reference](https://docs.docker.com/compose/compose-file/)
- [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
- [Main README](README.md)
- [Training Guide](TRAINING_QUICKSTART.md)
---
## πŸŽ“ Examples
### Example 1: Complete Training Pipeline
```bash
# 1. Start MLflow
docker compose --profile mlflow up -d
# 2. Train model
docker compose run --rm train python train_ultrathink.py \
--dataset wikitext \
--use_mlflow \
--num_epochs 5 \
--output_dir /app/outputs/wikitext_model
# 3. Check MLflow UI
# Visit http://localhost:5000
# 4. Start inference UI
docker compose up app
# 5. Cleanup
docker compose down
```
### Example 2: Distributed Training
```bash
# Multi-GPU training with DeepSpeed
docker compose run --rm \
--env CUDA_VISIBLE_DEVICES=0,1,2,3 \
train-gpu python train_ultrathink.py \
--distributed \
--deepspeed \
--deepspeed_config /app/configs/deepspeed_z3.json \
--dataset c4 --streaming
```
### Example 3: Development Workflow
```bash
# Start dev container
docker compose --profile dev run --rm dev bash
# Inside container:
# 1. Run tests
pytest
# 2. Train small model
python train_ultrathink.py --dataset wikitext --num_epochs 1
# 3. Profile performance
python scripts/profile_model.py --size tiny
# 4. Exit
exit
```
---
For more information, see the [main README](README.md) or [Advanced Training Guide](ADVANCED_TRAINING_GUIDE.md).