UltraThinking-LLM-Training / DOCKER_GUIDE.md

Vedisasi

Upload folder using huggingface_hub

54c5666 verified 4 months ago

preview code

raw

history blame contribute delete

8.39 kB

Docker Usage Guide

Complete guide for using ULTRATHINK with Docker.

🚀 Quick Start

Run Web Interface (Default)

docker compose up app
# Visit http://localhost:7860

Run Training (CPU)

docker compose --profile train up

Run Training (GPU)

docker compose --profile train-gpu up

📦 Available Services

1. Web Interface (`app`)

Purpose: Interactive Gradio UI for model inference

# Start the web interface
docker compose up app

# Run in background
docker compose up -d app

# View logs
docker compose logs -f app

Ports:

7860: Gradio web interface
8000: FastAPI (if needed)

Volumes:

./outputs - Model outputs
./checkpoints - Model checkpoints

2. CPU Training (`train`)

Purpose: Train models on CPU (for testing/small models)

# Start training with default config
docker compose --profile train up

# Custom training command
docker compose run --rm train python train_ultrathink.py \
  --dataset wikitext \
  --hidden_size 512 \
  --num_layers 6 \
  --batch_size 2 \
  --num_epochs 3

Example - WikiText Training:

docker compose run --rm train python train_advanced.py \
  --config /app/configs/train_small.yaml

3. GPU Training (`train-gpu`)

Purpose: Train models with GPU acceleration

Prerequisites:

NVIDIA GPU
NVIDIA Docker runtime
nvidia-container-toolkit

# Start GPU training
docker compose --profile train-gpu up

# Custom GPU training
docker compose run --rm train-gpu python train_ultrathink.py \
  --dataset c4 --streaming \
  --hidden_size 768 --num_layers 12 \
  --use_amp --gradient_checkpointing \
  --output_dir /app/outputs/c4_model

4. MLflow Tracking (`mlflow`)

Purpose: Experiment tracking and model registry

# Start MLflow server
docker compose --profile mlflow up -d

# Access UI
# http://localhost:5000

Train with MLflow tracking:

docker compose run --rm \
  --env MLFLOW_TRACKING_URI=http://mlflow:5000 \
  train python train_ultrathink.py \
    --use_mlflow \
    --dataset wikitext \
    --num_epochs 3

5. Development Environment (`dev`)

Purpose: Interactive development with all tools

# Start dev container
docker compose --profile dev run --rm dev

# Inside container:
pytest                  # Run tests
python train_ultrathink.py --help
jupyter notebook --ip 0.0.0.0 --port 8888

Access Jupyter:

http://localhost:8888

🎯 Common Use Cases

Use Case 1: Quick Demo

# Start web interface only
docker compose up app
# Visit http://localhost:7860

Use Case 2: Training + Monitoring

# Start MLflow and training
docker compose --profile mlflow up -d
docker compose --profile train up

Use Case 3: Full Development Stack

# Start all services
docker compose --profile dev --profile mlflow up

Use Case 4: Production Training

# GPU training with checkpointing
docker compose run --rm train-gpu \
  python train_advanced.py \
    --config /app/configs/train_medium.yaml \
    --checkpoint_frequency 1000 \
    --output_dir /app/outputs/production_model

🔧 Advanced Usage

Building Specific Stages

Production image (minimal):

docker build --target production -t ultrathink:prod .

Development image (with tools):

docker build --target development -t ultrathink:dev .

Training image:

docker build --target training -t ultrathink:train .

Custom Environment Variables

Create .env file:

WANDB_API_KEY=your_api_key_here
HF_TOKEN=your_hf_token_here
MLFLOW_TRACKING_URI=http://mlflow:5000
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Run with env file:

docker compose --env-file .env up

Volume Mounting for Development

# Mount entire project (live editing)
docker run -it --rm \
  -v $(pwd):/app \
  ultrathink:dev bash

Multi-GPU Training

# Use specific GPUs
docker compose run --rm \
  --env CUDA_VISIBLE_DEVICES=0,1,2,3 \
  train-gpu python train_ultrathink.py \
    --distributed \
    --num_gpus 4

📊 Monitoring

View Logs

# View all logs
docker compose logs

# Follow specific service
docker compose logs -f app

# Last 100 lines
docker compose logs --tail 100 train

Check Resource Usage

# Container stats
docker stats

# Specific container
docker stats ultrathink_train_gpu

Access Running Container

# Execute command in running container
docker compose exec app bash

# Run pytest in running container
docker compose exec app pytest

🧹 Cleanup

Stop Services

# Stop all services
docker compose down

# Stop and remove volumes
docker compose down -v

Remove Images

# Remove project images
docker rmi ultrathink:latest ultrathink:training ultrathink:dev

# Prune unused images
docker image prune -a

Clean Build Cache

docker builder prune -a

🐛 Troubleshooting

Issue: GPU not detected

Solution:

# Check NVIDIA runtime
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# Install nvidia-container-toolkit if needed
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

Issue: Out of memory

Solution:

# Reduce batch size
docker compose run --rm train python train_ultrathink.py \
  --batch_size 1 \
  --gradient_accumulation_steps 32

# Use gradient checkpointing
docker compose run --rm train python train_ultrathink.py \
  --gradient_checkpointing

Issue: Slow training

Solution:

# Enable AMP and optimizations
docker compose run --rm train-gpu python train_ultrathink.py \
  --use_amp \
  --gradient_checkpointing \
  --use_flash_attention

Issue: Container fails to start

Solution:

# Check logs
docker compose logs app

# Rebuild image
docker compose build --no-cache app

# Check disk space
docker system df

🔐 Security Best Practices

Don't hardcode secrets - Use environment variables
Use .env file - Keep secrets out of docker-compose.yml
Limit port exposure - Only expose necessary ports
Use specific tags - Avoid latest in production
Scan images - Use docker scan ultrathink:latest
Non-root user - Run containers as non-root (future enhancement)

📚 Additional Resources

🎓 Examples

Example 1: Complete Training Pipeline

# 1. Start MLflow
docker compose --profile mlflow up -d

# 2. Train model
docker compose run --rm train python train_ultrathink.py \
  --dataset wikitext \
  --use_mlflow \
  --num_epochs 5 \
  --output_dir /app/outputs/wikitext_model

# 3. Check MLflow UI
# Visit http://localhost:5000

# 4. Start inference UI
docker compose up app

# 5. Cleanup
docker compose down

Example 2: Distributed Training

# Multi-GPU training with DeepSpeed
docker compose run --rm \
  --env CUDA_VISIBLE_DEVICES=0,1,2,3 \
  train-gpu python train_ultrathink.py \
    --distributed \
    --deepspeed \
    --deepspeed_config /app/configs/deepspeed_z3.json \
    --dataset c4 --streaming

Example 3: Development Workflow

# Start dev container
docker compose --profile dev run --rm dev bash

# Inside container:
# 1. Run tests
pytest

# 2. Train small model
python train_ultrathink.py --dataset wikitext --num_epochs 1

# 3. Profile performance
python scripts/profile_model.py --size tiny

# 4. Exit
exit

For more information, see the main README or Advanced Training Guide.

Docker Usage Guide

🚀 Quick Start

Run Web Interface (Default)

Run Training (CPU)

Run Training (GPU)

📦 Available Services

1. Web Interface (app)

2. CPU Training (train)

3. GPU Training (train-gpu)

4. MLflow Tracking (mlflow)

5. Development Environment (dev)

🎯 Common Use Cases

Use Case 1: Quick Demo

Use Case 2: Training + Monitoring

Use Case 3: Full Development Stack

Use Case 4: Production Training

🔧 Advanced Usage

Building Specific Stages

Custom Environment Variables

Volume Mounting for Development

Multi-GPU Training

📊 Monitoring

View Logs

Check Resource Usage

Access Running Container

🧹 Cleanup

Stop Services

Remove Images

Clean Build Cache

🐛 Troubleshooting

Issue: GPU not detected

Issue: Out of memory

Issue: Slow training

Issue: Container fails to start

🔐 Security Best Practices

📚 Additional Resources

🎓 Examples

Example 1: Complete Training Pipeline

Example 2: Distributed Training

Example 3: Development Workflow

1. Web Interface (`app`)

2. CPU Training (`train`)

3. GPU Training (`train-gpu`)

4. MLflow Tracking (`mlflow`)

5. Development Environment (`dev`)