UltraThinking-LLM-Training / docs /TROUBLESHOOTING.md
Vedisasi's picture
Upload folder using huggingface_hub
54c5666 verified
# πŸ”§ Troubleshooting Guide
Common issues and solutions for ULTRATHINK training.
## Table of Contents
- [Installation Issues](#installation-issues)
- [Training Errors](#training-errors)
- [Memory Issues](#memory-issues)
- [Performance Problems](#performance-problems)
- [Data Loading Issues](#data-loading-issues)
- [Distributed Training](#distributed-training)
- [Monitoring & Logging](#monitoring--logging)
- [Docker Issues](#docker-issues)
---
## Installation Issues
### ❌ `ImportError: No module named 'flash_attn'`
**Cause**: Flash Attention 2 not installed or incompatible with your CUDA version.
**Solution**:
```bash
# Check CUDA version
nvidia-smi
# Install Flash Attention 2 (requires CUDA 11.6+)
pip install flash-attn --no-build-isolation
# If build fails, disable Flash Attention
python train_ultrathink.py --no_flash_attention
```
**Alternative**: Use PyTorch's native SDPA:
```python
# In your config
model:
use_flash_attention: false
use_sdpa: true # PyTorch 2.0+ scaled dot product attention
```
---
### ❌ `CUDA out of memory` during installation
**Cause**: Trying to build packages that require GPU memory.
**Solution**:
```bash
# Set environment variable to reduce memory usage
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
# Install with no cache
pip install --no-cache-dir -r requirements.txt
# Or install CPU-only first, then GPU packages
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt
```
---
### ❌ `ModuleNotFoundError: No module named 'src'`
**Cause**: Python can't find the `src` module.
**Solution**:
```bash
# Make sure you're in the correct directory
cd UltraThinking-LLM-Training/deep
# Install in development mode
pip install -e .
# Or add to PYTHONPATH
export PYTHONPATH="${PYTHONPATH}:$(pwd)"
```
---
## Training Errors
### ❌ `RuntimeError: Expected all tensors to be on the same device`
**Cause**: Model and data are on different devices (CPU vs GPU).
**Solution**:
```python
# Ensure device consistency
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = {k: v.to(device) for k, v in inputs.items()}
```
**In training script**:
```bash
# Force CPU training
python train_ultrathink.py --device cpu
# Force GPU training
python train_ultrathink.py --device cuda
```
---
### ❌ `NaN loss` or `loss becomes inf`
**Cause**: Gradient explosion, learning rate too high, or numerical instability.
**Solutions**:
1. **Reduce learning rate**:
```bash
python train_ultrathink.py --learning_rate 1e-4 # Instead of 3e-4
```
2. **Enable gradient clipping**:
```bash
python train_ultrathink.py --gradient_clip_norm 1.0
```
3. **Use mixed precision carefully**:
```bash
# Try FP32 first
python train_ultrathink.py --no_amp
# Or use BF16 if supported (A100, H100)
python train_ultrathink.py --use_bf16
```
4. **Check for bad data**:
```python
# Add validation in data loading
def validate_batch(batch):
for k, v in batch.items():
if torch.isnan(v).any() or torch.isinf(v).any():
raise ValueError(f"Invalid values in {k}")
return batch
```
5. **Reduce batch size**:
```bash
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 8
```
---
### ❌ `ValueError: Tokenizer not found`
**Cause**: Tokenizer not downloaded or path incorrect.
**Solution**:
```bash
# Download tokenizer manually
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('gpt2')"
# Or specify tokenizer path
python train_ultrathink.py --tokenizer_name gpt2
# Use local tokenizer
python train_ultrathink.py --tokenizer_path ./my_tokenizer
```
---
### ❌ `AssertionError: hidden_size must be divisible by num_heads`
**Cause**: Invalid model configuration.
**Solution**:
```bash
# Ensure hidden_size is divisible by num_heads
# Example: hidden_size=768, num_heads=12 βœ…
# Example: hidden_size=768, num_heads=10 ❌
python train_ultrathink.py --hidden_size 768 --num_heads 12
# Common valid combinations:
# 256 / 4 = 64
# 512 / 8 = 64
# 768 / 12 = 64
# 1024 / 16 = 64
```
---
## Memory Issues
### ❌ `CUDA out of memory` during training
**Immediate fixes** (try in order):
1. **Reduce batch size**:
```bash
python train_ultrathink.py --batch_size 1 --gradient_accumulation_steps 16
```
2. **Enable gradient checkpointing**:
```bash
python train_ultrathink.py --gradient_checkpointing
```
3. **Use mixed precision**:
```bash
python train_ultrathink.py --use_amp
```
4. **Reduce sequence length**:
```bash
python train_ultrathink.py --max_seq_length 512 # Instead of 2048
```
5. **Use DeepSpeed ZeRO**:
```bash
python train_ultrathink.py --use_deepspeed --deepspeed_config deepspeed_config_zero2.json
```
**Memory optimization checklist**:
```python
# In your training script
import torch
# Clear cache before training
torch.cuda.empty_cache()
# Enable memory efficient attention
model.config.use_flash_attention = True
# Reduce optimizer memory (use 8-bit Adam)
from bitsandbytes.optim import AdamW8bit
optimizer = AdamW8bit(model.parameters())
# Enable CPU offloading (slower but uses less GPU memory)
model.enable_cpu_offload()
```
---
### ❌ Memory leak (memory keeps increasing)
**Cause**: Accumulating gradients, keeping references to tensors, or logging too much.
**Solutions**:
1. **Clear gradients properly**:
```python
optimizer.zero_grad(set_to_none=True) # More memory efficient than zero_grad()
```
2. **Detach metrics**:
```python
loss_value = loss.detach().item() # Don't keep computation graph
```
3. **Limit logging**:
```python
if step % 100 == 0: # Log every 100 steps, not every step
logger.log_metrics({"loss": loss.item()})
```
4. **Clear cache periodically**:
```python
if step % 1000 == 0:
torch.cuda.empty_cache()
```
---
### ❌ `RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED`
**Cause**: CUDA/cuDNN version mismatch or insufficient GPU memory.
**Solution**:
```bash
# Check CUDA version
nvidia-smi
# Reinstall PyTorch with correct CUDA version
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# Or disable cuDNN
export CUDA_VISIBLE_DEVICES=0
export CUDNN_ENABLED=0
python train_ultrathink.py
```
---
## Performance Problems
### 🐌 Training is very slow
**Diagnosis**:
```python
# Add profiling to find bottleneck
python scripts/profile_model.py --size small
# Check GPU utilization
nvidia-smi -l 1 # Update every second
```
**Common causes and fixes**:
1. **CPU bottleneck (data loading)**:
```bash
# Increase data loading workers
python train_ultrathink.py --num_workers 4 --prefetch_factor 2
```
2. **Small batch size**:
```bash
# Increase effective batch size
python train_ultrathink.py --batch_size 8 --gradient_accumulation_steps 4
```
3. **Not using Flash Attention**:
```bash
pip install flash-attn --no-build-isolation
python train_ultrathink.py --use_flash_attention
```
4. **Logging too frequently**:
```bash
python train_ultrathink.py --log_interval 100 # Instead of 10
```
5. **Slow storage**:
```bash
# Use local SSD instead of network storage
# Copy dataset to local disk first
cp -r /network/dataset /local/ssd/dataset
python train_ultrathink.py --dataset_path /local/ssd/dataset
```
---
### 🐌 Low GPU utilization (<50%)
**Causes**:
- Data loading bottleneck
- Small batch size
- CPU preprocessing too slow
**Solutions**:
```bash
# Increase workers and prefetch
python train_ultrathink.py --num_workers 8 --prefetch_factor 4
# Use streaming datasets (no preprocessing)
python train_ultrathink.py --dataset c4 --streaming
# Increase batch size
python train_ultrathink.py --batch_size 16
# Pin memory for faster transfer
python train_ultrathink.py --pin_memory
```
---
## Data Loading Issues
### ❌ `ConnectionError: Couldn't reach the Hugging Face Hub`
**Cause**: Network issues or HF Hub down.
**Solution**:
```bash
# Use cached dataset
export HF_DATASETS_OFFLINE=1
python train_ultrathink.py --dataset wikitext
# Or download dataset manually
python -c "from datasets import load_dataset; load_dataset('wikitext', 'wikitext-103-v1')"
# Use local dataset
python train_ultrathink.py --dataset_path ./my_local_dataset
```
---
### ❌ `Dataset is too large for disk`
**Solution**: Use streaming mode
```bash
python train_ultrathink.py --dataset c4 --streaming
```
**Or**: Use a smaller subset
```bash
python train_ultrathink.py --dataset c4 --max_samples 100000
```
---
### ❌ `KeyError: 'text'` or wrong dataset format
**Cause**: Dataset doesn't have expected column names.
**Solution**:
```python
# Check dataset structure
from datasets import load_dataset
dataset = load_dataset("your_dataset")
print(dataset.column_names)
# Map to correct format
def preprocess(examples):
return {"text": examples["content"]} # Rename column
dataset = dataset.map(preprocess)
```
**In training script**:
```bash
python train_ultrathink.py --text_column content # Instead of default "text"
```
---
## Distributed Training
### ❌ `RuntimeError: Address already in use`
**Cause**: Previous training process still running or port conflict.
**Solution**:
```bash
# Kill previous processes
pkill -f train_ultrathink.py
# Use different port
python train_ultrathink.py --master_port 29501
# Or let system choose port
python train_ultrathink.py --master_port 0
```
---
### ❌ Multi-GPU training hangs or crashes
**Diagnosis**:
```bash
# Test NCCL communication
python -m torch.distributed.run --nproc_per_node=2 scripts/test_distributed.py
# Check NCCL debug info
export NCCL_DEBUG=INFO
python train_ultrathink.py
```
**Common fixes**:
1. **Set correct backend**:
```bash
export NCCL_SOCKET_IFNAME=eth0 # Or your network interface
export NCCL_IB_DISABLE=1 # Disable InfiniBand if not available
```
2. **Use correct launcher**:
```bash
# Use torchrun (recommended)
torchrun --nproc_per_node=2 train_ultrathink.py
# Or accelerate
accelerate launch --num_processes=2 train_ultrathink.py
```
3. **Increase timeout**:
```bash
export NCCL_TIMEOUT=1800 # 30 minutes
```
---
### ❌ `RuntimeError: NCCL error: unhandled system error`
**Solution**:
```bash
# Disable peer-to-peer access
export NCCL_P2P_DISABLE=1
# Use different NCCL backend
export NCCL_SOCKET_IFNAME=lo # Loopback for single node
# Check GPU topology
nvidia-smi topo -m
```
---
## Monitoring & Logging
### ❌ MLflow UI not starting
**Solution**:
```bash
# Check if port is in use
lsof -i :5000
# Use different port
mlflow ui --port 5001
# Or specify host
mlflow ui --host 0.0.0.0 --port 5000
```
---
### ❌ Weights & Biases not logging
**Solution**:
```bash
# Login to W&B
wandb login
# Check API key
echo $WANDB_API_KEY
# Disable W&B if not needed
export WANDB_MODE=disabled
python train_ultrathink.py
```
---
### ❌ TensorBoard shows no data
**Solution**:
```bash
# Check log directory
ls -la ./runs
# Start TensorBoard with correct path
tensorboard --logdir ./runs --port 6006
# Force reload
tensorboard --logdir ./runs --reload_interval 5
```
---
## Docker Issues
### ❌ `docker: permission denied`
**Solution**:
```bash
# Add user to docker group
sudo usermod -aG docker $USER
newgrp docker
# Or use sudo
sudo docker compose up
```
---
### ❌ Container runs out of memory
**Solution**:
```bash
# Increase Docker memory limit
# Docker Desktop: Settings β†’ Resources β†’ Memory
# Or use docker run with memory limit
docker run --memory=16g --gpus all ultrathink:latest
```
---
### ❌ GPU not available in Docker
**Solution**:
```bash
# Install nvidia-docker2
sudo apt-get install nvidia-docker2
sudo systemctl restart docker
# Test GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# Run with GPU
docker run --gpus all ultrathink:latest
```
---
## Getting Help
If you can't find a solution here:
1. **Check existing issues**: [GitHub Issues](https://github.com/vediyappanm/UltraThinking-LLM-Training/issues)
2. **Search discussions**: [GitHub Discussions](https://github.com/vediyappanm/UltraThinking-LLM-Training/discussions)
3. **Enable debug logging**:
```bash
export LOG_LEVEL=DEBUG
python train_ultrathink.py --verbose
```
4. **Create a minimal reproduction**:
```bash
python train_ultrathink.py \
--hidden_size 256 --num_layers 2 \
--batch_size 1 --max_steps 10 \
--dataset wikitext --max_samples 100
```
5. **Open an issue** with:
- Error message and full traceback
- Your configuration (model size, hardware, etc.)
- Steps to reproduce
- Output of `python --version`, `torch.__version__`, `nvidia-smi`
---
## Debugging Checklist
Before opening an issue, try:
- [ ] Update to latest version: `git pull && pip install -r requirements.txt`
- [ ] Clear cache: `rm -rf ~/.cache/huggingface`
- [ ] Test with minimal config: `--hidden_size 256 --num_layers 2 --batch_size 1`
- [ ] Check GPU: `nvidia-smi`
- [ ] Check disk space: `df -h`
- [ ] Check CUDA version: `nvcc --version`
- [ ] Run tests: `pytest tests/`
- [ ] Enable verbose logging: `--verbose`
---
**Last Updated**: January 2025
**Version**: 1.0.0
Found a solution not listed here? [Contribute to this guide!](CONTRIBUTING.md)