# CRAYON CUDA Testing Guide for Google Colab T4

## Quick Setup Commands

Run these cells in sequence in Google Colab (with T4 GPU runtime):

```bash
# Cell 1: Check GPU
!nvidia-smi
!nvcc --version
```

```bash
# Cell 2: Install PyTorch CUDA
!pip uninstall torch torchvision torchaudio -y
!pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

import torch
print(f"PyTorch: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None'}")
```

```bash
# Cell 3: Install CRAYON with CUDA
!pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ xerv-crayon[cuda]

# Verify installation
!python -c "import crayon; print('CRAYON installed')"
```

```python
# Cell 4: Test CUDA functionality
import logging
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')

from crayon.core.vocabulary import CrayonVocab

print("=== CRAYON CUDA Test ===")

# Auto-detection (should pick CUDA)
vocab = CrayonVocab(device="auto")
print(f"Device: {vocab.device}")

# Load profile
vocab.load_profile("lite")
print(f"Profile loaded: {len(vocab)} tokens")

# Test tokenization
text = "Hello, world! This is CUDA-accelerated tokenization."
tokens = vocab.tokenize(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Count: {len(tokens)}")
```

```python
# Cell 5: Performance benchmark
import time

def benchmark(vocab, text, runs=5):
    times = []
    for _ in range(runs):
        start = time.time()
        tokens = vocab.tokenize(text)
        times.append(time.time() - start)
    avg_time = sum(times) / len(times)
    return avg_time, len(tokens)

# Test texts
texts = [
    "Hello world",
    "Hello world! " * 10,
    "Hello world! " * 100,
    "Hello world! " * 1000,
]

# CPU comparison
vocab_cpu = CrayonVocab(device="cpu")
vocab_cpu.load_profile("lite")

print("=== Performance Comparison ===")
for i, text in enumerate(texts):
    print(f"\nTest {i+1}: {len(text)} chars")

    # CPU
    cpu_time, cpu_tokens = benchmark(vocab_cpu, text)
    print(f"  CPU:  {cpu_time:.6f}s ({cpu_tokens} tokens)")

    # CUDA
    cuda_time, cuda_tokens = benchmark(vocab, text)
    print(f"  CUDA: {cuda_time:.6f}s ({cuda_tokens} tokens)")

    # Speedup
    speedup = cpu_time / cuda_time if cuda_time > 0 else 0
    print(f"  Speedup: {speedup:.2f}x")
```

```python
# Cell 6: Batch processing test
batch_texts = [
    "def fibonacci(n): return n if n <= 1 else fibonacci(n-1) + fibonacci(n-2)",
    "class NeuralNetwork(nn.Module): def __init__(self): super().__init__()",
    "import torch; model = torch.nn.Sequential(torch.nn.Linear(10, 5), torch.nn.ReLU())",
] * 50  # Large batch

print(f"Batch size: {len(batch_texts)}")

# CUDA batch
start = time.time()
batch_tokens = vocab.tokenize(batch_texts)
cuda_batch_time = time.time() - start

# CPU batch
start = time.time()
batch_tokens_cpu = vocab_cpu.tokenize(batch_texts)
cpu_batch_time = time.time() - start

print(f"CPU batch:  {cpu_batch_time:.4f}s")
print(f"CUDA batch: {cuda_batch_time:.4f}s")
print(f"Speedup: {cpu_batch_time/cuda_batch_time:.2f}x")
```

## Expected Results on T4

- **Device Detection**: Should automatically select "cuda"
- **Hardware**: NVIDIA T4, ~16GB VRAM, Compute Capability 7.5
- **Performance**: 2-5x speedup on single texts, 5-10x on batches
- **Memory**: Efficient GPU utilization

## Troubleshooting

If CUDA doesn't work, run this diagnostic:

```python
# Get detailed error information
vocab = CrayonVocab(device="cpu")  # Initialize first
print(vocab._get_cuda_import_error())
```

Common fixes:
1. **PyTorch not CUDA**: Reinstall with `cu121` wheels
2. **CUDA_HOME**: Colab usually has this set correctly
3. **GPU runtime**: Ensure "GPU" is selected in runtime settings

## Colab-Specific Notes

- **Free T4 GPU**: Limited to ~12 hours, may disconnect
- **Memory**: ~16GB GPU RAM, ~25GB system RAM
- **CUDA**: Pre-installed CUDA 12.2, but we use 12.1 for compatibility
- **PyTorch**: Must be CUDA-enabled version

## Alternative: Use Development Version

```bash
# Install directly from GitHub
!pip install git+https://github.com/Electroiscoding/CRAYON.git

# Force CUDA build if needed
!CRAYON_FORCE_CUDA=1 pip install git+https://github.com/Electroiscoding/CRAYON.git
```

This guide tests the CRAYON improvements made to fix CUDA extension issues and provide better error messaging.