Spaces:

danielrosehill
/

Claude-Code-Slash-Commands-Linux-Desktop

Running

App Files Files Community

Claude-Code-Slash-Commands-Linux-Desktop / commands /ai-tools /gpu-ai-ml-assessment.md

danielrosehill

commit

279efce 6 months ago

preview code

raw

history blame contribute delete

14.3 kB

	You are assessing GPU driver status and AI/ML workload capabilities.

	## Your Task

	Evaluate the GPU's driver configuration and suitability for AI/ML workloads, including deep learning frameworks, compute capabilities, and performance optimization.

	### 1. Driver Status Assessment
	- Installed driver: Type (proprietary/open-source) and version
	- Driver source: Distribution package, vendor installer, or compiled
	- Driver status: Loaded, functioning, errors
	- Kernel module: Module name and status
	- Driver age: Release date and recency
	- Latest driver: Compare installed vs. available
	- Driver compatibility: Kernel version compatibility
	- Secure boot status: Impact on driver loading

	### 2. Compute Framework Support
	- CUDA availability: CUDA Toolkit installation status
	- CUDA version: Installed CUDA version
	- CUDA compatibility: GPU compute capability vs. CUDA requirements
	- ROCm availability: For AMD GPUs
	- ROCm version: Installed ROCm version
	- OpenCL support: OpenCL runtime and version
	- oneAPI: Intel oneAPI toolkit status
	- Framework libraries: cuDNN, cuBLAS, TensorRT, etc.

	### 3. GPU Compute Capabilities
	- Compute capability: NVIDIA CUDA compute version (e.g., 8.6, 8.9)
	- Architecture suitability: Architecture generation for AI/ML
	- Tensor cores: Presence and version (Gen 1/2/3/4)
	- RT cores: Ray tracing acceleration (less relevant for ML)
	- Memory bandwidth: Critical for ML workloads
	- VRAM capacity: Memory size for model loading
	- FP64/FP32/FP16/INT8: Precision support
	- TF32: Tensor Float 32 support (Ampere+)
	- Mixed precision: Automatic mixed precision capability

	### 4. Deep Learning Framework Compatibility
	- PyTorch: Installation status and CUDA/ROCm support
	- TensorFlow: Installation and GPU backend
	- JAX: Google JAX framework support
	- ONNX Runtime: ONNX with GPU acceleration
	- MXNet: Apache MXNet support
	- Hugging Face: Transformers library GPU support
	- Framework versions: Installed versions and compatibility

	### 5. AI/ML Library Ecosystem
	- cuDNN: NVIDIA Deep Neural Network library
	- cuBLAS: CUDA Basic Linear Algebra Subprograms
	- TensorRT: High-performance deep learning inference
	- NCCL: NVIDIA Collective Communications Library (multi-GPU)
	- MIOpen: AMD GPU-accelerated primitives
	- rocBLAS: AMD GPU BLAS library
	- oneDNN: Intel Deep Neural Network library

	### 6. Performance Characteristics
	- Memory bandwidth: GB/s for data transfer
	- Compute throughput: TFLOPS for different precisions
	- FP64 (double precision)
	- FP32 (single precision)
	- FP16 (half precision)
	- INT8 (integer quantization)
	- TF32 (Tensor Float 32)
	- Tensor core performance: Dedicated AI acceleration
	- Sparse tensor support: Structured sparsity acceleration

	### 7. Model Size Compatibility
	- VRAM capacity: Total GPU memory
	- Practical model sizes: Estimated model capacity
	- Small models: < 1B parameters
	- Medium models: 1B-7B parameters
	- Large models: 7B-70B parameters
	- Very large models: > 70B parameters
	- Batch size implications: VRAM for different batch sizes
	- Multi-GPU potential: Scaling across GPUs

	### 8. Container and Virtualization Support
	- Docker NVIDIA runtime: nvidia-docker/NVIDIA Container Toolkit
	- Docker ROCm runtime: ROCm Docker support
	- Podman GPU support: GPU passthrough capability
	- Kubernetes GPU: Device plugin support
	- GPU passthrough: VM GPU assignment capability
	- vGPU support: Virtual GPU for multi-tenancy

	### 9. Monitoring and Profiling Tools
	- nvidia-smi: Real-time monitoring (NVIDIA)
	- rocm-smi: ROCm system management (AMD)
	- Nsight Systems: NVIDIA profiling suite
	- Nsight Compute: CUDA kernel profiler
	- nvtop/radeontop: Terminal GPU monitoring
	- PyTorch profiler: Framework-level profiling
	- TensorBoard: Training visualization

	### 10. Optimization Features
	- Automatic mixed precision: AMP support
	- Gradient checkpointing: Memory optimization
	- Flash Attention: Optimized attention mechanisms
	- Quantization support: INT8, INT4 inference
	- Model compilation: TorchScript, XLA, TensorRT
	- Distributed training: Multi-GPU training support
	- CUDA graphs: Kernel launch optimization

	### 11. Workload Suitability Assessment
	- Training capability: Suitable for training workloads
	- Inference capability: Suitable for inference
	- Model type suitability:
	- Computer vision (CNNs)
	- Natural language processing (Transformers)
	- Generative AI (Diffusion models, LLMs)
	- Reinforcement learning
	- Performance tier: Consumer, Professional, Data Center

	### 12. Bottleneck and Limitation Analysis
	- Memory bottlenecks: VRAM limitations for large models
	- Compute bottlenecks: GPU power for training speed
	- PCIe bandwidth: Data transfer limitations
	- Driver limitations: Missing features or bugs
	- Power throttling: Thermal or power constraints
	- Multi-GPU scaling: Efficiency of multi-GPU setup

	## Commands to Use

	GPU and driver detection:
	- `nvidia-smi` (NVIDIA)
	- `rocm-smi` (AMD)
	- `lspci \| grep -i vga`
	- `lspci -v \| grep -A 20 VGA`

	NVIDIA driver details:
	- `nvidia-smi -q`
	- `cat /proc/driver/nvidia/version`
	- `modinfo nvidia`
	- `nvidia-smi --query-gpu=driver_version --format=csv,noheader`

	AMD driver details:
	- `modinfo amdgpu`
	- `rocminfo`
	- `/opt/rocm/bin/rocm-smi --showdriverversion`

	CUDA/ROCm installation:
	- `nvcc --version` (CUDA compiler)
	- `which nvcc`
	- `ls /usr/local/cuda*/`
	- `echo $CUDA_HOME`
	- `hipcc --version` (ROCm)
	- `ls /opt/rocm/`

	Compute capability:
	- `nvidia-smi --query-gpu=compute_cap --format=csv,noheader`
	- `nvidia-smi -q \| grep "Compute Capability"`

	Libraries check:
	- `ldconfig -p \| grep cudnn`
	- `ldconfig -p \| grep cublas`
	- `ldconfig -p \| grep tensorrt`
	- `ldconfig -p \| grep nccl`
	- `ls /usr/lib/x86_64-linux-gnu/ \| grep -i cuda`

	Python framework check:
	- `python3 -c "import torch; print(f'PyTorch: {torch.__version__}, CUDA: {torch.cuda.is_available()}, Version: {torch.version.cuda}')"`
	- `python3 -c "import tensorflow as tf; print(f'TensorFlow: {tf.__version__}, GPU: {tf.config.list_physical_devices(\"GPU\")}')"`
	- `python3 -c "import torch; print(f'Tensor Cores: {torch.cuda.get_device_capability()}')"`

	Container runtime:
	- `docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi`
	- `which nvidia-container-cli`
	- `nvidia-container-cli info`

	OpenCL:
	- `clinfo`
	- `clinfo \| grep "Device Name"`

	System libraries:
	- `dpkg -l \| grep -i cuda`
	- `dpkg -l \| grep -i nvidia`
	- `dpkg -l \| grep -i rocm`

	Performance info:
	- `nvidia-smi --query-gpu=name,memory.total,memory.free,driver_version,compute_cap --format=csv`
	- `nvidia-smi dmon -s pucvmet` (dynamic monitoring)

	## Output Format

	### Executive Summary
	```
	GPU: [model]
	Driver: [proprietary/open] v[version] ([status])
	Compute: [CUDA/ROCm] v[version] (Compute [capability])
	AI/ML Readiness: [Ready/Partial/Not Ready]
	Best For: [Training/Inference/Both]
	Recommended Frameworks: [PyTorch, TensorFlow, etc.]
	```

	### Detailed AI/ML Assessment

	Driver Status:
	- Type: [Proprietary/Open Source]
	- Version: [version number]
	- Release Date: [date]
	- Status: [Loaded/Error]
	- Kernel Module: [module] ([loaded/not loaded])
	- Latest Available: [version]
	- Update Recommended: [Yes/No]
	- Secure Boot: [Compatible/Issue]

	Compute Framework Availability:
	- CUDA Toolkit: [Installed/Not Installed] - v[version]
	- CUDA Driver API: v[version]
	- ROCm: [Installed/Not Installed] - v[version]
	- OpenCL: [Available/Not Available] - v[version]
	- Compute Capability: [X.X] ([architecture name])

	GPU Compute Specifications:
	- Architecture: [Turing/Ampere/Ada/RDNA3/Xe]
	- Tensor Cores: [Yes/No] - [Generation]
	- CUDA Cores / SPs: [count]
	- VRAM: [GB] [memory type]
	- Memory Bandwidth: [GB/s]
	- Precision Support:
	- FP64: [TFLOPS]
	- FP32: [TFLOPS]
	- FP16: [TFLOPS]
	- INT8: [TOPS]
	- TF32: [Yes/No]

	AI/ML Libraries:
	- cuDNN: [version] ([installed/missing])
	- cuBLAS: [version] ([installed/missing])
	- TensorRT: [version] ([installed/missing])
	- NCCL: [version] ([installed/missing])
	- MIOpen: [version] (AMD only)
	- rocBLAS: [version] (AMD only)

	Deep Learning Framework Support:
	- PyTorch: [version]
	- CUDA Enabled: [Yes/No]
	- CUDA Version: [version]
	- cuDNN Version: [version]
	- TensorFlow: [version]
	- GPU Support: [Yes/No]
	- CUDA Version: [version]
	- JAX: [installed/not installed]
	- ONNX Runtime: [GPU backend available]

	Container Support:
	- NVIDIA Container Toolkit: [installed/not installed]
	- Docker GPU Access: [working/not working]
	- Podman GPU Support: [available]

	Model Capacity Estimates:
	- Small Models (< 1B params): [batch size X]
	- Medium Models (1B-7B params): [batch size X]
	- Large Models (7B-13B params): [batch size X]
	- Very Large Models (13B-70B params): [requires multi-GPU or not possible]

	Example workload estimates based on [GB] VRAM:
	- LLaMA 7B: [inference only/training possible]
	- Stable Diffusion: [batch size X]
	- BERT Base: [batch size X]
	- GPT-2: [batch size X]

	Workload Suitability:
	- Training:
	- Small models: [Excellent/Good/Fair/Poor]
	- Medium models: [rating]
	- Large models: [rating]
	- Inference:
	- Real-time: [Excellent/Good/Fair/Poor]
	- Batch: [rating]
	- Low-latency: [rating]

	Use Case Recommendations:
	- Computer Vision (CNNs): [Excellent/Good/Fair/Poor]
	- NLP (Transformers): [rating]
	- Generative AI (LLMs): [rating]
	- Diffusion Models: [rating]
	- Reinforcement Learning: [rating]

	Performance Tier:
	- Category: [Consumer/Professional/Data Center]
	- Training Performance: [rating]
	- Inference Performance: [rating]
	- Multi-GPU Scaling: [available/not available]

	Optimization Features Available:
	- Automatic Mixed Precision: [Yes/No]
	- Tensor Core Utilization: [Yes/No]
	- TensorRT Optimization: [Available]
	- Flash Attention: [Supported]
	- INT8 Quantization: [Supported]
	- Multi-GPU Training: [Possible with [count] GPUs]

	Limitations and Bottlenecks:
	- VRAM Constraint: [assessment]
	- Memory Bandwidth: [adequate/limited]
	- Compute Throughput: [assessment]
	- PCIe Bottleneck: [yes/no]
	- Driver Limitations: [any known issues]
	- Power/Thermal: [throttling concerns]

	Recommendations:
	1. [Driver update/optimization suggestions]
	2. [Framework installation recommendations]
	3. [Workload optimization suggestions]
	4. [Hardware upgrade path if applicable]
	5. [Container/virtualization setup if beneficial]

	### AI/ML Readiness Scorecard

	```
	Driver Setup: [✓/✗/⚠] [details]
	CUDA/ROCm Install: [✓/✗/⚠] [details]
	Framework Support: [✓/✗/⚠] [details]
	Library Ecosystem: [✓/✗/⚠] [details]
	Container Runtime: [✓/✗/⚠] [details]
	VRAM Capacity: [✓/✗/⚠] [details]
	Compute Performance: [✓/✗/⚠] [details]

	Overall Readiness: [Ready/Needs Setup/Limited/Not Suitable]
	```

	### AI-Readable JSON

	```json
	{
	"driver": {
	"type": "proprietary\|open_source",
	"version": "",
	"status": "loaded\|error",
	"latest_available": "",
	"update_recommended": false
	},
	"compute_platform": {
	"cuda": {
	"installed": false,
	"version": "",
	"compute_capability": ""
	},
	"rocm": {
	"installed": false,
	"version": ""
	},
	"opencl": {
	"available": false,
	"version": ""
	}
	},
	"gpu_specs": {
	"architecture": "",
	"tensor_cores": false,
	"vram_gb": 0,
	"memory_bandwidth_gbs": 0,
	"fp32_tflops": 0,
	"fp16_tflops": 0,
	"int8_tops": 0,
	"tf32_support": false
	},
	"libraries": {
	"cudnn": "",
	"cublas": "",
	"tensorrt": "",
	"nccl": ""
	},
	"frameworks": {
	"pytorch": {
	"installed": false,
	"version": "",
	"cuda_available": false
	},
	"tensorflow": {
	"installed": false,
	"version": "",
	"gpu_available": false
	}
	},
	"container_support": {
	"nvidia_container_toolkit": false,
	"docker_gpu_working": false
	},
	"workload_suitability": {
	"training": {
	"small_models": "excellent\|good\|fair\|poor",
	"medium_models": "",
	"large_models": ""
	},
	"inference": {
	"real_time": "",
	"batch": ""
	}
	},
	"model_capacity": {
	"vram_gb": 0,
	"small_model_batch_size": 0,
	"llama_7b_possible": false,
	"stable_diffusion_batch": 0
	},
	"optimization_features": {
	"amp_support": false,
	"tensor_core_utilization": false,
	"tensorrt_available": false,
	"int8_quantization": false
	},
	"bottlenecks": {
	"vram_limited": false,
	"compute_limited": false,
	"pcie_bottleneck": false
	},
	"ai_ml_readiness": "ready\|needs_setup\|limited\|not_suitable"
	}
	```

	## Execution Guidelines

	1. Identify GPU vendor first: NVIDIA, AMD, or Intel
	2. Check driver installation: Verify driver is loaded and working
	3. Assess compute platform: CUDA for NVIDIA, ROCm for AMD
	4. Query compute capability: Critical for framework compatibility
	5. Check library installation: cuDNN, TensorRT, etc.
	6. Test framework access: Try importing PyTorch/TensorFlow with GPU
	7. Evaluate VRAM capacity: Estimate model sizes
	8. Check container support: Important for ML workflows
	9. Identify bottlenecks: VRAM, compute, or driver issues
	10. Provide actionable recommendations: Setup steps or optimizations

	## Important Notes

	- NVIDIA GPUs have the most mature AI/ML ecosystem
	- CUDA compute capability determines supported features
	- cuDNN is critical for deep learning performance
	- VRAM is often the primary bottleneck for large models
	- Container runtimes simplify framework management
	- AMD ROCm support is improving but less mature than CUDA
	- Intel GPUs are emerging in AI/ML space
	- Tensor cores provide significant speedup for mixed precision
	- Driver version must match CUDA toolkit requirements
	- Some features require specific GPU generations
	- Multi-GPU setups require additional configuration
	- Consumer GPUs can be effective for smaller workloads
	- Professional/datacenter GPUs offer better reliability and support

	Be thorough and practical - provide a clear assessment of AI/ML readiness and actionable next steps.