Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
gary-boon
Claude Opus 4.5
commited on
Commit
·
e694533
1
Parent(s):
5f122aa
Update plan: Phase 1 paused due to GB10 GPU support
Browse filesDocument the blocker with DGX Spark's GB10 GPU (sm_121 compute
capability) not being supported by current PyTorch releases.
- Mark Phase 0 and 0.5 as complete
- Mark Phase 1 as paused (not blocked permanently)
- Document what we tried and what we learned
- Add clear restart instructions for when PyTorch 2.9.x adds sm_121
- List infrastructure already in place on Spark
The Spark deployment makes no sense on CPU when Mac Studio and
HuggingFace Spaces are available. Wait for official GPU support.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
docs/devstral-spark-plan-phased.md
CHANGED
|
@@ -1814,11 +1814,128 @@ Before marking each phase complete, verify:
|
|
| 1814 |
|
| 1815 |
## Current Status
|
| 1816 |
|
| 1817 |
-
- [
|
| 1818 |
-
- [
|
| 1819 |
-
- [ ] **Phase 1**: Deploy CodeGen to DGX Spark
|
| 1820 |
- [ ] **Phase 2**: Add Devstral backend support
|
| 1821 |
- [ ] **Phase 2b**: Frontend dynamic layer handling
|
| 1822 |
- [ ] **Phase 2c**: Wire Spark into frontend backend router + Deploy Devstral to GPU HF Space
|
| 1823 |
- [ ] **Phase 3**: Deploy Devstral to DGX Spark
|
| 1824 |
- [ ] **Phase 4**: Future enhancements (optional)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1814 |
|
| 1815 |
## Current Status
|
| 1816 |
|
| 1817 |
+
- [x] **Phase 0**: Secure GPU HF Space + verify basic routing ✅ COMPLETE
|
| 1818 |
+
- [x] **Phase 0.5**: Fix critical API route routing (prove GPU routing works) ✅ COMPLETE
|
| 1819 |
+
- [ ] **Phase 1**: Deploy CodeGen to DGX Spark ⏸️ PAUSED (see blocker below)
|
| 1820 |
- [ ] **Phase 2**: Add Devstral backend support
|
| 1821 |
- [ ] **Phase 2b**: Frontend dynamic layer handling
|
| 1822 |
- [ ] **Phase 2c**: Wire Spark into frontend backend router + Deploy Devstral to GPU HF Space
|
| 1823 |
- [ ] **Phase 3**: Deploy Devstral to DGX Spark
|
| 1824 |
- [ ] **Phase 4**: Future enhancements (optional)
|
| 1825 |
+
|
| 1826 |
+
---
|
| 1827 |
+
|
| 1828 |
+
## Blocker: DGX Spark GB10 GPU Not Yet Supported by PyTorch
|
| 1829 |
+
|
| 1830 |
+
**Date:** December 2024
|
| 1831 |
+
|
| 1832 |
+
**Status:** ⏸️ Phase 1 paused pending PyTorch update
|
| 1833 |
+
|
| 1834 |
+
### The Issue
|
| 1835 |
+
|
| 1836 |
+
The DGX Spark uses an NVIDIA GB10 GPU (Grace Blackwell architecture) with compute capability **sm_121**. Current PyTorch releases (including NGC containers up to 24.08) do not include pre-built CUDA kernels for sm_121.
|
| 1837 |
+
|
| 1838 |
+
**Error observed:**
|
| 1839 |
+
```
|
| 1840 |
+
RuntimeError: CUDA error: no kernel image is available for execution on the device
|
| 1841 |
+
CUDA kernel errors might be asynchronously reported at some other API call
|
| 1842 |
+
```
|
| 1843 |
+
|
| 1844 |
+
**Hardware details:**
|
| 1845 |
+
- DGX Spark hostname: `spark-c691.local`
|
| 1846 |
+
- GPU: NVIDIA GB10 (sm_121 compute capability)
|
| 1847 |
+
- CUDA driver: 13.0
|
| 1848 |
+
- Architecture: ARM64 (aarch64)
|
| 1849 |
+
|
| 1850 |
+
### What We Tried
|
| 1851 |
+
|
| 1852 |
+
1. **NGC PyTorch container 24.08-py3** - Does not include sm_121 kernels
|
| 1853 |
+
2. **NGC PyTorch container 24.11-py3** - Python 3.12 compatibility issues with dependencies
|
| 1854 |
+
3. **Standard PyTorch images** - No ARM64 + CUDA 13.0 support
|
| 1855 |
+
4. **CPU fallback** - Works but defeats the purpose of using Spark
|
| 1856 |
+
|
| 1857 |
+
### What We Learned
|
| 1858 |
+
|
| 1859 |
+
From the [PyTorch forums](https://discuss.pytorch.org/t/nvidia-dgx-spark-support/223677/16):
|
| 1860 |
+
|
| 1861 |
+
1. **sm_121 is binary compatible with sm_120** - The warning/error is overly cautious
|
| 1862 |
+
2. **A PR exists** to add sm_121 support but missed PyTorch 2.9.0 release
|
| 1863 |
+
3. **Workaround exists** - Building PyTorch from source with sm_121 support works, but requires recompiling PyTorch, torchvision, and triton
|
| 1864 |
+
|
| 1865 |
+
### Why We're Pausing (Not Workaround)
|
| 1866 |
+
|
| 1867 |
+
Running CodeGen on CPU on the Spark provides no benefit over:
|
| 1868 |
+
- Mac Studio (512GB RAM) for local development
|
| 1869 |
+
- HuggingFace Spaces (CPU and GPU options available)
|
| 1870 |
+
|
| 1871 |
+
The Spark deployment only makes sense when we can leverage the GB10 GPU. Building PyTorch from source is complex and fragile for a temporary workaround.
|
| 1872 |
+
|
| 1873 |
+
### What's Ready on Spark
|
| 1874 |
+
|
| 1875 |
+
The following infrastructure is in place and ready to test once GPU support lands:
|
| 1876 |
+
|
| 1877 |
+
- [x] Docker infrastructure: `docker/compose.spark.yml`
|
| 1878 |
+
- [x] Dockerfile: `docker/Dockerfile.spark` (using NGC container)
|
| 1879 |
+
- [x] Environment template: `.env.spark.example`
|
| 1880 |
+
- [x] SSH access configured with key-based auth
|
| 1881 |
+
- [x] Git clone at `/srv/visualisable/backend`
|
| 1882 |
+
- [x] Model cache directory: `/srv/models-cache/huggingface`
|
| 1883 |
+
- [x] Backend code has DEVICE env var override (for CPU fallback if needed)
|
| 1884 |
+
- [x] `/health`, `/ready`, `/debug/device` endpoints added
|
| 1885 |
+
|
| 1886 |
+
### Restart Instructions
|
| 1887 |
+
|
| 1888 |
+
When PyTorch officially supports sm_121 (expected in PyTorch 2.9.x patch or 2.10):
|
| 1889 |
+
|
| 1890 |
+
1. **Check for updated NGC container:**
|
| 1891 |
+
```bash
|
| 1892 |
+
# Look for NGC PyTorch containers with sm_121 support
|
| 1893 |
+
# https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
|
| 1894 |
+
```
|
| 1895 |
+
|
| 1896 |
+
2. **Update Dockerfile.spark:**
|
| 1897 |
+
```dockerfile
|
| 1898 |
+
# Update to NGC container version with sm_121 support
|
| 1899 |
+
FROM nvcr.io/nvidia/pytorch:XX.XX-py3
|
| 1900 |
+
```
|
| 1901 |
+
|
| 1902 |
+
3. **On Spark, pull and rebuild:**
|
| 1903 |
+
```bash
|
| 1904 |
+
ssh dgxspark@spark-c691.local
|
| 1905 |
+
cd /srv/visualisable/backend
|
| 1906 |
+
git pull
|
| 1907 |
+
|
| 1908 |
+
# Remove DEVICE=cpu from .env.spark (or comment it out)
|
| 1909 |
+
vim .env.spark
|
| 1910 |
+
|
| 1911 |
+
# Rebuild with new NGC container
|
| 1912 |
+
docker compose -f docker/compose.spark.yml --env-file .env.spark up -d --build
|
| 1913 |
+
```
|
| 1914 |
+
|
| 1915 |
+
4. **Verify GPU is working:**
|
| 1916 |
+
```bash
|
| 1917 |
+
# Should show cuda_available: true, model_device: cuda:0
|
| 1918 |
+
curl -s http://spark-c691.local:8000/debug/device | python -m json.tool
|
| 1919 |
+
|
| 1920 |
+
# Test inference
|
| 1921 |
+
curl -X POST http://spark-c691.local:8000/analyze/research/attention \
|
| 1922 |
+
-H "Content-Type: application/json" \
|
| 1923 |
+
-d '{"prompt": "def hello():", "max_tokens": 5}'
|
| 1924 |
+
```
|
| 1925 |
+
|
| 1926 |
+
5. **Continue with Phase 1 validation criteria**
|
| 1927 |
+
|
| 1928 |
+
### Monitoring PyTorch Progress
|
| 1929 |
+
|
| 1930 |
+
- PyTorch GitHub: Watch for sm_121 PRs
|
| 1931 |
+
- NGC Container releases: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch
|
| 1932 |
+
- PyTorch forums: https://discuss.pytorch.org/t/nvidia-dgx-spark-support/223677
|
| 1933 |
+
|
| 1934 |
+
### Pre-Devstral Tag
|
| 1935 |
+
|
| 1936 |
+
Before making these changes, both repos were tagged: `pre-devstral-v1`
|
| 1937 |
+
|
| 1938 |
+
To restore to this state if needed:
|
| 1939 |
+
```bash
|
| 1940 |
+
git checkout pre-devstral-v1
|
| 1941 |
+
```
|