Upload inference/README.md with huggingface_hub

2720e07 verified 5 months ago

preview code

raw

history blame contribute delete

9.86 kB

DeepSeek V3.2 NVFP4 Reference Implementation

Reference CPU inference implementation for NVFP4-quantized DeepSeek V3.2 (671B parameters)

Overview

This directory contains a functional reference implementation for CPU inference of the NVFP4-quantized DeepSeek V3.2 model. NVFP4 (FP4 E2M1) provides 16x compression compared to FP32 while maintaining model functionality.

Status: FUNCTIONAL

Quantization: 30,769 weights converted, 0 errors
Model Size: 391GB (compressed from ~2.6TB FP32)
Tests: All validation tests passing
Inference: End-to-end CPU inference working

Quick Start

Prerequisites

Python 3.8+
PyTorch 2.0+ with float8 support
~400GB RAM minimum
Safetensors, transformers libraries

Installation

cd /mnt/models/deepseek-v3.2-nvfp4/inference
pip install -r requirements.txt

Running Tests

Quick validation (~30 seconds):

python test_nvfp4_kernel.py

Full validation (~10-15 minutes):

# Clear cache first (recommended)
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

# Run forward pass test
python test_forward_pass.py

Generation test (~15-20 minutes):

python test_minimal_generation.py

Interactive Inference

python generate.py \
    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
    --config config_671B_nvfp4.json \
    --interactive \
    --max-new-tokens 10 \
    --temperature 0.6

Note: CPU inference is slow (approximately 2-5 minutes per token). This is a reference implementation for validation, not production deployment.

Architecture

NVFP4 Format

E2M1 Specification:

4 bits per value (16 representable values)
Values: {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}
Storage: 2 FP4 values packed per uint8 byte

Dual-Level Scaling:

Per-block scale: FP8 E4M3, 16 elements per block
Global scale: FP32 scalar
Formula: value = packed * weight_scale * weight_scale_2

Model Structure

DeepSeek V3.2 (671B parameters)
├── Embedding Layer (129,280 vocab)
├── 61 Transformer Blocks
│   ├── Multi-Head Latent Attention (MLA)
│   │   ├── Query/KV LoRA projections
│   │   ├── Sparse attention indexer
│   │   └── FP8 KV cache
│   └── Mixture of Experts (MoE)
│       ├── 256 routed experts
│       ├── 1 shared expert
│       └── Top-8 routing
└── LM Head (output projection)

Key Components

model.py: Model architecture with NVFP4 support
nvfp4_kernel.py: NVFP4 CPU dequantization kernel
generate.py: Interactive inference pipeline
kernel.py: FP8 quantization kernels
encoding_dsv32.py: DeepSeek message encoding
convert.py: Checkpoint conversion utilities

File Structure

inference/
├── README.md                      # This file
├── IMPLEMENTATION_SUMMARY.md      # Detailed implementation notes
├── requirements.txt               # Python dependencies
│
├── config_671B_nvfp4.json        # NVFP4 model configuration
├── config_671B_v3.2.json         # FP8 model configuration
│
├── model.py                       # Model architecture
├── generate.py                    # Inference pipeline
├── nvfp4_kernel.py               # NVFP4 CPU kernels
├── kernel.py                      # FP8 kernels
├── nvfp4_triton.py               # NVFP4 GPU kernels (incomplete)
├── encoding_dsv32.py             # Message encoding
├── convert.py                     # Checkpoint conversion
│
└── test_*.py                      # Test suite
    ├── test_nvfp4_kernel.py      # Unit tests
    ├── test_model_loading.py      # Loading tests
    ├── test_forward_pass.py       # Forward pass tests
    └── test_minimal_generation.py # Generation tests

Test Suite

Unit Tests (`test_nvfp4_kernel.py`)

Validates NVFP4 quantization math:

Lookup table correctness
Dequantization accuracy
Quantization roundtrip error
GEMM operation shapes
Output correctness

Expected: All 5 tests pass in under 30 seconds

Integration Tests

Model Loading (test_model_loading.py):

Config validation
Model instantiation
Weight loading from 73 shards
NVFP4 layer structure verification
Weight statistics validation

Forward Pass (test_forward_pass.py):

Single forward pass through full model
Output shape validation
NaN/Inf detection
Logits range checking
Prediction coherence

Token Generation (test_minimal_generation.py):

5-token autoregressive generation
KV cache functionality
Sampling correctness
Output decoding

Performance

Measured on CPU (Reference Implementation)

Metric	Value
Model Loading	8-10 minutes
Forward Pass	2-5 minutes
Tokens/Second	0.003-0.01
Memory Usage	~260GB
Model Size	391GB

Quantization Quality

Metric	Value
Compression	16x (vs FP32)
Bits/Parameter	4.56 (4-bit weights + scales)
Conversion Errors	0
Mean Quant Error	0.14-1.8
Relative Error	18-42%

Note: Error metrics are acceptable for aggressive 4-bit quantization.

Usage Examples

Example 1: Quick Validation

# Test NVFP4 math is correct
python test_nvfp4_kernel.py

# Expected output:
# ALL TESTS PASSED
# NVFP4 kernel functions are working correctly

Example 2: Full Model Test

# Clear cache
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

# Run forward pass
python test_forward_pass.py

# Expected:
# FORWARD PASS TEST PASSED
# Forward pass completed successfully

Example 3: Interactive Chat

# Start interactive session
python generate.py \
    --ckpt-path /mnt/models/deepseek-v3.2-nvfp4 \
    --config config_671B_nvfp4.json \
    --interactive \
    --max-new-tokens 20 \
    --temperature 0.6

# Example interaction:
# User: What is 2+2?
# Assistant: [generates response, ~2-5 min per token]

Troubleshooting

Out of Memory

Symptoms: Process killed during loading

Solutions:

# Clear system cache (Linux)
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

# Check available memory
free -h

# Ensure >400GB available

Slow Performance

Symptoms: More than 5 minutes per token

Expected Behavior: CPU inference is slow for 671B parameters

Mitigations:

Use smaller --max-new-tokens values
GPU acceleration (Triton kernels) would provide 100-1000x speedup

NaN/Inf Outputs

Symptoms: Model produces NaN or Inf

Debug:

# Check scales
print(f"Scale range: [{weight_scale.min()}, {weight_scale.max()}]")
print(f"Has zeros: {(weight_scale == 0).any()}")

Solution: Verify quantization conversion report has 0 errors

Implementation Details

NVFP4 Dequantization

Algorithm (from nvfp4_kernel.py):

def dequantize_nvfp4(packed, scale, scale_2):
    # 1. Unpack two FP4 values per byte
    low = packed & 0x0F
    high = (packed >> 4) & 0x0F
    fp4_tensor = torch.stack([low, high], dim=-1)

    # 2. Lookup table dequantization
    tensor = NVFP4_LUT[fp4_tensor.long()]

    # 3. Apply dual-level scales
    tensor = tensor.reshape(M, K // 16, 16)
    tensor = tensor * scale.unsqueeze(-1) * scale_2

    return tensor

NVFP4 GEMM

CPU Fallback (from nvfp4_kernel.py):

def nvfp4_gemm_dequant(x, weight, weight_scale, weight_scale_2):
    # Dequantize NVFP4 weights to bfloat16
    weight_bf16 = dequantize_nvfp4(
        weight, weight_scale, weight_scale_2,
        dtype=torch.bfloat16
    )

    # Standard matmul
    return torch.matmul(x, weight_bf16.T)

Note: This is a simple but slow implementation. GPU-accelerated Triton kernels would be much faster.

Limitations

Current Limitations

CPU Only: GPU Triton kernels incomplete (TODOs at nvfp4_triton.py:257, 265)
Slow Inference: Approximately 2-5 minutes per token (expected for CPU)
Memory Intensive: Requires approximately 400GB RAM
No Batch Support: Single-sample inference only

Not Included

GPU acceleration (Triton kernels incomplete)
Batch inference support
Streaming generation
Quantization-aware training
Model conversion pipeline (see /mnt/git/fp8_quant/)

Future Work

Priority 1: GPU Acceleration

Complete Triton NVFP4 kernel implementation
Enable TMA (Tensor Memory Accelerator) support
Add dimension padding for non-aligned tensors
Expected speedup: 100-1000x vs CPU

Priority 2: Optimization

Implement mixed-precision inference (FP8 + NVFP4)
Add batch inference support
Optimize memory usage during loading
Streaming generation support

Priority 3: Validation

Benchmark against FP8/FP16 baselines
Measure perplexity on standard datasets
Test across diverse tasks
Quality analysis

References

Documentation

Implementation Summary: IMPLEMENTATION_SUMMARY.md
Quantization Script: See conversion tools documentation
Original Model: DeepSeek V3.2 base model
Conversion Report: conversion_report.json (in model directory)

External Resources

License

See DeepSeek V3 model license.

Support

For issues or questions:

Check IMPLEMENTATION_SUMMARY.md for detailed implementation notes
Review test logs in test_*.log files
Verify conversion report in conversion_report.json

Status: Functional reference CPU inference (December 2025)

Model: DeepSeek V3.2 (671B parameters)

Format: NVFP4 E2M1 (4-bit quantization)

Compression: 16x vs FP32

Quality: Validated through comprehensive testing