NextCoder-32B-FP8 / README.md

Update README.md

5c2eb57 verified 2 months ago

8.38 kB

	---
	license: mit
	base_model: microsoft/NextCoder-32B
	tags:
	- code
	- fp8
	- quantized
	- nextcoder
	- microsoft
	library_name: transformers
	pipeline_tag: text-generation
	---

	# NextCoder-32B-FP8

	High-quality FP8 quantization of Microsoft's NextCoder-32B, optimized for production inference

	This is an FP8 (E4M3) quantized version of [microsoft/NextCoder-32B](https://huggingface.co/microsoft/NextCoder-32B) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware with 2048 calibration samples.

	## 🎯 Recommended Usage: vLLM (Required)

	For 32B models, vLLM is essential for practical deployment. FP8 quantization makes this flagship model accessible on high-end consumer GPUs.

	### Quick Start with vLLM

	```bash
	pip install vllm
	```

	Python API:

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	# vLLM auto-detects FP8 from model config
	llm = LLM(model="TevunahAi/NextCoder-32B-FP8", dtype="auto")

	# Prepare prompt with chat template
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-32B-FP8")
	messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	# Generate
	outputs = llm.generate(prompt, SamplingParams(temperature=0.7, max_tokens=512))
	print(outputs[0].outputs[0].text)
	```

	OpenAI-Compatible API Server:

	```bash
	vllm serve TevunahAi/NextCoder-32B-FP8 \
	--dtype auto \
	--max-model-len 4096
	```

	Then use with OpenAI client:

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="token-abc123", # dummy key
	)

	response = client.chat.completions.create(
	model="TevunahAi/NextCoder-32B-FP8",
	messages=[
	{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
	],
	temperature=0.7,
	max_tokens=512,
	)

	print(response.choices[0].message.content)
	```

	### vLLM Benefits

	- ✅ Weights, activations, and KV cache in FP8
	- ✅ ~32GB VRAM (50% reduction vs BF16's ~64GB)
	- ✅ Single high-end GPU deployment (H100, RTX 6000 Ada, A100 80GB)
	- ✅ Native FP8 tensor core acceleration
	- ✅ Production-grade performance

	## ⚠️ Transformers: Not Practical

	At 32B parameters, transformers will decompress to ~64GB+ VRAM, requiring multi-GPU setups or data center GPUs. This is not recommended for deployment.

	<details>
	<summary>Transformers Example (Multi-GPU Required - Click to expand)</summary>

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Requires multi-GPU or 80GB+ single GPU
	model = AutoModelForCausalLM.from_pretrained(
	"TevunahAi/NextCoder-32B-FP8",
	device_map="auto", # Will distribute across GPUs
	torch_dtype="auto",
	low_cpu_mem_usage=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-32B-FP8")

	# Generate code
	messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Requirements:
	```bash
	pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
	```

	System Requirements:
	- ~64GB+ VRAM (decompressed to BF16)
	- Multi-GPU setup or A100 80GB / H100 80GB
	- Not practical for most deployments

	⚠️ Critical: Use vLLM instead. Transformers is only viable for research/testing with multi-GPU setups.

	</details>

	## 📊 Quantization Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [microsoft/NextCoder-32B](https://huggingface.co/microsoft/NextCoder-32B) \|
	\| Quantization Method \| FP8 E4M3 weight-only \|
	\| Framework \| llm-compressor + compressed_tensors \|
	\| Storage Size \| ~32GB (sharded safetensors) \|
	\| VRAM (vLLM) \| ~32GB \|
	\| VRAM (Transformers) \| ~64GB+ (decompressed to BF16) \|
	\| Target Hardware \| NVIDIA H100, A100 80GB, RTX 6000 Ada \|
	\| Quantization Date \| November 23, 2025 \|
	\| Quantization Time \| 213.8 minutes \|

	### Quantization Infrastructure

	Professional hardware ensures consistent, high-quality quantization:

	- CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
	- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
	- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
	- Software Stack: Ubuntu 25.10 \| Python 3.12 \| PyTorch 2.8 \| CUDA 13.0 \| llm-compressor

	## 🔧 Why FP8 for 32B Models?

	### With vLLM/TensorRT-LLM:
	- ✅ Enables single-GPU deployment (~32GB vs ~64GB BF16)
	- ✅ 50% memory reduction across weights, activations, and KV cache
	- ✅ Faster inference via native FP8 tensor cores
	- ✅ Makes flagship model accessible on high-end consumer/prosumer GPUs
	- ✅ Minimal quality loss (sub-1% perplexity increase)

	### Without FP8:
	- ❌ BF16 requires ~64GB VRAM (H100 80GB or multi-GPU)
	- ❌ Limited deployment options
	- ❌ Higher infrastructure costs

	FP8 quantization transforms 32B from "data center only" to "high-end workstation deployable".

	## 💾 Model Files

	This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.

	## 🚀 Performance Comparison

	The 32B model represents the flagship tier:

	\| Model \| VRAM (vLLM) \| Quality \| Use Case \|
	\|-------\|-------------\|---------\|----------\|
	\| 7B-FP8 \| ~7GB \| Good \| General coding, fast iteration \|
	\| 14B-FP8 \| ~14GB \| Better \| Complex tasks, better reasoning \|
	\| 32B-FP8 \| ~32GB \| Best \| Flagship performance, production \|

	32B Benefits:
	- ✅ State-of-the-art code quality for Microsoft NextCoder family
	- ✅ Superior reasoning and complex problem solving
	- ✅ Enterprise-grade completions for mission-critical applications
	- ✅ Best context understanding across the model family


	## 📚 Original Model

	This quantization is based on [microsoft/NextCoder-32B](https://huggingface.co/microsoft/NextCoder-32B) by Microsoft.

	For comprehensive information about:
	- Model architecture and training methodology
	- Capabilities, use cases, and limitations
	- Evaluation benchmarks and results
	- Ethical considerations and responsible AI guidelines

	Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-32B).

	## 🔧 Hardware Requirements

	### Minimum (vLLM):
	- GPU: NVIDIA A100 40GB or RTX 6000 Ada (48GB)
	- VRAM: 32GB minimum, 40GB+ recommended
	- CUDA: 11.8 or newer

	### Recommended (vLLM):
	- GPU: NVIDIA H100 (80GB) / A100 80GB / RTX 6000 Ada (48GB)
	- VRAM: 40GB+
	- CUDA: 12.0+

	### Transformers:
	- GPU: Multi-GPU setup (2x A100 40GB) or single A100/H100 80GB
	- VRAM: 64GB+ total
	- Not recommended - use vLLM instead

	## 📖 Additional Resources

	- vLLM Documentation: [docs.vllm.ai](https://docs.vllm.ai/)
	- TensorRT-LLM: [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
	- TevunahAi Models: [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
	- llm-compressor: [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)

	## 📄 License

	This model inherits the MIT License from the original NextCoder-32B model.

	## 🙏 Acknowledgments

	- Original Model: Microsoft NextCoder team
	- Quantization Framework: Neural Magic's llm-compressor
	- Quantized by: [TevunahAi](https://huggingface.co/TevunahAi)

	## 📝 Citation

	If you use this model, please cite the original NextCoder work:

	```bibtex
	@misc{nextcoder2024,
	title={NextCoder: Next-Generation Code LLM},
	author={Microsoft},
	year={2024},
	url={https://huggingface.co/microsoft/NextCoder-32B}
	}
	```

	---

	<div align="center">

	Professional AI Model Quantization by TevunahAi

	Making flagship models accessible through enterprise-grade quantization

	[View all models](https://huggingface.co/TevunahAi) \| [Contact for custom quantization](https://huggingface.co/TevunahAi)

	</div>