NextCoder-14B-FP8 / README.md

Update README.md

b9b611b verified 2 months ago

8.3 kB

	---
	license: mit
	base_model: microsoft/NextCoder-14B
	tags:
	- code
	- fp8
	- quantized
	- nextcoder
	- microsoft
	library_name: transformers
	pipeline_tag: text-generation
	---

	# NextCoder-14B-FP8

	High-quality FP8 quantization of Microsoft's NextCoder-14B, optimized for production inference

	This is an FP8 (E4M3) quantized version of [microsoft/NextCoder-14B](https://huggingface.co/microsoft/NextCoder-14B) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware with 2048 calibration samples.

	## 🎯 Recommended Usage: vLLM

	For optimal performance with full FP8 benefits (2x memory savings + faster inference), use vLLM or TensorRT-LLM:

	### Quick Start with vLLM

	```bash
	pip install vllm
	```

	Python API:

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	# vLLM auto-detects FP8 from model config
	llm = LLM(model="TevunahAi/NextCoder-14B-FP8", dtype="auto")

	# Prepare prompt with chat template
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-14B-FP8")
	messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	# Generate
	outputs = llm.generate(prompt, SamplingParams(temperature=0.7, max_tokens=512))
	print(outputs[0].outputs[0].text)
	```

	OpenAI-Compatible API Server:

	```bash
	vllm serve TevunahAi/NextCoder-14B-FP8 \
	--dtype auto \
	--max-model-len 4096
	```

	Then use with OpenAI client:

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="token-abc123", # dummy key
	)

	response = client.chat.completions.create(
	model="TevunahAi/NextCoder-14B-FP8",
	messages=[
	{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
	],
	temperature=0.7,
	max_tokens=512,
	)

	print(response.choices[0].message.content)
	```

	### vLLM Benefits

	- ✅ Weights, activations, and KV cache in FP8
	- ✅ ~14GB VRAM (50% reduction vs BF16)
	- ✅ Native FP8 tensor core acceleration on Ada/Hopper GPUs
	- ✅ Faster inference with optimized CUDA kernels
	- ✅ Single GPU deployment on RTX 5000 Ada, RTX 4090, or H100

	## ⚙️ Alternative: Transformers (Not Recommended)

	This model can be loaded with `transformers`, but will decompress FP8 → BF16 during inference, requiring ~28GB+ VRAM. For 14B models, vLLM is strongly recommended for practical single-GPU deployment.

	<details>
	<summary>Transformers Example (Click to expand)</summary>

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Loads FP8 weights but decompresses to BF16 during compute
	model = AutoModelForCausalLM.from_pretrained(
	"TevunahAi/NextCoder-14B-FP8",
	device_map="auto",
	torch_dtype="auto",
	low_cpu_mem_usage=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-14B-FP8")

	# Generate code
	messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Requirements:
	```bash
	pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
	```

	System Requirements:
	- ~28GB+ VRAM (decompressed to BF16) - requires multi-GPU or high-end single GPU
	- CUDA 11.8 or newer
	- PyTorch 2.1+ with CUDA support

	⚠️ Warning: Most consumer GPUs will struggle with transformers inference at this size. Use vLLM for practical deployment.

	</details>

	## 📊 Quantization Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [microsoft/NextCoder-14B](https://huggingface.co/microsoft/NextCoder-14B) \|
	\| Quantization Method \| FP8 E4M3 weight-only \|
	\| Framework \| llm-compressor + compressed_tensors \|
	\| Storage Size \| ~14GB (sharded safetensors) \|
	\| VRAM (vLLM) \| ~14GB \|
	\| VRAM (Transformers) \| ~28GB+ (decompressed to BF16) \|
	\| Target Hardware \| NVIDIA Ada (RTX 4000/5000) or Hopper (H100/GH200) \|
	\| Quantization Date \| November 22, 2025 \|

	### Quantization Infrastructure

	Professional hardware ensures consistent, high-quality quantization:

	- CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
	- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
	- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
	- Software Stack: Ubuntu 25.10 \| Python 3.12 \| PyTorch 2.8 \| CUDA 13.0 \| llm-compressor

	## 🔧 Why FP8?

	### With vLLM/TensorRT-LLM:
	- ✅ 50% memory reduction vs BF16 (weights + activations + KV cache)
	- ✅ Faster inference via native FP8 tensor cores
	- ✅ Single GPU deployment on 24GB+ cards
	- ✅ Better throughput with optimized kernels
	- ✅ Minimal quality loss (sub-1% perplexity increase)

	### With Transformers:
	- ✅ Smaller download size (~14GB vs ~28GB BF16)
	- ✅ Compatible with standard transformers workflow
	- ⚠️ Decompresses to BF16 during inference (no runtime memory benefit)
	- ❌ Requires 28GB+ VRAM - impractical for most setups

	For 14B models, vLLM is essential for practical deployment.

	## 💾 Model Files

	This model is sharded into multiple safetensors files (all required for inference). The compressed format enables efficient storage and faster downloads.

	## 🚀 Performance vs 7B

	The 14B model offers significant improvements over 7B:

	- ✅ Superior code quality and more accurate completions
	- ✅ Enhanced understanding of complex programming concepts
	- ✅ Better reasoning for difficult coding tasks
	- ✅ Improved context handling for larger codebases
	- ⚠️ Trade-off: 2x VRAM requirement (14GB vs 7GB with vLLM)

	With vLLM, the 14B model fits comfortably on a single RTX 4090 (24GB) or RTX 5000 Ada (32GB).



	## 📚 Original Model

	This quantization is based on [microsoft/NextCoder-14B](https://huggingface.co/microsoft/NextCoder-14B) by Microsoft.

	For comprehensive information about:
	- Model architecture and training methodology
	- Capabilities, use cases, and limitations
	- Evaluation benchmarks and results
	- Ethical considerations and responsible AI guidelines

	Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-14B).

	## 🔧 Hardware Requirements

	### Minimum (vLLM):
	- GPU: NVIDIA RTX 4090 (24GB) or RTX 5000 Ada (32GB)
	- VRAM: 16GB minimum, 24GB+ recommended
	- CUDA: 11.8 or newer

	### Recommended (vLLM):
	- GPU: NVIDIA RTX 5000 Ada (32GB) / H100 (80GB)
	- VRAM: 24GB+
	- CUDA: 12.0+

	### Transformers:
	- GPU: Multi-GPU setup or A100 (40GB+)
	- VRAM: 28GB+ (single GPU) or distributed across multiple GPUs
	- Not recommended for practical deployment

	## 📖 Additional Resources

	- vLLM Documentation: [docs.vllm.ai](https://docs.vllm.ai/)
	- TensorRT-LLM: [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
	- TevunahAi Models: [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
	- llm-compressor: [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)

	## 📄 License

	This model inherits the MIT License from the original NextCoder-14B model.

	## 🙏 Acknowledgments

	- Original Model: Microsoft NextCoder team
	- Quantization Framework: Neural Magic's llm-compressor
	- Quantized by: [TevunahAi](https://huggingface.co/TevunahAi)

	## 📝 Citation

	If you use this model, please cite the original NextCoder work:

	```bibtex
	@misc{nextcoder2024,
	title={NextCoder: Next-Generation Code LLM},
	author={Microsoft},
	year={2024},
	url={https://huggingface.co/microsoft/NextCoder-14B}
	}
	```

	---

	<div align="center">

	Professional AI Model Quantization by TevunahAi

	Enterprise-grade quantization on specialized hardware

	[View all models](https://huggingface.co/TevunahAi) \| [Contact for custom quantization](https://huggingface.co/TevunahAi)

	</div>