NextCoder-7B-FP8 / README.md

Update README.md

a8611c9 verified about 2 months ago

7.45 kB

	---
	license: mit
	base_model: microsoft/NextCoder-7B
	tags:
	- code
	- fp8
	- quantized
	- nextcoder
	- microsoft
	library_name: transformers
	pipeline_tag: text-generation
	---

	# NextCoder-7B-FP8

	High-quality FP8 quantization of Microsoft's NextCoder-7B, optimized for production inference

	This is an FP8 (E4M3) quantized version of [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware with 2048 calibration samples.

	## 🎯 Recommended Usage: vLLM

	For optimal performance with full FP8 benefits (2x memory savings + faster inference), use vLLM or TensorRT-LLM:

	### Quick Start with vLLM

	```bash
	pip install vllm
	```

	Python API:

	```python
	from vllm import LLM, SamplingParams
	from transformers import AutoTokenizer

	# vLLM auto-detects FP8 from model config
	llm = LLM(model="TevunahAi/NextCoder-7B-FP8", dtype="auto")

	# Prepare prompt with chat template
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")
	messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

	# Generate
	outputs = llm.generate(prompt, SamplingParams(temperature=0.7, max_tokens=512))
	print(outputs[0].outputs[0].text)
	```

	OpenAI-Compatible API Server:

	```bash
	vllm serve TevunahAi/NextCoder-7B-FP8 \
	--dtype auto \
	--max-model-len 4096
	```

	Then use with OpenAI client:

	```python
	from openai import OpenAI

	client = OpenAI(
	base_url="http://localhost:8000/v1",
	api_key="token-abc123", # dummy key
	)

	response = client.chat.completions.create(
	model="TevunahAi/NextCoder-7B-FP8",
	messages=[
	{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
	],
	temperature=0.7,
	max_tokens=512,
	)

	print(response.choices[0].message.content)
	```

	### vLLM Benefits

	- ✅ Weights, activations, and KV cache in FP8
	- ✅ ~7GB VRAM (50% reduction vs BF16)
	- ✅ Native FP8 tensor core acceleration on Ada/Hopper GPUs
	- ✅ Faster inference with optimized CUDA kernels
	- ✅ Production-grade performance

	## ⚙️ Alternative: Transformers

	This model can also be loaded with `transformers`. Note: Transformers will decompress FP8 → BF16 during inference, losing the memory benefit. However, at 7B parameters, this is manageable (~14GB VRAM).

	<details>
	<summary>Transformers Example (Click to expand)</summary>

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	# Loads FP8 weights but decompresses to BF16 during compute
	model = AutoModelForCausalLM.from_pretrained(
	"TevunahAi/NextCoder-7B-FP8",
	device_map="auto",
	torch_dtype="auto",
	low_cpu_mem_usage=True,
	)
	tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")

	# Generate code
	messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer([text], return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True
	)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	Requirements:
	```bash
	pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
	```

	System Requirements:
	- ~14GB VRAM (decompressed to BF16)
	- CUDA 11.8 or newer
	- PyTorch 2.1+ with CUDA support

	</details>

	## 📊 Quantization Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) \|
	\| Quantization Method \| FP8 E4M3 weight-only \|
	\| Framework \| llm-compressor + compressed_tensors \|
	\| Storage Size \| ~7GB (3 sharded safetensors) \|
	\| VRAM (vLLM) \| ~7GB \|
	\| VRAM (Transformers) \| ~14GB (decompressed to BF16) \|
	\| Target Hardware \| NVIDIA Ada (RTX 4000/5000) or Hopper (H100/GH200) \|
	\| Quantization Date \| November 22, 2025 \|
	\| Quantization Time \| 47 minutes \|

	### Quantization Infrastructure

	Professional hardware ensures consistent, high-quality quantization:

	- CPUs: Dual Intel Xeon Max 9480 (112 cores / 224 threads, 128GB HBM2e)
	- GPU: NVIDIA RTX 5000 Ada Generation (32GB VRAM, native FP8 support)
	- Memory: 256GB DDR5 + 128GB HBM2e = 384GB total system memory
	- Software Stack: Ubuntu 25.10 \| Python 3.12 \| PyTorch 2.8 \| CUDA 13.0 \| llm-compressor

	## 🔧 Why FP8?

	### With vLLM/TensorRT-LLM:
	- ✅ 50% memory reduction vs BF16 (weights + activations + KV cache)
	- ✅ Faster inference via native FP8 tensor cores
	- ✅ Minimal quality loss (sub-1% perplexity increase)
	- ✅ Better throughput with optimized kernels

	### With Transformers:
	- ✅ Smaller download size (~7GB vs ~14GB BF16)
	- ✅ Compatible with standard transformers workflow
	- ⚠️ Decompresses to BF16 during inference (no runtime memory benefit)

	For production inference, use vLLM to realize the full FP8 benefits.

	## 💾 Model Files

	This model is sharded into 3 safetensors files (all required for inference):

	- `model-00001-of-00003.safetensors`
	- `model-00002-of-00003.safetensors`
	- `model-00003-of-00003.safetensors`

	## 📚 Original Model

	This quantization is based on [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) by Microsoft.

	For comprehensive information about:
	- Model architecture and training methodology
	- Capabilities, use cases, and limitations
	- Evaluation benchmarks and results
	- Ethical considerations and responsible AI guidelines

	Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-7B).

	## 🔧 Hardware Requirements

	### Minimum (vLLM):
	- GPU: NVIDIA RTX 4060 Ti (16GB) or better
	- VRAM: 8GB minimum, 16GB recommended
	- CUDA: 11.8 or newer

	### Recommended (vLLM):
	- GPU: NVIDIA RTX 4090 / RTX 5000 Ada / H100
	- VRAM: 16GB+
	- CUDA: 12.0+

	### Transformers:
	- GPU: Any CUDA-capable GPU
	- VRAM: 16GB+ (due to BF16 decompression)

	## 📖 Additional Resources

	- vLLM Documentation: [docs.vllm.ai](https://docs.vllm.ai/)
	- TensorRT-LLM: [github.com/NVIDIA/TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)
	- TevunahAi Models: [huggingface.co/TevunahAi](https://huggingface.co/TevunahAi)
	- llm-compressor: [github.com/vllm-project/llm-compressor](https://github.com/vllm-project/llm-compressor)

	## 📄 License

	This model inherits the MIT License from the original NextCoder-7B model.

	## 🙏 Acknowledgments

	- Original Model: Microsoft NextCoder team
	- Quantization Framework: Neural Magic's llm-compressor
	- Quantized by: [TevunahAi](https://huggingface.co/TevunahAi)

	## 📝 Citation

	If you use this model, please cite the original NextCoder work:

	```bibtex
	@misc{nextcoder2024,
	title={NextCoder: Next-Generation Code LLM},
	author={Microsoft},
	year={2024},
	url={https://huggingface.co/microsoft/NextCoder-7B}
	}
	```

	---

	<div align="center">

	Professional AI Model Quantization by TevunahAi

	Enterprise-grade quantization on specialized hardware

	[View all models](https://huggingface.co/TevunahAi) \| [Contact for custom quantization](https://huggingface.co/TevunahAi)

	</div>