---
license: mit
base_model: microsoft/NextCoder-7B
tags:
- code
- fp8
- quantized
- nextcoder
- microsoft
library_name: transformers
pipeline_tag: text-generation
---
# NextCoder-7B-FP8
**High-quality FP8 quantization of Microsoft's NextCoder-7B, optimized for production inference**
This is an FP8 (E4M3) quantized version of [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) using compressed_tensors format. Quantized by [TevunahAi](https://huggingface.co/TevunahAi) on enterprise-grade hardware with 2048 calibration samples.
## 🎯 Recommended Usage: vLLM
For optimal performance with **full FP8 benefits** (2x memory savings + faster inference), use **vLLM** or **TensorRT-LLM**:
### Quick Start with vLLM
```bash
pip install vllm
```
**Python API:**
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
# vLLM auto-detects FP8 from model config
llm = LLM(model="TevunahAi/NextCoder-7B-FP8", dtype="auto")
# Prepare prompt with chat template
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Generate
outputs = llm.generate(prompt, SamplingParams(temperature=0.7, max_tokens=512))
print(outputs[0].outputs[0].text)
```
**OpenAI-Compatible API Server:**
```bash
vllm serve TevunahAi/NextCoder-7B-FP8 \
--dtype auto \
--max-model-len 4096
```
Then use with OpenAI client:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="token-abc123", # dummy key
)
response = client.chat.completions.create(
model="TevunahAi/NextCoder-7B-FP8",
messages=[
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
```
### vLLM Benefits
- ✅ **Weights, activations, and KV cache in FP8**
- ✅ **~7GB VRAM** (50% reduction vs BF16)
- ✅ **Native FP8 tensor core acceleration** on Ada/Hopper GPUs
- ✅ **Faster inference** with optimized CUDA kernels
- ✅ **Production-grade performance**
## ⚙️ Alternative: Transformers
This model can also be loaded with `transformers`. **Note:** Transformers will decompress FP8 → BF16 during inference, losing the memory benefit. However, at 7B parameters, this is manageable (~14GB VRAM).
Transformers Example (Click to expand)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Loads FP8 weights but decompresses to BF16 during compute
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/NextCoder-7B-FP8",
device_map="auto",
torch_dtype="auto",
low_cpu_mem_usage=True,
)
tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")
# Generate code
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
**Requirements:**
```bash
pip install torch>=2.1.0 transformers>=4.40.0 accelerate compressed-tensors
```
**System Requirements:**
- ~14GB VRAM (decompressed to BF16)
- CUDA 11.8 or newer
- PyTorch 2.1+ with CUDA support