| | --- |
| | license: mit |
| | base_model: microsoft/NextCoder-7B |
| | tags: |
| | - code |
| | - fp8 |
| | - quantized |
| | - nextcoder |
| | - microsoft |
| | library_name: transformers |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # NextCoder-7B-FP8 |
| |
|
| | This is an FP8 quantized version of [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B). |
| |
|
| | ## Model Description |
| |
|
| | FP8 (8-bit floating point) quantization of NextCoder-7B for efficient inference on NVIDIA Ada Lovelace and newer GPUs. |
| |
|
| | ### Quantization Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | Original Model | [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) | |
| | | Quantization Method | FP8 (E4M3) via llm-compressor | |
| | | Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) | |
| | | Quantization Date | 2025-11-22 | |
| | | Quantization Time | 47.0 minutes | |
| | | Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) | |
| |
|
| | ## Usage |
| |
|
| | ### Loading the Model |
| | ```python |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | import torch |
| | |
| | # Load model with FP8 quantization |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "TevunahAi/NextCoder-7B-FP8", |
| | torch_dtype=torch.float8_e4m3fn, # FP8 dtype |
| | device_map="auto", |
| | low_cpu_mem_usage=True, |
| | ) |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8") |
| | |
| | # Generate code |
| | messages = [{"role": "user", "content": "Write a function to calculate fibonacci numbers"}] |
| | text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| | inputs = tokenizer([text], return_tensors="pt").to(model.device) |
| | |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | do_sample=True |
| | ) |
| | |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ### Requirements |
| | ```bash |
| | pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+ |
| | pip install transformers>=4.40.0 |
| | pip install accelerate |
| | ``` |
| |
|
| | **Note**: FP8 inference requires: |
| | - PyTorch 2.1 or newer with CUDA support |
| | - NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.) |
| | - CUDA 11.8 or newer |
| | ``` |
| | |
| | ## Benefits of FP8 |
| | |
| | - **~50% memory reduction** compared to FP16/BF16 |
| | - **Faster inference** on Ada Lovelace GPUs with native FP8 support |
| | - **Minimal quality loss** compared to INT8 or INT4 quantization |
| | |
| | ## Original Model |
| | |
| | This quantization is based on microsoft/NextCoder-7B by Microsoft. |
| | Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-7B) for: |
| | - Training details |
| | - Intended use cases |
| | - Limitations and biases |
| | - License information |
| | |
| | ## License |
| | |
| | This model inherits the MIT license from the original NextCoder model. |