--- license: mit base_model: microsoft/NextCoder-7B tags: - code - fp8 - quantized - nextcoder - microsoft library_name: transformers pipeline_tag: text-generation --- # NextCoder-7B-FP8 This is an FP8 quantized version of [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B). ## Model Description FP8 (8-bit floating point) quantization of NextCoder-7B for efficient inference on NVIDIA Ada Lovelace and newer GPUs. ### Quantization Details | Property | Value | |----------|-------| | Original Model | [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) | | Quantization Method | FP8 (E4M3) via llm-compressor | | Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) | | Quantization Date | 2025-11-22 | | Quantization Time | 47.0 minutes | | Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) | ## Usage ### Loading the Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model with FP8 quantization model = AutoModelForCausalLM.from_pretrained( "TevunahAi/NextCoder-7B-FP8", torch_dtype=torch.float8_e4m3fn, # FP8 dtype device_map="auto", low_cpu_mem_usage=True, ) tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8") # Generate code messages = [{"role": "user", "content": "Write a function to calculate fibonacci numbers"}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer([text], return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, do_sample=True ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Requirements ```bash pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+ pip install transformers>=4.40.0 pip install accelerate ``` **Note**: FP8 inference requires: - PyTorch 2.1 or newer with CUDA support - NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.) - CUDA 11.8 or newer ``` ## Benefits of FP8 - **~50% memory reduction** compared to FP16/BF16 - **Faster inference** on Ada Lovelace GPUs with native FP8 support - **Minimal quality loss** compared to INT8 or INT4 quantization ## Original Model This quantization is based on microsoft/NextCoder-7B by Microsoft. Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-7B) for: - Training details - Intended use cases - Limitations and biases - License information ## License This model inherits the MIT license from the original NextCoder model.