rockylynnstein commited on
Commit
32fafeb
·
verified ·
1 Parent(s): c9aab89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -3
README.md CHANGED
@@ -1,3 +1,123 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: microsoft/NextCoder-32B
4
+ tags:
5
+ - code
6
+ - fp8
7
+ - quantized
8
+ - nextcoder
9
+ - microsoft
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # NextCoder-32B-FP8
15
+
16
+ This is an FP8 quantized version of [microsoft/NextCoder-32B](https://huggingface.co/microsoft/NextCoder-32B) for efficient inference on NVIDIA Ada Lovelace and newer GPUs.
17
+
18
+ ## Model Description
19
+
20
+ FP8 (8-bit floating point) quantization of NextCoder-32B, optimized for fast code generation with minimal quality loss.
21
+
22
+ ### Quantization Details
23
+
24
+ | Property | Value |
25
+ |----------|-------|
26
+ | Original Model | [microsoft/NextCoder-32B](https://huggingface.co/microsoft/NextCoder-32B) |
27
+ | Quantization Method | FP8 (E4M3) via llm-compressor |
28
+ | Model Size | ~64GB (sharded safetensors files) |
29
+ | Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) |
30
+ | Quantization Date | 2025-11-23 |
31
+ | Quantization Time | 213.8 minutes |
32
+ | Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) |
33
+
34
+ ## Usage
35
+
36
+ ### Loading the Model
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+ import torch
40
+
41
+ # Load model with FP8 quantization
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ "TevunahAi/NextCoder-32B-FP8",
44
+ torch_dtype=torch.float8_e4m3fn, # FP8 dtype
45
+ device_map="auto",
46
+ low_cpu_mem_usage=True,
47
+ )
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-32B-FP8")
50
+
51
+ # Generate code
52
+ messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
53
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
54
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
55
+
56
+ outputs = model.generate(
57
+ **inputs,
58
+ max_new_tokens=512,
59
+ temperature=0.7,
60
+ do_sample=True
61
+ )
62
+
63
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
64
+ ```
65
+
66
+ ### Requirements
67
+ ```bash
68
+ pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+
69
+ pip install transformers>=4.40.0
70
+ pip install accelerate
71
+ ```
72
+
73
+ **System Requirements:**
74
+ - PyTorch 2.1 or newer with CUDA support
75
+ - NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
76
+ - CUDA 11.8 or newer
77
+ - ~64GB VRAM for inference (or use multi-GPU setup with device_map="auto")
78
+
79
+ ## Benefits of FP8
80
+
81
+ - **~50% memory reduction** compared to FP16/BF16
82
+ - **Faster inference** on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
83
+ - **Minimal quality loss** compared to INT8 or INT4 quantization
84
+ - **Native hardware acceleration** on modern NVIDIA GPUs
85
+
86
+ ## Model Files
87
+
88
+ This model is sharded into multiple safetensors files. All files are required for inference.
89
+
90
+ ## Original Model
91
+
92
+ This quantization is based on [microsoft/NextCoder-32B](https://huggingface.co/microsoft/NextCoder-32B) by Microsoft. Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-32B) for:
93
+ - Training details
94
+ - Intended use cases
95
+ - Capabilities and limitations
96
+ - Evaluation results
97
+ - Ethical considerations
98
+
99
+ ## Quantization Recipe
100
+
101
+ This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in `recipe.yaml`.
102
+
103
+ ## License
104
+
105
+ This model inherits the MIT license from the original NextCoder-32B model.
106
+
107
+ ## Citation
108
+
109
+ If you use this model, please cite the original NextCoder work:
110
+ ```bibtex
111
+ @misc{nextcoder2024,
112
+ title={NextCoder: Next-Generation Code LLM},
113
+ author={Microsoft},
114
+ year={2024},
115
+ url={https://huggingface.co/microsoft/NextCoder-32B}
116
+ }
117
+ ```
118
+
119
+ ## Acknowledgments
120
+
121
+ - Original model by Microsoft
122
+ - Quantization performed using Neural Magic's llm-compressor
123
+ - Quantized by TevunahAi