rockylynnstein commited on
Commit
d848a26
·
verified ·
1 Parent(s): e27d197

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +130 -3
README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: microsoft/NextCoder-7B
4
+ tags:
5
+ - code
6
+ - fp8
7
+ - quantized
8
+ - nextcoder
9
+ - microsoft
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # NextCoder-7B-FP8
15
+
16
+ This is an FP8 quantized version of [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) for efficient inference on NVIDIA Ada Lovelace and newer GPUs.
17
+
18
+ ## Model Description
19
+
20
+ FP8 (8-bit floating point) quantization of NextCoder-7B, optimized for fast code generation with minimal quality loss.
21
+
22
+ ### Quantization Details
23
+
24
+ | Property | Value |
25
+ |----------|-------|
26
+ | Original Model | [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) |
27
+ | Quantization Method | FP8 (E4M3) via llm-compressor |
28
+ | Model Size | ~14GB (3 sharded safetensors files) |
29
+ | Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) |
30
+ | Quantization Date | 2025-11-22 |
31
+ | Quantization Time | 47.0 minutes |
32
+ | Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) |
33
+
34
+ ## Usage
35
+
36
+ ### Loading the Model
37
+ ```python
38
+ from transformers import AutoModelForCausalLM, AutoTokenizer
39
+ import torch
40
+
41
+ # Load model with FP8 quantization
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ "TevunahAi/NextCoder-7B-FP8",
44
+ torch_dtype=torch.float8_e4m3fn, # FP8 dtype
45
+ device_map="auto",
46
+ low_cpu_mem_usage=True,
47
+ )
48
+
49
+ tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")
50
+
51
+ # Generate code
52
+ messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
53
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
54
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
55
+
56
+ outputs = model.generate(
57
+ **inputs,
58
+ max_new_tokens=512,
59
+ temperature=0.7,
60
+ do_sample=True
61
+ )
62
+
63
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
64
+ ```
65
+
66
+ ### Requirements
67
+ ```bash
68
+ pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+
69
+ pip install transformers>=4.40.0
70
+ pip install accelerate
71
+ ```
72
+
73
+ **System Requirements:**
74
+ - PyTorch 2.1 or newer with CUDA support
75
+ - NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
76
+ - CUDA 11.8 or newer
77
+ - ~14GB VRAM for inference
78
+
79
+ ## Benefits of FP8
80
+
81
+ - **~50% memory reduction** compared to FP16/BF16
82
+ - **Faster inference** on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
83
+ - **Minimal quality loss** compared to INT8 or INT4 quantization
84
+ - **Native hardware acceleration** on modern NVIDIA GPUs
85
+
86
+ ## Model Files
87
+
88
+ This model is sharded into 3 safetensors files:
89
+ - `model-00001-of-00003.safetensors`
90
+ - `model-00002-of-00003.safetensors`
91
+ - `model-00003-of-00003.safetensors`
92
+
93
+ All files are required for inference.
94
+
95
+ ## Original Model
96
+
97
+ This quantization is based on [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) by Microsoft.
98
+
99
+ Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-7B) for:
100
+ - Training details
101
+ - Intended use cases
102
+ - Capabilities and limitations
103
+ - Evaluation results
104
+ - Ethical considerations
105
+
106
+ ## Quantization Recipe
107
+
108
+ This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in `recipe.yaml`.
109
+
110
+ ## License
111
+
112
+ This model inherits the MIT license from the original NextCoder-7B model.
113
+
114
+ ## Citation
115
+
116
+ If you use this model, please cite the original NextCoder work:
117
+ ```bibtex
118
+ @misc{nextcoder2024,
119
+ title={NextCoder: Next-Generation Code LLM},
120
+ author={Microsoft},
121
+ year={2024},
122
+ url={https://huggingface.co/microsoft/NextCoder-7B}
123
+ }
124
+ ```
125
+
126
+ ## Acknowledgments
127
+
128
+ - Original model by Microsoft
129
+ - Quantization performed using Neural Magic's llm-compressor
130
+ - Quantized by TevunahAi