rockylynnstein commited on
Commit
2122c61
·
verified ·
1 Parent(s): 7c00e84

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -3
README.md CHANGED
@@ -1,3 +1,133 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: microsoft/NextCoder-14B
4
+ tags:
5
+ - code
6
+ - fp8
7
+ - quantized
8
+ - nextcoder
9
+ - microsoft
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # NextCoder-14B-FP8
15
+
16
+ This is an FP8 quantized version of [microsoft/NextCoder-14B](https://huggingface.co/microsoft/NextCoder-14B) for efficient inference on NVIDIA Ada Lovelace and newer GPUs.
17
+
18
+ ## Model Description
19
+
20
+ FP8 (8-bit floating point) quantization of NextCoder-14B, optimized for fast code generation with minimal quality loss.
21
+
22
+ ### Quantization Details
23
+
24
+ | Property | Value |
25
+ |----------|-------|
26
+ | Original Model | [microsoft/NextCoder-14B](https://huggingface.co/microsoft/NextCoder-14B) |
27
+ | Quantization Method | FP8 (E4M3) via llm-compressor |
28
+ | Model Size | ~28GB (sharded safetensors files) |
29
+ | Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) |
30
+ | Quantization Date | 2025-11-22 |
31
+ | Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) |
32
+
33
+ ## Usage
34
+
35
+ ### Loading the Model
36
+ ```python
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
+ import torch
39
+
40
+ # Load model with FP8 quantization
41
+ model = AutoModelForCausalLM.from_pretrained(
42
+ "TevunahAi/NextCoder-14B-FP8",
43
+ torch_dtype=torch.float8_e4m3fn, # FP8 dtype
44
+ device_map="auto",
45
+ low_cpu_mem_usage=True,
46
+ )
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-14B-FP8")
49
+
50
+ # Generate code
51
+ messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"}]
52
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
53
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
54
+
55
+ outputs = model.generate(
56
+ **inputs,
57
+ max_new_tokens=512,
58
+ temperature=0.7,
59
+ do_sample=True
60
+ )
61
+
62
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
63
+ ```
64
+
65
+ ### Requirements
66
+ ```bash
67
+ pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+
68
+ pip install transformers>=4.40.0
69
+ pip install accelerate
70
+ ```
71
+
72
+ **System Requirements:**
73
+ - PyTorch 2.1 or newer with CUDA support
74
+ - NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
75
+ - CUDA 11.8 or newer
76
+ - ~28GB VRAM for inference (or use multi-GPU with device_map="auto")
77
+
78
+ ## Benefits of FP8
79
+
80
+ - **~50% memory reduction** compared to FP16/BF16
81
+ - **Faster inference** on Ada Lovelace and Hopper GPUs with native FP8 Tensor Cores
82
+ - **Minimal quality loss** compared to INT8 or INT4 quantization
83
+ - **Native hardware acceleration** on modern NVIDIA GPUs
84
+ - **Larger model accessible** on consumer GPUs (fits on RTX 5000 Ada with 32GB VRAM)
85
+
86
+ ## Model Files
87
+
88
+ This model is sharded into multiple safetensors files for efficient loading and distribution. All files are required for inference.
89
+
90
+ ## Original Model
91
+
92
+ This quantization is based on [microsoft/NextCoder-14B](https://huggingface.co/microsoft/NextCoder-14B) by Microsoft.
93
+
94
+ Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-14B) for:
95
+ - Training details
96
+ - Intended use cases
97
+ - Capabilities and limitations
98
+ - Evaluation results
99
+ - Ethical considerations
100
+
101
+ ## Performance vs 7B
102
+
103
+ The 14B model offers:
104
+ - **Better code quality** and more accurate completions
105
+ - **Improved understanding** of complex programming concepts
106
+ - **Enhanced reasoning** for difficult coding tasks
107
+ - **Trade-off**: Requires 2x VRAM (28GB vs 14GB)
108
+
109
+ ## Quantization Recipe
110
+
111
+ This model was quantized using llm-compressor with the FP8 E4M3 format. The quantization recipe is included in `recipe.yaml`.
112
+
113
+ ## License
114
+
115
+ This model inherits the MIT license from the original NextCoder-14B model.
116
+
117
+ ## Citation
118
+
119
+ If you use this model, please cite the original NextCoder work:
120
+ ```bibtex
121
+ @misc{nextcoder2024,
122
+ title={NextCoder: Next-Generation Code LLM},
123
+ author={Microsoft},
124
+ year={2024},
125
+ url={https://huggingface.co/microsoft/NextCoder-14B}
126
+ }
127
+ ```
128
+
129
+ ## Acknowledgments
130
+
131
+ - Original model by Microsoft
132
+ - Quantization performed using Neural Magic's llm-compressor
133
+ - Quantized by TevunahAi