Files changed (1) hide show
  1. README.md +95 -3
README.md CHANGED
@@ -1,3 +1,95 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: microsoft/NextCoder-7B
4
+ tags:
5
+ - code
6
+ - fp8
7
+ - quantized
8
+ - nextcoder
9
+ - microsoft
10
+ library_name: transformers
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # NextCoder-7B-FP8
15
+
16
+ This is an FP8 quantized version of [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B).
17
+
18
+ ## Model Description
19
+
20
+ FP8 (8-bit floating point) quantization of NextCoder-7B for efficient inference on NVIDIA Ada Lovelace and newer GPUs.
21
+
22
+ ### Quantization Details
23
+
24
+ | Property | Value |
25
+ |----------|-------|
26
+ | Original Model | [microsoft/NextCoder-7B](https://huggingface.co/microsoft/NextCoder-7B) |
27
+ | Quantization Method | FP8 (E4M3) via llm-compressor |
28
+ | Target Hardware | NVIDIA Ada Lovelace (RTX 40xx, RTX 5000 Ada, etc.) |
29
+ | Quantization Date | 2025-11-22 |
30
+ | Quantization Time | 47.0 minutes |
31
+ | Hardware Used | NVIDIA RTX 5000 Ada Generation (31.5 GB) |
32
+
33
+ ## Usage
34
+
35
+ ### Loading the Model
36
+ ```python
37
+ from transformers import AutoModelForCausalLM, AutoTokenizer
38
+ import torch
39
+
40
+ # Load model with FP8 quantization
41
+ model = AutoModelForCausalLM.from_pretrained(
42
+ "TevunahAi/NextCoder-7B-FP8",
43
+ torch_dtype=torch.float8_e4m3fn, # FP8 dtype
44
+ device_map="auto",
45
+ low_cpu_mem_usage=True,
46
+ )
47
+
48
+ tokenizer = AutoTokenizer.from_pretrained("TevunahAi/NextCoder-7B-FP8")
49
+
50
+ # Generate code
51
+ messages = [{"role": "user", "content": "Write a function to calculate fibonacci numbers"}]
52
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
53
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
54
+
55
+ outputs = model.generate(
56
+ **inputs,
57
+ max_new_tokens=512,
58
+ temperature=0.7,
59
+ do_sample=True
60
+ )
61
+
62
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
63
+ ```
64
+
65
+ ### Requirements
66
+ ```bash
67
+ pip install torch>=2.1.0 # FP8 support requires PyTorch 2.1+
68
+ pip install transformers>=4.40.0
69
+ pip install accelerate
70
+ ```
71
+
72
+ **Note**: FP8 inference requires:
73
+ - PyTorch 2.1 or newer with CUDA support
74
+ - NVIDIA GPU with FP8 support (Ada Lovelace or newer: RTX 40xx series, RTX 5000 Ada, H100, etc.)
75
+ - CUDA 11.8 or newer
76
+ ```
77
+
78
+ ## Benefits of FP8
79
+
80
+ - **~50% memory reduction** compared to FP16/BF16
81
+ - **Faster inference** on Ada Lovelace GPUs with native FP8 support
82
+ - **Minimal quality loss** compared to INT8 or INT4 quantization
83
+
84
+ ## Original Model
85
+
86
+ This quantization is based on microsoft/NextCoder-7B by Microsoft.
87
+ Please refer to the [original model card](https://huggingface.co/microsoft/NextCoder-7B) for:
88
+ - Training details
89
+ - Intended use cases
90
+ - Limitations and biases
91
+ - License information
92
+
93
+ ## License
94
+
95
+ This model inherits the MIT license from the original NextCoder model.