File size: 11,243 Bytes
213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 213a028 f715494 9fc46e5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 | ---
library_name: transformers
pipeline_tag: text-generation
base_model: openai/gpt-oss-20b
license: apache-2.0
tags:
- text-generation
- causal-lm
- gpt_oss
- moe
- fp8
- conversational
- english
---
# Model Card for `KJML/gpt-oss-20b-FP8-Dynamic`
This repository provides an FP8-dynamic quantized variant of **OpenAI’s `gpt-oss-20b`** model.
It is intended for users who want the reasoning capabilities of gpt-oss-20b with a **smaller memory footprint and faster inference** on modern GPUs that support FP8 inference.
> ⚠️ This model is **not** trained or fine-tuned further; it is a **post-training quantization** of the original `openai/gpt-oss-20b` weights.
---
## Model Details
### Model Description
- **Base model:** `openai/gpt-oss-20b`
- **Architecture:** Mixture-of-Experts (MoE) Transformer language model (≈21B total params, ≈3.6B active per token, inherited from base)
- **Quantization:** FP8 dynamic (weights + activations) for inference
- **Context length:** Same as base `gpt-oss-20b` (long-context, Harmony-format chat)
- **Language(s):** Primarily English; inherits multilingual capability from base model
- **License:** Apache 2.0 (inherits from base model)
- **Model type:** Causal language model for text / chat generation
- **Developer of this variant:** KJML
- **Finetuned from model:** `openai/gpt-oss-20b` (no additional training; quantization only)
The original `gpt-oss-20b` is an open-weight reasoning model from OpenAI, designed for agentic workflows, tool use, and configurable reasoning effort. This FP8-dynamic variant preserves those capabilities while targeting **more efficient deployment**.
### Model Sources
- **Base model repository:** <https://huggingface.co/openai/gpt-oss-20b>
- **Upstream project / docs:** <https://github.com/openai/gpt-oss>
- **This quantized model:** <https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic> (this repo)
---
## Uses
### Direct Use
Typical direct-use scenarios (without additional fine-tuning):
- General chat and assistant-style dialogue (English-first)
- Reasoning and analysis (step-by-step / chain-of-thought) for:
- Technical explanations
- Brainstorming and ideation
- Code reasoning and pseudo-code (light coding assistance)
- Agentic / tool-using setups:
- Function calling and structured outputs
- Retrieval-augmented generation (RAG) backends
- Local “AI PC” / workstation deployments where FP8 is supported
**Note:** The model is trained on OpenAI’s Harmony response format. For best results, use a chat template that applies the Harmony format (e.g. `tokenizer.apply_chat_template` in Transformers) when prompting.
### Downstream Use
The FP8-dynamic variant can be used as a drop-in replacement for `openai/gpt-oss-20b` in:
- Custom backends with vLLM / TGI / custom inference servers
- Local desktop apps (LM Studio, Ollama-style setups, etc.) that support FP8
- RAG systems where latency and VRAM usage are important
- Multi-agent frameworks where many concurrent contexts are needed
If you fine-tune or adapt this model further, treat it as you would the base `gpt-oss-20b` model, but keep in mind that **quantization can slightly change numeric behavior**, especially for very long generations.
### Out-of-Scope Use
The model (and this quantized variant) is **not recommended** for:
- High-stakes decision making without human review, e.g.:
- Medical, legal, or financial advice
- Safety-critical environments (autonomous driving, industrial control, etc.)
- Generating content that violates laws or platform policies
- Acting as the sole decision-maker in any context where errors could cause **harm to people or property**
Users should always keep a human in the loop for sensitive or impactful applications.
---
## Bias, Risks, and Limitations
This model inherits all **biases, risks, and limitations** of the base `gpt-oss-20b` model. As a large language model trained on internet-scale data, it may:
- Produce **biased or stereotypical content**, including along axes such as gender, race, nationality, or religion.
- Hallucinate facts, references, or citations.
- Overstate its own certainty.
- Generate unsafe or undesirable content if prompted adversarially or without proper safety layers.
The FP8-dynamic quantization may also:
- Introduce small degradations in quality vs. BF16 / MXFP4 versions, particularly for:
- Very long generations
- Edge cases that are numerically sensitive
- Behave slightly differently from the base model, even with identical prompts.
### Recommendations
- **Do not** rely on this model as a single source of truth.
- Add **safety filters** and/or a moderation layer around generations.
- Use **human review** for any high-impact or user-facing deployment.
- Evaluate the FP8-dynamic variant on your own tasks and data before using in production.
---
## How to Get Started with the Model
Basic usage with 🤗 Transformers:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "KJML/gpt-oss-20b-FP8-Dynamic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto", # Will use FP8 where supported
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain what FP8 dynamic quantization is in simple terms."},
]
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=256,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
````
Make sure you are using a recent version of **Transformers** and a PyTorch build that supports FP8 where applicable.
---
## Training Details
### Training Data
No new training data is introduced in this repository.
* **This model is not trained from scratch.**
* It directly reuses the weights and training data of `openai/gpt-oss-20b`.
* For full details on the original training data and methodology, see the official gpt-oss model card and paper.
### Training Procedure
No additional gradient-based training was performed. The steps were:
1. Start from base `openai/gpt-oss-20b` weights.
2. Apply FP8-dynamic post-training quantization (weights and activations) for inference.
3. Export quantized weights to `safetensors` format for deployment.
#### Preprocessing
No extra data preprocessing was done beyond what OpenAI used for the base model.
#### Training Hyperparameters
* **Training regime for this repo:** *None* (no fine-tuning; quantization only)
* **Original base model:** Trained by OpenAI using high-precision training and post-training MXFP4 quantization of MoE weights (see upstream model card / paper for specifics).
#### Speeds, Sizes, Times
Exact performance depends on your hardware and FP8 support, but in general:
* **VRAM usage:** Lower than the BF16 / MXFP4 original, enabling more concurrent contexts or larger batch sizes.
* **Throughput:** Higher tokens/sec on FP8-capable hardware compared to running BF16 weights, especially at batch size >1.
You should benchmark on your own GPU(s) for precise numbers.
---
## Evaluation
No separate benchmark suite has been run specifically for the FP8-dynamic variant at this time.
### Testing Data, Factors & Metrics
* **Testing data:** Not re-evaluated independently here.
* It is reasonable to expect **similar qualitative behavior** to `openai/gpt-oss-20b`, with minor differences due to quantization.
### Results
If you run your own evals (e.g. on reasoning or coding benchmarks), please feel free to share issues / PRs or discussion links so others can reference them.
#### Summary
* Use this model when you want **gpt-oss-20b-level reasoning** with **lower memory usage and better throughput**.
* Expect small quality differences vs. the original due to FP8 quantization.
---
## Model Examination (Optional)
No additional interpretability or probing analysis has been carried out on this quantized variant.
For deeper analysis and interpretability work, refer to:
* The official gpt-oss paper / model card.
* Independent community evaluations of `gpt-oss-20b`.
---
## Environmental Impact
This repository does **not** involve training a new model.
* The main compute cost is a **one-time quantization pass** over the base weights.
* Carbon footprint is therefore negligible compared to the original model training.
For estimates of training-time emissions, please consult the original gpt-oss model card and related publications.
---
## Technical Specifications
### Model Architecture and Objective
* **Architecture:** Mixture-of-Experts Transformer language model (same as `gpt-oss-20b`)
* **Objective:** Next-token prediction / causal language modeling
* **Quantization:**
* FP8 dynamic for weights and activations at inference time
* Intended for GPUs / accelerators that support efficient FP8 matmul
The quantization is applied in a way that preserves the original architecture and I/O behavior.
### Compute Infrastructure
Quantization was performed on a single modern GPU (exact details may vary; see repository description or commits if you need exact hardware).
#### Hardware
* Single GPU with FP8 support (for quantization and testing)
* Standard CPU + RAM sufficient to host original and quantized weights
#### Software
* PyTorch (FP8-capable build)
* Hugging Face Transformers
* Supporting libraries for FP8 quantization and safetensor export
---
## Citation
If you use this model in academic or commercial work, please cite at least the original gpt-oss paper/model card from OpenAI:
**BibTeX:**
```bibtex
@misc{openai2025gptoss120bgptoss20bmodel,
title={gpt-oss-120b & gpt-oss-20b Model Card},
author={OpenAI},
year={2025},
eprint={2508.10925},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.10925}
}
```
You may also optionally reference this quantized variant as:
```bibtex
@misc{kjml2025gptoss20bfp8dynamic,
title={KJML/gpt-oss-20b-FP8-Dynamic: FP8-dynamic Quantized Variant of gpt-oss-20b},
author={KJML},
year={2025},
howpublished={Hugging Face model repository},
url={https://huggingface.co/KJML/gpt-oss-20b-FP8-Dynamic}
}
```
---
## Glossary
* **MoE (Mixture-of-Experts):** Architecture where only a subset of “experts” (parameter blocks) are active per token, reducing compute vs. dense models.
* **FP8 dynamic:** 8-bit floating point representation with dynamic scaling, used to reduce memory and bandwidth while preserving model quality.
* **Harmony format:** OpenAI’s chat / response formatting used for training gpt-oss models; must be respected for best performance.
---
## More Information
* Base model details, prompts, and advanced usage examples: see `openai/gpt-oss-20b` on Hugging Face and the official gpt-oss GitHub repository.
* For questions, issues, or suggestions around this FP8-dynamic variant, please open an issue or discussion in this repository.
---
## Model Card Authors
* **Author:** KJML
* **Contact:** Kongphop.ct.ja.sc@gmail.com
```
|