| --- |
| base_model: unsloth/gpt-oss-20b-unsloth-bnb-4bit |
| tags: |
| - text-generation-inference |
| - transformers |
| - unsloth |
| - gpt_oss |
| license: apache-2.0 |
| language: |
| - en |
| --- |
| ## Model Card |
| ### We release open-weight early experimental Codeforce metatune-gpt20b, fine tuned version of OpenAI's gpt-oss-20b model, this is one of the first public release recursive self improving AI. |
| - Generates new data for itself of Codeforce-Cot |
| - Evaluates its performance, and |
| - Adjusts its own hyperparameters based on improvement metrics. |
|
|
| ## Use cases: |
| - Coding |
|
|
| ## Guardrails: |
| - generally, please set reasoning = "high", it will usually prevent jailbreaking and prompt injection |
| - use safety gpt oss 20b for guardrails before this model: [openai/gpt-oss-safeguard-20b](https://huggingface.co/openai/gpt-oss-safeguard-20b) |
|
|
| # Inference examples |
|
|
| ## Transformers |
|
|
| You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package. |
|
|
| To get started, install the necessary dependencies to setup your environment: |
|
|
| We recommend sampling with temperature=1.0 and top_p=1.0. |
| ``` |
| pip install -U transformers kernels torch |
| ``` |
| |
| For Google Colab (free/Pro) |
| ``` |
| !pip install -q --upgrade torch |
| |
| !pip install -q transformers triton==3.4 kernels |
| |
| !pip uninstall -q torchvision torchaudio -y |
| ``` |
| |
| Once, setup you can proceed to run the model by running the snippet below: |
| |
| ```py |
| from transformers import pipeline |
| import torch |
| model_id = "EpistemeAI/Codeforce-metatune-gpt20b" |
| pipe = pipeline( |
| "text-generation", |
| model=model_id, |
| torch_dtype="auto", |
| device_map="auto", |
| ) |
| messages = [ |
| {"role": "user", "content": "Derive the Euler–Lagrange equation from the principle of stationary action.""}, |
| ] |
| outputs = pipe( |
| messages, |
| max_new_tokens=3000, |
| ) |
| print(outputs[0]["generated_text"][-1]) |
| ``` |
| # Reasoning levels |
| |
| You can adjust the reasoning level that suits your task across three levels: |
|
|
| * **Low:** Fast responses for general dialogue. |
| * **Medium:** Balanced speed and detail. |
| * **High:** Deep and detailed analysis. |
|
|
| The reasoning level can be set in the system prompts, e.g., "Reasoning: high". |
|
|
| # Tool use |
|
|
| The gpt-oss models are excellent for: |
| * Web browsing (using built-in browsing tools) |
| * Function calling with defined schemas |
| * Agentic operations like browser tasks |
|
|
| # Fine-tuning |
|
|
| Both gpt-oss models can be fine-tuned for a variety of specialized use cases. |
|
|
| This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) can be fine-tuned on a single H100 node. |
|
|
|
|
| # Benchmark |
| ```py |
| #humaneval |
| !lm_eval --model hf --model_args pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16 --tasks humaneval --trust_remote_code --confirm_run_unsafe_code --num_fewshot 0 --gen_kwargs temperature=0.9,top_p=0.9,max_new_tokens=1024 --batch_size auto:4 --limit 10 --device cuda:0 --output_path ./eval_harness/gpt-oss-20b3 |
| ``` |
|
|
| hf (pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (temperature=0.9,top_p=0.9,max_new_tokens=1024), limit: 10.0, num_fewshot: 0, batch_size: auto:4 |
| | Tasks |Version| Filter |n-shot| Metric | |Value| |Stderr| |
| |---------|------:|-----------|-----:|---------|---|----:|---|-----:| |
| |humaneval| 1|create_test| 0|pass@1 | | 0.9|± | 0.1| |
| |
| # 🧠 Model Benchmark Comparison |
| |
| This table presents HumanEval benchmark scores across several large language models. |
| |
| | Model | HumanEval | |
| |------------------------|------------| |
| | Codeforce-GPT-oss-20b | **90** | |
| | Qwen 3 235B | 80 | |
| | DeepSeek-R1 70B | 88 | |
| | Phi-4 Reasoning | 88 | |
| | Llama 4 Scout | 78 | |
| | Llama 3.3 70B | 83 | |
| | Gemma 3 27B | 76 | |
| | GPT-OSS 20B | 73 | |
| | GPT-OSS 120B | 71 | |
| |
| --- |
| |
| ### 📊 Notes |
| - **HumanEval** measures coding problem-solving and reasoning ability. |
| - Scores are normalized for consistency across models. |
| - Models highlighted in **bold** achieved top-tier performance. |
| |
| --- |
| |
| ### 🔍 Summary |
| Codeforce-GPT-oss-20b leads the benchmark, surpassing even larger models like Qwen 3 235B and DeepSeek-R1 70B. Its superior reasoning and code synthesis capabilities indicate an optimized training strategy rather than sheer scale dominance. |
| |
| -------------------------------------- |
| |
| - **Developed by:** EpistemeAI |
| - **License:** apache-2.0 |
| - **Finetuned from model :** unsloth/gpt-oss-20b-unsloth-bnb-4bit |
| |
| This gpt_oss model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
|
|
| [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
|
|
| # Citation |
|
|
| ```bibtex |
| |
| @misc{bi2025gptossgoodcomprehensiveevaluation, |
| title={Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models}, |
| author={Ziqian Bi and Keyu Chen and Chiung-Yi Tseng and Danyang Zhang and Tianyang Wang and Hongying Luo and Lu Chen and Junming Huang and Jibin Guan and Junfeng Hao and Junhao Song}, |
| year={2025}, |
| eprint={2508.12461}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2508.12461}, |
| } |
| ``` |