Update README.md

91cc028 verified 3 months ago

5.69 kB

	---
	base_model: unsloth/gpt-oss-20b-unsloth-bnb-4bit
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- gpt_oss
	license: apache-2.0
	language:
	- en
	---
	## Model Card
	### We release open-weight early experimental Codeforce metatune-gpt20b, fine tuned version of OpenAI's gpt-oss-20b model, this is one of the first public release recursive self improving AI.
	- Generates new data for itself of Codeforce-Cot
	- Evaluates its performance, and
	- Adjusts its own hyperparameters based on improvement metrics.

	## Use cases:
	- Coding

	## Guardrails:
	- generally, please set reasoning = "high", it will usually prevent jailbreaking and prompt injection
	- use safety gpt oss 20b for guardrails before this model: [openai/gpt-oss-safeguard-20b](https://huggingface.co/openai/gpt-oss-safeguard-20b)

	# Inference examples

	## Transformers

	You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package.

	To get started, install the necessary dependencies to setup your environment:

	We recommend sampling with temperature=1.0 and top_p=1.0.
	```
	pip install -U transformers kernels torch
	```

	For Google Colab (free/Pro)
	```
	!pip install -q --upgrade torch

	!pip install -q transformers triton==3.4 kernels

	!pip uninstall -q torchvision torchaudio -y
	```

	Once, setup you can proceed to run the model by running the snippet below:

	```py
	from transformers import pipeline
	import torch
	model_id = "EpistemeAI/Codeforce-metatune-gpt20b"
	pipe = pipeline(
	"text-generation",
	model=model_id,
	torch_dtype="auto",
	device_map="auto",
	)
	messages = [
	{"role": "user", "content": "Derive the Euler–Lagrange equation from the principle of stationary action.""},
	]
	outputs = pipe(
	messages,
	max_new_tokens=3000,
	)
	print(outputs[0]["generated_text"][-1])
	```
	# Reasoning levels

	You can adjust the reasoning level that suits your task across three levels:

	* Low: Fast responses for general dialogue.
	* Medium: Balanced speed and detail.
	* High: Deep and detailed analysis.

	The reasoning level can be set in the system prompts, e.g., "Reasoning: high".

	# Tool use

	The gpt-oss models are excellent for:
	* Web browsing (using built-in browsing tools)
	* Function calling with defined schemas
	* Agentic operations like browser tasks

	# Fine-tuning

	Both gpt-oss models can be fine-tuned for a variety of specialized use cases.

	This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) can be fine-tuned on a single H100 node.


	# Benchmark
	```py
	#humaneval
	!lm_eval --model hf --model_args pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16 --tasks humaneval --trust_remote_code --confirm_run_unsafe_code --num_fewshot 0 --gen_kwargs temperature=0.9,top_p=0.9,max_new_tokens=1024 --batch_size auto:4 --limit 10 --device cuda:0 --output_path ./eval_harness/gpt-oss-20b3
	```

	hf (pretrained=EpistemeAI/Codeforce-metatune-gpt20b,parallelize=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (temperature=0.9,top_p=0.9,max_new_tokens=1024), limit: 10.0, num_fewshot: 0, batch_size: auto:4
	\| Tasks \|Version\| Filter \|n-shot\| Metric \| \|Value\| \|Stderr\|
	\|---------\|------:\|-----------\|-----:\|---------\|---\|----:\|---\|-----:\|
	\|humaneval\| 1\|create_test\| 0\|pass@1 \| \| 0.9\|± \| 0.1\|

	# 🧠 Model Benchmark Comparison

	This table presents HumanEval benchmark scores across several large language models.

	\| Model \| HumanEval \|
	\|------------------------\|------------\|
	\| Codeforce-GPT-oss-20b \| 90 \|
	\| Qwen 3 235B \| 80 \|
	\| DeepSeek-R1 70B \| 88 \|
	\| Phi-4 Reasoning \| 88 \|
	\| Llama 4 Scout \| 78 \|
	\| Llama 3.3 70B \| 83 \|
	\| Gemma 3 27B \| 76 \|
	\| GPT-OSS 20B \| 73 \|
	\| GPT-OSS 120B \| 71 \|

	---

	### 📊 Notes
	- HumanEval measures coding problem-solving and reasoning ability.
	- Scores are normalized for consistency across models.
	- Models highlighted in bold achieved top-tier performance.

	---

	### 🔍 Summary
	Codeforce-GPT-oss-20b leads the benchmark, surpassing even larger models like Qwen 3 235B and DeepSeek-R1 70B. Its superior reasoning and code synthesis capabilities indicate an optimized training strategy rather than sheer scale dominance.

	--------------------------------------

	- Developed by: EpistemeAI
	- License: apache-2.0
	- Finetuned from model : unsloth/gpt-oss-20b-unsloth-bnb-4bit

	This gpt_oss model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

	[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

	# Citation

	```bibtex

	@misc{bi2025gptossgoodcomprehensiveevaluation,
	title={Is GPT-OSS Good? A Comprehensive Evaluation of OpenAI's Latest Open Source Models},
	author={Ziqian Bi and Keyu Chen and Chiung-Yi Tseng and Danyang Zhang and Tianyang Wang and Hongying Luo and Lu Chen and Junming Huang and Jibin Guan and Junfeng Hao and Junhao Song},
	year={2025},
	eprint={2508.12461},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2508.12461},
	}
	```