OSS-20B-FFT-NUMINA-COT

This is a fine-tuned version of the unsloth/gpt-oss-20b-BF16 model, fine-tuned with Numina Chain-of-Thought (CoT) training methodology.

Model Details

Base Model

Base Model: unsloth/gpt-oss-20b-BF16
Architecture: GPT-OSS (Mixture of Experts)
Parameters: ~20B
Precision: bfloat16

Model Architecture

Hidden Size: 2880
Number of Layers: 24
Attention Heads: 64
Key-Value Heads: 8
Head Dimension: 64
Intermediate Size: 2880
Vocabulary Size: 201,088
Max Position Embeddings: 131,072 tokens
Mixture of Experts: 32 experts per layer, 4 active experts per token
Attention Pattern: Alternating sliding window (128) and full attention layers
RoPE Scaling: YaRN with factor 32.0, beta_fast 32.0, beta_slow 1.0
RoPE Theta: 150,000

Training Details

Fine-tuning Method: Full Fine-Tuning (FFT)
Training Dataset: Numina Chain-of-Thought dataset
Context Length: Up to 131K tokens

Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Kushalkhemka/OSS-20B-FFT-NUMINA-COT"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using vLLM (Recommended for Production)

vLLM provides high-performance inference with optimized attention mechanisms and continuous batching. This model is fully compatible with vLLM.

Installation

pip install vllm

Basic Usage

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
    tensor_parallel_size=1,  # Increase for multi-GPU
    dtype="bfloat16",
    trust_remote_code=True
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Generate
prompts = [
    "Explain quantum computing:",
    "What is machine learning?",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Multi-GPU Setup

For better performance with large models, use tensor parallelism:

from vllm import LLM, SamplingParams

# Use multiple GPUs
llm = LLM(
    model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
    tensor_parallel_size=4,  # Use 4 GPUs
    dtype="bfloat16",
    trust_remote_code=True
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["Your prompt here"], sampling_params)

vLLM Server (OpenAI-Compatible API)

Start a server with OpenAI-compatible API:

python -m vllm.entrypoints.openai.api_server \
    --model Kushalkhemka/OSS-20B-FFT-NUMINA-COT \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --trust-remote-code

Then use it with OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

completion = client.chat.completions.create(
    model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
    messages=[
        {"role": "user", "content": "Explain quantum computing:"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(completion.choices[0].message.content)

vLLM Performance Tips

Tensor Parallelism: Use tensor_parallel_size to distribute the model across multiple GPUs
Paged Attention: Enabled by default in vLLM for efficient memory usage
Continuous Batching: Automatically handles dynamic batch sizes
Quantization: For even better performance, consider using AWQ or GPTQ quantization

Chat Template

The model includes a chat template. Use it as follows:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Kushalkhemka/OSS-20B-FFT-NUMINA-COT")

messages = [
    {"role": "user", "content": "What is Python?"}
]

# Apply chat template
formatted_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(formatted_prompt)

Model Configuration

Generation Config

EOS Token IDs: [200002, 199999, 200012]
BOS Token ID: 199998
PAD Token ID: 199999
Do Sample: True (default)

Special Tokens

<|startoftext|> (199998): Beginning of text
<|endoftext|> (199999): End of text / Padding
<|return|> (200002): Return/EOS token
<|call|> (200012): Function call token
<|endofprompt|> (200018): End of prompt

Hardware Requirements

Minimum Requirements

RAM: 40GB+ (for CPU inference)
VRAM: 24GB+ (for single GPU inference)
Storage: 40GB+ free space

Recommended for vLLM

GPU: NVIDIA A100 (40GB) or better
VRAM: 40GB+ per GPU
Multi-GPU: 2-4 GPUs recommended for optimal performance

Performance

Inference Speed (vLLM)

Single A100 (40GB): ~50-100 tokens/second (depending on context length)
Multi-GPU (4x A100): ~200-400 tokens/second

Memory Usage

Model Weights: ~40GB (bfloat16)
KV Cache: Varies with batch size and sequence length
Peak Memory: ~45-50GB for single GPU inference

Limitations

The model may generate incorrect or biased information
Performance depends on hardware and inference framework
Long context generation (131K tokens) requires significant memory
Mixture of Experts routing may occasionally select suboptimal experts

Citation

If you use this model, please cite:

@misc{oss-20b-fft-numina-cot,
  author = {Kushalkhemka},
  title = {OSS-20B-FFT-NUMINA-COT: Fine-tuned GPT-OSS-20B with Numina Chain-of-Thought},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Kushalkhemka/OSS-20B-FFT-NUMINA-COT}}
}

Acknowledgments

Base model: unsloth/gpt-oss-20b-BF16
Training framework: Unsloth
Inference optimization: vLLM

License

This model is licensed under Apache 2.0.

Downloads last month: 2

Safetensors

Model size

21B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kushalkhemka/OSS-20B-FFT-NUMINA-COT

Base model

openai/gpt-oss-20b

Finetuned

unsloth/gpt-oss-20b-BF16

Finetuned

(55)

this model