OSS-20B-FFT-NUMINA-COT

This is a fine-tuned version of the unsloth/gpt-oss-20b-BF16 model, fine-tuned with Numina Chain-of-Thought (CoT) training methodology.

Model Details

Base Model

Model Architecture

  • Hidden Size: 2880
  • Number of Layers: 24
  • Attention Heads: 64
  • Key-Value Heads: 8
  • Head Dimension: 64
  • Intermediate Size: 2880
  • Vocabulary Size: 201,088
  • Max Position Embeddings: 131,072 tokens
  • Mixture of Experts: 32 experts per layer, 4 active experts per token
  • Attention Pattern: Alternating sliding window (128) and full attention layers
  • RoPE Scaling: YaRN with factor 32.0, beta_fast 32.0, beta_slow 1.0
  • RoPE Theta: 150,000

Training Details

  • Fine-tuning Method: Full Fine-Tuning (FFT)
  • Training Dataset: Numina Chain-of-Thought dataset
  • Context Length: Up to 131K tokens

Usage

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Kushalkhemka/OSS-20B-FFT-NUMINA-COT"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using vLLM (Recommended for Production)

vLLM provides high-performance inference with optimized attention mechanisms and continuous batching. This model is fully compatible with vLLM.

Installation

pip install vllm

Basic Usage

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(
    model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
    tensor_parallel_size=1,  # Increase for multi-GPU
    dtype="bfloat16",
    trust_remote_code=True
)

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

# Generate
prompts = [
    "Explain quantum computing:",
    "What is machine learning?",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Multi-GPU Setup

For better performance with large models, use tensor parallelism:

from vllm import LLM, SamplingParams

# Use multiple GPUs
llm = LLM(
    model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
    tensor_parallel_size=4,  # Use 4 GPUs
    dtype="bfloat16",
    trust_remote_code=True
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["Your prompt here"], sampling_params)

vLLM Server (OpenAI-Compatible API)

Start a server with OpenAI-compatible API:

python -m vllm.entrypoints.openai.api_server \
    --model Kushalkhemka/OSS-20B-FFT-NUMINA-COT \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --trust-remote-code

Then use it with OpenAI client:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

completion = client.chat.completions.create(
    model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
    messages=[
        {"role": "user", "content": "Explain quantum computing:"}
    ],
    temperature=0.7,
    max_tokens=512
)

print(completion.choices[0].message.content)

vLLM Performance Tips

  1. Tensor Parallelism: Use tensor_parallel_size to distribute the model across multiple GPUs
  2. Paged Attention: Enabled by default in vLLM for efficient memory usage
  3. Continuous Batching: Automatically handles dynamic batch sizes
  4. Quantization: For even better performance, consider using AWQ or GPTQ quantization

Chat Template

The model includes a chat template. Use it as follows:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Kushalkhemka/OSS-20B-FFT-NUMINA-COT")

messages = [
    {"role": "user", "content": "What is Python?"}
]

# Apply chat template
formatted_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

print(formatted_prompt)

Model Configuration

Generation Config

  • EOS Token IDs: [200002, 199999, 200012]
  • BOS Token ID: 199998
  • PAD Token ID: 199999
  • Do Sample: True (default)

Special Tokens

  • <|startoftext|> (199998): Beginning of text
  • <|endoftext|> (199999): End of text / Padding
  • <|return|> (200002): Return/EOS token
  • <|call|> (200012): Function call token
  • <|endofprompt|> (200018): End of prompt

Hardware Requirements

Minimum Requirements

  • RAM: 40GB+ (for CPU inference)
  • VRAM: 24GB+ (for single GPU inference)
  • Storage: 40GB+ free space

Recommended for vLLM

  • GPU: NVIDIA A100 (40GB) or better
  • VRAM: 40GB+ per GPU
  • Multi-GPU: 2-4 GPUs recommended for optimal performance

Performance

Inference Speed (vLLM)

  • Single A100 (40GB): ~50-100 tokens/second (depending on context length)
  • Multi-GPU (4x A100): ~200-400 tokens/second

Memory Usage

  • Model Weights: ~40GB (bfloat16)
  • KV Cache: Varies with batch size and sequence length
  • Peak Memory: ~45-50GB for single GPU inference

Limitations

  • The model may generate incorrect or biased information
  • Performance depends on hardware and inference framework
  • Long context generation (131K tokens) requires significant memory
  • Mixture of Experts routing may occasionally select suboptimal experts

Citation

If you use this model, please cite:

@misc{oss-20b-fft-numina-cot,
  author = {Kushalkhemka},
  title = {OSS-20B-FFT-NUMINA-COT: Fine-tuned GPT-OSS-20B with Numina Chain-of-Thought},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Kushalkhemka/OSS-20B-FFT-NUMINA-COT}}
}

Acknowledgments

License

This model is licensed under Apache 2.0.

Downloads last month
17
Safetensors
Model size
21B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kushalkhemka/OSS-20B-FFT-NUMINA-COT

Base model

openai/gpt-oss-20b
Finetuned
(24)
this model