OSS-20B-FFT-NUMINA-COT
This is a fine-tuned version of the unsloth/gpt-oss-20b-BF16 model, fine-tuned with Numina Chain-of-Thought (CoT) training methodology.
Model Details
Base Model
- Base Model: unsloth/gpt-oss-20b-BF16
- Architecture: GPT-OSS (Mixture of Experts)
- Parameters: ~20B
- Precision: bfloat16
Model Architecture
- Hidden Size: 2880
- Number of Layers: 24
- Attention Heads: 64
- Key-Value Heads: 8
- Head Dimension: 64
- Intermediate Size: 2880
- Vocabulary Size: 201,088
- Max Position Embeddings: 131,072 tokens
- Mixture of Experts: 32 experts per layer, 4 active experts per token
- Attention Pattern: Alternating sliding window (128) and full attention layers
- RoPE Scaling: YaRN with factor 32.0, beta_fast 32.0, beta_slow 1.0
- RoPE Theta: 150,000
Training Details
- Fine-tuning Method: Full Fine-Tuning (FFT)
- Training Dataset: Numina Chain-of-Thought dataset
- Context Length: Up to 131K tokens
Usage
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Kushalkhemka/OSS-20B-FFT-NUMINA-COT"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
# Generate text
prompt = "Explain quantum computing in simple terms:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Using vLLM (Recommended for Production)
vLLM provides high-performance inference with optimized attention mechanisms and continuous batching. This model is fully compatible with vLLM.
Installation
pip install vllm
Basic Usage
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(
model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
tensor_parallel_size=1, # Increase for multi-GPU
dtype="bfloat16",
trust_remote_code=True
)
# Set sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512
)
# Generate
prompts = [
"Explain quantum computing:",
"What is machine learning?",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Multi-GPU Setup
For better performance with large models, use tensor parallelism:
from vllm import LLM, SamplingParams
# Use multiple GPUs
llm = LLM(
model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
tensor_parallel_size=4, # Use 4 GPUs
dtype="bfloat16",
trust_remote_code=True
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=512)
outputs = llm.generate(["Your prompt here"], sampling_params)
vLLM Server (OpenAI-Compatible API)
Start a server with OpenAI-compatible API:
python -m vllm.entrypoints.openai.api_server \
--model Kushalkhemka/OSS-20B-FFT-NUMINA-COT \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--trust-remote-code
Then use it with OpenAI client:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
completion = client.chat.completions.create(
model="Kushalkhemka/OSS-20B-FFT-NUMINA-COT",
messages=[
{"role": "user", "content": "Explain quantum computing:"}
],
temperature=0.7,
max_tokens=512
)
print(completion.choices[0].message.content)
vLLM Performance Tips
- Tensor Parallelism: Use
tensor_parallel_sizeto distribute the model across multiple GPUs - Paged Attention: Enabled by default in vLLM for efficient memory usage
- Continuous Batching: Automatically handles dynamic batch sizes
- Quantization: For even better performance, consider using AWQ or GPTQ quantization
Chat Template
The model includes a chat template. Use it as follows:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Kushalkhemka/OSS-20B-FFT-NUMINA-COT")
messages = [
{"role": "user", "content": "What is Python?"}
]
# Apply chat template
formatted_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(formatted_prompt)
Model Configuration
Generation Config
- EOS Token IDs: [200002, 199999, 200012]
- BOS Token ID: 199998
- PAD Token ID: 199999
- Do Sample: True (default)
Special Tokens
<|startoftext|>(199998): Beginning of text<|endoftext|>(199999): End of text / Padding<|return|>(200002): Return/EOS token<|call|>(200012): Function call token<|endofprompt|>(200018): End of prompt
Hardware Requirements
Minimum Requirements
- RAM: 40GB+ (for CPU inference)
- VRAM: 24GB+ (for single GPU inference)
- Storage: 40GB+ free space
Recommended for vLLM
- GPU: NVIDIA A100 (40GB) or better
- VRAM: 40GB+ per GPU
- Multi-GPU: 2-4 GPUs recommended for optimal performance
Performance
Inference Speed (vLLM)
- Single A100 (40GB): ~50-100 tokens/second (depending on context length)
- Multi-GPU (4x A100): ~200-400 tokens/second
Memory Usage
- Model Weights: ~40GB (bfloat16)
- KV Cache: Varies with batch size and sequence length
- Peak Memory: ~45-50GB for single GPU inference
Limitations
- The model may generate incorrect or biased information
- Performance depends on hardware and inference framework
- Long context generation (131K tokens) requires significant memory
- Mixture of Experts routing may occasionally select suboptimal experts
Citation
If you use this model, please cite:
@misc{oss-20b-fft-numina-cot,
author = {Kushalkhemka},
title = {OSS-20B-FFT-NUMINA-COT: Fine-tuned GPT-OSS-20B with Numina Chain-of-Thought},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Kushalkhemka/OSS-20B-FFT-NUMINA-COT}}
}
Acknowledgments
- Base model: unsloth/gpt-oss-20b-BF16
- Training framework: Unsloth
- Inference optimization: vLLM
License
This model is licensed under Apache 2.0.
- Downloads last month
- 17
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support