GPT-OSS 20B ONNX CUDA (RTN Quantized) This model is an optimized version of gpt-oss-20b to enable local inference on CUDA GPUs using RTN (Round-to-Nearest) quantization.

Model Summary Developed by: Microsoft

Model Type: ONNX

License: Apache-2.0

Optimization: RTN Quantization for GPU memory efficiency

Model Size: 11.8 GB (model.onnx.data)

Technical Description This repository contains a conversion of the gpt-oss-20b model specifically tailored for local inference on CUDA-enabled GPUs. By utilizing the ONNX format and RTN quantization, the model achieves a significant reduction in VRAM footprint while maintaining the core capabilities of the base architecture.

Base Model Information For detailed information regarding the architecture, training data, and intended use cases, please refer to the original gpt-oss-20b model on Azure AI Foundry.

Deployment Compatible with ONNX Runtime (ORT) using the CUDAExecutionProvider.

Disclaimer The model is an optimization of the base model; any risk associated with its use is the responsibility of the user. Please verify and test for your specific scenarios. There may be slight differences in output compared to the base model due to the optimizations applied. Note that these optimizations are distinct from fine-tuning and do not alter the intended uses or capabilities of the model.

Downloads last month
6
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support