Instructions to use ALGOTECH/QwQ-32B-TRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ALGOTECH/QwQ-32B-TRT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ALGOTECH/QwQ-32B-TRT")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("ALGOTECH/QwQ-32B-TRT", dtype="auto")

TensorRT

How to use ALGOTECH/QwQ-32B-TRT with TensorRT:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ALGOTECH/QwQ-32B-TRT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ALGOTECH/QwQ-32B-TRT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ALGOTECH/QwQ-32B-TRT",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/ALGOTECH/QwQ-32B-TRT

SGLang

How to use ALGOTECH/QwQ-32B-TRT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ALGOTECH/QwQ-32B-TRT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ALGOTECH/QwQ-32B-TRT",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ALGOTECH/QwQ-32B-TRT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ALGOTECH/QwQ-32B-TRT",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use ALGOTECH/QwQ-32B-TRT with Docker Model Runner:
```
docker model run hf.co/ALGOTECH/QwQ-32B-TRT
```

QwQ-32B TensorRT Optimized Version

Model Introduction

This repository contains the TensorRT-optimized version of the QwQ-32B model, built upon the original QwQ-32B model with the following features:

TensorRT Acceleration: Optimized for inference using NVIDIA TensorRT
Performance Boost: Significantly improved inference speed compared to the original PyTorch implementation
Hardware Optimization: Deeply optimized for NVIDIA GPUs
Precision Retention: Maintains the same inference accuracy as the original model

System Requirements

Hardware Requirements

GPU: NVIDIA GPU (Ampere architecture or newer recommended, e.g., A100, H100, RTX 3090/4090)
VRAM: At least 64GB GPU memory (FP16 precision)

Software Requirements

CUDA: Version 11.8 or higher
TensorRT: Version 8.6 or higher
Python: 3.8-3.10

Dependencies:

pip install tensorrt transformers polygraphy

Performance Benchmarks

Environment	Throughput (tokens/sec)	Latency (ms/token)	VRAM Usage
Original (A100 80GB)	45	22	58GB
TensorRT (A100 80GB)	80	12.5	52GB

Test conditions: FP16 precision, input length 512, output length 128, batch size=1

Deployment Recommendations

Precision Selection:
- FP16: Recommended for most scenarios, balancing precision and performance
- INT8: Requires additional quantization calibration, further reducing VRAM usage

Optimization Configuration:

# Recommended configuration when building the TRT engine
config = {
    "precision": "fp16",
    "max_input_length": 8192,
    "opt_batch_size": [1, 2, 4],
    "max_output_length": 2048
}

Long Sequence Handling:
- If processing sequences longer than 8K, ensure YaRN extension is enabled
- Set appropriate max_input_length when building the TRT engine

Notes

Model Differences:
- This version is optimized for inference and does not support training or fine-tuning
- Some dynamic control features (e.g., dynamic batch size) must be pre-configured during engine building
Version Compatibility:
- Ensure the TensorRT version matches the CUDA version
- Different GPU architectures require separate engine builds
Quantization Information:
- FP16 version maintains the original model's precision
- INT8 version may have slight precision loss

Acknowledgments

This optimized version is based on the following original work:

@misc{qwq32b,
    title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
    url = {https://qwenlm.github.io/blog/qwq-32b/},
    author = {Qwen Team},
    month = {March},
    year = {2025}
}