Instructions to use ALGOTECH/QwQ-32B-TRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ALGOTECH/QwQ-32B-TRT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ALGOTECH/QwQ-32B-TRT")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("ALGOTECH/QwQ-32B-TRT", dtype="auto") - TensorRT
How to use ALGOTECH/QwQ-32B-TRT with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ALGOTECH/QwQ-32B-TRT with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ALGOTECH/QwQ-32B-TRT" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ALGOTECH/QwQ-32B-TRT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ALGOTECH/QwQ-32B-TRT
- SGLang
How to use ALGOTECH/QwQ-32B-TRT with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ALGOTECH/QwQ-32B-TRT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ALGOTECH/QwQ-32B-TRT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ALGOTECH/QwQ-32B-TRT" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ALGOTECH/QwQ-32B-TRT", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ALGOTECH/QwQ-32B-TRT with Docker Model Runner:
docker model run hf.co/ALGOTECH/QwQ-32B-TRT
QwQ-32B TensorRT Optimized Version
Model Introduction
This repository contains the TensorRT-optimized version of the QwQ-32B model, built upon the original QwQ-32B model with the following features:
- TensorRT Acceleration: Optimized for inference using NVIDIA TensorRT
- Performance Boost: Significantly improved inference speed compared to the original PyTorch implementation
- Hardware Optimization: Deeply optimized for NVIDIA GPUs
- Precision Retention: Maintains the same inference accuracy as the original model
System Requirements
Hardware Requirements
- GPU: NVIDIA GPU (Ampere architecture or newer recommended, e.g., A100, H100, RTX 3090/4090)
- VRAM: At least 64GB GPU memory (FP16 precision)
Software Requirements
- CUDA: Version 11.8 or higher
- TensorRT: Version 8.6 or higher
- Python: 3.8-3.10
- Dependencies:
pip install tensorrt transformers polygraphy
Performance Benchmarks
| Environment | Throughput (tokens/sec) | Latency (ms/token) | VRAM Usage |
|---|---|---|---|
| Original (A100 80GB) | 45 | 22 | 58GB |
| TensorRT (A100 80GB) | 80 | 12.5 | 52GB |
Test conditions: FP16 precision, input length 512, output length 128, batch size=1
Deployment Recommendations
Precision Selection:
- FP16: Recommended for most scenarios, balancing precision and performance
- INT8: Requires additional quantization calibration, further reducing VRAM usage
Optimization Configuration:
# Recommended configuration when building the TRT engine config = { "precision": "fp16", "max_input_length": 8192, "opt_batch_size": [1, 2, 4], "max_output_length": 2048 }Long Sequence Handling:
- If processing sequences longer than 8K, ensure YaRN extension is enabled
- Set appropriate
max_input_lengthwhen building the TRT engine
Notes
Model Differences:
- This version is optimized for inference and does not support training or fine-tuning
- Some dynamic control features (e.g., dynamic batch size) must be pre-configured during engine building
Version Compatibility:
- Ensure the TensorRT version matches the CUDA version
- Different GPU architectures require separate engine builds
Quantization Information:
- FP16 version maintains the original model's precision
- INT8 version may have slight precision loss
Acknowledgments
This optimized version is based on the following original work:
@misc{qwq32b,
title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
url = {https://qwenlm.github.io/blog/qwq-32b/},
author = {Qwen Team},
month = {March},
year = {2025}
}
Issue Reporting
For technical issues, please submit an issue via:
- GitHub Issues
- Huggingface Discussion Section
Note: Use of this model is subject to the original model's Apache 2.0 License
- Downloads last month
- 5