Instructions to use mah-quantum/quantised-llm-checkpoints with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use mah-quantum/quantised-llm-checkpoints with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| tags: | |
| - cuda | |
| - tensorrt | |
| - quantization | |
| - tensor-cores | |
| pipeline_tag: text-generation | |
| # 🌌 MQ-Cognitive-Base // Quantised LLM Checkpoints | |
| This repository hosts the optimized, mixed-precision quantized model weight checkpoints engineered by the **MAH Quantum Research Scholars** cohort. These weights are explicitly compiled for accelerated execution layers using native **NVIDIA® CUDA®** and **TensorRT™-LLM** runtimes. | |
| --- | |
| ## ⚡ Architectural Specifications | |
| * **Quantization Framework:** Post-Training Quantization (PTQ) / Activation-aware Weight-Quantization (AWQ) | |
| * **Target Precision Target:** INT8 / INT4 Weight-Only Quantization Matrix | |
| * **Hardware Optimization Optimization:** NVIDIA Compute Capability 8.0+ (Ampere, Hopper, Blackwell architectures) | |
| * **Primary Infrastructure Node:** NVIDIA® NGC Org ID `0963318590610147` | |
| --- | |
| ## 🔬 Deployment & Performance Intent | |
| These model matrices are structured to maximize token throughput and minimize memory footprint during heavy industrial inferencing. By compressing large parameter graphs down to optimized bit-widths, our distributed node network achieves sub-60ms Time-To-First-Token (TTFT) performance on localized compute clusters. | |
| ### 📊 Benchmark Logs | |
| ```json | |
| { | |
| "PERFORMANCE_METRICS": { | |
| "CompilationEngine": "TensorRT-LLM v0.10.x", | |
| "QuantizationType": "INT4-AWQ", | |
| "MemoryFootprintReduction": " ~72%", | |
| "TensorCoreUtilization": "Optimal" | |
| } | |
| } |