Instructions to use Abhimanue/TCompress with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Abhimanue/TCompress with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Abhimanue/TCompress", dtype="auto") - Notebooks
- Google Colab
- Kaggle
TCompress Model Benchmark Report
... (the rest of your markdown follows here)
TCompress Model Benchmark Report
Quantization Type: QAT (Quantization Aware Training) Precision: BF16 Evaluation Dataset: Salesforce/wikitext (wikitext-2-raw-v1)
Performance Metrics
| Metric | Result |
|---|---|
| Total Tokens Evaluated | 1,000,000 |
| Latency (Mean) | 58.12 ms |
| Throughput | 17,206.3 tok/s |
| Peak GPU Memory | 1,174.4 MB |
Model Accuracy
| Variant | Agreement vs Base | Flipped Tokens |
|---|---|---|
| TCompress (bf16) | 94.92% | 50,829 |
Storage & Packaging
| Asset | Size |
|---|---|
| Model Weight File | 298.0 MB |
Implementation Details
This model has been optimized via Training-Aware Quantization to maintain high fidelity (94.92% agreement) with the base FP32 architecture while significantly reducing memory footprint and maximizing throughput on CUDA-enabled hardware.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support