Instructions to use tq-ag/TQCompressedGPT2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tq-ag/TQCompressedGPT2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tq-ag/TQCompressedGPT2")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tq-ag/TQCompressedGPT2") model = AutoModelForCausalLM.from_pretrained("tq-ag/TQCompressedGPT2") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use tq-ag/TQCompressedGPT2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "tq-ag/TQCompressedGPT2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tq-ag/TQCompressedGPT2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/tq-ag/TQCompressedGPT2
- SGLang
How to use tq-ag/TQCompressedGPT2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "tq-ag/TQCompressedGPT2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tq-ag/TQCompressedGPT2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "tq-ag/TQCompressedGPT2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "tq-ag/TQCompressedGPT2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use tq-ag/TQCompressedGPT2 with Docker Model Runner:
docker model run hf.co/tq-ag/TQCompressedGPT2
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tq-ag/TQCompressedGPT2")
model = AutoModelForCausalLM.from_pretrained("tq-ag/TQCompressedGPT2")license: cc-by-nc-nd-4.0
Introduction
TQCompressedGPT-2 is an advanced neural network model, offering a novel method for model compression through improved tensor decompositions. It addresses the challenges of computational and storage demands in NLP tasks, introducing a permutation-based enhancement to Kronecker decomposition, significantly reducing model size while maintaining performance.
TQCompressedGPT2 © 2024 by Terra Quantum AG is licensed under CC BY-NC-ND 4.0. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/
Any entity who wishes to use this library for commercial purposes should contact info@terraquantum.swiss for more information.
Features
Model Size Reduction: Compresses the GPT-2small model from 124 million to 81 million parameters.
Permutation-Based Enhancement: Introduces a new permutation algorithm for matrix factorization, minimizing performance degradation.
Efficient Training Strategy: Employs multi-step knowledge distillation with a fraction (3.1%) of the OpenWebText dataset.
Performance: Outperforms DistilGPT-2 in comparative evaluations.
Permutation-Based Enhancement
In our work we employ permutation-based algorithm, which allows to achieve better decomposition approximation for weight matrices:
Methodology
For more details about the techniques of TQCompressedGPT-2, refer to our paper: (ADD LINK)TQCompressor: Improving Tensor Decomposition in Neural Networks via Permutations
TQCompressed Decomposition: Focuses on optimal permutation of weight matrices followed by Kronecker decomposition.
Knowledge Distillation: Uses an iterative compression method coupled with knowledge distillation, enhancing performance.
Application: Demonstrated on the GPT-2 model, showing its versatility and applicability to various neural network architectures.
Usage
The model and code are publicly available at:
Citation
If you find TQCompressedGPT-2 useful in your research, please cite the following paper:
@article{tqcompressedgpt2,
title={TQCompressor: Improving Tensor Decomposition in Neural Networks via Permutations},
author={Abronin, V., Naumov, A., Mazur, D., Bystrov, D., Tsarova, K., Melnikov, Ar., Oseledets, I., Dolgov, S., Brasher, R., Perelshtein, M.},
journal={arXiv preprint arXiv:[insert_arxiv_id]},
year={2023}
}
Acknowledgments
- Terra Quantum AG, Kornhausstrasse 25, 9000 St. Gallen, Switzerland
- Project contributors and researchers.
- Downloads last month
- 7
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="tq-ag/TQCompressedGPT2")