| # llama-65b-4bit | |
| This works with my branch of GPTQ-for-LLaMa: https://github.com/catid/GPTQ-for-LLaMa-65B-2GPU | |
| To test it out on two RTX4090 GPUs and 64GB RAM (might work with a big swap file haven't tested): | |
| ```bash | |
| # Install git-lfs | |
| sudo apt install git git-lfs | |
| # Clone the code | |
| git clone https://github.com/catid/GPTQ-for-LLaMa-65B-2GPU | |
| cd GPTQ-for-LLaMa-65B-2GPU | |
| # Clone the model weights | |
| git lfs install | |
| git clone https://huggingface.co/catid/llama-65b-4bit | |
| # Set up conda environment | |
| conda create -n gptq python=3.10 | |
| conda activate gptq | |
| # Install script dependencies | |
| pip install -r requirements.txt | |
| # Work around protobuf error | |
| export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python | |
| # Run a test | |
| python llama_inference.py llama-65b-4bit --load llama-65b-4bit/llama65b-4bit-128g.safetensors --groupsize 128 --wbits 4 --text "I woke up with a dent in my forehead. " --max_length 128 --min_length 32 | |
| ``` | |
| --- | |
| license: bsd-3-clause | |
| --- | |