How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf simsvml/quant-repair-llama-3:Q8_0
# Run inference directly in the terminal:
llama-cli -hf simsvml/quant-repair-llama-3:Q8_0
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf simsvml/quant-repair-llama-3:Q8_0
# Run inference directly in the terminal:
llama-cli -hf simsvml/quant-repair-llama-3:Q8_0
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf simsvml/quant-repair-llama-3:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf simsvml/quant-repair-llama-3:Q8_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf simsvml/quant-repair-llama-3:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf simsvml/quant-repair-llama-3:Q8_0
Use Docker
docker model run hf.co/simsvml/quant-repair-llama-3:Q8_0
Quick Links

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LoRAs for improving the quality of quantized Llama 3 models.

LoRAs available

WikiText
Model PPL KL-Div Same top p
Llama 3 70B Instruct (baseline) 5.282 0 100%
Llama 3 70B Instruct IQ2_XXS 7.691 3.340 ร— 10-1 79.3%
Llama 3 70B Instruct IQ2_M (without LoRA) 7.765 3.430 ร— 10-1 81.4%
Llama 3 70B Instruct IQ2_XS (without LoRA) 9.320 5.502 ร— 10-1 77.2%
Llama 3 70B Instruct IQ2_XXS (without LoRA) 10.554 6.767 ร— 10-1 73.8%

How to use this

Each subdirectory in this repo has a LoRA for a specific model and quant. Each LoRA works only with the exact quantized GGUF file it was trained on. See the README in each directory for details.

  1. Choose a subdirectory and download its LoRA GGUF and the matching quantized model GGUF.
  2. Download the quant-repair scripts: https://github.com/simsvml/quant-repair
  3. Apply the LoRA to the model:
    cd quant-repair
    # Install dependencies if needed:
    pip3 install numpy
    python3 combine_gguf.py lora.gguf model.gguf -o out.gguf
    
    For combine_gguf.py, the LoRA GGUF must be the first input because the script copies metadata entries from the first input.
  4. Build the llama-with-lora branch of llama.cpp: https://github.com/simsvml/llama.cpp/tree/llama-with-lora
  5. Run llama.cpp as usual using the combined out.gguf file.

Training from a checkpoint

Some subdirectories contain LoRAs that haven't finished training. These have filenames like lora_ckpt17.gguf and come with a corresponding training checkpoint lora_ckpt17.pt. If you have 24GB VRAM, you can continue the training yourself:

  1. Download the training checkpoint and quantized model GGUF.

  2. Download the original model in safetensors format. This is used as a reference during training.

  3. In the quant-repair repo, set up symlinks to the original and quantized models. Check the config file for the checkpoint to find the correct paths. The path given for orig_weights_safetensors_dir should be a symlink to the directory containing the original model's safetensors file, and the path given for quant_weights_gguf_path should be a symlink to the quantized GGUF.

    Alternatively, reconfigure the checkpoint with paths that are correct for your system. See the quant-repair README for instructions.

  4. Install dependencies for the quant-repair training scripts:

    cd quant-repair
    pip3 install -r requirements.txt
    cd ../llama.cpp/gguf-py
    pip3 install .
    
  5. Train:

    cd quant-repair
    python3 train_repair_lora2.py train lora.pt
    
Downloads last month
1
GGUF
Model size
0.4B params
Architecture
llamawithlora
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support