GGUF
conversational
sunil-pathak commited on
Commit
e35197e
·
verified ·
1 Parent(s): b6be8fd

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +570 -58
README.md CHANGED
@@ -1,84 +1,596 @@
1
- ---
2
- license: other
3
- tags:
4
- - gguf
5
- - llama.cpp
6
- - gemma
7
- - fp16
8
- - text-generation
9
- - cpu-inference
10
- pipeline_tag: text-generation
11
- ---
12
 
13
- # Gemma 4 E4B IT – GGUF (FP16)
14
 
15
- ![Format](https://img.shields.io/badge/format-GGUF-orange)
16
- ![Precision](https://img.shields.io/badge/precision-FP16-blue)
17
- ![Runtime](https://img.shields.io/badge/runtime-llama.cpp-red)
18
 
19
- ---
20
 
21
- ## 🔷 Model Overview
22
 
23
- This repository contains a **GGUF FP16 conversion** of:
24
 
25
- - **Base Model:** google/gemma-4-E4B-it
26
- - **Developer:** Google
27
- - **Format:** GGUF (optimized for llama.cpp)
28
- - **Precision:** FP16 (full precision weights)
29
 
30
- This model is designed for **high-quality local inference**.
31
 
32
- ---
 
 
 
 
 
 
 
 
 
33
 
34
- ## 📦 Files
35
 
36
- | File | Description |
37
- |------|------------|
38
- | `gemma-4-E4B-it-f16.gguf` | FP16 full-precision GGUF model |
39
 
40
- ---
41
 
42
- ## ⚙️ Technical Details
 
 
 
43
 
44
- | Parameter | Value |
45
- |----------|------|
46
- | Architecture | Gemma 4 E4B |
47
- | Format | GGUF |
48
- | Precision | FP16 |
49
- | Runtime | llama.cpp |
50
- | Use Case | High-quality inference |
51
 
52
- ---
53
 
54
- ## ⚡ Why GGUF?
 
 
55
 
56
- GGUF enables:
 
57
 
58
- - Efficient CPU inference via llama.cpp
59
- - Single-file model distribution
60
- - Fast loading using memory mapping
61
- - Cross-platform compatibility
62
 
63
- ---
64
 
65
- ## ⚠️ License & Usage
 
66
 
67
- This is a **converted derivative model**.
 
 
 
 
 
 
 
68
 
69
- You must comply with the original license:
70
- 👉 https://huggingface.co/google/gemma-4-E4B-it
71
 
72
- ### Important:
73
- - ❌ Not an official Google release
74
- - ❌ No additional rights granted
75
- - ✅ Original model ownership remains with Google
76
- - ⚠️ Use responsibly under original license terms
77
 
78
- ---
79
 
80
- ## 🚀 Quick Start (llama.cpp)
81
 
82
- ### Run inference:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ```bash
84
- ./llama-cli -m gemma-4-E4B-it-f16.gguf -p "Explain AI simply"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # llama.cpp
 
 
 
 
 
 
 
 
 
 
2
 
3
+ ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
4
 
5
+ [![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://opensource.org/licenses/MIT)
6
+ [![Release](https://img.shields.io/github/v/release/ggml-org/llama.cpp)](https://github.com/ggml-org/llama.cpp/releases)
7
+ [![Server](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml/badge.svg)](https://github.com/ggml-org/llama.cpp/actions/workflows/server.yml)
8
 
9
+ [Manifesto](https://github.com/ggml-org/llama.cpp/discussions/205) / [ggml](https://github.com/ggml-org/ggml) / [ops](https://github.com/ggml-org/llama.cpp/blob/master/docs/ops.md)
10
 
11
+ LLM inference in C/C++
12
 
13
+ ## Recent API changes
14
 
15
+ - [Changelog for `libllama` API](https://github.com/ggml-org/llama.cpp/issues/9289)
16
+ - [Changelog for `llama-server` REST API](https://github.com/ggml-org/llama.cpp/issues/9291)
 
 
17
 
18
+ ## Hot topics
19
 
20
+ - **Hugging Face cache migration: models downloaded with `-hf` are now stored in the standard Hugging Face cache directory, enabling sharing with other HF tools.**
21
+ - **[guide : using the new WebUI of llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/16938)**
22
+ - [guide : running gpt-oss with llama.cpp](https://github.com/ggml-org/llama.cpp/discussions/15396)
23
+ - [[FEEDBACK] Better packaging for llama.cpp to support downstream consumers 🤗](https://github.com/ggml-org/llama.cpp/discussions/15313)
24
+ - Support for the `gpt-oss` model with native MXFP4 format has been added | [PR](https://github.com/ggml-org/llama.cpp/pull/15091) | [Collaboration with NVIDIA](https://blogs.nvidia.com/blog/rtx-ai-garage-openai-oss) | [Comment](https://github.com/ggml-org/llama.cpp/discussions/15095)
25
+ - Multimodal support arrived in `llama-server`: [#12898](https://github.com/ggml-org/llama.cpp/pull/12898) | [documentation](./docs/multimodal.md)
26
+ - VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
27
+ - Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
28
+ - Hugging Face Inference Endpoints now support GGUF out of the box! https://github.com/ggml-org/llama.cpp/discussions/9669
29
+ - Hugging Face GGUF editor: [discussion](https://github.com/ggml-org/llama.cpp/discussions/9268) | [tool](https://huggingface.co/spaces/CISCai/gguf-editor)
30
 
31
+ ----
32
 
33
+ ## Quick start
 
 
34
 
35
+ Getting started with llama.cpp is straightforward. Here are several ways to install it on your machine:
36
 
37
+ - Install `llama.cpp` using [brew, nix or winget](docs/install.md)
38
+ - Run with Docker - see our [Docker documentation](docs/docker.md)
39
+ - Download pre-built binaries from the [releases page](https://github.com/ggml-org/llama.cpp/releases)
40
+ - Build from source by cloning this repository - check out [our build guide](docs/build.md)
41
 
42
+ Once installed, you'll need a model to work with. Head to the [Obtaining and quantizing models](#obtaining-and-quantizing-models) section to learn more.
 
 
 
 
 
 
43
 
44
+ Example command:
45
 
46
+ ```sh
47
+ # Use a local model file
48
+ llama-cli -m my_model.gguf
49
 
50
+ # Or download and run a model directly from Hugging Face
51
+ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
52
 
53
+ # Launch OpenAI-compatible API server
54
+ llama-server -hf ggml-org/gemma-3-1b-it-GGUF
55
+ ```
 
56
 
57
+ ## Description
58
 
59
+ The main goal of `llama.cpp` is to enable LLM inference with minimal setup and state-of-the-art performance on a wide
60
+ range of hardware - locally and in the cloud.
61
 
62
+ - Plain C/C++ implementation without any dependencies
63
+ - Apple silicon is a first-class citizen - optimized via ARM NEON, Accelerate and Metal frameworks
64
+ - AVX, AVX2, AVX512 and AMX support for x86 architectures
65
+ - RVV, ZVFH, ZFH, ZICBOP and ZIHINTPAUSE support for RISC-V architectures
66
+ - 1.5-bit, 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit integer quantization for faster inference and reduced memory use
67
+ - Custom CUDA kernels for running LLMs on NVIDIA GPUs (support for AMD GPUs via HIP and Moore Threads GPUs via MUSA)
68
+ - Vulkan and SYCL backend support
69
+ - CPU+GPU hybrid inference to partially accelerate models larger than the total VRAM capacity
70
 
71
+ The `llama.cpp` project is the main playground for developing new features for the [ggml](https://github.com/ggml-org/ggml) library.
 
72
 
73
+ <details>
74
+ <summary>Models</summary>
 
 
 
75
 
76
+ Typically finetunes of the base models below are supported as well.
77
 
78
+ Instructions for adding support for new models: [HOWTO-add-model.md](docs/development/HOWTO-add-model.md)
79
 
80
+ #### Text-only
81
+
82
+ - [X] LLaMA 🦙
83
+ - [x] LLaMA 2 🦙🦙
84
+ - [x] LLaMA 3 🦙🦙🦙
85
+ - [X] [Mistral 7B](https://huggingface.co/mistralai/Mistral-7B-v0.1)
86
+ - [x] [Mixtral MoE](https://huggingface.co/models?search=mistral-ai/Mixtral)
87
+ - [x] [DBRX](https://huggingface.co/databricks/dbrx-instruct)
88
+ - [x] [Jamba](https://huggingface.co/ai21labs)
89
+ - [X] [Falcon](https://huggingface.co/models?search=tiiuae/falcon)
90
+ - [X] [Chinese LLaMA / Alpaca](https://github.com/ymcui/Chinese-LLaMA-Alpaca) and [Chinese LLaMA-2 / Alpaca-2](https://github.com/ymcui/Chinese-LLaMA-Alpaca-2)
91
+ - [X] [Vigogne (French)](https://github.com/bofenghuang/vigogne)
92
+ - [X] [BERT](https://github.com/ggml-org/llama.cpp/pull/5423)
93
+ - [X] [Koala](https://bair.berkeley.edu/blog/2023/04/03/koala/)
94
+ - [X] [Baichuan 1 & 2](https://huggingface.co/models?search=baichuan-inc/Baichuan) + [derivations](https://huggingface.co/hiyouga/baichuan-7b-sft)
95
+ - [X] [Aquila 1 & 2](https://huggingface.co/models?search=BAAI/Aquila)
96
+ - [X] [Starcoder models](https://github.com/ggml-org/llama.cpp/pull/3187)
97
+ - [X] [Refact](https://huggingface.co/smallcloudai/Refact-1_6B-fim)
98
+ - [X] [MPT](https://github.com/ggml-org/llama.cpp/pull/3417)
99
+ - [X] [Bloom](https://github.com/ggml-org/llama.cpp/pull/3553)
100
+ - [x] [Yi models](https://huggingface.co/models?search=01-ai/Yi)
101
+ - [X] [StableLM models](https://huggingface.co/stabilityai)
102
+ - [x] [Deepseek models](https://huggingface.co/models?search=deepseek-ai/deepseek)
103
+ - [x] [Qwen models](https://huggingface.co/models?search=Qwen/Qwen)
104
+ - [x] [PLaMo-13B](https://github.com/ggml-org/llama.cpp/pull/3557)
105
+ - [x] [Phi models](https://huggingface.co/models?search=microsoft/phi)
106
+ - [x] [PhiMoE](https://github.com/ggml-org/llama.cpp/pull/11003)
107
+ - [x] [GPT-2](https://huggingface.co/gpt2)
108
+ - [x] [Orion 14B](https://github.com/ggml-org/llama.cpp/pull/5118)
109
+ - [x] [InternLM2](https://huggingface.co/models?search=internlm2)
110
+ - [x] [CodeShell](https://github.com/WisdomShell/codeshell)
111
+ - [x] [Gemma](https://ai.google.dev/gemma)
112
+ - [x] [Mamba](https://github.com/state-spaces/mamba)
113
+ - [x] [Grok-1](https://huggingface.co/keyfan/grok-1-hf)
114
+ - [x] [Xverse](https://huggingface.co/models?search=xverse)
115
+ - [x] [Command-R models](https://huggingface.co/models?search=CohereForAI/c4ai-command-r)
116
+ - [x] [SEA-LION](https://huggingface.co/models?search=sea-lion)
117
+ - [x] [GritLM-7B](https://huggingface.co/GritLM/GritLM-7B) + [GritLM-8x7B](https://huggingface.co/GritLM/GritLM-8x7B)
118
+ - [x] [OLMo](https://allenai.org/olmo)
119
+ - [x] [OLMo 2](https://allenai.org/olmo)
120
+ - [x] [OLMoE](https://huggingface.co/allenai/OLMoE-1B-7B-0924)
121
+ - [x] [Granite models](https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330)
122
+ - [x] [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) + [Pythia](https://github.com/EleutherAI/pythia)
123
+ - [x] [Snowflake-Arctic MoE](https://huggingface.co/collections/Snowflake/arctic-66290090abe542894a5ac520)
124
+ - [x] [Smaug](https://huggingface.co/models?search=Smaug)
125
+ - [x] [Poro 34B](https://huggingface.co/LumiOpen/Poro-34B)
126
+ - [x] [Bitnet b1.58 models](https://huggingface.co/1bitLLM)
127
+ - [x] [Flan T5](https://huggingface.co/models?search=flan-t5)
128
+ - [x] [Open Elm models](https://huggingface.co/collections/apple/openelm-instruct-models-6619ad295d7ae9f868b759ca)
129
+ - [x] [ChatGLM3-6b](https://huggingface.co/THUDM/chatglm3-6b) + [ChatGLM4-9b](https://huggingface.co/THUDM/glm-4-9b) + [GLMEdge-1.5b](https://huggingface.co/THUDM/glm-edge-1.5b-chat) + [GLMEdge-4b](https://huggingface.co/THUDM/glm-edge-4b-chat)
130
+ - [x] [GLM-4-0414](https://huggingface.co/collections/THUDM/glm-4-0414-67f3cbcb34dd9d252707cb2e)
131
+ - [x] [SmolLM](https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966)
132
+ - [x] [EXAONE-3.0-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct)
133
+ - [x] [FalconMamba Models](https://huggingface.co/collections/tiiuae/falconmamba-7b-66b9a580324dd1598b0f6d4a)
134
+ - [x] [Jais](https://huggingface.co/inceptionai/jais-13b-chat)
135
+ - [x] [Bielik-11B-v2.3](https://huggingface.co/collections/speakleash/bielik-11b-v23-66ee813238d9b526a072408a)
136
+ - [x] [RWKV-7](https://huggingface.co/collections/shoumenchougou/rwkv7-gxx-gguf)
137
+ - [x] [RWKV-6](https://github.com/BlinkDL/RWKV-LM)
138
+ - [x] [QRWKV-6](https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1)
139
+ - [x] [GigaChat-20B-A3B](https://huggingface.co/ai-sage/GigaChat-20B-A3B-instruct)
140
+ - [X] [Trillion-7B-preview](https://huggingface.co/trillionlabs/Trillion-7B-preview)
141
+ - [x] [Ling models](https://huggingface.co/collections/inclusionAI/ling-67c51c85b34a7ea0aba94c32)
142
+ - [x] [LFM2 models](https://huggingface.co/collections/LiquidAI/lfm2-686d721927015b2ad73eaa38)
143
+ - [x] [Hunyuan models](https://huggingface.co/collections/tencent/hunyuan-dense-model-6890632cda26b19119c9c5e7)
144
+ - [x] [BailingMoeV2 (Ring/Ling 2.0) models](https://huggingface.co/collections/inclusionAI/ling-v2-68bf1dd2fc34c306c1fa6f86)
145
+
146
+ #### Multimodal
147
+
148
+ - [x] [LLaVA 1.5 models](https://huggingface.co/collections/liuhaotian/llava-15-653aac15d994e992e2677a7e), [LLaVA 1.6 models](https://huggingface.co/collections/liuhaotian/llava-16-65b9e40155f60fd046a5ccf2)
149
+ - [x] [BakLLaVA](https://huggingface.co/models?search=SkunkworksAI/Bakllava)
150
+ - [x] [Obsidian](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
151
+ - [x] [ShareGPT4V](https://huggingface.co/models?search=Lin-Chen/ShareGPT4V)
152
+ - [x] [MobileVLM 1.7B/3B models](https://huggingface.co/models?search=mobileVLM)
153
+ - [x] [Yi-VL](https://huggingface.co/models?search=Yi-VL)
154
+ - [x] [Mini CPM](https://huggingface.co/models?search=MiniCPM)
155
+ - [x] [Moondream](https://huggingface.co/vikhyatk/moondream2)
156
+ - [x] [Bunny](https://github.com/BAAI-DCAI/Bunny)
157
+ - [x] [GLM-EDGE](https://huggingface.co/models?search=glm-edge)
158
+ - [x] [Qwen2-VL](https://huggingface.co/collections/Qwen/qwen2-vl-66cee7455501d7126940800d)
159
+ - [x] [LFM2-VL](https://huggingface.co/collections/LiquidAI/lfm2-vl-68963bbc84a610f7638d5ffa)
160
+
161
+ </details>
162
+
163
+ <details>
164
+ <summary>Bindings</summary>
165
+
166
+ - Python: [ddh0/easy-llama](https://github.com/ddh0/easy-llama)
167
+ - Python: [abetlen/llama-cpp-python](https://github.com/abetlen/llama-cpp-python)
168
+ - Go: [go-skynet/go-llama.cpp](https://github.com/go-skynet/go-llama.cpp)
169
+ - Node.js: [withcatai/node-llama-cpp](https://github.com/withcatai/node-llama-cpp)
170
+ - JS/TS (llama.cpp server client): [lgrammel/modelfusion](https://modelfusion.dev/integration/model-provider/llamacpp)
171
+ - JS/TS (Programmable Prompt Engine CLI): [offline-ai/cli](https://github.com/offline-ai/cli)
172
+ - JavaScript/Wasm (works in browser): [tangledgroup/llama-cpp-wasm](https://github.com/tangledgroup/llama-cpp-wasm)
173
+ - Typescript/Wasm (nicer API, available on npm): [ngxson/wllama](https://github.com/ngxson/wllama)
174
+ - Ruby: [yoshoku/llama_cpp.rb](https://github.com/yoshoku/llama_cpp.rb)
175
+ - Rust (more features): [edgenai/llama_cpp-rs](https://github.com/edgenai/llama_cpp-rs)
176
+ - Rust (nicer API): [mdrokz/rust-llama.cpp](https://github.com/mdrokz/rust-llama.cpp)
177
+ - Rust (more direct bindings): [utilityai/llama-cpp-rs](https://github.com/utilityai/llama-cpp-rs)
178
+ - Rust (automated build from crates.io): [ShelbyJenkins/llm_client](https://github.com/ShelbyJenkins/llm_client)
179
+ - C#/.NET: [SciSharp/LLamaSharp](https://github.com/SciSharp/LLamaSharp)
180
+ - C#/VB.NET (more features - community license): [LM-Kit.NET](https://docs.lm-kit.com/lm-kit-net/index.html)
181
+ - Scala 3: [donderom/llm4s](https://github.com/donderom/llm4s)
182
+ - Clojure: [phronmophobic/llama.clj](https://github.com/phronmophobic/llama.clj)
183
+ - React Native: [mybigday/llama.rn](https://github.com/mybigday/llama.rn)
184
+ - Java: [kherud/java-llama.cpp](https://github.com/kherud/java-llama.cpp)
185
+ - Java: [QuasarByte/llama-cpp-jna](https://github.com/QuasarByte/llama-cpp-jna)
186
+ - Zig: [deins/llama.cpp.zig](https://github.com/Deins/llama.cpp.zig)
187
+ - Flutter/Dart: [netdur/llama_cpp_dart](https://github.com/netdur/llama_cpp_dart)
188
+ - Flutter: [xuegao-tzx/Fllama](https://github.com/xuegao-tzx/Fllama)
189
+ - PHP (API bindings and features built on top of llama.cpp): [distantmagic/resonance](https://github.com/distantmagic/resonance) [(more info)](https://github.com/ggml-org/llama.cpp/pull/6326)
190
+ - Guile Scheme: [guile_llama_cpp](https://savannah.nongnu.org/projects/guile-llama-cpp)
191
+ - Swift [srgtuszy/llama-cpp-swift](https://github.com/srgtuszy/llama-cpp-swift)
192
+ - Swift [ShenghaiWang/SwiftLlama](https://github.com/ShenghaiWang/SwiftLlama)
193
+ - Delphi [Embarcadero/llama-cpp-delphi](https://github.com/Embarcadero/llama-cpp-delphi)
194
+ - Go (no CGo needed): [hybridgroup/yzma](https://github.com/hybridgroup/yzma)
195
+ - Android: [llama.android](/examples/llama.android)
196
+
197
+ </details>
198
+
199
+ <details>
200
+ <summary>UIs</summary>
201
+
202
+ *(to have a project listed here, it should clearly state that it depends on `llama.cpp`)*
203
+
204
+ - [AI Sublime Text plugin](https://github.com/yaroslavyaroslav/OpenAI-sublime-text) (MIT)
205
+ - [BonzAI App](https://apps.apple.com/us/app/bonzai-your-local-ai-agent/id6752847988) (proprietary)
206
+ - [cztomsik/ava](https://github.com/cztomsik/ava) (MIT)
207
+ - [Dot](https://github.com/alexpinel/Dot) (GPL)
208
+ - [eva](https://github.com/ylsdamxssjxxdd/eva) (MIT)
209
+ - [iohub/collama](https://github.com/iohub/coLLaMA) (Apache-2.0)
210
+ - [janhq/jan](https://github.com/janhq/jan) (AGPL)
211
+ - [johnbean393/Sidekick](https://github.com/johnbean393/Sidekick) (MIT)
212
+ - [KanTV](https://github.com/zhouwg/kantv?tab=readme-ov-file) (Apache-2.0)
213
+ - [KodiBot](https://github.com/firatkiral/kodibot) (GPL)
214
+ - [llama.vim](https://github.com/ggml-org/llama.vim) (MIT)
215
+ - [LARS](https://github.com/abgulati/LARS) (AGPL)
216
+ - [Llama Assistant](https://github.com/vietanhdev/llama-assistant) (GPL)
217
+ - [LlamaLib](https://github.com/undreamai/LlamaLib) (Apache-2.0)
218
+ - [LLMFarm](https://github.com/guinmoon/LLMFarm?tab=readme-ov-file) (MIT)
219
+ - [LLMUnity](https://github.com/undreamai/LLMUnity) (MIT)
220
+ - [LMStudio](https://lmstudio.ai/) (proprietary)
221
+ - [LocalAI](https://github.com/mudler/LocalAI) (MIT)
222
+ - [LostRuins/koboldcpp](https://github.com/LostRuins/koboldcpp) (AGPL)
223
+ - [MindMac](https://mindmac.app) (proprietary)
224
+ - [MindWorkAI/AI-Studio](https://github.com/MindWorkAI/AI-Studio) (FSL-1.1-MIT)
225
+ - [Mobile-Artificial-Intelligence/maid](https://github.com/Mobile-Artificial-Intelligence/maid) (MIT)
226
+ - [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) (Apache-2.0)
227
+ - [nat/openplayground](https://github.com/nat/openplayground) (MIT)
228
+ - [nomic-ai/gpt4all](https://github.com/nomic-ai/gpt4all) (MIT)
229
+ - [ollama/ollama](https://github.com/ollama/ollama) (MIT)
230
+ - [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui) (AGPL)
231
+ - [PocketPal AI](https://github.com/a-ghorbani/pocketpal-ai) (MIT)
232
+ - [psugihara/FreeChat](https://github.com/psugihara/FreeChat) (MIT)
233
+ - [ptsochantaris/emeltal](https://github.com/ptsochantaris/emeltal) (MIT)
234
+ - [pythops/tenere](https://github.com/pythops/tenere) (AGPL)
235
+ - [ramalama](https://github.com/containers/ramalama) (MIT)
236
+ - [semperai/amica](https://github.com/semperai/amica) (MIT)
237
+ - [withcatai/catai](https://github.com/withcatai/catai) (MIT)
238
+ - [Autopen](https://github.com/blackhole89/autopen) (GPL)
239
+
240
+ </details>
241
+
242
+ <details>
243
+ <summary>Tools</summary>
244
+
245
+ - [akx/ggify](https://github.com/akx/ggify) – download PyTorch models from Hugging Face Hub and convert them to GGML
246
+ - [akx/ollama-dl](https://github.com/akx/ollama-dl) – download models from the Ollama library to be used directly with llama.cpp
247
+ - [crashr/gppm](https://github.com/crashr/gppm) – launch llama.cpp instances utilizing NVIDIA Tesla P40 or P100 GPUs with reduced idle power consumption
248
+ - [gpustack/gguf-parser](https://github.com/gpustack/gguf-parser-go/tree/main/cmd/gguf-parser) - review/check the GGUF file and estimate the memory usage
249
+ - [Styled Lines](https://marketplace.unity.com/packages/tools/generative-ai/styled-lines-llama-cpp-model-292902) (proprietary licensed, async wrapper of inference part for game development in Unity3d with pre-built Mobile and Web platform wrappers and a model example)
250
+ - [unslothai/unsloth](https://github.com/unslothai/unsloth) – 🦥 exports/saves fine-tuned and trained models to GGUF (Apache-2.0)
251
+
252
+ </details>
253
+
254
+ <details>
255
+ <summary>Infrastructure</summary>
256
+
257
+ - [Paddler](https://github.com/intentee/paddler) - Open-source LLMOps platform for hosting and scaling AI in your own infrastructure
258
+ - [GPUStack](https://github.com/gpustack/gpustack) - Manage GPU clusters for running LLMs
259
+ - [llama_cpp_canister](https://github.com/onicai/llama_cpp_canister) - llama.cpp as a smart contract on the Internet Computer, using WebAssembly
260
+ - [llama-swap](https://github.com/mostlygeek/llama-swap) - transparent proxy that adds automatic model switching with llama-server
261
+ - [Kalavai](https://github.com/kalavai-net/kalavai-client) - Crowdsource end to end LLM deployment at any scale
262
+ - [llmaz](https://github.com/InftyAI/llmaz) - ☸️ Easy, advanced inference platform for large language models on Kubernetes.
263
+ - [LLMKube](https://github.com/defilantech/llmkube) - Kubernetes operator for llama.cpp with multi-GPU and Apple Silicon Metal
264
+ support"
265
+ </details>
266
+
267
+ <details>
268
+ <summary>Games</summary>
269
+
270
+ - [Lucy's Labyrinth](https://github.com/MorganRO8/Lucys_Labyrinth) - A simple maze game where agents controlled by an AI model will try to trick you.
271
+
272
+ </details>
273
+
274
+
275
+ ## Supported backends
276
+
277
+ | Backend | Target devices |
278
+ | --- | --- |
279
+ | [Metal](docs/build.md#metal-build) | Apple Silicon |
280
+ | [BLAS](docs/build.md#blas-build) | All |
281
+ | [BLIS](docs/backend/BLIS.md) | All |
282
+ | [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU |
283
+ | [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs |
284
+ | [MUSA](docs/build.md#musa) | Moore Threads GPU |
285
+ | [CUDA](docs/build.md#cuda) | Nvidia GPU |
286
+ | [HIP](docs/build.md#hip) | AMD GPU |
287
+ | [ZenDNN](docs/build.md#zendnn) | AMD CPU |
288
+ | [Vulkan](docs/build.md#vulkan) | GPU |
289
+ | [CANN](docs/build.md#cann) | Ascend NPU |
290
+ | [OpenCL](docs/backend/OPENCL.md) | Adreno GPU |
291
+ | [IBM zDNN](docs/backend/zDNN.md) | IBM Z & LinuxONE |
292
+ | [WebGPU [In Progress]](docs/build.md#webgpu) | All |
293
+ | [RPC](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) | All |
294
+ | [Hexagon [In Progress]](docs/backend/snapdragon/README.md) | Snapdragon |
295
+ | [VirtGPU](docs/backend/VirtGPU.md) | VirtGPU APIR |
296
+
297
+ ## Obtaining and quantizing models
298
+
299
+ The [Hugging Face](https://huggingface.co) platform hosts a [number of LLMs](https://huggingface.co/models?library=gguf&sort=trending) compatible with `llama.cpp`:
300
+
301
+ - [Trending](https://huggingface.co/models?library=gguf&sort=trending)
302
+ - [LLaMA](https://huggingface.co/models?sort=trending&search=llama+gguf)
303
+
304
+ You can either manually download the GGUF file or directly use any `llama.cpp`-compatible models from [Hugging Face](https://huggingface.co/) or other model hosting sites, by using this CLI argument: `-hf <user>/<model>[:quant]`. For example:
305
+
306
+ ```sh
307
+ llama-cli -hf ggml-org/gemma-3-1b-it-GGUF
308
+ ```
309
+
310
+ By default, the CLI would download from Hugging Face, you can switch to other options with the environment variable `MODEL_ENDPOINT`. The `MODEL_ENDPOINT` must point to a Hugging Face compatible API endpoint.
311
+
312
+ After downloading a model, use the CLI tools to run it locally - see below.
313
+
314
+ `llama.cpp` requires the model to be stored in the [GGUF](https://github.com/ggml-org/ggml/blob/master/docs/gguf.md) file format. Models in other data formats can be converted to GGUF using the `convert_*.py` Python scripts in this repo.
315
+
316
+ The Hugging Face platform provides a variety of online tools for converting, quantizing and hosting models with `llama.cpp`:
317
+
318
+ - Use the [GGUF-my-repo space](https://huggingface.co/spaces/ggml-org/gguf-my-repo) to convert to GGUF format and quantize model weights to smaller sizes
319
+ - Use the [GGUF-my-LoRA space](https://huggingface.co/spaces/ggml-org/gguf-my-lora) to convert LoRA adapters to GGUF format (more info: https://github.com/ggml-org/llama.cpp/discussions/10123)
320
+ - Use the [GGUF-editor space](https://huggingface.co/spaces/CISCai/gguf-editor) to edit GGUF meta data in the browser (more info: https://github.com/ggml-org/llama.cpp/discussions/9268)
321
+ - Use the [Inference Endpoints](https://ui.endpoints.huggingface.co/) to directly host `llama.cpp` in the cloud (more info: https://github.com/ggml-org/llama.cpp/discussions/9669)
322
+
323
+ To learn more about model quantization, [read this documentation](tools/quantize/README.md)
324
+
325
+ ## [`llama-cli`](tools/cli)
326
+
327
+ #### A CLI tool for accessing and experimenting with most of `llama.cpp`'s functionality.
328
+
329
+ - <details open>
330
+ <summary>Run in conversation mode</summary>
331
+
332
+ Models with a built-in chat template will automatically activate conversation mode. If this doesn't occur, you can manually enable it by adding `-cnv` and specifying a suitable chat template with `--chat-template NAME`
333
+
334
+ ```bash
335
+ llama-cli -m model.gguf
336
+
337
+ # > hi, who are you?
338
+ # Hi there! I'm your helpful assistant! I'm an AI-powered chatbot designed to assist and provide information to users like you. I'm here to help answer your questions, provide guidance, and offer support on a wide range of topics. I'm a friendly and knowledgeable AI, and I'm always happy to help with anything you need. What's on your mind, and how can I assist you today?
339
+ #
340
+ # > what is 1+1?
341
+ # Easy peasy! The answer to 1+1 is... 2!
342
+ ```
343
+
344
+ </details>
345
+
346
+ - <details>
347
+ <summary>Run in conversation mode with custom chat template</summary>
348
+
349
+ ```bash
350
+ # use the "chatml" template (use -h to see the list of supported templates)
351
+ llama-cli -m model.gguf -cnv --chat-template chatml
352
+
353
+ # use a custom template
354
+ llama-cli -m model.gguf -cnv --in-prefix 'User: ' --reverse-prompt 'User:'
355
+ ```
356
+
357
+ </details>
358
+
359
+ - <details>
360
+ <summary>Constrain the output with a custom grammar</summary>
361
+
362
+ ```bash
363
+ llama-cli -m model.gguf -n 256 --grammar-file grammars/json.gbnf -p 'Request: schedule a call at 8pm; Command:'
364
+
365
+ # {"appointmentTime": "8pm", "appointmentDetails": "schedule a a call"}
366
+ ```
367
+
368
+ The [grammars/](grammars/) folder contains a handful of sample grammars. To write your own, check out the [GBNF Guide](grammars/README.md).
369
+
370
+ For authoring more complex JSON grammars, check out https://grammar.intrinsiclabs.ai/
371
+
372
+ </details>
373
+
374
+
375
+ ## [`llama-server`](tools/server)
376
+
377
+ #### A lightweight, [OpenAI API](https://github.com/openai/openai-openapi) compatible, HTTP server for serving LLMs.
378
+
379
+ - <details open>
380
+ <summary>Start a local HTTP server with default configuration on port 8080</summary>
381
+
382
+ ```bash
383
+ llama-server -m model.gguf --port 8080
384
+
385
+ # Basic web UI can be accessed via browser: http://localhost:8080
386
+ # Chat completion endpoint: http://localhost:8080/v1/chat/completions
387
+ ```
388
+
389
+ </details>
390
+
391
+ - <details>
392
+ <summary>Support multiple-users and parallel decoding</summary>
393
+
394
+ ```bash
395
+ # up to 4 concurrent requests, each with 4096 max context
396
+ llama-server -m model.gguf -c 16384 -np 4
397
+ ```
398
+
399
+ </details>
400
+
401
+ - <details>
402
+ <summary>Enable speculative decoding</summary>
403
+
404
+ ```bash
405
+ # the draft.gguf model should be a small variant of the target model.gguf
406
+ llama-server -m model.gguf -md draft.gguf
407
+ ```
408
+
409
+ </details>
410
+
411
+ - <details>
412
+ <summary>Serve an embedding model</summary>
413
+
414
+ ```bash
415
+ # use the /embedding endpoint
416
+ llama-server -m model.gguf --embedding --pooling cls -ub 8192
417
+ ```
418
+
419
+ </details>
420
+
421
+ - <details>
422
+ <summary>Serve a reranking model</summary>
423
+
424
+ ```bash
425
+ # use the /reranking endpoint
426
+ llama-server -m model.gguf --reranking
427
+ ```
428
+
429
+ </details>
430
+
431
+ - <details>
432
+ <summary>Constrain all outputs with a grammar</summary>
433
+
434
+ ```bash
435
+ # custom grammar
436
+ llama-server -m model.gguf --grammar-file grammar.gbnf
437
+
438
+ # JSON
439
+ llama-server -m model.gguf --grammar-file grammars/json.gbnf
440
+ ```
441
+
442
+ </details>
443
+
444
+
445
+ ## [`llama-perplexity`](tools/perplexity)
446
+
447
+ #### A tool for measuring the [perplexity](tools/perplexity/README.md) [^1] (and other quality metrics) of a model over a given text.
448
+
449
+ - <details open>
450
+ <summary>Measure the perplexity over a text file</summary>
451
+
452
+ ```bash
453
+ llama-perplexity -m model.gguf -f file.txt
454
+
455
+ # [1]15.2701,[2]5.4007,[3]5.3073,[4]6.2965,[5]5.8940,[6]5.6096,[7]5.7942,[8]4.9297, ...
456
+ # Final estimate: PPL = 5.4007 +/- 0.67339
457
+ ```
458
+
459
+ </details>
460
+
461
+ - <details>
462
+ <summary>Measure KL divergence</summary>
463
+
464
+ ```bash
465
+ # TODO
466
+ ```
467
+
468
+ </details>
469
+
470
+ [^1]: [https://huggingface.co/docs/transformers/perplexity](https://huggingface.co/docs/transformers/perplexity)
471
+
472
+ ## [`llama-bench`](tools/llama-bench)
473
+
474
+ #### Benchmark the performance of the inference for various parameters.
475
+
476
+ - <details open>
477
+ <summary>Run default benchmark</summary>
478
+
479
+ ```bash
480
+ llama-bench -m model.gguf
481
+
482
+ # Output:
483
+ # | model | size | params | backend | threads | test | t/s |
484
+ # | ------------------- | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
485
+ # | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | pp512 | 5765.41 ± 20.55 |
486
+ # | qwen2 1.5B Q4_0 | 885.97 MiB | 1.54 B | Metal,BLAS | 16 | tg128 | 197.71 ± 0.81 |
487
+ #
488
+ # build: 3e0ba0e60 (4229)
489
+ ```
490
+
491
+ </details>
492
+
493
+ ## [`llama-simple`](examples/simple)
494
+
495
+ #### A minimal example for implementing apps with `llama.cpp`. Useful for developers.
496
+
497
+ - <details>
498
+ <summary>Basic text completion</summary>
499
+
500
+ ```bash
501
+ llama-simple -m model.gguf
502
+
503
+ # Hello my name is Kaitlyn and I am a 16 year old girl. I am a junior in high school and I am currently taking a class called "The Art of
504
+ ```
505
+
506
+ </details>
507
+
508
+
509
+ ## Contributing
510
+
511
+ - Contributors can open PRs
512
+ - Collaborators will be invited based on contributions
513
+ - Maintainers can push to branches in the `llama.cpp` repo and merge PRs into the `master` branch
514
+ - Any help with managing issues, PRs and projects is very appreciated!
515
+ - See [good first issues](https://github.com/ggml-org/llama.cpp/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) for tasks suitable for first contributions
516
+ - Read the [CONTRIBUTING.md](CONTRIBUTING.md) for more information
517
+ - Make sure to read this: [Inference at the edge](https://github.com/ggml-org/llama.cpp/discussions/205)
518
+ - A bit of backstory for those who are interested: [Changelog podcast](https://changelog.com/podcast/532)
519
+
520
+ ## Other documentation
521
+
522
+ - [cli](tools/cli/README.md)
523
+ - [completion](tools/completion/README.md)
524
+ - [server](tools/server/README.md)
525
+ - [GBNF grammars](grammars/README.md)
526
+
527
+ #### Development documentation
528
+
529
+ - [How to build](docs/build.md)
530
+ - [Running on Docker](docs/docker.md)
531
+ - [Build on Android](docs/android.md)
532
+ - [Performance troubleshooting](docs/development/token_generation_performance_tips.md)
533
+ - [GGML tips & tricks](https://github.com/ggml-org/llama.cpp/wiki/GGML-Tips-&-Tricks)
534
+
535
+ #### Seminal papers and background on the models
536
+
537
+ If your issue is with model generation quality, then please at least scan the following links and papers to understand the limitations of LLaMA models. This is especially important when choosing an appropriate model size and appreciating both the significant and subtle differences between LLaMA models and ChatGPT:
538
+ - LLaMA:
539
+ - [Introducing LLaMA: A foundational, 65-billion-parameter large language model](https://ai.facebook.com/blog/large-language-model-llama-meta-ai/)
540
+ - [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
541
+ - GPT-3
542
+ - [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
543
+ - GPT-3.5 / InstructGPT / ChatGPT:
544
+ - [Aligning language models to follow instructions](https://openai.com/research/instruction-following)
545
+ - [Training language models to follow instructions with human feedback](https://arxiv.org/abs/2203.02155)
546
+
547
+ ## XCFramework
548
+ The XCFramework is a precompiled version of the library for iOS, visionOS, tvOS,
549
+ and macOS. It can be used in Swift projects without the need to compile the
550
+ library from source. For example:
551
+ ```swift
552
+ // swift-tools-version: 5.10
553
+ // The swift-tools-version declares the minimum version of Swift required to build this package.
554
+
555
+ import PackageDescription
556
+
557
+ let package = Package(
558
+ name: "MyLlamaPackage",
559
+ targets: [
560
+ .executableTarget(
561
+ name: "MyLlamaPackage",
562
+ dependencies: [
563
+ "LlamaFramework"
564
+ ]),
565
+ .binaryTarget(
566
+ name: "LlamaFramework",
567
+ url: "https://github.com/ggml-org/llama.cpp/releases/download/b5046/llama-b5046-xcframework.zip",
568
+ checksum: "c19be78b5f00d8d29a25da41042cb7afa094cbf6280a225abe614b03b20029ab"
569
+ )
570
+ ]
571
+ )
572
+ ```
573
+ The above example is using an intermediate build `b5046` of the library. This can be modified
574
+ to use a different version by changing the URL and checksum.
575
+
576
+ ## Completions
577
+ Command-line completion is available for some environments.
578
+
579
+ #### Bash Completion
580
  ```bash
581
+ $ build/bin/llama-cli --completion-bash > ~/.llama-completion.bash
582
+ $ source ~/.llama-completion.bash
583
+ ```
584
+ Optionally this can be added to your `.bashrc` or `.bash_profile` to load it
585
+ automatically. For example:
586
+ ```console
587
+ $ echo "source ~/.llama-completion.bash" >> ~/.bashrc
588
+ ```
589
+
590
+ ## Dependencies
591
+
592
+ - [yhirose/cpp-httplib](https://github.com/yhirose/cpp-httplib) - Single-header HTTP server, used by `llama-server` - MIT license
593
+ - [stb-image](https://github.com/nothings/stb) - Single-header image format decoder, used by multimodal subsystem - Public domain
594
+ - [nlohmann/json](https://github.com/nlohmann/json) - Single-header JSON library, used by various tools/examples - MIT License
595
+ - [miniaudio.h](https://github.com/mackron/miniaudio) - Single-header audio format decoder, used by multimodal subsystem - Public domain
596
+ - [subprocess.h](https://github.com/sheredom/subprocess.h) - Single-header process launching solution for C and C++ - Public domain