Commit ·
e54089f
1
Parent(s): 8f5811e
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -68,7 +68,7 @@ The following clients/libraries will automatically download models for you, prov
|
|
| 68 |
|
| 69 |
### In `text-generation-webui`
|
| 70 |
|
| 71 |
-
Under Download Model, you can enter the model repo: andrijdavid/MiniMA-2-3B-GGUF and below it, a specific filename to download, such as: MiniMA-2-3B
|
| 72 |
|
| 73 |
Then click Download.
|
| 74 |
|
|
@@ -83,7 +83,7 @@ pip3 install huggingface-hub
|
|
| 83 |
Then you can download any individual model file to the current directory, at high speed, with a command like this:
|
| 84 |
|
| 85 |
```shell
|
| 86 |
-
huggingface-cli download andrijdavid/MiniMA-2-3B-GGUF MiniMA-2-3B
|
| 87 |
```
|
| 88 |
|
| 89 |
<details>
|
|
@@ -106,7 +106,7 @@ pip3 install hf_transfer
|
|
| 106 |
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
| 107 |
|
| 108 |
```shell
|
| 109 |
-
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download andrijdavid/MiniMA-2-3B-GGUF MiniMA-2-3B
|
| 110 |
```
|
| 111 |
|
| 112 |
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
|
|
@@ -118,7 +118,7 @@ Windows Command Line users: You can set the environment variable by running `set
|
|
| 118 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
| 119 |
|
| 120 |
```shell
|
| 121 |
-
./main -ngl 35 -m MiniMA-2-3B
|
| 122 |
```
|
| 123 |
|
| 124 |
Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
|
|
@@ -169,7 +169,7 @@ pip install llama-cpp-python
|
|
| 169 |
from llama_cpp import Llama
|
| 170 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
| 171 |
llm = Llama(
|
| 172 |
-
model_path="./MiniMA-2-3B
|
| 173 |
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
|
| 174 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
| 175 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
|
@@ -182,7 +182,7 @@ output = llm(
|
|
| 182 |
echo=True # Whether to echo the prompt
|
| 183 |
)
|
| 184 |
# Chat Completion API
|
| 185 |
-
llm = Llama(model_path="./MiniMA-2-3B
|
| 186 |
llm.create_chat_completion(
|
| 187 |
messages = [
|
| 188 |
{"role": "system", "content": "You are a story writing assistant."},
|
|
|
|
| 68 |
|
| 69 |
### In `text-generation-webui`
|
| 70 |
|
| 71 |
+
Under Download Model, you can enter the model repo: andrijdavid/MiniMA-2-3B-GGUF and below it, a specific filename to download, such as: MiniMA-2-3B.gguf.
|
| 72 |
|
| 73 |
Then click Download.
|
| 74 |
|
|
|
|
| 83 |
Then you can download any individual model file to the current directory, at high speed, with a command like this:
|
| 84 |
|
| 85 |
```shell
|
| 86 |
+
huggingface-cli download andrijdavid/MiniMA-2-3B-GGUF MiniMA-2-3B.gguf --local-dir . --local-dir-use-symlinks False
|
| 87 |
```
|
| 88 |
|
| 89 |
<details>
|
|
|
|
| 106 |
And set environment variable `HF_HUB_ENABLE_HF_TRANSFER` to `1`:
|
| 107 |
|
| 108 |
```shell
|
| 109 |
+
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download andrijdavid/MiniMA-2-3B-GGUF MiniMA-2-3B.gguf --local-dir . --local-dir-use-symlinks False
|
| 110 |
```
|
| 111 |
|
| 112 |
Windows Command Line users: You can set the environment variable by running `set HF_HUB_ENABLE_HF_TRANSFER=1` before the download command.
|
|
|
|
| 118 |
Make sure you are using `llama.cpp` from commit [d0cee0d](https://github.com/ggerganov/llama.cpp/commit/d0cee0d36d5be95a0d9088b674dbb27354107221) or later.
|
| 119 |
|
| 120 |
```shell
|
| 121 |
+
./main -ngl 35 -m MiniMA-2-3B.gguf --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "<PROMPT>"
|
| 122 |
```
|
| 123 |
|
| 124 |
Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
|
|
|
|
| 169 |
from llama_cpp import Llama
|
| 170 |
# Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
|
| 171 |
llm = Llama(
|
| 172 |
+
model_path="./MiniMA-2-3B.gguf", # Download the model file first
|
| 173 |
n_ctx=32768, # The max sequence length to use - note that longer sequence lengths require much more resources
|
| 174 |
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
|
| 175 |
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
|
|
|
|
| 182 |
echo=True # Whether to echo the prompt
|
| 183 |
)
|
| 184 |
# Chat Completion API
|
| 185 |
+
llm = Llama(model_path="./MiniMA-2-3B.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
|
| 186 |
llm.create_chat_completion(
|
| 187 |
messages = [
|
| 188 |
{"role": "system", "content": "You are a story writing assistant."},
|