Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,11 @@ This repo contains the model weights for **Instinct**, [Continue](https://contin
|
|
| 10 |
|
| 11 |
## Serving the model
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
* SGLang: `python3 -m sglang.launch_server --model-path continuedev/instinct --load-format safetensors`
|
| 16 |
* vLLM : `vllm serve continuedev/instinct --served-model-name instinct --load-format safetensors`
|
|
|
|
| 10 |
|
| 11 |
## Serving the model
|
| 12 |
|
| 13 |
+
We've released a [Q4_K_M GGUF quantization of Instinct](https://huggingface.co/continuedev/instinct-GGUF), for efficient local inference.
|
| 14 |
+
|
| 15 |
+
[Ollama Instructions coming soon]
|
| 16 |
+
|
| 17 |
+
Besides Ollama, there are many ways to plug a local model into Continue; we internally used an endpoint served by [SGLang](https://github.com/sgl-project/sglang), which is one of the options below. Quantizing for faster inference is also an option that worked well for us.
|
| 18 |
|
| 19 |
* SGLang: `python3 -m sglang.launch_server --model-path continuedev/instinct --load-format safetensors`
|
| 20 |
* vLLM : `vllm serve continuedev/instinct --served-model-name instinct --load-format safetensors`
|