Instructions to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF", filename="DeepSeek-Coder-V2-Lite-Instruct-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
Use Docker
docker model run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
- Ollama
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Ollama:
ollama run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
- Unsloth Studio
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF to start chatting
- Docker Model Runner
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Docker Model Runner:
docker model run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
- Lemonade
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.DeepSeek-Coder-V2-Lite-Instruct-GGUF-Q4_K_M
List all available models
lemonade list
Error loading model
Hello,
I've tried loading the q8_0 quant, and I get this error, using windows text generation webui:
llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'deepseek2'
llama_load_model_from_file: failed to load model
19:48:02-513121 ERROR Failed to load the model.
text gen llama-cpp needs an update
Turn off flash attention. This seems to be a known bug.
i would think that's a different error than 'unknown model architecture' but i may be wrong
Loading some layers to GPU (-ngl) with latest llama.cpp returned "llama_init_from_gpt_params: error: failed to load model".
Using only CPU solved this for me (as mentioned here https://github.com/ggerganov/llama.cpp/pull/7519).
Using flash attention (-fa) gave error: "GGML_ASSERT: ggml.c:5716: ggml_nelements(a) == ne0*ne1".
@bartowski Thanks, good to know! In my case the card lacks sufficient RAM, so I'd set llama to load only a subset of the layers on the GPU, which is possible with a number of models, but seems not to be on this one.
Hi all,
could you tell me, how you make it run?
Right now, I am using this cumbersome ipynb
from llama_cpp import Llama
llm = Llama(
model_path="/DeepSeek-Coder-V2-Lite-Instruct-GGUF/DeepSeek-Coder-V2-Lite-Instruct-Q8_1.gguf",
n_gpu_layers=-1, # Uncomment to use GPU acceleration
# seed=1337, # Uncomment to set a specific seed
n_ctx=8*2048, # Uncomment to increase the context window
)
response = llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are a helpful coding assistant."},
{
"role": "user",
"content": "give me quick sort in c++."
}
]
)
print(response["choices"][0]["message"]["content"])
Is there a more convenient way, using huggingface or anything else?
Thank you in advance!
(updated the name to Q8_0_L from Q8_1 just now fyi)
That looks like a fine implementation, is there an issue you're running into or just trying to find a better way?
alright, great, thank you very much!
I am just used to the transformers lingo and thought, maybe there's a better way.
and thanks for the fast reply!
Hi. I wanted to test a model up to 8 gigabytes. Downloaded IQ 3. It doesn't work in programs - GPT4 All and LM Studio(((( I'd appreciate it if you could help me get it up and running.
Getting this in LMstudio w flash attention off, tried both w GPU offload and CPU only, same message. Not sure what to do :/ Preset is Deepseek Coder, maybe it needs a deepseek coder instruct preset?
error:
"llama.cpp error: 'error loading model architecture: unknown model architecture: 'deepseek2''"
Update to 0.2.25 from the website, or ignore it if you're already on it
I'm running LM Studio 0.2.26, and it fails. Tried gpt4all, Jan, ollama with Chatollama. Nothing will load this model. tried q4, q8. Flash attention is disabled. How do I use this model?
Ok, I figured it out. If you are using Ollama in a docker:
docker pull ollama/ollama:latest