Instructions to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF",
	filename="DeepSeek-Coder-V2-Lite-Instruct-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

Use Docker

docker model run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

Ollama
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Ollama:
```
ollama run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
```

Unsloth Studio

How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M
```

Lemonade

How to use bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.DeepSeek-Coder-V2-Lite-Instruct-GGUF-Q4_K_M

List all available models

lemonade list

Error loading model

by sm54 - opened Jun 17, 2024

Discussion

sm54

Jun 17, 2024

•

edited Jun 17, 2024

Hello,

I've tried loading the q8_0 quant, and I get this error, using windows text generation webui:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'deepseek2'
llama_load_model_from_file: failed to load model
19:48:02-513121 ERROR Failed to load the model.

bartowski

Owner Jun 17, 2024

text gen llama-cpp needs an update

saintjohnny

Jun 18, 2024

Turn off flash attention. This seems to be a known bug.

bartowski

Owner Jun 18, 2024

i would think that's a different error than 'unknown model architecture' but i may be wrong

wrtn2

Jun 19, 2024

Loading some layers to GPU (-ngl) with latest llama.cpp returned "llama_init_from_gpt_params: error: failed to load model".
Using only CPU solved this for me (as mentioned here https://github.com/ggerganov/llama.cpp/pull/7519).
Using flash attention (-fa) gave error: "GGML_ASSERT: ggml.c:5716: ggml_nelements(a) == ne0*ne1".

bartowski

Owner Jun 19, 2024

@wrtn2 you have to disable flash attention for this model to use GPU

wrtn2

Jun 20, 2024

@bartowski Thanks, good to know! In my case the card lacks sufficient RAM, so I'd set llama to load only a subset of the layers on the GPU, which is possible with a number of models, but seems not to be on this one.

paolovic

Jun 20, 2024

Hi all,
could you tell me, how you make it run?

Right now, I am using this cumbersome ipynb

from llama_cpp import Llama

llm = Llama(
      model_path="/DeepSeek-Coder-V2-Lite-Instruct-GGUF/DeepSeek-Coder-V2-Lite-Instruct-Q8_1.gguf",
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      n_ctx=8*2048, # Uncomment to increase the context window
)

response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful coding assistant."},
          {
              "role": "user",
              "content": "give me quick sort in c++."
          }
      ]
)
print(response["choices"][0]["message"]["content"])

Is there a more convenient way, using huggingface or anything else?

Thank you in advance!

bartowski

Owner Jun 20, 2024

(updated the name to Q8_0_L from Q8_1 just now fyi)

That looks like a fine implementation, is there an issue you're running into or just trying to find a better way?

paolovic

Jun 20, 2024

alright, great, thank you very much!

I am just used to the transformers lingo and thought, maybe there's a better way.

and thanks for the fast reply!

Vitaliy-K-1

Jun 21, 2024

This comment has been hidden

Konstantin89

Jun 22, 2024

Hi. I wanted to test a model up to 8 gigabytes. Downloaded IQ 3. It doesn't work in programs - GPT4 All and LM Studio(((( I'd appreciate it if you could help me get it up and running.

someuser44

Jun 23, 2024

Getting this in LMstudio w flash attention off, tried both w GPU offload and CPU only, same message. Not sure what to do :/ Preset is Deepseek Coder, maybe it needs a deepseek coder instruct preset?

error:
"llama.cpp error: 'error loading model architecture: unknown model architecture: 'deepseek2''"

bartowski

Owner Jun 23, 2024

Update to 0.2.25 from the website, or ignore it if you're already on it

Honeywest

Jul 1, 2024

•

edited Jul 1, 2024

I'm running LM Studio 0.2.26, and it fails. Tried gpt4all, Jan, ollama with Chatollama. Nothing will load this model. tried q4, q8. Flash attention is disabled. How do I use this model?

Ok, I figured it out. If you are using Ollama in a docker:

docker pull ollama/ollama:latest

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment