Instructions to use m-a-p/OpenCodeInterpreter-DS-33B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use m-a-p/OpenCodeInterpreter-DS-33B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="m-a-p/OpenCodeInterpreter-DS-33B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("m-a-p/OpenCodeInterpreter-DS-33B")
model = AutoModelForCausalLM.from_pretrained("m-a-p/OpenCodeInterpreter-DS-33B")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use m-a-p/OpenCodeInterpreter-DS-33B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "m-a-p/OpenCodeInterpreter-DS-33B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OpenCodeInterpreter-DS-33B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/m-a-p/OpenCodeInterpreter-DS-33B

SGLang

How to use m-a-p/OpenCodeInterpreter-DS-33B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "m-a-p/OpenCodeInterpreter-DS-33B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OpenCodeInterpreter-DS-33B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "m-a-p/OpenCodeInterpreter-DS-33B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "m-a-p/OpenCodeInterpreter-DS-33B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use m-a-p/OpenCodeInterpreter-DS-33B with Docker Model Runner:
```
docker model run hf.co/m-a-p/OpenCodeInterpreter-DS-33B
```

Works well quantised to q8 on 2 x AMD 7900XTX cards

#11

by bkieser - opened Apr 7, 2024

Discussion

bkieser

Apr 7, 2024

I did the following:

Cloned the repo from HF.
Used llama.cpp to generate a native (F32) gguf
Used llama.cpp to quantise to q_8 (also gguf output)
Ran that q8 version (llama.cpp for now, ollama still to do).

Works as well as GPT 4 and Claude 3 in my experience. It generated a snake game in ruby, including threads for the snake to keep moving while maintaining a responsive keyboard in about 20 shots. Similar, in the real world, to GPT 4 or Claude 3 Opus and much better than the real world experience of Gemini.

Well done to the people who produced this model. It's the first truly usable model for coding, sufficiently good enough for use with agents such as crew ai or autogen.

bkieser

Apr 7, 2024

Need to add - The 2 x Radeon 7900XTX cards were chosen because they are 50% cheaper (or more) than the Nvidia RTX4090. The important thing is the VRAM. The Nvidia 4090 GPU itself is way faster than the 7900XTX, but it has only 24GB RAM which is too small for anything > 7B LLMs unless you start heavily quantising and then you lose the model's power. It comes at a cost. Whereas each 7900XTX also has 24GB, which means that you get twice the VRAM (48GB) for the same price as the 4090.

Another advantage is that with normal PC motherboards you can fit in only 1 x 4090 because of the size of the card and the cooling on it. But you can squeeze in 2 x Radeon 7900XTX cards on a standard PC motherboard. To get more of each you need to jump up quite seriously in cost point to the more specialised motherboards significantly adding to your costs.

So right now the best option is the AMD Radeon 7900XTX cards on a standard motherboard (you need a larger power pack), and with q8 quantisation there is almost no loss off the F32 or F16 versions of this model. And it will happily run at full context length across both GPUs, taking up 75% of the RAM on both, giving you one heck of a powerful code generation capability for commodity hardware prices.

Very impressive.

aaabiao

Multimodal Art Projection org Apr 8, 2024

Thank you for sharing your experience and insights on using AMD Radeon 7900XTX cards for coding with the model. Your comparison with GPT-4 and Claude 3, along with the technical details on cost-effectiveness and performance, is highly appreciated. Thanks for contributing!

aaabiao changed discussion status to closed Apr 8, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment