Instructions to use aws-neuron/CodeLlama-7b-hf-neuron-8xlarge with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use aws-neuron/CodeLlama-7b-hf-neuron-8xlarge with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="aws-neuron/CodeLlama-7b-hf-neuron-8xlarge")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("aws-neuron/CodeLlama-7b-hf-neuron-8xlarge")
model = AutoModelForCausalLM.from_pretrained("aws-neuron/CodeLlama-7b-hf-neuron-8xlarge")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use aws-neuron/CodeLlama-7b-hf-neuron-8xlarge with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "aws-neuron/CodeLlama-7b-hf-neuron-8xlarge"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aws-neuron/CodeLlama-7b-hf-neuron-8xlarge",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/aws-neuron/CodeLlama-7b-hf-neuron-8xlarge

SGLang

How to use aws-neuron/CodeLlama-7b-hf-neuron-8xlarge with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "aws-neuron/CodeLlama-7b-hf-neuron-8xlarge" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aws-neuron/CodeLlama-7b-hf-neuron-8xlarge",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "aws-neuron/CodeLlama-7b-hf-neuron-8xlarge" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "aws-neuron/CodeLlama-7b-hf-neuron-8xlarge",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use aws-neuron/CodeLlama-7b-hf-neuron-8xlarge with Docker Model Runner:
```
docker model run hf.co/aws-neuron/CodeLlama-7b-hf-neuron-8xlarge
```

Upload folder using huggingface_hub

by jburtoft - opened Apr 1, 2024

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

+634

-590

jburtoft

AWS Inferentia and Trainium org Apr 1, 2024

•

edited Apr 1, 2024

Upload folder using huggingface_hub

Multi commit ID: 316934fa09af93adc0517f5ecf0eccdb2930b8e62bcfce4ceb35e7f1a75d88d9

Scheduled commits:

Upload 47 file(s) totalling 2.1G (24d0bb86ca5dc5a5652e6aad5becd44a8c3d7a236fc572dbf8869ef61b441b0a)
Upload 44 file(s) totalling 2.1G (78f45c2e4f4d4a1270dd6a054feb78a6b1a870533a015585866ba3d2046dba30)
Upload 50 file(s) totalling 2.1G (c6c758cd54e23662f0df24a6fbf9597e79b890c8990653a1224ab1a47f49649f)
Upload 46 file(s) totalling 2.1G (b18c4b6b25f4876fb074ad87a341156dd75696fd7ec74aa29b49c7a7b88da985)
Upload 44 file(s) totalling 2.1G (a25f81d3a81181e5b7b85d5911c297bbf70da65a9b2835e79e69084ddbd76d6b)
Upload 46 file(s) totalling 2.1G (21dae127a6432b6c51051e504231b180337cf2bf8ff7d7962e43bfa621b37e3e)
Upload 29 file(s) totalling 846.7M (2859057289238237110c14e24f6da0872277621881b8d6db4c95775174e5ebfb)

This is a PR opened using the huggingface_hub library in the context of a multi-commit. PR can be commented as a usual PR. However, please be aware that manually updating the PR description, changing the PR status, or pushing new commits, is not recommended as it might corrupt the commit process. Learn more about multi-commits in this guide.

24d0bb86ca5dc5a5652e6aad5becd44a8c3d7a236fc572dbf8869ef61b441b0a18895a45

78f45c2e4f4d4a1270dd6a054feb78a6b1a870533a015585866ba3d2046dba30d8905375

c6c758cd54e23662f0df24a6fbf9597e79b890c8990653a1224ab1a47f49649fb07b007e

b18c4b6b25f4876fb074ad87a341156dd75696fd7ec74aa29b49c7a7b88da98559c7c14c

a25f81d3a81181e5b7b85d5911c297bbf70da65a9b2835e79e69084ddbd76d6b3d84501e

21dae127a6432b6c51051e504231b180337cf2bf8ff7d7962e43bfa621b37e3ef69d3a32

jburtoft

AWS Inferentia and Trainium org Apr 1, 2024

Multi-commit is now completed! You can ping the repo owner to review the changes. This PR can now be commented or modified without risking to corrupt it.

This is a comment posted using the huggingface_hub library in the context of a multi-commit. Learn more about multi-commits in this guide.

jburtoft changed pull request status to open Apr 1, 2024

jburtoft

AWS Inferentia and Trainium org Apr 1, 2024

create_pr=False has been passed so PR is automatically merged.

This is a comment posted using the huggingface_hub library in the context of a multi-commit. Learn more about multi-commits in this guide.

jburtoft changed pull request status to merged Apr 1, 2024

jburtoft changed pull request title from [WIP] Upload folder using huggingface_hub (multi-commit 316934fa09af93adc0517f5ecf0eccdb2930b8e62bcfce4ceb35e7f1a75d88d9) to Upload folder using huggingface_hub Apr 1, 2024

2859057289238237110c14e24f6da0872277621881b8d6db4c95775174e5ebfbe61ac3c1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment