How to use from
SGLangUse Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Sakuna/LLaMaCoderAll" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Sakuna/LLaMaCoderAll",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'Quick Links
LLaMaCoder
Model Description
LLaMaCoder is based on LLaMa2 7B language model, finetuned using LoRA adaptors.
Usage
Generate code with LLaMaCoder in 4bit model according to the following python snippet:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch
MODEL_NAME = "Sakuna/LLaMaCoderAll"
device = "cuda:0"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
model = model.to(device)
model.eval()
prompt = "Write a Java program to calculate the factorial of a given number k"
input = f"{prompt}\n### Solution:\n"
device = "cuda:0"
inputs = tokenizer(input, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_length=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 12
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Sakuna/LLaMaCoderAll" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Sakuna/LLaMaCoderAll", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'