How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Sakuna/LLaMaCoderAll"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Sakuna/LLaMaCoderAll",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'
Use Docker
docker model run hf.co/Sakuna/LLaMaCoderAll
Quick Links

LLaMaCoder

Model Description

LLaMaCoder is based on LLaMa2 7B language model, finetuned using LoRA adaptors.

Usage

Generate code with LLaMaCoder in 4bit model according to the following python snippet:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch

MODEL_NAME = "Sakuna/LLaMaCoderAll"
device = "cuda:0"


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = model.to(device)
model.eval()

prompt = "Write a Java program to calculate the factorial of a given number k"
input = f"{prompt}\n### Solution:\n"
device = "cuda:0"

inputs = tokenizer(input, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_length=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Sakuna/LLaMaCoderAll