Instructions to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit")
model = AutoModelForCausalLM.from_pretrained("SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit

SGLang

How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with Docker Model Runner:
```
docker model run hf.co/SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

GPTQ 4-bit Quantized Llama-3 8B Instruct Model

Model Version: 1.0

Model Creator: CollAIborator (https://www.collaiborate.com)

Model Overview: This repo contains 4 Bit quantized GPTQ model files from meta-llama/Meta-Llama-3-8B-Instruct. This model is an optimized version to run on lower config GPUs and comes with a small quality degradation from the original model but the intent was to make Llama-3 available for use in smaller GPUs with maximum improvement in latency and throughput.

Intended Use: The GPTQ 4-bit Quantized Llama-3 8B Instruct Model is intended to be used for tasks involving instructional text comprehension, such as question answering, summarization, and instructional text generation. It can be deployed in applications where understanding and generating instructional content is crucial, including educational platforms, virtual assistants, and content recommendation systems.

Limitations and Considerations: While the GPTQ 4-bit Quantized Llama-3 8B Instruct Model demonstrates strong performance in tasks related to instructional text comprehension, it may not perform optimally in domains or tasks outside its training data distribution. Users should evaluate the model's performance on specific tasks and datasets before deploying it in production environments.

Ethical Considerations: As with any language model, the GPTQ 4-bit Quantized Llama-3 8B Instruct Model can potentially generate biased or inappropriate content based on the input it receives. Users are encouraged to monitor and evaluate the model's outputs to ensure they align with ethical guidelines and do not propagate harmful stereotypes or misinformation.

Disclaimer: The GPTQ 4-bit Quantized Llama-3 8B Instruct Model is provided by CollAIborator and is offered as-is, without any warranty or guarantee of performance. Users are solely responsible for the use and outcomes of the model in their applications.

Developed by: CollAIborator team

Model type: Text Generation

Language(s) (NLP): en

License: llama3

Finetuned from model [optional]: meta-llama/Meta-Llama-3-8B-Instruct

Downloads last month: -