Instructions to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit") model = AutoModelForCausalLM.from_pretrained("SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit
- SGLang
How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit with Docker Model Runner:
docker model run hf.co/SrikanthChellappa/Meta-Llama-3-8B-Instruct-GPTQ-4Bit
GPTQ 4-bit Quantized Llama-3 8B Instruct Model
Model Version: 1.0
Model Creator: CollAIborator (https://www.collaiborate.com)
Model Overview: This repo contains 4 Bit quantized GPTQ model files from meta-llama/Meta-Llama-3-8B-Instruct. This model is an optimized version to run on lower config GPUs and comes with a small quality degradation from the original model but the intent was to make Llama-3 available for use in smaller GPUs with maximum improvement in latency and throughput.
Intended Use: The GPTQ 4-bit Quantized Llama-3 8B Instruct Model is intended to be used for tasks involving instructional text comprehension, such as question answering, summarization, and instructional text generation. It can be deployed in applications where understanding and generating instructional content is crucial, including educational platforms, virtual assistants, and content recommendation systems.
Limitations and Considerations: While the GPTQ 4-bit Quantized Llama-3 8B Instruct Model demonstrates strong performance in tasks related to instructional text comprehension, it may not perform optimally in domains or tasks outside its training data distribution. Users should evaluate the model's performance on specific tasks and datasets before deploying it in production environments.
Ethical Considerations: As with any language model, the GPTQ 4-bit Quantized Llama-3 8B Instruct Model can potentially generate biased or inappropriate content based on the input it receives. Users are encouraged to monitor and evaluate the model's outputs to ensure they align with ethical guidelines and do not propagate harmful stereotypes or misinformation.
Disclaimer: The GPTQ 4-bit Quantized Llama-3 8B Instruct Model is provided by CollAIborator and is offered as-is, without any warranty or guarantee of performance. Users are solely responsible for the use and outcomes of the model in their applications.
Developed by: CollAIborator team
Model type: Text Generation
Language(s) (NLP): en
License: llama3
Finetuned from model [optional]: meta-llama/Meta-Llama-3-8B-Instruct
- Downloads last month
- -