Text Generation
Transformers
Safetensors
llama
4bit
bnb
nf4
qlora
text-generation-inference
4-bit precision
bitsandbytes
Instructions to use ping98k/open_llama_3b_v2_4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ping98k/open_llama_3b_v2_4bit with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ping98k/open_llama_3b_v2_4bit")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ping98k/open_llama_3b_v2_4bit") model = AutoModelForCausalLM.from_pretrained("ping98k/open_llama_3b_v2_4bit") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ping98k/open_llama_3b_v2_4bit with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ping98k/open_llama_3b_v2_4bit" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ping98k/open_llama_3b_v2_4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/ping98k/open_llama_3b_v2_4bit
- SGLang
How to use ping98k/open_llama_3b_v2_4bit with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ping98k/open_llama_3b_v2_4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ping98k/open_llama_3b_v2_4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ping98k/open_llama_3b_v2_4bit" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ping98k/open_llama_3b_v2_4bit", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use ping98k/open_llama_3b_v2_4bit with Docker Model Runner:
docker model run hf.co/ping98k/open_llama_3b_v2_4bit
| library_name: transformers | |
| base_model: openlm-research/open_llama_3b_v2 | |
| tags: | |
| - 4bit | |
| - bnb | |
| - nf4 | |
| - qlora | |
| - llama | |
| quantized_by: ping98k | |
| # Open LLaMA 3B v2 — 4-bit NF4 | |
| 4-bit quantized version of [openlm-research/open_llama_3b_v2](https://huggingface.co/openlm-research/open_llama_3b_v2) using bitsandbytes NF4, ready for QLoRA fine-tuning. | |
| ## Quantization Details | |
| | Parameter | Value | | |
| |---|---| | |
| | Quant method | bitsandbytes NF4 | | |
| | Double quant | Yes | | |
| | Compute dtype | bfloat16 | | |
| | Model size | ~1.93 GB | | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained("ping98k/open_llama_3b_v2_4bit", device_map="auto") | |
| tokenizer = AutoTokenizer.from_pretrained("ping98k/open_llama_3b_v2_4bit") | |
| ``` | |
| ### QLoRA Fine-tuning | |
| ```python | |
| from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training | |
| model.gradient_checkpointing_enable() | |
| model = prepare_model_for_kbit_training(model) | |
| lora_config = LoraConfig( | |
| r=16, | |
| lora_alpha=32, | |
| target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], | |
| lora_dropout=0.05, | |
| bias="none", | |
| task_type="CAUSAL_LM", | |
| ) | |
| model = get_peft_model(model, lora_config) | |
| ``` | |