Instructions to use selfrag/selfrag_llama2_7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use selfrag/selfrag_llama2_7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="selfrag/selfrag_llama2_7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("selfrag/selfrag_llama2_7b") model = AutoModelForCausalLM.from_pretrained("selfrag/selfrag_llama2_7b") - Inference
- Local Apps Settings
- vLLM
How to use selfrag/selfrag_llama2_7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "selfrag/selfrag_llama2_7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "selfrag/selfrag_llama2_7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/selfrag/selfrag_llama2_7b
- SGLang
How to use selfrag/selfrag_llama2_7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "selfrag/selfrag_llama2_7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "selfrag/selfrag_llama2_7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "selfrag/selfrag_llama2_7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "selfrag/selfrag_llama2_7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use selfrag/selfrag_llama2_7b with Docker Model Runner:
docker model run hf.co/selfrag/selfrag_llama2_7b
PermissionError: [Errno 13] Permission denied: '/gscratch'
Thank you very much for the valuable information. However, may I ask a question?
I encountered the following error:
----> 1 model = LLM("selfrag/selfrag_llama2_7b", download_dir="/gscratch/h2lab/akari/model_cache", dtype="half")
File ~/.conda/envs/vllm/lib/python3.9/site-packages/vllm/entrypoints/llm.py:123, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, **kwargs)
102 kwargs["disable_log_stats"] = True
103 engine_args = EngineArgs(
104 model=model,
105 tokenizer=tokenizer,
(...)
121 **kwargs,
122 )
--> 123 self.llm_engine = LLMEngine.from_engine_args(
124 engine_args, usage_context=UsageContext.LLM_CLASS)
125 self.request_counter = Counter()
File ~/.conda/envs/vllm/lib/python3.9/site-packages/vllm/engine/llm_engine.py:292, in LLMEngine.from_engine_args(cls, engine_args, usage_context)
289 executor_class = GPUExecutor
291 # Create the LLM engine.
--> 292 engine = cls(
293 **engine_config.to_dict(),
294 executor_class=executor_class,
295 log_stats=not engine_args.disable_log_stats,
296 usage_context=usage_context,
...
227 # Cannot rely on checking for EEXIST, since the operating system
228 # could give priority to other errors like EACCES or EROFS
229 if not exist_ok or not path.isdir(name):
PermissionError: [Errno 13] Permission denied: '/gscratch'