Instructions to use akkshay/hyde-llama-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use akkshay/hyde-llama-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="akkshay/hyde-llama-7b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("akkshay/hyde-llama-7b") model = AutoModelForCausalLM.from_pretrained("akkshay/hyde-llama-7b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use akkshay/hyde-llama-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "akkshay/hyde-llama-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "akkshay/hyde-llama-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/akkshay/hyde-llama-7b
- SGLang
How to use akkshay/hyde-llama-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "akkshay/hyde-llama-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "akkshay/hyde-llama-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "akkshay/hyde-llama-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "akkshay/hyde-llama-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use akkshay/hyde-llama-7b with Docker Model Runner:
docker model run hf.co/akkshay/hyde-llama-7b
Hyde LLaMa 2 7B Legal
Model Details
Backbone Model: meta-llama/Llama-2-7b-chat
Input: Models input text only.
Output: Models generate text only.
Inference
def hyde_gen(
topic:str,
model:object,
tokenizer:object,
device:object
):
prompt = (
f"Write legal facts about the following topic:\n{topic}\n"
)
len_prompt = len(prompt)
output = model.generate(
**tokenizer(prompt,
return_tensors="pt",
return_token_type_ids=False).to(device
),
max_new_tokens=300,
early_stopping=True,
do_sample=True,
top_k=10,
top_p=0.98,
no_repeat_ngram_size=3,
eos_token_id=2,
repetition_penalty=1.1,
num_beams=3,
)
return tokenizer.decode(output[0])[len_prompt:]
def hyde_infer(input_topic):
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
model_pth = "akkshay/hyde-llama-7b"
model = AutoModelForCausalLM.from_pretrained(
model_pth,
device_map={"": 0},
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
tokenizer = AutoTokenizer.from_pretrained(model_pth)
model.eval()
model.config.use_cache = (True)
tokenizer.pad_token = tokenizer.eos_token
output = hyde_gen(
topic=input_topic,
model=model,
tokenizer=tokenizer,
device=device
)
return output
if __name__ == "__main__":
fact = hyde_infer("VW emissions scandal")
print(fact)
Since Hyde Llama 2 uses 'FastTokenizer' provided by HF tokenizers NOT sentencepiece package, it is required to use use_fast=True option when we initialize the tokenizer.
Lastly Apple M1/M2 chips does not support BF16 computing, so use CPU instead.
- Downloads last month
- 11