Instructions to use chargoddard/llama-2-26b-trenchcoat-stack with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use chargoddard/llama-2-26b-trenchcoat-stack with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="chargoddard/llama-2-26b-trenchcoat-stack")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("chargoddard/llama-2-26b-trenchcoat-stack")
model = AutoModelForCausalLM.from_pretrained("chargoddard/llama-2-26b-trenchcoat-stack")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use chargoddard/llama-2-26b-trenchcoat-stack with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "chargoddard/llama-2-26b-trenchcoat-stack"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "chargoddard/llama-2-26b-trenchcoat-stack",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/chargoddard/llama-2-26b-trenchcoat-stack

SGLang

How to use chargoddard/llama-2-26b-trenchcoat-stack with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "chargoddard/llama-2-26b-trenchcoat-stack" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "chargoddard/llama-2-26b-trenchcoat-stack",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "chargoddard/llama-2-26b-trenchcoat-stack" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "chargoddard/llama-2-26b-trenchcoat-stack",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use chargoddard/llama-2-26b-trenchcoat-stack with Docker Model Runner:
```
docker model run hf.co/chargoddard/llama-2-26b-trenchcoat-stack
```

Llama 2 13b is a pretty decent language model. You know what's probably better? Two Llama 2 13b models. In a trenchcoat.

Produced by bakllama.py with this config file:

layer_slices:
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40

No fine tuning was done on this model. Yes, it's still coherent somehow.

Benchmark results:

Benchmark	Llama2-13b	Llama2-26b-tcs	Percent Change
ARC	59.3	55.03	-7.2%
HellaSwag	82.15	79.9	-2.74%
MMLU	55.67	53.73	-3.48%
TruthfulQA	37.39	40.48	+5.59%
Average	58.63	57.29	-2.29%
Average Minus TQA	65.70	62.85	-4.34%

This tells us two very important things:

TruthfulQA is a perfect benchmark in every way.
Llama models are amazingly robust to being fed their own output.

Downloads last month: 201

Safetensors

Model size

26B params

Tensor type

F16

Collection including chargoddard/llama-2-26b-trenchcoat-stack

Frankenmodels

Collection

They're not supposed to be that size! Neat, right? • 8 items • Updated Dec 12, 2023 • 3