Instructions to use whynlp/tinyllama-lckv-w2-ft-100b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use whynlp/tinyllama-lckv-w2-ft-100b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use whynlp/tinyllama-lckv-w2-ft-100b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "whynlp/tinyllama-lckv-w2-ft-100b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whynlp/tinyllama-lckv-w2-ft-100b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/whynlp/tinyllama-lckv-w2-ft-100b

SGLang

How to use whynlp/tinyllama-lckv-w2-ft-100b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "whynlp/tinyllama-lckv-w2-ft-100b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whynlp/tinyllama-lckv-w2-ft-100b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "whynlp/tinyllama-lckv-w2-ft-100b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "whynlp/tinyllama-lckv-w2-ft-100b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use whynlp/tinyllama-lckv-w2-ft-100b with Docker Model Runner:
```
docker model run hf.co/whynlp/tinyllama-lckv-w2-ft-100b
```

LCKV

This is a research-purpose pretrained model described in paper "Layer-Condensed KV Cache for Efficient Inference of Large Language Models".

About

Layer-Condensed KV Cache (LCKV) is a variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. See more details in our github repo: https://github.com/whyNLP/LCKV

Quick Start

# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w2-ft-100b", trust_remote_code=True)

Sample text generation script:

# This is consistent with the `run_generation.py` script in the github repo: https://github.com/whyNLP/LCKV
import torch
from accelerate.utils import set_seed

from transformers import pipeline


set_seed(42)

pipe = pipeline(
    "text-generation",
    model="whynlp/tinyllama-lckv-w2-ft-100b",
    torch_dtype=torch.bfloat16,
    device="cuda",
    trust_remote_code=True,
    model_kwargs={"attn_implementation": "flash_attention_2"},
)

response = pipe(
    "the meaning of life is",
    add_special_tokens=False,
    max_new_tokens=50,
    temperature=1.0,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    do_sample=True,
)

print(response[0]["generated_text"])
# the meaning of life is the tension that this presence gives rise to each moment of the thought to let live out the moment of my appearance. For Sarkar, sense is what has also forgotten: It is forgets.
# On kiu3/ this is and

The LCKV Collection

The model has 2 warmup layers. i.e. 3/22 KV cache of a standard TinyLlama.

This model was first initialized from the TinyLlama 2.5T checkpoint, then continued pre-training on 100B tokens from SlimPajama.

Since the model structure has been changed, the initialization cannot inherit the performance of the TinyLlama checkpoint, but it effectively boosts the training process compared to pre-training from scratch.

The evaluation follows that of TinyLlama. Refer to our paper for more details.

Model	Paper Section	Dev ppl.	Common-sense Reasoning
whynlp/tinyllama-lckv-w10-ft-250b	--	7.939	50.86
whynlp/tinyllama-lckv-w2-ft-100b	Appendix C.1, Table 7 (line 5)	8.514	49.55
whynlp/tinyllama-lckv-w10-100b	Section 3.2, Table 2 (line 3)	9.265	46.84
whynlp/tinyllama-lckv-w2-100b	Section 3.2, Table 2 (line 2)	9.746	45.45

Downloads last month: 4

Safetensors

Model size

1B params

Tensor type

BF16

Collection including whynlp/tinyllama-lckv-w2-ft-100b

LCKV-1.1B-reproduction

Collection

4 items • Updated Dec 3, 2024

Paper for whynlp/tinyllama-lckv-w2-ft-100b

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Paper • 2405.10637 • Published May 17, 2024 • 22