Instructions to use MaxJeblick/llama2-0b-unit-test with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MaxJeblick/llama2-0b-unit-test with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="MaxJeblick/llama2-0b-unit-test")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("MaxJeblick/llama2-0b-unit-test")
model = AutoModelForCausalLM.from_pretrained("MaxJeblick/llama2-0b-unit-test")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use MaxJeblick/llama2-0b-unit-test with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MaxJeblick/llama2-0b-unit-test"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MaxJeblick/llama2-0b-unit-test",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/MaxJeblick/llama2-0b-unit-test

SGLang

How to use MaxJeblick/llama2-0b-unit-test with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "MaxJeblick/llama2-0b-unit-test" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MaxJeblick/llama2-0b-unit-test",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "MaxJeblick/llama2-0b-unit-test" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MaxJeblick/llama2-0b-unit-test",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use MaxJeblick/llama2-0b-unit-test with Docker Model Runner:
```
docker model run hf.co/MaxJeblick/llama2-0b-unit-test
```

Hello, may I ask what dataset you are using? Is it open source or self-made?

by xttttttttt - opened Dec 27, 2023

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+22

-75

Files changed (6) hide show

README.md +13 -43
config.json +3 -4
generation_config.json +1 -1
model.safetensors +0 -3
special_tokens_map.json +3 -21
tokenizer_config.json +2 -3

README.md CHANGED Viewed

@@ -1,6 +1,3 @@
----
-{}
----
 Small dummy LLama2-type Model useable for Unit/Integration tests. Suitable for CPU only machines, see [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio/blob/main/tests/integration/test_integration.py) for an example integration test.
 Model was created as follows:
@@ -11,7 +8,7 @@ repo_name = "MaxJeblick/llama2-0b-unit-test"
 model_name = "h2oai/h2ogpt-4096-llama2-7b-chat"
 config = AutoConfig.from_pretrained(model_name)
 config.hidden_size = 12
-config.max_position_embeddings = 1024
 config.intermediate_size = 24
 config.num_attention_heads = 2
 config.num_hidden_layers = 2
@@ -27,44 +24,17 @@ tokenizer.push_to_hub(repo_name, private=False)
 config.push_to_hub(repo_name, private=False)
 ```
-Below is a small example that will run in ~ 1 second.
-```python
-import torch
-from transformers import AutoModelForCausalLM
-def test_manual_greedy_generate():
-    max_new_tokens = 10
-    # note this is on CPU!
-    model = AutoModelForCausalLM.from_pretrained("MaxJeblick/llama2-0b-unit-test").eval()
-    input_ids = model.dummy_inputs["input_ids"]
-    y = model.generate(input_ids, max_new_tokens=max_new_tokens)
-    assert y.shape == (3, input_ids.shape[1] + max_new_tokens)
-    for _ in range(max_new_tokens):
-        with torch.no_grad():
-            outputs = model(input_ids)
-        next_token_logits = outputs.logits[:, -1, :]
-        next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
-        input_ids = torch.cat([input_ids, next_token_id], dim=-1)
-    assert torch.allclose(y, input_ids)
-```
-Tipp:
-Use fixtures with session scope to load the model only once. This will decrease test runtime further.
-```python
-import pytest
-from transformers import AutoModelForCausalLM
-@pytest.fixture(scope="session")
-def model():
-    return AutoModelForCausalLM.from_pretrained("MaxJeblick/llama2-0b-unit-test").eval()
-```

 Small dummy LLama2-type Model useable for Unit/Integration tests. Suitable for CPU only machines, see [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio/blob/main/tests/integration/test_integration.py) for an example integration test.
 Model was created as follows:
 model_name = "h2oai/h2ogpt-4096-llama2-7b-chat"
 config = AutoConfig.from_pretrained(model_name)
 config.hidden_size = 12
+config.max_position_embeddings = 32
 config.intermediate_size = 24
 config.num_attention_heads = 2
 config.num_hidden_layers = 2
 config.push_to_hub(repo_name, private=False)
 ```
+Use the following configuration in [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio) to run a complete experiment in **5 seconds** using the default dataset and default settings otherwise:
+```yaml
+Validation Size: 0.1
+Data Sample: 0.1
+Max Length Prompt: 32
+Max Length Answer: 32
+Max Length: 64
+Backbone Dtype: float16
+Gradient Checkpointing: False
+Batch Size: 8
+Max Length Inference: 16
+```

config.json CHANGED Viewed

@@ -4,14 +4,13 @@
     "LlamaForCausalLM"
   ],
   "attention_bias": false,
-  "attention_dropout": 0.0,
   "bos_token_id": 1,
   "eos_token_id": 2,
   "hidden_act": "silu",
   "hidden_size": 12,
   "initializer_range": 0.02,
   "intermediate_size": 24,
-  "max_position_embeddings": 1024,
   "model_type": "llama",
   "num_attention_heads": 2,
   "num_hidden_layers": 2,
@@ -21,8 +20,8 @@
   "rope_scaling": null,
   "rope_theta": 10000.0,
   "tie_word_embeddings": false,
-  "torch_dtype": "float16",
-  "transformers_version": "4.38.1",
   "use_cache": true,
   "vocab_size": 32000
 }

     "LlamaForCausalLM"
   ],
   "attention_bias": false,
   "bos_token_id": 1,
   "eos_token_id": 2,
   "hidden_act": "silu",
   "hidden_size": 12,
   "initializer_range": 0.02,
   "intermediate_size": 24,
+  "max_position_embeddings": 32,
   "model_type": "llama",
   "num_attention_heads": 2,
   "num_hidden_layers": 2,
   "rope_scaling": null,
   "rope_theta": 10000.0,
   "tie_word_embeddings": false,
+  "torch_dtype": "float32",
+  "transformers_version": "4.34.0",
   "use_cache": true,
   "vocab_size": 32000
 }

generation_config.json CHANGED Viewed

@@ -2,5 +2,5 @@
   "_from_model_config": true,
   "bos_token_id": 1,
   "eos_token_id": 2,
-  "transformers_version": "4.38.1"
 }

   "_from_model_config": true,
   "bos_token_id": 1,
   "eos_token_id": 2,
+  "transformers_version": "4.34.0"
 }

model.safetensors DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:5108f9b61c4c32b2ae72fd11c85535054ea4ffef80fa0fb8a2cd7c5d0e7de717
-size 3085952

special_tokens_map.json CHANGED Viewed

@@ -1,23 +1,5 @@
 {
-  "bos_token": {
-    "content": "<s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "eos_token": {
-    "content": "</s>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  },
-  "unk_token": {
-    "content": "<unk>",
-    "lstrip": false,
-    "normalized": false,
-    "rstrip": false,
-    "single_word": false
-  }
 }

 {
+  "bos_token": "<s>",
+  "eos_token": "</s>",
+  "unk_token": "<unk>"
 }

tokenizer_config.json CHANGED Viewed

@@ -1,6 +1,4 @@
 {
-  "add_bos_token": true,
-  "add_eos_token": false,
   "added_tokens_decoder": {
     "0": {
       "content": "<unk>",
@@ -27,6 +25,7 @@
       "special": true
     }
   },
   "bos_token": "<s>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "</s>",
@@ -37,5 +36,5 @@
   "sp_model_kwargs": {},
   "tokenizer_class": "LlamaTokenizer",
   "unk_token": "<unk>",
-  "use_default_system_prompt": false
 }

 {
   "added_tokens_decoder": {
     "0": {
       "content": "<unk>",
       "special": true
     }
   },
+  "additional_special_tokens": [],
   "bos_token": "<s>",
   "clean_up_tokenization_spaces": false,
   "eos_token": "</s>",
   "sp_model_kwargs": {},
   "tokenizer_class": "LlamaTokenizer",
   "unk_token": "<unk>",
+  "use_default_system_prompt": true
 }