Instructions to use dinalt/walsh-1-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dinalt/walsh-1-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="dinalt/walsh-1-7b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("dinalt/walsh-1-7b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dinalt/walsh-1-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dinalt/walsh-1-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dinalt/walsh-1-7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/dinalt/walsh-1-7b

SGLang

How to use dinalt/walsh-1-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dinalt/walsh-1-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dinalt/walsh-1-7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dinalt/walsh-1-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dinalt/walsh-1-7b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use dinalt/walsh-1-7b with Docker Model Runner:
```
docker model run hf.co/dinalt/walsh-1-7b
```

Hadamard-Walsh 1.7B is an experimental model using a new positional encoder. The encoder represents absolute positions by using a combination of rows from the Hadamard-Walsh matrix (https://en.wikipedia.org/wiki/Hadamard_code). Each row corresponds to a binary digit is the positional code, where the presence of a row codes for a 1 and the absence, a zero. While training, the base offset in the sequence is randomly chosen for each batch. The result is that the model is very proficient at sequences much longer than those seen in training.

The encoding scheme was devised when I was performing experiments to determine the degree to which various positional encodings schemes interfere with the information carrying capacity of embeddings. This particular scheme did exceptionally well. It was both highly resilient to interference as well as minimally interfering with the embeddings. As a follow-on experiment, I adapted the encoder to work with my transformer implementation and found that it performs exceptionally well when directly compared with other popular positional encoding schemes.

The model has had approximately three weeks of pretraining on six RTX4090s. It seems to be doing remarkably well when compared with my other model of similar size and training, but using ALiBi positional encodings and SwiGLU. I have also noted an unusual loss pattern, where evaluation loss shows large punctuated drops. I can only speculate, but my suspicion is that the random offset of the positional encoder may have a regularizing effect on training. The attention patterns are also quite "interesting." [TODO: add images of attention patterns].

Model Details:

Model Dimension: 2048
Hidden Layers: 32
Attention Heads: 32
Feedforward Dimension: 8192
Feedforward Network Type: Conventional MLP with GeLU activation
Vocabulary Size: 32000
Max Sequence Length: 16K (14-bit absolute positional encoding via Walsh matrix)
Weight Initialization: DeepNet, https://arxiv.org/abs/2203.00555
Pretraining Datasets: RedPajama-Data-1T, mostly "books" and some Wikipedia.

Loading:

The model implementation is all my own, so you will need to use "trust_remote_code" to load the model.

from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
)

model_id = "dinalt/walsh-1-7b"
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
# flash_attention_2 requires bfloat16 or float16
torch_dtype=torch.bfloat16,
# One of ["flash_attention_2", "sdpa", "eager"]
attn_implementation="flash_attention_2",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

The model has been tested with text-generation-webui, which needs to be started with the "--trust-remote-code" flag.

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16

Datasets used to train dinalt/walsh-1-7b

Paper for dinalt/walsh-1-7b

DeepNet: Scaling Transformers to 1,000 Layers

Paper • 2203.00555 • Published Mar 1, 2022 • 2