Instructions to use normalcomputing/extended-mind-mpt-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use normalcomputing/extended-mind-mpt-7b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="normalcomputing/extended-mind-mpt-7b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use normalcomputing/extended-mind-mpt-7b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "normalcomputing/extended-mind-mpt-7b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "normalcomputing/extended-mind-mpt-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/normalcomputing/extended-mind-mpt-7b

SGLang

How to use normalcomputing/extended-mind-mpt-7b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "normalcomputing/extended-mind-mpt-7b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "normalcomputing/extended-mind-mpt-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "normalcomputing/extended-mind-mpt-7b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "normalcomputing/extended-mind-mpt-7b",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use normalcomputing/extended-mind-mpt-7b with Docker Model Runner:
```
docker model run hf.co/normalcomputing/extended-mind-mpt-7b
```

Passing new external memories without re-loading model

by xmrt - opened Mar 5, 2024

Discussion

xmrt

Mar 5, 2024

•

edited Mar 5, 2024

Hello,

I'm using extended-mind-mpt-7b for answering muliple questions using different set of external memories from a dataset. Is there a method to feed new memories to the model without having to call AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", external_memories=memory_ids, trust_remote_code=True) each time?

Thanks!

phoebeklett

Mar 5, 2024

Absolutely. Simply set:

model.empty_memories() OR model.memories = None
model._memories = memory_ids

where memory_ids are your new tokenized memories (like you'd pass to the .from_pretrained() call).

When .generate() is called, it checks:

if self._memories is not None and self.memories is None: #init memories once on first call 
     self.memories = self.generate_cache(self._memories, cache_type=self.memory_type)

Likely will make this more user-friendly in upcoming versions!

xmrt

Mar 6, 2024

Perfect, thanks a lot! It is a super interesting model, you've developed.

I actually have another question: in the article you describe "The choice of which memories to attend to is made using cosine similarity within each decoder layer and attention head" - I have a hard time finding this place in the code, is it at line 112 in scaled_multihead_dot_product_attention in attention.py (https://huggingface.co/normalcomputing/extended-mind-mpt-7b/blob/main/attention.py)?

phoebeklett

Mar 6, 2024

Indeed! When we compute the inner product (sim = q_n.matmul(k_n)) of the queries with the keys from our external memories, we've already normalized them (see lines 109, 110) so this is exactly the cosine similarity. We make the choice to normalize (regular attention uses unnormalized inner product) to mimic the way vectors are usually retrieved from a vector database.

Thanks for the questions!

xmrt

Mar 6, 2024

Great, I see. Thanks for the fast replies!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment