Instructions to use normalcomputing/extended-mind-mpt-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use normalcomputing/extended-mind-mpt-7b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="normalcomputing/extended-mind-mpt-7b", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use normalcomputing/extended-mind-mpt-7b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "normalcomputing/extended-mind-mpt-7b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "normalcomputing/extended-mind-mpt-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/normalcomputing/extended-mind-mpt-7b
- SGLang
How to use normalcomputing/extended-mind-mpt-7b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "normalcomputing/extended-mind-mpt-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "normalcomputing/extended-mind-mpt-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "normalcomputing/extended-mind-mpt-7b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "normalcomputing/extended-mind-mpt-7b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use normalcomputing/extended-mind-mpt-7b with Docker Model Runner:
docker model run hf.co/normalcomputing/extended-mind-mpt-7b
Passing new external memories without re-loading model
Hello,
I'm using extended-mind-mpt-7b for answering muliple questions using different set of external memories from a dataset. Is there a method to feed new memories to the model without having to call AutoModelForCausalLM.from_pretrained("normalcomputing/extended-mind-mpt-7b", external_memories=memory_ids, trust_remote_code=True) each time?
Thanks!
Absolutely. Simply set:
model.empty_memories() OR model.memories = Nonemodel._memories = memory_ids
where memory_ids are your new tokenized memories (like you'd pass to the .from_pretrained() call).
When .generate() is called, it checks:
if self._memories is not None and self.memories is None: #init memories once on first call
self.memories = self.generate_cache(self._memories, cache_type=self.memory_type)
Likely will make this more user-friendly in upcoming versions!
Perfect, thanks a lot! It is a super interesting model, you've developed.
I actually have another question: in the article you describe "The choice of which memories to attend to is made using cosine similarity within each decoder layer and attention head" - I have a hard time finding this place in the code, is it at line 112 in scaled_multihead_dot_product_attention in attention.py (https://huggingface.co/normalcomputing/extended-mind-mpt-7b/blob/main/attention.py)?
Indeed! When we compute the inner product (sim = q_n.matmul(k_n)) of the queries with the keys from our external memories, we've already normalized them (see lines 109, 110) so this is exactly the cosine similarity. We make the choice to normalize (regular attention uses unnormalized inner product) to mimic the way vectors are usually retrieved from a vector database.
Thanks for the questions!
Great, I see. Thanks for the fast replies!