Instructions to use ISTA-MLCV/llama_2_13b_single_emb with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ISTA-MLCV/llama_2_13b_single_emb with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ISTA-MLCV/llama_2_13b_single_emb") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ISTA-MLCV/llama_2_13b_single_emb") model = AutoModelForCausalLM.from_pretrained("ISTA-MLCV/llama_2_13b_single_emb") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ISTA-MLCV/llama_2_13b_single_emb with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ISTA-MLCV/llama_2_13b_single_emb" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-MLCV/llama_2_13b_single_emb", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ISTA-MLCV/llama_2_13b_single_emb
- SGLang
How to use ISTA-MLCV/llama_2_13b_single_emb with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ISTA-MLCV/llama_2_13b_single_emb" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-MLCV/llama_2_13b_single_emb", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ISTA-MLCV/llama_2_13b_single_emb" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ISTA-MLCV/llama_2_13b_single_emb", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ISTA-MLCV/llama_2_13b_single_emb with Docker Model Runner:
docker model run hf.co/ISTA-MLCV/llama_2_13b_single_emb
Llama 2 13B Vanilla
This is the Llama 2 13B model fine-tuned as the vanilla (unmodified) baseline, trained and evaluated in the paper ASIDE: Architectural Separation of Instructions and Data in Language Models.
Model Description
This is the vanilla (unmodified) baseline fine-tuned with the same training data and procedure, but without any embedding modification.
Usage
To use this model, first clone and follow the installation instructions in the official ASIDE Repository.
Inside the repository, run the following code snippet (also provided here as a script) to do inference with this model.
import torch
import deepspeed
import json
import os
from huggingface_hub import login
from model_api import CustomModelHandler # Import your custom handler
from model_api import format_prompt # Import your prompt formatting function
# Define your instruction and data
instruction_text = "Translate to German."
data_text = "Who is Albert Einstein?"
# Model configuration
hf_token = os.environ["HUGGINGFACE_HUB_TOKEN"]
login(token=hf_token)
embedding_type = "single_emb"
base_model = "meta-llama/Llama-2-13b-hf"
model_path = "Embeddings-Collab/llama_2_13b_single_emb_emb_SFTv110_from_base_run_20"
# Initialize the model handler
handler = CustomModelHandler(
model_path,
base_model,
base_model,
model_path,
None,
0,
embedding_type=embedding_type,
load_from_checkpoint=True
)
# Initialize DeepSpeed inference engine
engine = deepspeed.init_inference(
model=handler.model,
mp_size=torch.cuda.device_count(), # Number of GPUs
dtype=torch.float16,
replace_method='auto',
replace_with_kernel_inject=False
)
handler.model = engine.module
# Load prompt templates
with open("./data/prompt_templates.json", "r") as f:
templates = json.load(f)
template = templates[0]
instruction_text = format_prompt(instruction_text, template, "system")
data_text = format_prompt(data_text, template, "user")
# Generate output
output, inp = handler.call_model_api_batch([instruction_text], [data_text])
print(output)
Citation
If you use this model, please cite our paper:
@inproceedings{
zverev2026aside,
title={{ASIDE}}: Architectural Separation of Instructions and Data in Language Models},
author={Egor Zverev and Evgenii Kortukov and Alexander Panfilov and Alexandra Volkova and Rush Tabesh and Sebastian Lapuschkin and Wojciech Samek and Christoph H. Lampert},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=C81TnwHiRM}
}
- Downloads last month
- 4