souvik18/mistral_tokenized_2048_fixed_v2
Viewer โข Updated โข 10.7M โข 5
How to use souvik18/Roy with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="souvik18/Roy")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages) # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("souvik18/Roy")
model = AutoModelForCausalLM.from_pretrained("souvik18/Roy")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use souvik18/Roy with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "souvik18/Roy"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "souvik18/Roy",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker model run hf.co/souvik18/Roy
How to use souvik18/Roy with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "souvik18/Roy" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "souvik18/Roy",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "souvik18/Roy" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "souvik18/Roy",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
]
}'How to use souvik18/Roy with Docker Model Runner:
docker model run hf.co/souvik18/Roy
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("souvik18/Roy")
model = AutoModelForCausalLM.from_pretrained("souvik18/Roy")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Roy is a fine-tuned large language model based onmistralai/Mistral-7B-Instruct-v0.2.
The model was trained using QLoRA with a resumable streaming pipeline and later merged into the base model to produce a single standalone checkpoint (no LoRA adapter required at inference time).
This model is optimized for:
The model was trained on a custom tokenized dataset:
mistral_tokenized_2048_fixed_v2input_idsq_proj, k_proj, v_proj, o_proj,gate_proj, up_proj, down_projAfter training, the LoRA adapter was merged into the base model weights to create this final model.
This model can be used directly without any LoRA adapter.
!pip uninstall -y transformers peft accelerate torch safetensors numpy
!pip install numpy==1.26.4
!pip install torch==2.2.2
!pip install transformers==4.41.2
!pip install peft==0.11.1
!pip install accelerate==0.30.1
!pip install safetensors==0.4.3
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# -----------------------------
# CONFIG
# -----------------------------
MODEL_ID = "souvik18/Roy"
DTYPE = torch.float16 # use float16 for GPU
# -----------------------------
# LOAD TOKENIZER & MODEL
# -----------------------------
print("๐น Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
print("๐น Loading model...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
torch_dtype=DTYPE,
device_map="auto"
)
model.eval()
print("\nโ
Model loaded successfully")
print("Type 'exit' or 'quit' to stop\n")
# -----------------------------
# CHAT LOOP
# -----------------------------
while True:
user_input = input("๐ง You: ").strip()
if user_input.lower() in ["exit", "quit"]:
print("๐ Bye!")
break
prompt = f"[INST] {user_input} [/INST]"
inputs = tokenizer(
prompt,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
do_sample=True,
repetition_penalty=1.1,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\n Roy: {response}\n")
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="souvik18/Roy") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)