Instructions to use veyra-ai/veyra-30m-instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use veyra-ai/veyra-30m-instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="veyra-ai/veyra-30m-instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("veyra-ai/veyra-30m-instruct", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use veyra-ai/veyra-30m-instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "veyra-ai/veyra-30m-instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "veyra-ai/veyra-30m-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/veyra-ai/veyra-30m-instruct
- SGLang
How to use veyra-ai/veyra-30m-instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "veyra-ai/veyra-30m-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "veyra-ai/veyra-30m-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "veyra-ai/veyra-30m-instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "veyra-ai/veyra-30m-instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use veyra-ai/veyra-30m-instruct with Docker Model Runner:
docker model run hf.co/veyra-ai/veyra-30m-instruct
Veyra 30M Instruct
veyra-30m-instruct is the first public instruction-tuned checkpoint in the Veyra 30M line.
It is built on top of veyra-ai/veyra-30m-base-5b-tokens and is intended to test how far a 30M-parameter English base model can be pushed with masked ChatML supervised fine-tuning.
This release is experimental, but packaged as a Transformers model with custom model code. It is meant to be runnable, inspectable, and easy to compare against other tiny instruct models.
Model Details
- Base model:
veyra-ai/veyra-30m-base-5b-tokens - Parameters: approximately 36.2M
- Language: English
- Context length: 1024 tokens
- Architecture: decoder-only causal language model
- Chat format: ChatML
- License: Apache 2.0
- Status: experimental instruct release
Architecture
Veyra-30M is a compact decoder-only transformer-style model.
- Vocabulary size: 8,192
- Hidden size: 512
- Layers: 8
- Layer pattern: alternating attention and MLP-only blocks
- Query heads: 8
- Key/value heads: 2
- Attention type: grouped-query attention
- MLP: SwiGLU
- Normalization: RMSNorm
- Positional encoding: RoPE
- Context length: 1024 tokens
- Tied input/output embeddings
- KV cache support for generation
Training
This checkpoint was trained with masked ChatML supervised fine-tuning.
- Base checkpoint:
veyra-ai/veyra-30m-base-5b-tokens - Selected checkpoint:
sft_masked_chatml_0100M.pt - SFT tokens: 100M non-padding SFT tokens
- Supervised assistant tokens: approximately 35M supervised assistant tokens
- Loss masking: system and user turns were used as context but masked from loss; assistant responses were supervised.
The released checkpoint was selected from intermediate SFT milestones based on qualitative behavior, not only lowest training loss. Later SFT checkpoints reached lower loss but showed more generic refusal/template behavior, so checkpoint selection was based on output quality.
Intended Use
- Tiny-model instruction-following experiments
- Local assistant experiments
- ChatML behavior testing
- Simple formatting and JSON experiments
- Basic Python/helpfulness prompts
- Studying SFT behavior in very small language models
Not Intended For
- Production applications
- Safety-critical use cases
- Medical, legal, financial, or security advice
- Reliable factual QA
- Long-context reasoning
- Robust arithmetic
- User-facing deployment without additional safeguards
Known Limitations
This is a 30M-parameter model and has significant limitations.
Known failure modes include:
- Weak variable binding
- Weak arithmetic
- Weak multi-turn recall
- Occasional repetition
- Confident but incorrect answers
- Generic refusal or disclaimer behavior
- Tool-call or reasoning-template contamination
- Sensitivity to prompt wording
- Fluent nonsense on unfamiliar prompts
This model is not fully safety-tuned. It may refuse some harmful requests, but refusal behavior is not reliable.
Prompt Format
This model uses ChatML.
<|im_start|>system
You are Veyra, a tiny local instruction model. Be concise, useful, casual, and lightly playful. Correct mistakes gently.<|im_end|>
<|im_start|>user
What does a tokenizer do?<|im_end|>
<|im_start|>assistant
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "veyra-ai/veyra-30m-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
messages = [
{
"role": "system",
"content": "You are Veyra, a tiny local instruction model. Be concise, useful, casual, and lightly playful."
},
{
"role": "user",
"content": "Explain what an API is using a simple analogy."
},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=120,
temperature=0.3,
top_k=40,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|im_end|>")],
use_cache=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Example Prompts
The examples below are suggested prompts for trying the model. They are not benchmark results and should not be treated as representative guarantees.
Explain what an API is using a simple analogy.
Return only JSON for a book with title='Dune', author='Frank Herbert', year=1965.
Write a Python function that checks whether a number is even.
What does FileNotFoundError usually mean in Python?
Given these facts: color=blue, animal=otter, number=17. What animal was mentioned?
Special Tokens
Important tokenizer special tokens include:
<|bos|>
<|eos|>
<|pad|>
<|unk|>
<|im_start|>
<|im_end|>
<|tool_call|>
<|tool_result|>
<|context|>
<|reasoning|>
<|end_reasoning|>
<|answer|>
<|fim_prefix|>
<|fim_middle|>
<|fim_suffix|>
For this checkpoint, standard ChatML with <|im_start|> and <|im_end|> is the recommended format.
Benchmarks
Benchmarks will be added in a later update.
Citation / Attribution
If you use or build on this model, please retain attribution to Veyra AI.
License
Apache 2.0.
- Downloads last month
- 34
Model tree for veyra-ai/veyra-30m-instruct
Base model
veyra-ai/veyra-30m-base-5b-tokens