Rish AI
Model Description
Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms.
Key Features
- Sparse Mixture of Experts: 7 experts with 5 experts activated per token for optimal efficiency
- Rotary Position Embeddings: Dynamic RoPE scaling for better long-context handling
- Grouped Query Attention: Efficient attention with reduced key/value heads
- RMSNorm: Improved normalization for stable training
- Load Balancing: Automatic expert load balancing during training
Usage
Installation
pip install transformers
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "your-org/RishAI-1B-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Advanced Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model with specific configuration
model = AutoModelForCausalLM.from_pretrained(
"your-org/RishAI-1B-7B",
torch_dtype=torch.bfloat16, # For memory efficiency
device_map="auto" # Automatic device placement
)
tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B")
# Multi-turn conversation
conversation = [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a subset of AI..."},
{"role": "user", "content": "Can you give a practical example?"}
]
# Format conversation
formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(formatted_input, return_tensors="pt")
# Generate with controlled parameters
outputs = model.generate(
**inputs,
max_length=200,
temperature=0.8,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Model Configuration
from transformers import RishAIConfig
# Create custom configuration
config = RishAIConfig(
vocab_size=100352,
hidden_size=4096,
num_hidden_layers=32,
num_attention_heads=32,
num_experts=7, # Number of experts
num_experts_per_tok=5, # Experts activated per token
max_position_embeddings=4096,
rope_scaling={"rope_type": "dynamic", "factor": 1.0}
)
# Initialize model with config
from transformers import RishAIModel
model = RishAIModel(config)
Model Architecture
Sparse Mixture of Experts (MoE)
- Experts: 7 specialized sub-networks
- Routing: Top-5 expert selection per token
- Load Balancing: Automatic expert utilization optimization
Attention Mechanism
- Grouped Query Attention: Efficient key/value head reduction
- Rotary Embeddings: Position-aware attention with dynamic scaling
- RMSNorm: Stable layer normalization
Training Features
- Gradient Checkpointing: Memory-efficient training
- Flash Attention: Optimized attention computation
- Expert Parallelism: Distributed expert training
Performance
Speed
- Inference: Optimized for fast generation
- Training: Efficient MoE routing and load balancing
- Memory: Sparse activation reduces memory footprint
Quality
- Perplexity: Competitive with state-of-the-art models
- Long Context: Effective handling of 4K+ token sequences
- Multitask: Strong performance across diverse tasks
Limitations
- Requires significant computational resources for training
- Memory usage scales with number of active experts
- Best performance on modern GPUs with ample VRAM
Citation
@misc{rishailabs_2026,
author = { RishAILabs },
title = { RLLM-Base (Revision 552ee30) },
year = 2026,
url = { https://huggingface.co/RishAILabs/RLLM-Base },
doi = { 10.57967/hf/7560 },
publisher = { Hugging Face }
}
License
This model is released under the Apache 2.0 license.