Rish AI

Model Description

Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms.

Key Features

Sparse Mixture of Experts: 7 experts with 5 experts activated per token for optimal efficiency
Rotary Position Embeddings: Dynamic RoPE scaling for better long-context handling
Grouped Query Attention: Efficient attention with reduced key/value heads
RMSNorm: Improved normalization for stable training
Load Balancing: Automatic expert load balancing during training

Usage

Installation

pip install transformers

Basic Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "your-org/RishAI-1B-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Advanced Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model with specific configuration
model = AutoModelForCausalLM.from_pretrained(
    "your-org/RishAI-1B-7B",
    torch_dtype=torch.bfloat16,  # For memory efficiency
    device_map="auto"  # Automatic device placement
)

tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B")

# Multi-turn conversation
conversation = [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is a subset of AI..."},
    {"role": "user", "content": "Can you give a practical example?"}
]

# Format conversation
formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(formatted_input, return_tensors="pt")

# Generate with controlled parameters
outputs = model.generate(
    **inputs,
    max_length=200,
    temperature=0.8,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Model Configuration

from transformers import RishAIConfig

# Create custom configuration
config = RishAIConfig(
    vocab_size=100352,
    hidden_size=4096,
    num_hidden_layers=32,
    num_attention_heads=32,
    num_experts=7,           # Number of experts
    num_experts_per_tok=5,   # Experts activated per token
    max_position_embeddings=4096,
    rope_scaling={"rope_type": "dynamic", "factor": 1.0}
)

# Initialize model with config
from transformers import RishAIModel
model = RishAIModel(config)

Model Architecture

Sparse Mixture of Experts (MoE)

Experts: 7 specialized sub-networks
Routing: Top-5 expert selection per token
Load Balancing: Automatic expert utilization optimization

Attention Mechanism

Grouped Query Attention: Efficient key/value head reduction
Rotary Embeddings: Position-aware attention with dynamic scaling
RMSNorm: Stable layer normalization

Training Features

Gradient Checkpointing: Memory-efficient training
Flash Attention: Optimized attention computation
Expert Parallelism: Distributed expert training

Performance

Speed

Inference: Optimized for fast generation
Training: Efficient MoE routing and load balancing
Memory: Sparse activation reduces memory footprint

Quality

Perplexity: Competitive with state-of-the-art models
Long Context: Effective handling of 4K+ token sequences
Multitask: Strong performance across diverse tasks

Limitations

Requires significant computational resources for training
Memory usage scales with number of active experts
Best performance on modern GPUs with ample VRAM

Citation

@misc{rishailabs_2026,
    author       = { RishAILabs },
    title        = { RLLM-Base (Revision 552ee30) },
    year         = 2026,
    url          = { https://huggingface.co/RishAILabs/RLLM-Base },
    doi          = { 10.57967/hf/7560 },
    publisher    = { Hugging Face }
}

License

This model is released under the Apache 2.0 license.