# Rish AI

## Model Description

Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms.

## Key Features

- **Sparse Mixture of Experts**: 7 experts with 5 experts activated per token for optimal efficiency
- **Rotary Position Embeddings**: Dynamic RoPE scaling for better long-context handling
- **Grouped Query Attention**: Efficient attention with reduced key/value heads
- **RMSNorm**: Improved normalization for stable training
- **Load Balancing**: Automatic expert load balancing during training

## Usage

### Installation

```bash
pip install transformers
```

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "your-org/RishAI-1B-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Advanced Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model with specific configuration
model = AutoModelForCausalLM.from_pretrained(
    "your-org/RishAI-1B-7B",
    torch_dtype=torch.bfloat16,  # For memory efficiency
    device_map="auto"  # Automatic device placement
)

tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B")

# Multi-turn conversation
conversation = [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is a subset of AI..."},
    {"role": "user", "content": "Can you give a practical example?"}
]

# Format conversation
formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(formatted_input, return_tensors="pt")

# Generate with controlled parameters
outputs = model.generate(
    **inputs,
    max_length=200,
    temperature=0.8,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Model Configuration

```python
from transformers import RishAIConfig

# Create custom configuration
config = RishAIConfig(
    vocab_size=100352,
    hidden_size=4096,
    num_hidden_layers=32,
    num_attention_heads=32,
    num_experts=7,           # Number of experts
    num_experts_per_tok=5,   # Experts activated per token
    max_position_embeddings=4096,
    rope_scaling={"rope_type": "dynamic", "factor": 1.0}
)

# Initialize model with config
from transformers import RishAIModel
model = RishAIModel(config)
```

## Model Architecture

### Sparse Mixture of Experts (MoE)
- **Experts**: 7 specialized sub-networks
- **Routing**: Top-5 expert selection per token
- **Load Balancing**: Automatic expert utilization optimization

### Attention Mechanism
- **Grouped Query Attention**: Efficient key/value head reduction
- **Rotary Embeddings**: Position-aware attention with dynamic scaling
- **RMSNorm**: Stable layer normalization

### Training Features
- **Gradient Checkpointing**: Memory-efficient training
- **Flash Attention**: Optimized attention computation
- **Expert Parallelism**: Distributed expert training

## Performance

### Speed
- **Inference**: Optimized for fast generation
- **Training**: Efficient MoE routing and load balancing
- **Memory**: Sparse activation reduces memory footprint

### Quality
- **Perplexity**: Competitive with state-of-the-art models
- **Long Context**: Effective handling of 4K+ token sequences
- **Multitask**: Strong performance across diverse tasks

## Limitations

- Requires significant computational resources for training
- Memory usage scales with number of active experts
- Best performance on modern GPUs with ample VRAM

## Citation

```bibtex
@misc{rishailabs_2026,
    author       = { RishAILabs },
    title        = { RLLM-Base (Revision 552ee30) },
    year         = 2026,
    url          = { https://huggingface.co/RishAILabs/RLLM-Base },
    doi          = { 10.57967/hf/7560 },
    publisher    = { Hugging Face }
}
```

## License

This model is released under the Apache 2.0 license.