File size: 4,499 Bytes
68a9ee7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | # Rish AI
## Model Description
Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms.
## Key Features
- **Sparse Mixture of Experts**: 7 experts with 5 experts activated per token for optimal efficiency
- **Rotary Position Embeddings**: Dynamic RoPE scaling for better long-context handling
- **Grouped Query Attention**: Efficient attention with reduced key/value heads
- **RMSNorm**: Improved normalization for stable training
- **Load Balancing**: Automatic expert load balancing during training
## Usage
### Installation
```bash
pip install transformers
```
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "your-org/RishAI-1B-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Advanced Usage
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model with specific configuration
model = AutoModelForCausalLM.from_pretrained(
"your-org/RishAI-1B-7B",
torch_dtype=torch.bfloat16, # For memory efficiency
device_map="auto" # Automatic device placement
)
tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B")
# Multi-turn conversation
conversation = [
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is a subset of AI..."},
{"role": "user", "content": "Can you give a practical example?"}
]
# Format conversation
formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(formatted_input, return_tensors="pt")
# Generate with controlled parameters
outputs = model.generate(
**inputs,
max_length=200,
temperature=0.8,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Model Configuration
```python
from transformers import RishAIConfig
# Create custom configuration
config = RishAIConfig(
vocab_size=100352,
hidden_size=4096,
num_hidden_layers=32,
num_attention_heads=32,
num_experts=7, # Number of experts
num_experts_per_tok=5, # Experts activated per token
max_position_embeddings=4096,
rope_scaling={"rope_type": "dynamic", "factor": 1.0}
)
# Initialize model with config
from transformers import RishAIModel
model = RishAIModel(config)
```
## Model Architecture
### Sparse Mixture of Experts (MoE)
- **Experts**: 7 specialized sub-networks
- **Routing**: Top-5 expert selection per token
- **Load Balancing**: Automatic expert utilization optimization
### Attention Mechanism
- **Grouped Query Attention**: Efficient key/value head reduction
- **Rotary Embeddings**: Position-aware attention with dynamic scaling
- **RMSNorm**: Stable layer normalization
### Training Features
- **Gradient Checkpointing**: Memory-efficient training
- **Flash Attention**: Optimized attention computation
- **Expert Parallelism**: Distributed expert training
## Performance
### Speed
- **Inference**: Optimized for fast generation
- **Training**: Efficient MoE routing and load balancing
- **Memory**: Sparse activation reduces memory footprint
### Quality
- **Perplexity**: Competitive with state-of-the-art models
- **Long Context**: Effective handling of 4K+ token sequences
- **Multitask**: Strong performance across diverse tasks
## Limitations
- Requires significant computational resources for training
- Memory usage scales with number of active experts
- Best performance on modern GPUs with ample VRAM
## Citation
```bibtex
@misc{rishailabs_2026,
author = { RishAILabs },
title = { RLLM-Base (Revision 552ee30) },
year = 2026,
url = { https://huggingface.co/RishAILabs/RLLM-Base },
doi = { 10.57967/hf/7560 },
publisher = { Hugging Face }
}
```
## License
This model is released under the Apache 2.0 license. |