File size: 4,499 Bytes
68a9ee7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Rish AI

## Model Description

Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms.

## Key Features

- **Sparse Mixture of Experts**: 7 experts with 5 experts activated per token for optimal efficiency
- **Rotary Position Embeddings**: Dynamic RoPE scaling for better long-context handling
- **Grouped Query Attention**: Efficient attention with reduced key/value heads
- **RMSNorm**: Improved normalization for stable training
- **Load Balancing**: Automatic expert load balancing during training

## Usage

### Installation

```bash
pip install transformers
```

### Basic Usage

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model_name = "your-org/RishAI-1B-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare input
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Advanced Usage

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model with specific configuration
model = AutoModelForCausalLM.from_pretrained(
    "your-org/RishAI-1B-7B",
    torch_dtype=torch.bfloat16,  # For memory efficiency
    device_map="auto"  # Automatic device placement
)

tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B")

# Multi-turn conversation
conversation = [
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is a subset of AI..."},
    {"role": "user", "content": "Can you give a practical example?"}
]

# Format conversation
formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False)
inputs = tokenizer(formatted_input, return_tensors="pt")

# Generate with controlled parameters
outputs = model.generate(
    **inputs,
    max_length=200,
    temperature=0.8,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Model Configuration

```python
from transformers import RishAIConfig

# Create custom configuration
config = RishAIConfig(
    vocab_size=100352,
    hidden_size=4096,
    num_hidden_layers=32,
    num_attention_heads=32,
    num_experts=7,           # Number of experts
    num_experts_per_tok=5,   # Experts activated per token
    max_position_embeddings=4096,
    rope_scaling={"rope_type": "dynamic", "factor": 1.0}
)

# Initialize model with config
from transformers import RishAIModel
model = RishAIModel(config)
```

## Model Architecture

### Sparse Mixture of Experts (MoE)
- **Experts**: 7 specialized sub-networks
- **Routing**: Top-5 expert selection per token
- **Load Balancing**: Automatic expert utilization optimization

### Attention Mechanism
- **Grouped Query Attention**: Efficient key/value head reduction
- **Rotary Embeddings**: Position-aware attention with dynamic scaling
- **RMSNorm**: Stable layer normalization

### Training Features
- **Gradient Checkpointing**: Memory-efficient training
- **Flash Attention**: Optimized attention computation
- **Expert Parallelism**: Distributed expert training

## Performance

### Speed
- **Inference**: Optimized for fast generation
- **Training**: Efficient MoE routing and load balancing
- **Memory**: Sparse activation reduces memory footprint

### Quality
- **Perplexity**: Competitive with state-of-the-art models
- **Long Context**: Effective handling of 4K+ token sequences
- **Multitask**: Strong performance across diverse tasks

## Limitations

- Requires significant computational resources for training
- Memory usage scales with number of active experts
- Best performance on modern GPUs with ample VRAM

## Citation

```bibtex
@misc{rishailabs_2026,
    author       = { RishAILabs },
    title        = { RLLM-Base (Revision 552ee30) },
    year         = 2026,
    url          = { https://huggingface.co/RishAILabs/RLLM-Base },
    doi          = { 10.57967/hf/7560 },
    publisher    = { Hugging Face }
}
```

## License

This model is released under the Apache 2.0 license.