| # Rish AI | |
| ## Model Description | |
| Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms. | |
| ## Key Features | |
| - **Sparse Mixture of Experts**: 7 experts with 5 experts activated per token for optimal efficiency | |
| - **Rotary Position Embeddings**: Dynamic RoPE scaling for better long-context handling | |
| - **Grouped Query Attention**: Efficient attention with reduced key/value heads | |
| - **RMSNorm**: Improved normalization for stable training | |
| - **Load Balancing**: Automatic expert load balancing during training | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers | |
| ``` | |
| ### Basic Usage | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| # Load model and tokenizer | |
| model_name = "your-org/RishAI-1B-7B" | |
| tokenizer = AutoTokenizer.from_pretrained(model_name) | |
| model = AutoModelForCausalLM.from_pretrained(model_name) | |
| # Prepare input | |
| text = "Hello, how are you?" | |
| inputs = tokenizer(text, return_tensors="pt") | |
| # Generate response | |
| outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7) | |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### Advanced Usage | |
| ```python | |
| import torch | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| # Load model with specific configuration | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "your-org/RishAI-1B-7B", | |
| torch_dtype=torch.bfloat16, # For memory efficiency | |
| device_map="auto" # Automatic device placement | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B") | |
| # Multi-turn conversation | |
| conversation = [ | |
| {"role": "user", "content": "What is machine learning?"}, | |
| {"role": "assistant", "content": "Machine learning is a subset of AI..."}, | |
| {"role": "user", "content": "Can you give a practical example?"} | |
| ] | |
| # Format conversation | |
| formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False) | |
| inputs = tokenizer(formatted_input, return_tensors="pt") | |
| # Generate with controlled parameters | |
| outputs = model.generate( | |
| **inputs, | |
| max_length=200, | |
| temperature=0.8, | |
| top_p=0.9, | |
| do_sample=True, | |
| pad_token_id=tokenizer.eos_token_id | |
| ) | |
| response = tokenizer.decode(outputs[0], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### Model Configuration | |
| ```python | |
| from transformers import RishAIConfig | |
| # Create custom configuration | |
| config = RishAIConfig( | |
| vocab_size=100352, | |
| hidden_size=4096, | |
| num_hidden_layers=32, | |
| num_attention_heads=32, | |
| num_experts=7, # Number of experts | |
| num_experts_per_tok=5, # Experts activated per token | |
| max_position_embeddings=4096, | |
| rope_scaling={"rope_type": "dynamic", "factor": 1.0} | |
| ) | |
| # Initialize model with config | |
| from transformers import RishAIModel | |
| model = RishAIModel(config) | |
| ``` | |
| ## Model Architecture | |
| ### Sparse Mixture of Experts (MoE) | |
| - **Experts**: 7 specialized sub-networks | |
| - **Routing**: Top-5 expert selection per token | |
| - **Load Balancing**: Automatic expert utilization optimization | |
| ### Attention Mechanism | |
| - **Grouped Query Attention**: Efficient key/value head reduction | |
| - **Rotary Embeddings**: Position-aware attention with dynamic scaling | |
| - **RMSNorm**: Stable layer normalization | |
| ### Training Features | |
| - **Gradient Checkpointing**: Memory-efficient training | |
| - **Flash Attention**: Optimized attention computation | |
| - **Expert Parallelism**: Distributed expert training | |
| ## Performance | |
| ### Speed | |
| - **Inference**: Optimized for fast generation | |
| - **Training**: Efficient MoE routing and load balancing | |
| - **Memory**: Sparse activation reduces memory footprint | |
| ### Quality | |
| - **Perplexity**: Competitive with state-of-the-art models | |
| - **Long Context**: Effective handling of 4K+ token sequences | |
| - **Multitask**: Strong performance across diverse tasks | |
| ## Limitations | |
| - Requires significant computational resources for training | |
| - Memory usage scales with number of active experts | |
| - Best performance on modern GPUs with ample VRAM | |
| ## Citation | |
| ```bibtex | |
| @misc{rishailabs_2026, | |
| author = { RishAILabs }, | |
| title = { RLLM-Base (Revision 552ee30) }, | |
| year = 2026, | |
| url = { https://huggingface.co/RishAILabs/RLLM-Base }, | |
| doi = { 10.57967/hf/7560 }, | |
| publisher = { Hugging Face } | |
| } | |
| ``` | |
| ## License | |
| This model is released under the Apache 2.0 license. |