Initial upload of RishAI-Base-v2: Sparse MoE multilingual model

68a9ee7 verified about 1 month ago

4.5 kB

	# Rish AI

	## Model Description

	Rish AI is a cutting-edge Mixture of Experts (MoE) transformer model designed for efficient and scalable language understanding and generation. It features sparse routing with 7 experts per token, advanced rotary position embeddings, and optimized attention mechanisms.

	## Key Features

	- Sparse Mixture of Experts: 7 experts with 5 experts activated per token for optimal efficiency
	- Rotary Position Embeddings: Dynamic RoPE scaling for better long-context handling
	- Grouped Query Attention: Efficient attention with reduced key/value heads
	- RMSNorm: Improved normalization for stable training
	- Load Balancing: Automatic expert load balancing during training

	## Usage

	### Installation

	```bash
	pip install transformers
	```

	### Basic Usage

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model and tokenizer
	model_name = "your-org/RishAI-1B-7B"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name)

	# Prepare input
	text = "Hello, how are you?"
	inputs = tokenizer(text, return_tensors="pt")

	# Generate response
	outputs = model.generate(**inputs, max_length=50, do_sample=True, temperature=0.7)
	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Advanced Usage

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load model with specific configuration
	model = AutoModelForCausalLM.from_pretrained(
	"your-org/RishAI-1B-7B",
	torch_dtype=torch.bfloat16, # For memory efficiency
	device_map="auto" # Automatic device placement
	)

	tokenizer = AutoTokenizer.from_pretrained("your-org/RishAI-1B-7B")

	# Multi-turn conversation
	conversation = [
	{"role": "user", "content": "What is machine learning?"},
	{"role": "assistant", "content": "Machine learning is a subset of AI..."},
	{"role": "user", "content": "Can you give a practical example?"}
	]

	# Format conversation
	formatted_input = tokenizer.apply_chat_template(conversation, tokenize=False)
	inputs = tokenizer(formatted_input, return_tensors="pt")

	# Generate with controlled parameters
	outputs = model.generate(
	**inputs,
	max_length=200,
	temperature=0.8,
	top_p=0.9,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Model Configuration

	```python
	from transformers import RishAIConfig

	# Create custom configuration
	config = RishAIConfig(
	vocab_size=100352,
	hidden_size=4096,
	num_hidden_layers=32,
	num_attention_heads=32,
	num_experts=7, # Number of experts
	num_experts_per_tok=5, # Experts activated per token
	max_position_embeddings=4096,
	rope_scaling={"rope_type": "dynamic", "factor": 1.0}
	)

	# Initialize model with config
	from transformers import RishAIModel
	model = RishAIModel(config)
	```

	## Model Architecture

	### Sparse Mixture of Experts (MoE)
	- Experts: 7 specialized sub-networks
	- Routing: Top-5 expert selection per token
	- Load Balancing: Automatic expert utilization optimization

	### Attention Mechanism
	- Grouped Query Attention: Efficient key/value head reduction
	- Rotary Embeddings: Position-aware attention with dynamic scaling
	- RMSNorm: Stable layer normalization

	### Training Features
	- Gradient Checkpointing: Memory-efficient training
	- Flash Attention: Optimized attention computation
	- Expert Parallelism: Distributed expert training

	## Performance

	### Speed
	- Inference: Optimized for fast generation
	- Training: Efficient MoE routing and load balancing
	- Memory: Sparse activation reduces memory footprint

	### Quality
	- Perplexity: Competitive with state-of-the-art models
	- Long Context: Effective handling of 4K+ token sequences
	- Multitask: Strong performance across diverse tasks

	## Limitations

	- Requires significant computational resources for training
	- Memory usage scales with number of active experts
	- Best performance on modern GPUs with ample VRAM

	## Citation

	```bibtex
	@misc{rishailabs_2026,
	author = { RishAILabs },
	title = { RLLM-Base (Revision 552ee30) },
	year = 2026,
	url = { https://huggingface.co/RishAILabs/RLLM-Base },
	doi = { 10.57967/hf/7560 },
	publisher = { Hugging Face }
	}
	```

	## License

	This model is released under the Apache 2.0 license.