Instructions to use oncody/Nepalaya-R with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use oncody/Nepalaya-R with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="oncody/Nepalaya-R")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("oncody/Nepalaya-R")
model = AutoModelForCausalLM.from_pretrained("oncody/Nepalaya-R")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use oncody/Nepalaya-R with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "oncody/Nepalaya-R"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "oncody/Nepalaya-R",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/oncody/Nepalaya-R

SGLang

How to use oncody/Nepalaya-R with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "oncody/Nepalaya-R" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "oncody/Nepalaya-R",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "oncody/Nepalaya-R" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "oncody/Nepalaya-R",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use oncody/Nepalaya-R with Docker Model Runner:
```
docker model run hf.co/oncody/Nepalaya-R
```

Nepalaya-R

Nepalaya-R is a large language model project with full source, configs, and deployment tooling for local and Hugging Face usage.

About This Model

This repository contains the Nepalaya-R model implementation with:

✅ Full source code and inference implementations
✅ Tokenizer configuration adapted for Nepalaya-R
✅ Easy-to-use inference scripts
✅ Documentation and setup guides

Quick Start

Installation

pip install -r requirements.txt

Download & Setup

Option 1: Download from Hugging Face

export HF_TOKEN=your_token
python download_model.py --model-id your-username/Nepalaya-R --local-dir ./model_weights

Option 2: Run Quick Inference

python quick_inference.py --prompt "Your prompt here"

Mirror Setup

To create your own Nepalaya-R repo mirror:

export HF_TOKEN=your_token
python mirror_to_hf.py \
  --source source-org/source-model \
  --dest your-username/Nepalaya-R

Documentation

SETUP.md - Detailed setup and configuration guide
GITHUB_DEPLOY.md - Deployment instructions
inference/README.md - Inference code documentation

Model Architecture

Nepalaya-R architecture summary:

Parameters: 671B
Context Length: Extended via sparse attention
Training: Sparse attention based training pipeline
Architecture: Optimized transformer with mixture-of-experts

Key Features

Multi-expert routing for efficient inference
Sparse attention for long-context processing
Chat template support
Distributed inference capabilities

System Requirements

GPU Memory: 48GB+ VRAM recommended
RAM: 64GB+ system memory
Storage: ~300GB for full model weights
SSD: Fast storage recommended

Usage Examples

Basic Generation

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "your-username/Nepalaya-R",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("your-username/Nepalaya-R")

inputs = tokenizer("Hello", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Chat Mode

messages = [
    {"role": "user", "content": "What is machine learning?"}
]
inputs = tokenizer.apply_chat_template(messages, tokenize=True, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)

Repository Structure

Nepalaya-R/
├── README.md                          # This file
├── SETUP.md                           # Setup guide
├── GITHUB_DEPLOY.md                   # Deployment guide
├── requirements.txt                   # Python dependencies
├── config.json                        # Model configuration
├── tokenizer.json                     # Tokenizer
├── quick_inference.py                 # Quick inference script
├── download_model.py                  # Model downloader
├── mirror_to_hf.py                    # HF mirroring tool
├── inference/                         # Inference code
│   ├── generate.py                    # Generation script
│   ├── model.py                       # Model implementation
│   ├── convert.py                     # Weight converter
│   └── config_671B_nepalaya.json      # Inference config
└── assets/                            # Chat templates

Files Included

Source Code: Full inference implementation
Configuration: Model and generation configs
Tokenizer: Complete tokenizer setup
Documentation: Setup and usage guides
Utilities: Download and mirror scripts

License

MIT License - See LICENSE file

Support

For documentation, see SETUP.md For deployment, see GITHUB_DEPLOY.md

Nepalaya-R model card and repository maintained by the Nepalaya-R project.

Downloads last month: 9