HebrewGPT-1B-Instruct
A 1.08 billion parameter Hebrew instruction-tuned language model, fine-tuned from HebrewGPT-1B on 61K balanced Hebrew instruction examples.
Model Details
| Property | Value |
|---|---|
| Parameters | 1.08B |
| Architecture | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
| Base Model | HebrewGPT-1B (pretrained with Muon optimizer + SWA) |
| Context Length | 2,048 tokens |
| Tokenizer | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
| License | Apache 2.0 |
| Language | Hebrew (he) |
Architecture
HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:
- Width: 1024, Depth: 8 layers, Heads: 8 (head_dim=128)
- Interleaved blocks: Alternating RoPE multi-head attention and Mamba SSM layers
- MLP: SwiGLU activation
- Positional encoding: Rotary Position Embeddings (RoPE)
Base Model: HebrewGPT-1B
Built on HebrewGPT-1B, a 1.08B parameter model trained from scratch on Hebrew text.
Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)
| Dataset | Share | Description |
|---|---|---|
| Hebrew Wikipedia | 12% | Encyclopedia articles |
| Supreme Court Rulings | 22% | Israeli legal corpus |
| Ben Yehuda Project | 23% | Classic Hebrew literature |
| C4 Hebrew | 20% | Web-crawled text (cleaned) |
| CC100 Hebrew | 19% | CommonCrawl filtered |
| Task-specific | 4% | QA, NLI, sentiment prompts |
Pre-Training Details
- Tokens: 9.8B (3.9 epochs over 2.48B unique)
- Hardware: 8รH100 80GB (p5.48xlarge), 8 hours
- Optimizer: Muon + SWA (12.3% better BPB than AdamW at 1B scale)
- Perplexity: 29.75 (SWA)
- Research: 200 autonomous experiments across 4 versions, 100% hit rate in v4
- Paper: Autonomous AI-Driven Hebrew Language Model Research
- Ablation: HebrewGPT-1B-AdamW (same architecture, AdamW optimizer)
Training
SFT Configuration
- Method: Full Supervised Fine-Tuning (SFT)
- Training steps: 3,000
- Best validation loss: 2.9598
- Hardware: Single NVIDIA A10G GPU (AWS g5.2xlarge)
- Training time: ~6.5 hours
- SFT fine-tuning tokens: ~20.3M
- Base model pre-training: 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100)
Instruction Dataset (61K examples)
The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks:
| Category | Examples | Description |
|---|---|---|
| QA (HeQ) | 15,000 | Hebrew question answering |
| Sentiment | 10,000 | Hebrew sentiment analysis |
| NLI | 2,938 | Natural language inference |
| Summarization (HeSum) | 10,000 | Hebrew text summarization |
| Translation | 15,000 | Hebrew-English translation |
| Alpaca | 5,000 | General instruction following (translated) |
| Dolly | 2,000 | Open-domain instruction following |
| Chat | 1,000 | Conversational Hebrew |
| Winograd | 278 | Coreference resolution |
Usage
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)
Prompt Format
The model was trained with a structured instruction format:
### ืืืจืื:
{instruction}
### ืงืื:
{input}
### ืชืฉืืื:
{response}
Evaluation
Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison:
| Task | Base Model | Instruct (SFT) |
|---|---|---|
| SNLI | 50% | Pending |
| Sentiment | 33% | Pending |
| QA | 20% | Pending |
| Trivia | 13% | Pending |
| Average | 29.2% | Pending |
SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix.
Infrastructure
- Research Orchestration: Amazon Bedrock (Claude) via OpenClaw
- Training Compute: AWS EC2 g5.2xlarge (NVIDIA A10G)
- Data Pipeline: Automated dataset collection, translation, and balancing
Files
model.ptโ SFT fine-tuned model state dict (2.1 GB)tokenizer.modelโ SentencePiece BPE tokenizer (8,192 vocab)
Citation
@misc{hebrewgpt1b-instruct-2026,
title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct}
}
Limitations
- Small vocabulary (8,192 tokens) may limit performance on rare words
- 2,048 context window limits long-document tasks
- Trained primarily on structured instruction tasks; open-ended generation quality may vary
- Hebrew-specific model โ limited multilingual capability beyond Hebrew-English translation
License
Apache 2.0