| --- |
| language: |
| - he |
| license: apache-2.0 |
| tags: |
| - hebrew |
| - instruction-tuning |
| - sft |
| - language-model |
| - text-generation |
| - mamba |
| - transformer |
| pipeline_tag: text-generation |
| model-index: |
| - name: HebrewGPT-1B-Instruct |
| results: [] |
| --- |
| |
| # HebrewGPT-1B-Instruct |
|
|
| A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) on 61K balanced Hebrew instruction examples. |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Parameters** | 1.08B | |
| | **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) | |
| | **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) | |
| | **Context Length** | 2,048 tokens | |
| | **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting | |
| | **License** | Apache 2.0 | |
| | **Language** | Hebrew (he) | |
|
|
| ## Architecture |
|
|
| HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model: |
|
|
| - **Width:** 1024, **Depth:** 8 layers, **Heads:** 8 (head_dim=128) |
| - **Interleaved blocks:** Alternating RoPE multi-head attention and Mamba SSM layers |
| - **MLP:** SwiGLU activation |
| - **Positional encoding:** Rotary Position Embeddings (RoPE) |
| |
| ## Base Model: HebrewGPT-1B |
| |
| Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on Hebrew text. |
| |
| ### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens) |
| |
| | Dataset | Share | Description | |
| |---------|-------|-------------| |
| | Hebrew Wikipedia | 12% | Encyclopedia articles | |
| | Supreme Court Rulings | 22% | Israeli legal corpus | |
| | Ben Yehuda Project | 23% | Classic Hebrew literature | |
| | C4 Hebrew | 20% | Web-crawled text (cleaned) | |
| | CC100 Hebrew | 19% | CommonCrawl filtered | |
| | Task-specific | 4% | QA, NLI, sentiment prompts | |
| |
| ### Pre-Training Details |
| |
| - **Tokens:** 9.8B (3.9 epochs over 2.48B unique) |
| - **Hardware:** 8×H100 80GB (p5.48xlarge), 8 hours |
| - **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale) |
| - **Perplexity:** 29.75 (SWA) |
| - **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4 |
| - **Paper:** [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html) |
| - **Ablation:** [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) (same architecture, AdamW optimizer) |
| |
| ## Training |
| |
| ### SFT Configuration |
| - **Method:** Full Supervised Fine-Tuning (SFT) |
| - **Training steps:** 3,000 |
| - **Best validation loss:** 2.9598 |
| - **Hardware:** Single NVIDIA A10G GPU (AWS g5.2xlarge) |
| - **Training time:** ~6.5 hours |
| - **SFT fine-tuning tokens:** ~20.3M |
| - **Base model pre-training:** 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100) |
| |
| ### Instruction Dataset (61K examples) |
| |
| The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks: |
| |
| | Category | Examples | Description | |
| |----------|----------|-------------| |
| | QA (HeQ) | 15,000 | Hebrew question answering | |
| | Sentiment | 10,000 | Hebrew sentiment analysis | |
| | NLI | 2,938 | Natural language inference | |
| | Summarization (HeSum) | 10,000 | Hebrew text summarization | |
| | Translation | 15,000 | Hebrew-English translation | |
| | Alpaca | 5,000 | General instruction following (translated) | |
| | Dolly | 2,000 | Open-domain instruction following | |
| | Chat | 1,000 | Conversational Hebrew | |
| | Winograd | 278 | Coreference resolution | |
| |
| ## Usage |
| |
| ```python |
| import torch |
| import sentencepiece as spm |
| |
| # Load tokenizer |
| sp = spm.SentencePieceProcessor() |
| sp.Load("tokenizer.model") |
| |
| # Load model weights |
| state_dict = torch.load("model.pt", map_location="cpu") |
| # Initialize model architecture (see HebrewGPT-1B for model class definition) |
| # model.load_state_dict(state_dict) |
| ``` |
| |
| ### Prompt Format |
| |
| The model was trained with a structured instruction format: |
| |
| ``` |
| ### הוראה: |
| {instruction} |
|
|
| ### קלט: |
| {input} |
|
|
| ### תשובה: |
| {response} |
| ``` |
| |
| ## Evaluation |
| |
| Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison: |
| |
| | Task | Base Model | Instruct (SFT) | |
| |------|-----------|----------------| |
| | SNLI | 50% | *Pending* | |
| | Sentiment | 33% | *Pending* | |
| | QA | 20% | *Pending* | |
| | Trivia | 13% | *Pending* | |
| | **Average** | **29.2%** | *Pending* | |
| |
| SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix. |
| |
| ## Infrastructure |
| |
| - **Research Orchestration:** Amazon Bedrock (Claude) via OpenClaw |
| - **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G) |
| - **Data Pipeline:** Automated dataset collection, translation, and balancing |
| |
| ## Files |
| |
| - `model.pt` — SFT fine-tuned model state dict (2.1 GB) |
| - `tokenizer.model` — SentencePiece BPE tokenizer (8,192 vocab) |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{hebrewgpt1b-instruct-2026, |
| title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model}, |
| author={Slasky, Ronnen}, |
| year={2026}, |
| url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct} |
| } |
| ``` |
| |
| ## Limitations |
| |
| - Small vocabulary (8,192 tokens) may limit performance on rare words |
| - 2,048 context window limits long-document tasks |
| - Trained primarily on structured instruction tasks; open-ended generation quality may vary |
| - Hebrew-specific model — limited multilingual capability beyond Hebrew-English translation |
| |
| ## License |
| |
| Apache 2.0 |
| |