--- language: - he license: apache-2.0 tags: - hebrew - instruction-tuning - sft - language-model - text-generation - mamba - transformer pipeline_tag: text-generation model-index: - name: HebrewGPT-1B-Instruct results: [] --- # HebrewGPT-1B-Instruct A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) on 61K balanced Hebrew instruction examples. ## Model Details | Property | Value | |----------|-------| | **Parameters** | 1.08B | | **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) | | **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) | | **Context Length** | 2,048 tokens | | **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting | | **License** | Apache 2.0 | | **Language** | Hebrew (he) | ## Architecture HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model: - **Width:** 1024, **Depth:** 8 layers, **Heads:** 8 (head_dim=128) - **Interleaved blocks:** Alternating RoPE multi-head attention and Mamba SSM layers - **MLP:** SwiGLU activation - **Positional encoding:** Rotary Position Embeddings (RoPE) ## Base Model: HebrewGPT-1B Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on Hebrew text. ### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens) | Dataset | Share | Description | |---------|-------|-------------| | Hebrew Wikipedia | 12% | Encyclopedia articles | | Supreme Court Rulings | 22% | Israeli legal corpus | | Ben Yehuda Project | 23% | Classic Hebrew literature | | C4 Hebrew | 20% | Web-crawled text (cleaned) | | CC100 Hebrew | 19% | CommonCrawl filtered | | Task-specific | 4% | QA, NLI, sentiment prompts | ### Pre-Training Details - **Tokens:** 9.8B (3.9 epochs over 2.48B unique) - **Hardware:** 8×H100 80GB (p5.48xlarge), 8 hours - **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale) - **Perplexity:** 29.75 (SWA) - **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4 - **Paper:** [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html) - **Ablation:** [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) (same architecture, AdamW optimizer) ## Training ### SFT Configuration - **Method:** Full Supervised Fine-Tuning (SFT) - **Training steps:** 3,000 - **Best validation loss:** 2.9598 - **Hardware:** Single NVIDIA A10G GPU (AWS g5.2xlarge) - **Training time:** ~6.5 hours - **SFT fine-tuning tokens:** ~20.3M - **Base model pre-training:** 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100) ### Instruction Dataset (61K examples) The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks: | Category | Examples | Description | |----------|----------|-------------| | QA (HeQ) | 15,000 | Hebrew question answering | | Sentiment | 10,000 | Hebrew sentiment analysis | | NLI | 2,938 | Natural language inference | | Summarization (HeSum) | 10,000 | Hebrew text summarization | | Translation | 15,000 | Hebrew-English translation | | Alpaca | 5,000 | General instruction following (translated) | | Dolly | 2,000 | Open-domain instruction following | | Chat | 1,000 | Conversational Hebrew | | Winograd | 278 | Coreference resolution | ## Usage ```python import torch import sentencepiece as spm # Load tokenizer sp = spm.SentencePieceProcessor() sp.Load("tokenizer.model") # Load model weights state_dict = torch.load("model.pt", map_location="cpu") # Initialize model architecture (see HebrewGPT-1B for model class definition) # model.load_state_dict(state_dict) ``` ### Prompt Format The model was trained with a structured instruction format: ``` ### הוראה: {instruction} ### קלט: {input} ### תשובה: {response} ``` ## Evaluation Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison: | Task | Base Model | Instruct (SFT) | |------|-----------|----------------| | SNLI | 50% | *Pending* | | Sentiment | 33% | *Pending* | | QA | 20% | *Pending* | | Trivia | 13% | *Pending* | | **Average** | **29.2%** | *Pending* | SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix. ## Infrastructure - **Research Orchestration:** Amazon Bedrock (Claude) via OpenClaw - **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G) - **Data Pipeline:** Automated dataset collection, translation, and balancing ## Files - `model.pt` — SFT fine-tuned model state dict (2.1 GB) - `tokenizer.model` — SentencePiece BPE tokenizer (8,192 vocab) ## Citation ```bibtex @misc{hebrewgpt1b-instruct-2026, title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model}, author={Slasky, Ronnen}, year={2026}, url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct} } ``` ## Limitations - Small vocabulary (8,192 tokens) may limit performance on rare words - 2,048 context window limits long-document tasks - Trained primarily on structured instruction tasks; open-ended generation quality may vary - Hebrew-specific model — limited multilingual capability beyond Hebrew-English translation ## License Apache 2.0