---
language:
- he
license: apache-2.0
tags:
- hebrew
- instruction-tuning
- sft
- language-model
- text-generation
- mamba
- transformer
pipeline_tag: text-generation
model-index:
- name: HebrewGPT-1B-Instruct
  results: []
---

# HebrewGPT-1B-Instruct

A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) on 61K balanced Hebrew instruction examples.

## Model Details

| Property | Value |
|----------|-------|
| **Parameters** | 1.08B |
| **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
| **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) |
| **Context Length** | 2,048 tokens |
| **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
| **License** | Apache 2.0 |
| **Language** | Hebrew (he) |

## Architecture

HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:

- **Width:** 1024, **Depth:** 8 layers, **Heads:** 8 (head_dim=128)
- **Interleaved blocks:** Alternating RoPE multi-head attention and Mamba SSM layers
- **MLP:** SwiGLU activation
- **Positional encoding:** Rotary Position Embeddings (RoPE)

## Base Model: HebrewGPT-1B

Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on Hebrew text.

### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)

| Dataset | Share | Description |
|---------|-------|-------------|
| Hebrew Wikipedia | 12% | Encyclopedia articles |
| Supreme Court Rulings | 22% | Israeli legal corpus |
| Ben Yehuda Project | 23% | Classic Hebrew literature |
| C4 Hebrew | 20% | Web-crawled text (cleaned) |
| CC100 Hebrew | 19% | CommonCrawl filtered |
| Task-specific | 4% | QA, NLI, sentiment prompts |

### Pre-Training Details

- **Tokens:** 9.8B (3.9 epochs over 2.48B unique)
- **Hardware:** 8×H100 80GB (p5.48xlarge), 8 hours
- **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale)
- **Perplexity:** 29.75 (SWA)
- **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4
- **Paper:** [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
- **Ablation:** [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) (same architecture, AdamW optimizer)

## Training

### SFT Configuration
- **Method:** Full Supervised Fine-Tuning (SFT)
- **Training steps:** 3,000
- **Best validation loss:** 2.9598
- **Hardware:** Single NVIDIA A10G GPU (AWS g5.2xlarge)
- **Training time:** ~6.5 hours
- **SFT fine-tuning tokens:** ~20.3M
- **Base model pre-training:** 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100)

### Instruction Dataset (61K examples)

The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks:

| Category | Examples | Description |
|----------|----------|-------------|
| QA (HeQ) | 15,000 | Hebrew question answering |
| Sentiment | 10,000 | Hebrew sentiment analysis |
| NLI | 2,938 | Natural language inference |
| Summarization (HeSum) | 10,000 | Hebrew text summarization |
| Translation | 15,000 | Hebrew-English translation |
| Alpaca | 5,000 | General instruction following (translated) |
| Dolly | 2,000 | Open-domain instruction following |
| Chat | 1,000 | Conversational Hebrew |
| Winograd | 278 | Coreference resolution |

## Usage

```python
import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)
```

### Prompt Format

The model was trained with a structured instruction format:

```
### הוראה:
{instruction}

### קלט:
{input}

### תשובה:
{response}
```

## Evaluation

Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison:

| Task | Base Model | Instruct (SFT) |
|------|-----------|----------------|
| SNLI | 50% | *Pending* |
| Sentiment | 33% | *Pending* |
| QA | 20% | *Pending* |
| Trivia | 13% | *Pending* |
| **Average** | **29.2%** | *Pending* |

SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix.

## Infrastructure

- **Research Orchestration:** Amazon Bedrock (Claude) via OpenClaw
- **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G)
- **Data Pipeline:** Automated dataset collection, translation, and balancing

## Files

- `model.pt` — SFT fine-tuned model state dict (2.1 GB)
- `tokenizer.model` — SentencePiece BPE tokenizer (8,192 vocab)

## Citation

```bibtex
@misc{hebrewgpt1b-instruct-2026,
  title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct}
}
```

## Limitations

- Small vocabulary (8,192 tokens) may limit performance on rare words
- 2,048 context window limits long-document tasks
- Trained primarily on structured instruction tasks; open-ended generation quality may vary
- Hebrew-specific model — limited multilingual capability beyond Hebrew-English translation

## License

Apache 2.0