File size: 6,710 Bytes

---
language:
- he
license: apache-2.0
tags:
- hebrew
- instruction-tuning
- sft
- lora
- curriculum-distillation
- language-model
- text-generation
- mamba
- transformer
pipeline_tag: text-generation
model-index:
- name: HebrewGPT-1B-Instruct
  results:
  - task:
      type: text-generation
      name: Language Modeling
    metrics:
    - name: Perplexity
      type: perplexity
      value: 15.78
    - name: Instruction Following
      type: accuracy
      value: 97.3
    - name: Repetition Rate
      type: custom
      value: 0.001
---

# HebrewGPT-1B-Instruct (LoRA Phase 2) 🇮🇱

A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) using **LoRA Phase 2 curriculum distillation** on 65K Hebrew instruction examples.

This is the latest and best instruct variant — achieving **PPL 15.78** (↓47% from base) with **97.3% instruction following** and **zero repetition**, trained for ~$12 on a single A10G GPU.

- 📄 **Paper**: [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
- 💻 **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)
- 🏗️ **Base Model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B)

## Model Details

| Property | Value |
|----------|-------|
| **Parameters** | 1.08B (44.7M trainable via LoRA, 4%) |
| **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
| **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) |
| **Fine-Tuning** | LoRA SFT (rank=64, alpha=128) |
| **Context Length** | 2,048 tokens |
| **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
| **License** | Apache 2.0 |
| **Language** | Hebrew (he) |

## Architecture

HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:

- **Width:** 1024, **Depth:** 8 layers, **Heads:** 8 (head_dim=128)
- **Interleaved blocks:** Alternating RoPE multi-head attention and Mamba SSM layers
- **MLP:** SwiGLU activation
- **Positional encoding:** Rotary Position Embeddings (RoPE)

## Training: LoRA Phase 2

### Method
- **LoRA SFT** with rank=64, alpha=128
- **Target modules:** qkv, proj, gate, up, down
- **Trainable parameters:** 44.7M / 1.08B (4%)

### Data
- **65K examples** combined from two-phase curriculum:
  - **Phase 1 (ELI5 simple):** 28.5K examples — simple explanations for foundational instruction following
  - **Phase 2 (Sonnet/Nemotron complex):** 36.5K examples — advanced, diverse instruction data

### Two-Phase Curriculum
The training uses a curriculum distillation approach: starting with simple ELI5-style examples to establish instruction-following behavior, then progressing to complex Sonnet/Nemotron-generated examples for advanced capabilities.

### Training Details
| Property | Value |
|----------|-------|
| **Hardware** | NVIDIA A10G (AWS g5.2xlarge) |
| **Training time** | ~8 hours |
| **Best validation loss** | 2.4768 (BPB 3.57) |
| **Early stopping** | Step ~1000 (patience 5) |
| **Total cost** | ~$12 |

## Evaluation Results

| Metric | Base Model | LoRA Phase 2 | Delta |
|--------|-----------|-------------|-------|
| Perplexity | 25.14 | **15.78** | **-37%** |
| Instruction Following | — | **97.3%** | — |
| MCQA | — | 10% | — |
| Repetition Rate | 0.006 | **0.001** | **-83%** |
| High-rep Outputs | — | **0%** | — |

## Key Improvements

- **Perplexity:** 29.75 → 15.78 (**-47%** from base pretrained model)
- **Zero repetition** — Phase 1 distillation had severe repetition loops; LoRA Phase 2 eliminates them entirely
- **Fluent Hebrew generation** across diverse topics
- **97.3% instruction following rate** — the model reliably follows the instruction format
- **Total post-training cost:** ~$12 on a single NVIDIA A10G GPU

## Usage

```python
import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")

# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)
```

### Prompt Format

The model was trained with a structured instruction format:

```
### הוראה:
{instruction}

### קלט:
{input}

### תשובה:
{response}
```

For inference, provide the instruction and input, then let the model generate after `### תשובה:`.

## Files

- `model.pt` — LoRA Phase 2 merged clean weights (2.1 GB)
- `tokenizer.model` — SentencePiece BPE tokenizer (8,192 vocab)

## Limitations

- **Factual accuracy limited** — expected for a 1B parameter model
- **HTML entity artifacts** from training data contamination (e.g., `&#8230;` appearing in outputs)
- **MCQA still weak (10%)** — needs MCQA-specific training data to improve
- **2,048 context window** limits long-document tasks
- **Small vocabulary (8,192 tokens)** may limit performance on rare words
- Hebrew-specific model — limited multilingual capability

## Base Model: HebrewGPT-1B

Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on 9.8B tokens of Hebrew text.

### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)

| Dataset | Share | Description |
|---------|-------|-------------|
| Hebrew Wikipedia | 12% | Encyclopedia articles |
| Supreme Court Rulings | 22% | Israeli legal corpus |
| Ben Yehuda Project | 23% | Classic Hebrew literature |
| C4 Hebrew | 20% | Web-crawled text (cleaned) |
| CC100 Hebrew | 19% | CommonCrawl filtered |
| Task-specific | 4% | QA, NLI, sentiment prompts |

### Pre-Training Details

- **Tokens:** 9.8B (3.9 epochs over 2.48B unique)
- **Hardware:** 8×H100 80GB (p5.48xlarge), 8 hours
- **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale)
- **Perplexity:** 29.75 (SWA)
- **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4

## Infrastructure

- **Research Orchestration:** Amazon Bedrock (Claude) via OpenClaw
- **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G)
- **Data Pipeline:** Automated dataset collection, translation, and balancing

## Citation

```bibtex
@misc{hebrewgpt1b-instruct-2026,
  title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model via LoRA Curriculum Distillation},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct},
  note={Paper: https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
```

## License

Apache 2.0