File size: 6,710 Bytes
b666978 1122dc8 b666978 1122dc8 b666978 1122dc8 b666978 1122dc8 b666978 1122dc8 b666978 925c5eb 1122dc8 b666978 1122dc8 925c5eb 1122dc8 925c5eb 1122dc8 925c5eb 1122dc8 925c5eb 1122dc8 b666978 1122dc8 b666978 1122dc8 efc1c40 1122dc8 efc1c40 1122dc8 b666978 1122dc8 b666978 1122dc8 b666978 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | ---
language:
- he
license: apache-2.0
tags:
- hebrew
- instruction-tuning
- sft
- lora
- curriculum-distillation
- language-model
- text-generation
- mamba
- transformer
pipeline_tag: text-generation
model-index:
- name: HebrewGPT-1B-Instruct
results:
- task:
type: text-generation
name: Language Modeling
metrics:
- name: Perplexity
type: perplexity
value: 15.78
- name: Instruction Following
type: accuracy
value: 97.3
- name: Repetition Rate
type: custom
value: 0.001
---
# HebrewGPT-1B-Instruct (LoRA Phase 2) ๐ฎ๐ฑ
A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) using **LoRA Phase 2 curriculum distillation** on 65K Hebrew instruction examples.
This is the latest and best instruct variant โ achieving **PPL 15.78** (โ47% from base) with **97.3% instruction following** and **zero repetition**, trained for ~$12 on a single A10G GPU.
- ๐ **Paper**: [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
- ๐ป **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)
- ๐๏ธ **Base Model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B)
## Model Details
| Property | Value |
|----------|-------|
| **Parameters** | 1.08B (44.7M trainable via LoRA, 4%) |
| **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
| **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) |
| **Fine-Tuning** | LoRA SFT (rank=64, alpha=128) |
| **Context Length** | 2,048 tokens |
| **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
| **License** | Apache 2.0 |
| **Language** | Hebrew (he) |
## Architecture
HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:
- **Width:** 1024, **Depth:** 8 layers, **Heads:** 8 (head_dim=128)
- **Interleaved blocks:** Alternating RoPE multi-head attention and Mamba SSM layers
- **MLP:** SwiGLU activation
- **Positional encoding:** Rotary Position Embeddings (RoPE)
## Training: LoRA Phase 2
### Method
- **LoRA SFT** with rank=64, alpha=128
- **Target modules:** qkv, proj, gate, up, down
- **Trainable parameters:** 44.7M / 1.08B (4%)
### Data
- **65K examples** combined from two-phase curriculum:
- **Phase 1 (ELI5 simple):** 28.5K examples โ simple explanations for foundational instruction following
- **Phase 2 (Sonnet/Nemotron complex):** 36.5K examples โ advanced, diverse instruction data
### Two-Phase Curriculum
The training uses a curriculum distillation approach: starting with simple ELI5-style examples to establish instruction-following behavior, then progressing to complex Sonnet/Nemotron-generated examples for advanced capabilities.
### Training Details
| Property | Value |
|----------|-------|
| **Hardware** | NVIDIA A10G (AWS g5.2xlarge) |
| **Training time** | ~8 hours |
| **Best validation loss** | 2.4768 (BPB 3.57) |
| **Early stopping** | Step ~1000 (patience 5) |
| **Total cost** | ~$12 |
## Evaluation Results
| Metric | Base Model | LoRA Phase 2 | Delta |
|--------|-----------|-------------|-------|
| Perplexity | 25.14 | **15.78** | **-37%** |
| Instruction Following | โ | **97.3%** | โ |
| MCQA | โ | 10% | โ |
| Repetition Rate | 0.006 | **0.001** | **-83%** |
| High-rep Outputs | โ | **0%** | โ |
## Key Improvements
- **Perplexity:** 29.75 โ 15.78 (**-47%** from base pretrained model)
- **Zero repetition** โ Phase 1 distillation had severe repetition loops; LoRA Phase 2 eliminates them entirely
- **Fluent Hebrew generation** across diverse topics
- **97.3% instruction following rate** โ the model reliably follows the instruction format
- **Total post-training cost:** ~$12 on a single NVIDIA A10G GPU
## Usage
```python
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)
```
### Prompt Format
The model was trained with a structured instruction format:
```
### ืืืจืื:
{instruction}
### ืงืื:
{input}
### ืชืฉืืื:
{response}
```
For inference, provide the instruction and input, then let the model generate after `### ืชืฉืืื:`.
## Files
- `model.pt` โ LoRA Phase 2 merged clean weights (2.1 GB)
- `tokenizer.model` โ SentencePiece BPE tokenizer (8,192 vocab)
## Limitations
- **Factual accuracy limited** โ expected for a 1B parameter model
- **HTML entity artifacts** from training data contamination (e.g., `…` appearing in outputs)
- **MCQA still weak (10%)** โ needs MCQA-specific training data to improve
- **2,048 context window** limits long-document tasks
- **Small vocabulary (8,192 tokens)** may limit performance on rare words
- Hebrew-specific model โ limited multilingual capability
## Base Model: HebrewGPT-1B
Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on 9.8B tokens of Hebrew text.
### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)
| Dataset | Share | Description |
|---------|-------|-------------|
| Hebrew Wikipedia | 12% | Encyclopedia articles |
| Supreme Court Rulings | 22% | Israeli legal corpus |
| Ben Yehuda Project | 23% | Classic Hebrew literature |
| C4 Hebrew | 20% | Web-crawled text (cleaned) |
| CC100 Hebrew | 19% | CommonCrawl filtered |
| Task-specific | 4% | QA, NLI, sentiment prompts |
### Pre-Training Details
- **Tokens:** 9.8B (3.9 epochs over 2.48B unique)
- **Hardware:** 8รH100 80GB (p5.48xlarge), 8 hours
- **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale)
- **Perplexity:** 29.75 (SWA)
- **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4
## Infrastructure
- **Research Orchestration:** Amazon Bedrock (Claude) via OpenClaw
- **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G)
- **Data Pipeline:** Automated dataset collection, translation, and balancing
## Citation
```bibtex
@misc{hebrewgpt1b-instruct-2026,
title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model via LoRA Curriculum Distillation},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct},
note={Paper: https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
}
```
## License
Apache 2.0
|