ronnengmail's picture
Add detailed base model info, pre-training datasets, and research context
925c5eb verified
---
language:
- he
license: apache-2.0
tags:
- hebrew
- instruction-tuning
- sft
- language-model
- text-generation
- mamba
- transformer
pipeline_tag: text-generation
model-index:
- name: HebrewGPT-1B-Instruct
results: []
---
# HebrewGPT-1B-Instruct
A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) on 61K balanced Hebrew instruction examples.
## Model Details
| Property | Value |
|----------|-------|
| **Parameters** | 1.08B |
| **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
| **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) |
| **Context Length** | 2,048 tokens |
| **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
| **License** | Apache 2.0 |
| **Language** | Hebrew (he) |
## Architecture
HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:
- **Width:** 1024, **Depth:** 8 layers, **Heads:** 8 (head_dim=128)
- **Interleaved blocks:** Alternating RoPE multi-head attention and Mamba SSM layers
- **MLP:** SwiGLU activation
- **Positional encoding:** Rotary Position Embeddings (RoPE)
## Base Model: HebrewGPT-1B
Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on Hebrew text.
### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)
| Dataset | Share | Description |
|---------|-------|-------------|
| Hebrew Wikipedia | 12% | Encyclopedia articles |
| Supreme Court Rulings | 22% | Israeli legal corpus |
| Ben Yehuda Project | 23% | Classic Hebrew literature |
| C4 Hebrew | 20% | Web-crawled text (cleaned) |
| CC100 Hebrew | 19% | CommonCrawl filtered |
| Task-specific | 4% | QA, NLI, sentiment prompts |
### Pre-Training Details
- **Tokens:** 9.8B (3.9 epochs over 2.48B unique)
- **Hardware:** 8×H100 80GB (p5.48xlarge), 8 hours
- **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale)
- **Perplexity:** 29.75 (SWA)
- **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4
- **Paper:** [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
- **Ablation:** [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) (same architecture, AdamW optimizer)
## Training
### SFT Configuration
- **Method:** Full Supervised Fine-Tuning (SFT)
- **Training steps:** 3,000
- **Best validation loss:** 2.9598
- **Hardware:** Single NVIDIA A10G GPU (AWS g5.2xlarge)
- **Training time:** ~6.5 hours
- **SFT fine-tuning tokens:** ~20.3M
- **Base model pre-training:** 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100)
### Instruction Dataset (61K examples)
The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks:
| Category | Examples | Description |
|----------|----------|-------------|
| QA (HeQ) | 15,000 | Hebrew question answering |
| Sentiment | 10,000 | Hebrew sentiment analysis |
| NLI | 2,938 | Natural language inference |
| Summarization (HeSum) | 10,000 | Hebrew text summarization |
| Translation | 15,000 | Hebrew-English translation |
| Alpaca | 5,000 | General instruction following (translated) |
| Dolly | 2,000 | Open-domain instruction following |
| Chat | 1,000 | Conversational Hebrew |
| Winograd | 278 | Coreference resolution |
## Usage
```python
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.Load("tokenizer.model")
# Load model weights
state_dict = torch.load("model.pt", map_location="cpu")
# Initialize model architecture (see HebrewGPT-1B for model class definition)
# model.load_state_dict(state_dict)
```
### Prompt Format
The model was trained with a structured instruction format:
```
### הוראה:
{instruction}
### קלט:
{input}
### תשובה:
{response}
```
## Evaluation
Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison:
| Task | Base Model | Instruct (SFT) |
|------|-----------|----------------|
| SNLI | 50% | *Pending* |
| Sentiment | 33% | *Pending* |
| QA | 20% | *Pending* |
| Trivia | 13% | *Pending* |
| **Average** | **29.2%** | *Pending* |
SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix.
## Infrastructure
- **Research Orchestration:** Amazon Bedrock (Claude) via OpenClaw
- **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G)
- **Data Pipeline:** Automated dataset collection, translation, and balancing
## Files
- `model.pt` — SFT fine-tuned model state dict (2.1 GB)
- `tokenizer.model` — SentencePiece BPE tokenizer (8,192 vocab)
## Citation
```bibtex
@misc{hebrewgpt1b-instruct-2026,
title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct}
}
```
## Limitations
- Small vocabulary (8,192 tokens) may limit performance on rare words
- 2,048 context window limits long-document tasks
- Trained primarily on structured instruction tasks; open-ended generation quality may vary
- Hebrew-specific model — limited multilingual capability beyond Hebrew-English translation
## License
Apache 2.0