Add detailed base model info, pre-training datasets, and research context

925c5eb verified 2 days ago

5.54 kB

	---
	language:
	- he
	license: apache-2.0
	tags:
	- hebrew
	- instruction-tuning
	- sft
	- language-model
	- text-generation
	- mamba
	- transformer
	pipeline_tag: text-generation
	model-index:
	- name: HebrewGPT-1B-Instruct
	results: []
	---

	# HebrewGPT-1B-Instruct

	A 1.08 billion parameter Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) on 61K balanced Hebrew instruction examples.

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Parameters \| 1.08B \|
	\| Architecture \| Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) \|
	\| Base Model \| [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) \|
	\| Context Length \| 2,048 tokens \|
	\| Tokenizer \| SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting \|
	\| License \| Apache 2.0 \|
	\| Language \| Hebrew (he) \|

	## Architecture

	HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:

	- Width: 1024, Depth: 8 layers, Heads: 8 (head_dim=128)
	- Interleaved blocks: Alternating RoPE multi-head attention and Mamba SSM layers
	- MLP: SwiGLU activation
	- Positional encoding: Rotary Position Embeddings (RoPE)

	## Base Model: HebrewGPT-1B

	Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on Hebrew text.

	### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)

	\| Dataset \| Share \| Description \|
	\|---------\|-------\|-------------\|
	\| Hebrew Wikipedia \| 12% \| Encyclopedia articles \|
	\| Supreme Court Rulings \| 22% \| Israeli legal corpus \|
	\| Ben Yehuda Project \| 23% \| Classic Hebrew literature \|
	\| C4 Hebrew \| 20% \| Web-crawled text (cleaned) \|
	\| CC100 Hebrew \| 19% \| CommonCrawl filtered \|
	\| Task-specific \| 4% \| QA, NLI, sentiment prompts \|

	### Pre-Training Details

	- Tokens: 9.8B (3.9 epochs over 2.48B unique)
	- Hardware: 8×H100 80GB (p5.48xlarge), 8 hours
	- Optimizer: Muon + SWA (12.3% better BPB than AdamW at 1B scale)
	- Perplexity: 29.75 (SWA)
	- Research: 200 autonomous experiments across 4 versions, 100% hit rate in v4
	- Paper: [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
	- Ablation: [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) (same architecture, AdamW optimizer)

	## Training

	### SFT Configuration
	- Method: Full Supervised Fine-Tuning (SFT)
	- Training steps: 3,000
	- Best validation loss: 2.9598
	- Hardware: Single NVIDIA A10G GPU (AWS g5.2xlarge)
	- Training time: ~6.5 hours
	- SFT fine-tuning tokens: ~20.3M
	- Base model pre-training: 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100)

	### Instruction Dataset (61K examples)

	The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks:

	\| Category \| Examples \| Description \|
	\|----------\|----------\|-------------\|
	\| QA (HeQ) \| 15,000 \| Hebrew question answering \|
	\| Sentiment \| 10,000 \| Hebrew sentiment analysis \|
	\| NLI \| 2,938 \| Natural language inference \|
	\| Summarization (HeSum) \| 10,000 \| Hebrew text summarization \|
	\| Translation \| 15,000 \| Hebrew-English translation \|
	\| Alpaca \| 5,000 \| General instruction following (translated) \|
	\| Dolly \| 2,000 \| Open-domain instruction following \|
	\| Chat \| 1,000 \| Conversational Hebrew \|
	\| Winograd \| 278 \| Coreference resolution \|

	## Usage

	```python
	import torch
	import sentencepiece as spm

	# Load tokenizer
	sp = spm.SentencePieceProcessor()
	sp.Load("tokenizer.model")

	# Load model weights
	state_dict = torch.load("model.pt", map_location="cpu")
	# Initialize model architecture (see HebrewGPT-1B for model class definition)
	# model.load_state_dict(state_dict)
	```

	### Prompt Format

	The model was trained with a structured instruction format:

	```
	### הוראה:
	{instruction}

	### קלט:
	{input}

	### תשובה:
	{response}
	```

	## Evaluation

	Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison:

	\| Task \| Base Model \| Instruct (SFT) \|
	\|------\|-----------\|----------------\|
	\| SNLI \| 50% \| Pending \|
	\| Sentiment \| 33% \| Pending \|
	\| QA \| 20% \| Pending \|
	\| Trivia \| 13% \| Pending \|
	\| Average \| 29.2% \| Pending \|

	SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix.

	## Infrastructure

	- Research Orchestration: Amazon Bedrock (Claude) via OpenClaw
	- Training Compute: AWS EC2 g5.2xlarge (NVIDIA A10G)
	- Data Pipeline: Automated dataset collection, translation, and balancing

	## Files

	- `model.pt` — SFT fine-tuned model state dict (2.1 GB)
	- `tokenizer.model` — SentencePiece BPE tokenizer (8,192 vocab)

	## Citation

	```bibtex
	@misc{hebrewgpt1b-instruct-2026,
	title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model},
	author={Slasky, Ronnen},
	year={2026},
	url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct}
	}
	```

	## Limitations

	- Small vocabulary (8,192 tokens) may limit performance on rare words
	- 2,048 context window limits long-document tasks
	- Trained primarily on structured instruction tasks; open-ended generation quality may vary
	- Hebrew-specific model — limited multilingual capability beyond Hebrew-English translation

	## License

	Apache 2.0