ismail / README.md
İsmail Kağan Acar
Revise README with new title and tokenizer info
a0ee795 unverified
# ismail - DeepSeek-V3 Inspired Turkish LLM Implementation
![Status](https://img.shields.io/badge/Status-Untrained_Architecture-yellow)<br>
**ismail** is a from-scratch Turkish language model implementation designed for low-end hardware, built and trained on a single RTX 5070 (12GB). This is my first LLM project, heavily inspired by [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) and built with guidance from [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch). Ismail utilizes Ali Bayram's [Turkish Tiktokenizer](https://huggingface.co/spaces/alibayram/turkish_tiktokenizer), a morphology-based tokenizer that achieves significantly better compression for agglutinative languages than standard BPE.
**Language Focus**: ismail is trained exclusively on Turkish datasets using a custom morphology-aware tokenizer optimized for Turkish's agglutinative structure.
> **Status**: Pretraining is currently ongoing on Turkish text with a single 5070 GPU. This will take a while!
## Architecture Highlights
ismail implements several advanced techniques optimized for memory-constrained environments:
- **Multi-Head Latent Attention (MLA)**: DeepSeek-inspired attention mechanism with LoRA-style compression
- KV cache compression via low-rank projection (kv_lora_rank: 512/256)
- Separate RoPE and non-RoPE attention heads
- Reduced memory footprint for longer sequences
- **Mixture of Experts (MoE)**: Efficient sparse expert routing
- Routed experts: 4-6 experts with top-2 activation
- Shared experts for common knowledge
- Sequential expert training for limited VRAM
- Configurable expert rotation during training
- **YaRN RoPE**: Extended context length support
- Dynamic frequency scaling based on sequence length
- Smooth interpolation for position embeddings
- Support for sequences beyond training length
- **Custom Kernels**: Triton-based GPU kernels for FP8 quantization
- Optimized matrix multiplication
- Activation and weight quantization
- Memory-efficient inference
- **Turkish Morphological Tokenizer**: Custom hybrid tokenizer designed for Turkish
- Combines rule-based morphological analysis with BPE
- Preserves linguistic structure (roots, suffixes, phonological rules)
- Based on research: ["Tokens with Meaning"](https://arxiv.org/abs/2508.14292)
- 32,768 vocabulary size optimized for Turkish
## Model Configuration
**Current Training Config** (512-dim model for 12GB GPU):
```json
{
"vocab_size": 32768,
"dim": 512,
"n_layers": 16,
"n_heads": 12,
"n_routed_experts": 4,
"n_activated_experts": 2,
"max_seq_len": 512,
"kv_lora_rank": 256
}
```
**Full-Scale Config** (1024-dim model):
- 1024 hidden dimensions
- 20 layers (3 dense + 17 MoE)
- 6 routed experts per MoE layer
- Support for 2048+ token sequences
## Project Structure
```
ismail/
├── Model_Architecture/
│ ├── model.py # Core model implementation
│ ├── train.py # Training loop with expert rotation
│ ├── generation.py # Text generation and sampling
│ ├── data.py # Dataset and data loading
│ ├── kernel.py # Custom Triton kernels for FP8
│ ├── config.json # Model and training configuration
│ └── requirements.txt # Dependencies
├── LiteratureReview/
│ ├── Deepseek-V3/ # DeepSeek architecture analysis
│ ├── GPT-2/ # GPT-2 baseline implementations
│ ├── Llama/ # Llama 3 architecture study
│ ├── Mistral/ # Mistral architecture analysis
│ └── Qwen3/ # Qwen 3 architecture study
└── turkish_tiktokenizer/ # Custom Turkish morphological tokenizer
├── app.py # Gradio demo interface
└── README.md # Tokenizer documentation
```
## Installation
### Requirements
- Python 3.8+
- PyTorch 2.0+
- CUDA-capable GPU (tested on RTX 5070 12GB)
- 16GB+ system RAM recommended
### Setup
```bash
# Clone the repository
git clone https://github.com/yourusername/ismail.git
cd ismail
# Install dependencies
cd Model_Architecture
pip install -r requirements.txt
# Optional: Install W&B for experiment tracking
pip install wandb
# Optional: Install bitsandbytes for 8-bit Adam optimizer
pip install bitsandbytes
```
## Usage
### Training
```bash
cd Model_Architecture
# Train with default config
python train.py
# Train with custom config
python train.py --config config.json
# Resume from checkpoint
python train.py --resume checkpoints/step_10000.pt
```
**Training Features**:
- Gradient accumulation for effective larger batch sizes
- Expert rotation for memory-efficient MoE training
- Mixed precision training (FP32/BF16/FP8)
- Automatic checkpointing
- W&B integration for tracking
- Validation during training
### Generation
```bash
# Generate text
python generation.py --checkpoint checkpoints/latest.pt --prompt "Your prompt here"
```
### Model Configuration
Edit [config.json](Model_Architecture/config.json) to customize:
- Model architecture (dimensions, layers, experts)
- Training hyperparameters (learning rate, batch size)
- Data paths and tokenizer
- Logging and checkpointing
## Turkish Language Support
ismail uses a custom hybrid tokenizer specifically designed for Turkish:
- **Morphological Awareness**: Understands Turkish word structure (roots + suffixes)
- **Efficient Encoding**: 32K vocabulary with ~3.5x compression ratio
- **Linguistic Preservation**: Maintains grammatical information in token boundaries
- **Research-Based**: Implements hybrid approach from [arXiv:2508.14292](https://arxiv.org/abs/2508.14292)
The tokenizer handles Turkish's rich morphology better than standard BPE, preserving linguistic meaning while maintaining vocabulary efficiency. See [turkish_tiktokenizer/README.md](turkish_tiktokenizer/README.md) for details.
## Key Features for Low-End Hardware
1. **Sequential Expert Training**: Train one expert at a time to fit in 12GB VRAM
2. **Gradient Checkpointing**: Trade compute for memory
3. **8-bit Optimizer**: bitsandbytes Adam optimizer reduces memory by ~40%
4. **Small Batch Training**: Gradient accumulation enables large effective batch sizes
5. **FP8 Inference**: Custom kernels for efficient inference
6. **Flexible Configuration**: Easy to scale down for smaller GPUs
## Inspiration & References
This project draws heavily from:
- **[DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3)**: MLA and MoE architecture
- **[LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)**: Educational foundation and best practices
- **GPT-2/3**: Transformer baseline architecture
- **Llama 3**: RoPE and normalization techniques
## Technical Details
### Multi-Head Latent Attention (MLA)
The MLA mechanism compresses KV cache using low-rank projections:
- Query: Standard multi-head projection
- Key/Value: Compressed via LoRA-style down/up projection
- Split heads: RoPE-enabled (64d) + Non-RoPE (128d)
- Memory savings: ~4x reduction in KV cache size
### Mixture of Experts (MoE)
- Top-K routing (K=2) with learned router
- Shared experts for common features
- Load balancing loss to prevent expert collapse
- Sequential training mode for VRAM constraints
### YaRN Positional Encoding
- Extends context beyond training length
- Smooth frequency interpolation
- Maintains performance on short sequences
- Configurable extrapolation factors
## Current Status & Roadmap
**Current**:
- ✅ Core architecture implemented
- ✅ Training pipeline functional
- ✅ Custom Turkish morphological tokenizer
- ✅ Turkish dataset preparation
- 🔄 Pretraining on Turkish text with single 5070 (ongoing)
**Planned**:
- [ ] Complete initial pretraining run
- [ ] Evaluation on Turkish benchmarks (TurkishBench, etc.)
- [ ] Fine-tuning pipeline for instruction following
- [ ] Model release (if not too lame!)
- [ ] Multi-GPU training support
- [ ] Inference optimization and quantization
## Performance
Training on RTX 5070 (12GB):
- **512-dim model**: ~3.5 tokens/sec with batch_size=16, grad_accum=8
- **Memory usage**: ~11.5GB during training
- **Estimated pretraining**: Several weeks for 100K steps
*Performance will improve significantly with better hardware!*
## Acknowledgments
Special thanks to:
- [DeepSeek AI](https://github.com/deepseek-ai) for the innovative MLA and MoE architectures
- [Sebastian Raschka](https://github.com/rasbt) for the excellent LLMs-from-scratch educational resource
- The broader open-source LLM community for making this possible
## Contributing
This is primarily a learning project, but suggestions and feedback are welcome! Feel free to open issues or PRs.
## Contact
For questions or discussions, please open an issue on GitHub.
---
*Built with determination and limited VRAM* 🚀