# ismail - DeepSeek-V3 Inspired Turkish LLM Implementation ![Status](https://img.shields.io/badge/Status-Untrained_Architecture-yellow)
**ismail** is a from-scratch Turkish language model implementation designed for low-end hardware, built and trained on a single RTX 5070 (12GB). This is my first LLM project, heavily inspired by [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) and built with guidance from [LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch). Ismail utilizes Ali Bayram's [Turkish Tiktokenizer](https://huggingface.co/spaces/alibayram/turkish_tiktokenizer), a morphology-based tokenizer that achieves significantly better compression for agglutinative languages than standard BPE. **Language Focus**: ismail is trained exclusively on Turkish datasets using a custom morphology-aware tokenizer optimized for Turkish's agglutinative structure. > **Status**: Pretraining is currently ongoing on Turkish text with a single 5070 GPU. This will take a while! ## Architecture Highlights ismail implements several advanced techniques optimized for memory-constrained environments: - **Multi-Head Latent Attention (MLA)**: DeepSeek-inspired attention mechanism with LoRA-style compression - KV cache compression via low-rank projection (kv_lora_rank: 512/256) - Separate RoPE and non-RoPE attention heads - Reduced memory footprint for longer sequences - **Mixture of Experts (MoE)**: Efficient sparse expert routing - Routed experts: 4-6 experts with top-2 activation - Shared experts for common knowledge - Sequential expert training for limited VRAM - Configurable expert rotation during training - **YaRN RoPE**: Extended context length support - Dynamic frequency scaling based on sequence length - Smooth interpolation for position embeddings - Support for sequences beyond training length - **Custom Kernels**: Triton-based GPU kernels for FP8 quantization - Optimized matrix multiplication - Activation and weight quantization - Memory-efficient inference - **Turkish Morphological Tokenizer**: Custom hybrid tokenizer designed for Turkish - Combines rule-based morphological analysis with BPE - Preserves linguistic structure (roots, suffixes, phonological rules) - Based on research: ["Tokens with Meaning"](https://arxiv.org/abs/2508.14292) - 32,768 vocabulary size optimized for Turkish ## Model Configuration **Current Training Config** (512-dim model for 12GB GPU): ```json { "vocab_size": 32768, "dim": 512, "n_layers": 16, "n_heads": 12, "n_routed_experts": 4, "n_activated_experts": 2, "max_seq_len": 512, "kv_lora_rank": 256 } ``` **Full-Scale Config** (1024-dim model): - 1024 hidden dimensions - 20 layers (3 dense + 17 MoE) - 6 routed experts per MoE layer - Support for 2048+ token sequences ## Project Structure ``` ismail/ ├── Model_Architecture/ │ ├── model.py # Core model implementation │ ├── train.py # Training loop with expert rotation │ ├── generation.py # Text generation and sampling │ ├── data.py # Dataset and data loading │ ├── kernel.py # Custom Triton kernels for FP8 │ ├── config.json # Model and training configuration │ └── requirements.txt # Dependencies ├── LiteratureReview/ │ ├── Deepseek-V3/ # DeepSeek architecture analysis │ ├── GPT-2/ # GPT-2 baseline implementations │ ├── Llama/ # Llama 3 architecture study │ ├── Mistral/ # Mistral architecture analysis │ └── Qwen3/ # Qwen 3 architecture study └── turkish_tiktokenizer/ # Custom Turkish morphological tokenizer ├── app.py # Gradio demo interface └── README.md # Tokenizer documentation ``` ## Installation ### Requirements - Python 3.8+ - PyTorch 2.0+ - CUDA-capable GPU (tested on RTX 5070 12GB) - 16GB+ system RAM recommended ### Setup ```bash # Clone the repository git clone https://github.com/yourusername/ismail.git cd ismail # Install dependencies cd Model_Architecture pip install -r requirements.txt # Optional: Install W&B for experiment tracking pip install wandb # Optional: Install bitsandbytes for 8-bit Adam optimizer pip install bitsandbytes ``` ## Usage ### Training ```bash cd Model_Architecture # Train with default config python train.py # Train with custom config python train.py --config config.json # Resume from checkpoint python train.py --resume checkpoints/step_10000.pt ``` **Training Features**: - Gradient accumulation for effective larger batch sizes - Expert rotation for memory-efficient MoE training - Mixed precision training (FP32/BF16/FP8) - Automatic checkpointing - W&B integration for tracking - Validation during training ### Generation ```bash # Generate text python generation.py --checkpoint checkpoints/latest.pt --prompt "Your prompt here" ``` ### Model Configuration Edit [config.json](Model_Architecture/config.json) to customize: - Model architecture (dimensions, layers, experts) - Training hyperparameters (learning rate, batch size) - Data paths and tokenizer - Logging and checkpointing ## Turkish Language Support ismail uses a custom hybrid tokenizer specifically designed for Turkish: - **Morphological Awareness**: Understands Turkish word structure (roots + suffixes) - **Efficient Encoding**: 32K vocabulary with ~3.5x compression ratio - **Linguistic Preservation**: Maintains grammatical information in token boundaries - **Research-Based**: Implements hybrid approach from [arXiv:2508.14292](https://arxiv.org/abs/2508.14292) The tokenizer handles Turkish's rich morphology better than standard BPE, preserving linguistic meaning while maintaining vocabulary efficiency. See [turkish_tiktokenizer/README.md](turkish_tiktokenizer/README.md) for details. ## Key Features for Low-End Hardware 1. **Sequential Expert Training**: Train one expert at a time to fit in 12GB VRAM 2. **Gradient Checkpointing**: Trade compute for memory 3. **8-bit Optimizer**: bitsandbytes Adam optimizer reduces memory by ~40% 4. **Small Batch Training**: Gradient accumulation enables large effective batch sizes 5. **FP8 Inference**: Custom kernels for efficient inference 6. **Flexible Configuration**: Easy to scale down for smaller GPUs ## Inspiration & References This project draws heavily from: - **[DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3)**: MLA and MoE architecture - **[LLMs-from-scratch](https://github.com/rasbt/LLMs-from-scratch)**: Educational foundation and best practices - **GPT-2/3**: Transformer baseline architecture - **Llama 3**: RoPE and normalization techniques ## Technical Details ### Multi-Head Latent Attention (MLA) The MLA mechanism compresses KV cache using low-rank projections: - Query: Standard multi-head projection - Key/Value: Compressed via LoRA-style down/up projection - Split heads: RoPE-enabled (64d) + Non-RoPE (128d) - Memory savings: ~4x reduction in KV cache size ### Mixture of Experts (MoE) - Top-K routing (K=2) with learned router - Shared experts for common features - Load balancing loss to prevent expert collapse - Sequential training mode for VRAM constraints ### YaRN Positional Encoding - Extends context beyond training length - Smooth frequency interpolation - Maintains performance on short sequences - Configurable extrapolation factors ## Current Status & Roadmap **Current**: - ✅ Core architecture implemented - ✅ Training pipeline functional - ✅ Custom Turkish morphological tokenizer - ✅ Turkish dataset preparation - 🔄 Pretraining on Turkish text with single 5070 (ongoing) **Planned**: - [ ] Complete initial pretraining run - [ ] Evaluation on Turkish benchmarks (TurkishBench, etc.) - [ ] Fine-tuning pipeline for instruction following - [ ] Model release (if not too lame!) - [ ] Multi-GPU training support - [ ] Inference optimization and quantization ## Performance Training on RTX 5070 (12GB): - **512-dim model**: ~3.5 tokens/sec with batch_size=16, grad_accum=8 - **Memory usage**: ~11.5GB during training - **Estimated pretraining**: Several weeks for 100K steps *Performance will improve significantly with better hardware!* ## Acknowledgments Special thanks to: - [DeepSeek AI](https://github.com/deepseek-ai) for the innovative MLA and MoE architectures - [Sebastian Raschka](https://github.com/rasbt) for the excellent LLMs-from-scratch educational resource - The broader open-source LLM community for making this possible ## Contributing This is primarily a learning project, but suggestions and feedback are welcome! Feel free to open issues or PRs. ## Contact For questions or discussions, please open an issue on GitHub. --- *Built with determination and limited VRAM* 🚀