|
|
--- |
|
|
license: gpl-3.0 |
|
|
language: |
|
|
- az |
|
|
base_model: |
|
|
- Yusiko/Khazri |
|
|
tags: |
|
|
- aze |
|
|
- mini |
|
|
- yusiko |
|
|
--- |
|
|
|
|
|
|
|
|
# πͺοΈKhazri β Azerbaijani Language Model |
|
|
**A lightweight, efficient, and fully custom Azerbaijani language model designed for text generation, chat applications, education, and research.** |
|
|
Khazri is trained from scratch using a custom 10M-sample Azerbaijani dataset and optimized for running on consumer GPUs while maintaining competitive performance. |
|
|
|
|
|
## π Features |
|
|
- π¦πΏ Native Azerbaijani language support |
|
|
- β‘ Lightweight architecture (β 36M parameter) |
|
|
- π Supports fast inference with GGUF + llama.cpp |
|
|
- π¦ Available on Hugging Face |
|
|
- π― Optimized for chatbots, WebRTC real-time assistants, and low-latency deployment |
|
|
|
|
|
## ποΈ Model Architecture |
|
|
| Version | Parameters | Type | Context Length | Notes | |
|
|
|--------|------------|------|----------------|-------| |
|
|
| Khazri-36M | ~36.6M | GPT-2 Small variant | 1024 | Higher quality | |
|
|
|
|
|
Architecture: |
|
|
- Transformer decoder-only |
|
|
- Multi-head self-attention |
|
|
- Rotary positional embeddings (RoPE) |
|
|
- GELU activation |
|
|
- Layer normalization |
|
|
- Tied embeddings |
|
|
|
|
|
## π Dataset |
|
|
Khazri is trained on a 10 million-sample Azerbaijani dataset including: |
|
|
- News, books, conversations, social media, web articles, educational content |
|
|
|
|
|
Preprocessing: |
|
|
- Unicode normalization, deduplication, tokenizer preprocessing, length filtering |
|
|
|
|
|
## ποΈ Training Details |
|
|
### Hardware |
|
|
- NVIDIA RTX 3090 24GB |
|
|
- PyTorch 2.x + CUDA 12 |
|
|
- bf16 mixed precision |
|
|
|
|
|
### Hyperparameters |
|
|
``` |
|
|
epochs = 1 |
|
|
batch_size = 32 |
|
|
gradient_accumulation = 4 |
|
|
learning_rate = 3e-4 |
|
|
warmup_steps = 500 |
|
|
weight_decay = 0.1 |
|
|
sequence_length = 512 |
|
|
optimizer = AdamW |
|
|
precision = bf16 |
|
|
``` |
|
|
|
|
|
## π Training Challenges & Solutions |
|
|
### Bottleneck: Memory Bandwidth |
|
|
Small models saturate VRAM bandwidth β ~4.2 it/s |
|
|
Solution: shrink model size, adjust batch/accumulation, optimize data loading |
|
|
|
|
|
### Transformers |
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
tok = AutoTokenizer.from_pretrained("Yusiko/Khazri") |
|
|
model = AutoModelForCausalLM.from_pretrained("Yusiko/Khazri") |
|
|
``` |
|
|
|
|
|
## π Hugging Face |
|
|
Available at: https://huggingface.co/Yusiko/Khazri |
|
|
|
|
|
## π¦ License |
|
|
GPL 3.0 License |
|
|
|
|
|
## π Future Plans |
|
|
- 1B+ model |
|
|
- Better tokenizer |
|
|
- Instruction-tuning |
|
|
- WebGPU inference |
|
|
- Community fine-tuning tools |
|
|
|
|
|
## π€ Contact |
|
|
Created by **Yusiko** |
|
|
GitHub: [Yusiko99](https://github.com/Yusiko99) |
|
|
Website: https://yusi.xo.je |
|
|
Hugging Face: Yusiko |
|
|
|